Why do we use Poisson distribution or Negative Binomial distribution for regression?

Let’s say we have predictor variables (features) denoted as $latex X \in \mathbb{R}^n$ and response variable (label) $latex y$ whose underlying random variable is $latex Y$. If we want to fit an Ordinary Least Square (OLS) regression such that $latex y=WX+\epsilon$ where $latex \epsilon$ is an error term, then we have the following assumptions: 

  1. strict exogenous: $latex E(\epsilon|X)=0$ (error is not correlated with data itself, i.e., error comes from outside system)
  2. spherical error: $latex Var(\epsilon|X)=\sigma^2 I_n$ (variance on each dimension should be the same.)
  3. … More on wikipedia 

If we think $latex Y$, rather than growing linearly as X does, actually grows exponentially, then we have $latex E[log (Y)|X] = WX$. We can treat $latex y’ = log(y)$ and fit a new OLS regression. However, the problem is that when $latex WX$ is smaller than 0, it causes predicted $latex \hat{y}’$ smaller than 0, which makes no sense given it is the result of log.

Although we can ignore this in practice if fit is already good, the alternative is to fit a Poisson regression. $latex Y \sim \text{Poisson Distribution}(WX)$, i.e.,

$latex \lambda := E(Y|X;W)=e^{WX} \\ p(y|X;W)=\frac{\lambda^y}{y!}e^{-\lambda}= \frac{e^{yWX} \cdot e^{-e^{WX}}}{y!} &s=2$

 

We can learn $latex W$ by doing optimization to maximize the overall likelihood. 

The advantage of using Poisson distribution/regression is that it automatically deals with non-negative properties. No matter how small $latex Wx$ is, $latex y$ will always be positive. This can be illustrated through the following plot from wikipedia:

 Screenshot from 2016-11-06 09:44:53

Replace y axis’ title with $latex p(y|x;W)$. We find that if $latex \lambda$ is small (e.g., close to 1), $latex p(y|x;W)$’s probability mass will be concentrated around 0 but still strictly larger than 0. When $latex \lambda$ is close to 4, $latex p(y|x;W)$’s probability mass is a skewed peak. When $latex \lambda$ is becoming larger, then the shape of $latex p(y|x;W)$’s distribution is similar to a normal distribution.

The strong assumption brought by Poisson regression is that $latex E(Y|X;W)=Var(Y|X;W)$. If this assumption does not hold, you may have a poor fit. For example, there is a phenomenon called overdispersion: When $latex WX$ is large, $latex Var(Y|X;W)>E(Y|X;W)$ due to the inability of the model to account for the complication when data values are extreme. People have applied various models, such as negative binomial model or zero-inflated model, to fit the data if Poisson regression dose not suit. See: http://www.ats.ucla.edu/stat/stata/seminars/count_presentation/count.htm

Now, let’s look at one of the models, negative binomial model. In this model, the mean of $latex log(Y)$ is accounted by both $latex WX$ and an unobserved error term $latex \epsilon$: 

$latex E[Y|X, \epsilon;W] = e^{WX+\epsilon}= e^{WX} \cdot e^{\epsilon} = \mu \cdot \tau \\ p(y|X,\tau;W)= \frac{e^{-\mu\tau} \cdot (\mu\tau)^y}{y!}&s=2$

 

If we assume $latex \tau$ follows a gamma distribution, i.e. $latex p(\tau;\theta) \sim gamma(\theta, \theta)$, then:

$latex p(y|X;W) = \int_{0}^{\infty} p(y|X, \tau;W) p(\tau;\theta) d\tau \\= \frac{\Gamma(y+\theta)}{y!\Gamma(\theta)}(\frac{\theta}{\theta+\mu})^\theta (\frac{\mu}{\theta+\mu})^y &s=2$ 

 

We integrate out $latex \tau$ resulting to a negative binomial distribution, with:

$latex E[Y|X] = \mu \\ Var[Y|X] = \mu + \frac{\mu^2}{\theta} &s=2$

 

Therefore, the negative binomial model bypasses the restriction that $latex E[Y|X] = Var[Y|X]$ by introducing a new parameter $latex \theta$ to make the model more flexible when fitting to data.

 

Although Poisson regression originally assumes y should be positive integers (that’s why we often hear that Poisson regression is for count data), there are techniques to allow y to be positive real values. See http://stats.stackexchange.com/questions/70054/how-is-it-possible-that-poisson-glm-accepts-non-integer-numbers

 

Reference: 

http://stats.stackexchange.com/questions/3024/why-is-poisson-regression-used-for-count-data

http://maider.blog.sohu.com/304621504.html

http://www.karlin.mff.cuni.cz/~pesta/NMFM404/NB.html

Leetcode 198: House Robber

https://leetcode.com/problems/house-robber/

198. House Robber 

  • Total Accepted: 101581
  • Total Submissions: 275668
  • Difficulty: Easy
  • Contributors: Admin

You are a professional robber planning to rob houses along a street. Each house has a certain amount of money stashed, the only constraint stopping you from robbing each of them is that adjacent houses have security system connected and it will automatically contact the police if two adjacent houses were broken into on the same night.

Given a list of non-negative integers representing the amount of money of each house, determine the maximum amount of money you can rob tonight without alerting the police.

 

Code (Naive)

class Solution(object):
    def rob(self, nums):
        """
        :type nums: List[int]
        :rtype: int
        """
        if not nums:
            return 0
        return max(nums[0] + self.helper(nums, 2), self.helper(nums, 1))
    
    def helper(self, nums, idx):
        if idx >= len(nums):
            return 0
        else:
            return max(nums[idx] + self.helper(nums, idx+2), self.helper(nums, idx+1))

 

Idea

This is my naive code. While this seems working, it actually takes O(2^N) time and O(N) space.

 

Code

class Solution(object):
    def rob(self, nums):
        """
        :type nums: List[int]
        :rtype: int
        """
        if not nums:
            return 0
            
        prev, cur = 0, 0
        for n in nums:
            new_cur = max(prev+n, cur)
            prev = cur
            cur = new_cur
        
        return cur

 

Idea

It uses two caches: prev, cur. For every n in nums, prev is the maximum stolen amount until the element which is two elements apart from n. cur is the maximum stolen amount until the element before n. Therefore, the maximum stolen amount until n, new_cur, is max(prev+n, cur)

 

Reference: https://discuss.leetcode.com/topic/17199/python-solution-3-lines

Leetcode 298: Binary Tree Longest Consecutive Sequence

https://leetcode.com/problems/binary-tree-longest-consecutive-sequence/

Given a binary tree, find the length of the longest consecutive sequence path.

The path refers to any sequence of nodes from some starting node to any node in the tree along the parent-child connections. The longest consecutive path need to be from parent to child (cannot be the reverse).

For example,

   1
    \
     3
    / \
   2   4
        \
         5

Longest consecutive sequence path is 3-4-5, so return 3.

   2
    \
     3
    / 
   2    
  / 
 1

Longest consecutive sequence path is 2-3,not3-2-1, so return 2.

 
 
 
 
Code
# Definition for a binary tree node.
# class TreeNode(object):
#     def __init__(self, x):
#         self.val = x
#         self.left = None
#         self.right = None

class Solution(object):
    def longestConsecutive(self, root):
        """
        :type root: TreeNode
        :rtype: int
        """
        if root is None:
            return 0
            
        self.res = 0
        self.helper(root, root.left, 1)
        self.helper(root, root.right, 1)
        return self.res
        
    def helper(self, prev, cur, length):
        if cur is None:
            self.res = max(self.res, length)
            return
            
        if cur.val == prev.val + 1:
            self.helper(cur, cur.left, length+1)
            self.helper(cur, cur.right, length+1)
        else:
            self.res = max(self.res, length)
            self.helper(cur, cur.left, 1)
            self.helper(cur, cur.right, 1)
            

 

Idea
 
There are three parameters of helper function, prev (the previous node), cur (the current node) and length (the accumulated length before cur). If cur.val == prev.val + 1, then we can continue exploring the current node’s children with length increased by 1 (line 28, 29). Otherwise, refresh length. (line 32, 33)
 
The time complexity in the worst case will always be O(N) where N is the number of tree nodes.

 

Of course you can implement one that does not require a global variable: https://discuss.leetcode.com/topic/29205/simple-recursive-dfs-without-global-variable

Leetcode 70: Climbing Stairs

70. Climbing Stairs

  • Total Accepted: 139063
  • Total Submissions: 363734
  • Difficulty: Easy
  • Contributors: Admin

You are climbing a stair case. It takes n steps to reach to the top.

Each time you can either climb 1 or 2 steps. In how many distinct ways can you climb to the top?

 

Code

from math import sqrt

class Solution(object):
    def climbStairs(self, n):
        """
        :type n: int
        :rtype: int
        """
        if n < 0:
            return -1
        elif n == 0:
            return 0
        else:
            return int((((1 + sqrt(5))/2)**(n+1) - ((1 - sqrt(5))/2)**(n+1))/sqrt(5))

 

Idea

Essentially a fibonacci sequence.

Leetcode 258: Add Digits

258. Add Digits

  • Total Accepted: 134747
  • Total Submissions: 269960
  • Difficulty: Easy
  • Contributors: Admin

Given a non-negative integer num, repeatedly add all its digits until the result has only one digit.

For example:

Given num = 38, the process is like: 3 + 8 = 11, 1 + 1 = 2. Since 2 has only one digit, return it.

Follow up:
Could you do it without any loop/recursion in O(1) runtime?

Hint:

  1. A naive implementation of the above process is trivial. Could you come up with other methods?
  2. What are all the possible results?
  3. How do they occur, periodically or randomly?
  4. You may find this Wikipedia article useful.

 

Code

class Solution(object):
    def addDigits(self, num):
        """
        :type num: int
        :rtype: int
        """
        if num == 0:
            return num
        elif num % 9 != 0:
            return num % 9
        else:
            return 9

 

Idea

The result is called Digital root. Digital root will be equal to num % 9 except when num equals to 0 or num is itself 9’s multiples.

 

 

Leetcode 262: Trips and Users

262. Trips and Users

  • Total Accepted: 6774
  • Total Submissions: 43127
  • Difficulty: Hard
  • Contributors: Admin

The Trips table holds all taxi trips. Each trip has a unique Id, while Client_Id and Driver_Id are both foreign keys to the Users_Id at the Users table. Status is an ENUM type of (‘completed’, ‘cancelled_by_driver’, ‘cancelled_by_client’).

+----+-----------+-----------+---------+--------------------+----------+
| Id | Client_Id | Driver_Id | City_Id |        Status      |Request_at|
+----+-----------+-----------+---------+--------------------+----------+
| 1  |     1     |    10     |    1    |     completed      |2013-10-01|
| 2  |     2     |    11     |    1    | cancelled_by_driver|2013-10-01|
| 3  |     3     |    12     |    6    |     completed      |2013-10-01|
| 4  |     4     |    13     |    6    | cancelled_by_client|2013-10-01|
| 5  |     1     |    10     |    1    |     completed      |2013-10-02|
| 6  |     2     |    11     |    6    |     completed      |2013-10-02|
| 7  |     3     |    12     |    6    |     completed      |2013-10-02|
| 8  |     2     |    12     |    12   |     completed      |2013-10-03|
| 9  |     3     |    10     |    12   |     completed      |2013-10-03| 
| 10 |     4     |    13     |    12   | cancelled_by_driver|2013-10-03|
+----+-----------+-----------+---------+--------------------+----------+

The Users table holds all users. Each user has an unique Users_Id, and Role is an ENUM type of (‘client’, ‘driver’, ‘partner’).

+----------+--------+--------+
| Users_Id | Banned |  Role  |
+----------+--------+--------+
|    1     |   No   | client |
|    2     |   Yes  | client |
|    3     |   No   | client |
|    4     |   No   | client |
|    10    |   No   | driver |
|    11    |   No   | driver |
|    12    |   No   | driver |
|    13    |   No   | driver |
+----------+--------+--------+

Write a SQL query to find the cancellation rate of requests made by unbanned clients between Oct 1, 2013 and Oct 3, 2013. For the above tables, your SQL query should return the following rows with the cancellation rate being rounded to two decimal places.

+------------+-------------------+
|     Day    | Cancellation Rate |
+------------+-------------------+
| 2013-10-01 |       0.33        |
| 2013-10-02 |       0.00        |
| 2013-10-03 |       0.50        |
+------------+-------------------+

 

Code

# Write your MySQL query statement below
select 
    Request_at as Day,
    round(count(IF(Status != 'completed', 1, NULL) ) / count(*), 2) as 'Cancellation Rate'
from Trips
join Users
on Users.Users_id = Trips.Client_Id
where 
    Banned = 'No' 
    and Request_at between '2013-10-01' and '2013-10-03'
group by Request_at

 

Idea

In order to calculate cancellation rate per day, you need to do group by Request_at. To count cancellation, we need to have an IF clause for counting. 

 

Reference: https://discuss.leetcode.com/topic/42188/solution-without-join/3

Leetsql 185: Department Top Three Salaries

185. Department Top Three Salaries

  • Total Accepted: 8926
  • Total Submissions: 57724
  • Difficulty: Hard
  • Contributors: Admin

The Employee table holds all employees. Every employee has an Id, and there is also a column for the department Id.

+----+-------+--------+--------------+
| Id | Name  | Salary | DepartmentId |
+----+-------+--------+--------------+
| 1  | Joe   | 70000  | 1            |
| 2  | Henry | 80000  | 2            |
| 3  | Sam   | 60000  | 2            |
| 4  | Max   | 90000  | 1            |
| 5  | Janet | 69000  | 1            |
| 6  | Randy | 85000  | 1            |
+----+-------+--------+--------------+

The Department table holds all departments of the company.

+----+----------+
| Id | Name     |
+----+----------+
| 1  | IT       |
| 2  | Sales    |
+----+----------+

Write a SQL query to find employees who earn the top three salaries in each of the department. For the above tables, your SQL query should return the following rows.

+------------+----------+--------+
| Department | Employee | Salary |
+------------+----------+--------+
| IT         | Max      | 90000  |
| IT         | Randy    | 85000  |
| IT         | Joe      | 70000  |
| Sales      | Henry    | 80000  |
| Sales      | Sam      | 60000  |
+------------+----------+--------+

 

Code

# Write your MySQL query statement below
select d1.Name as Department, e1.Name as Employee, e1.Salary as Salary 
from Employee e1
join Department d1
on d1.Id = e1.DepartmentId
where 3 > (select count(distinct e2.Salary) 
           from Employee e2
           where e2.DepartmentId = e1.DepartmentId
           and e2.Salary > e1.Salary)

 

Idea

line 2~5 is just joining two tables, Employee and Department. line 6-9 makes sure the selection includes employees who earns top three.

 

Reference: https://discuss.leetcode.com/topic/8562/accepted-solution-without-group-by-or-order-by

LeetSql 177: Nth Highest Salary

177. Nth Highest Salary

  • Total Accepted: 12870
  • Total Submissions: 80808
  • Difficulty: Medium
  • Contributors: Admin

Write a SQL query to get the nth highest salary from the Employee table.

+----+--------+
| Id | Salary |
+----+--------+
| 1  | 100    |
| 2  | 200    |
| 3  | 300    |
+----+--------+

For example, given the above Employee table, the nth highest salary where n = 2 is 200. If there is no nth highest salary, then the query should return null.

 

Code

CREATE FUNCTION getNthHighestSalary(N INT) RETURNS INT
BEGIN
  declare M INT;
  set M=N-1;
  RETURN (
      # Write your MySQL query statement below.
      select distinct Salary from Employee order by Salary desc limit M, 1
  );
END

 

Idea

limit M, 1 means return one element since M element.

Leetcode 40: Combination Sum II

40. Combination Sum II

  • Total Accepted: 89506
  • Total Submissions: 294097
  • Difficulty: Medium
  • Contributors: Admin

Given a collection of candidate numbers (C) and a target number (T), find all unique combinations in C where the candidate numbers sums to T.

Each number in C may only be used once in the combination.

Note:

  • All numbers (including target) will be positive integers.
  • The solution set must not contain duplicate combinations.

For example, given candidate set [10, 1, 2, 7, 6, 1, 5] and target 8,
A solution set is:

[
  [1, 7],
  [1, 2, 5],
  [2, 6],
  [1, 1, 6]
]

 

Code

class Solution(object):
    def combinationSum2(self, candidates, target):
        """
        :type candidates: List[int]
        :type target: int
        :rtype: List[List[int]]
        """
        # no duplication in result. so sort candidates first.
        candidates = sorted(candidates)
    
        path, res = [], []
        self.dfs(0, candidates, path, res, target)
        return res
    
    def dfs(self, start_idx, candidates, path, res, target):
        # original target and all candidates are positive
        if target < 0:
            return
        elif target == 0:
            res.append(list(path))
            return
        
        for i in xrange(start_idx, len(candidates)):
            # remove duplication
            if i > start_idx and candidates[i] == candidates[i-1]:
                continue
            path.append(candidates[i])
            self.dfs(i+1, candidates, path, res, target-candidates[i])
            path.pop()
        

 

Idea

Essentially it is a DFS. In order to avoid duplicate combinations, we sort the candidates first (line 9) and also apply some mechanism at line 25. Then, we try adding `candidates[i]` to path, modifying target to `target-candidates[i]` and recurring on `dfs` method. 

Reference: https://discuss.leetcode.com/topic/19845/java-solution-using-dfs-easy-understand