LambdaMART

LambdaMART [7] is one of Learn to Rank algorithms. It emphasizes on fitting on the correct order of a list, which contains all documents returned by a query and marked as different relevance. It is a derivation/combination of RankNet, LambdaRank and MART (Multiple Addictive Regression Tree). 

For me, the most helpful order to approach LambdaMART is to understand: 1. MART; 2. RankNet; 3. LambdaRank; and 4. LambdaMART. [4] actually aligns with my order to introduce LambdaMART. (Sidenote: another good resource to learn MART is [8])

1. MART

Screen Shot 2017-10-13 at 4.31.11 PM

Above is an illustration plot for one regression tree. If data of different labels actually come from disjoint feature spaces, then one regression tree would be enough. However, there are many situations where data do come from interleaved feature spaces. Hence, it is natural to propose using multiple regression trees to differentiate irregular feature subspaces and make predictions.

Now, suppose MART’s prediction function is F(x)=\sum\limits_{m=0}^{M} h(x;a_m), where each h(x;a_m) is one regression tree’s output and  a_m are the parameters of that tree. Generally speaking, each regression tree has K-terminal nodes. Hence a_m contains \{R_{km} \}_1^K, which are the K feature subspaces that divide data, and \{\gamma_{km}\}_1^K, which are the predicted outputs of data belonging to \{R_{km} \}_1^K. h(x;a_m) can be represented as: h(x;a_m) = \sum\limits_{k=1}^K \gamma_{km} \mathbb{I}(x \in R_{km})

\{a_m\}_0^M are iteratively learned to fit to the training data:

  1. Set F_0(x) to initial guess
  2. for each m=1,2,\cdots, M, we learn optimal feature space divisions \{R_{km}\}_1^K
  3. Based on the learned\{R_{km}\}_1^K, the prediction value of each R_{km} (leaf) is deterministically fixed as:  \gamma_{km}=argmin_{\gamma} \sum\limits_{x_i \in R_{km}} L(y_i, F_{m-1}(x_i) + \gamma)

Although [4] doesn’t provide details for learning \{R_{km}\}_1^K, I guess they should be very similar to what adopted in normal decision trees. One possible criteria could be to minimize the total loss: \{R_{km}\}_1^K = argmin_{\{R_{km}\}_1^K} \sum\limits_{k=1}^K \sum\limits_{x_i \in R_{km}} L(y_i, F_{m-1}(x_i) + \gamma_{km}). (If using this criteria, \{R_{km}\}_1^K and \{\gamma_{km}\}_1^K are actually learned simultaneously)

[4] provides an example to illustrate the iterative learning process. The dataset contains data points in 2 dimensions and labels \{1, 2, 3, 4\}. The loss function L(\cdot) is least square loss. Each regression tree is only a decision stump.

Screen Shot 2017-10-13 at 7.19.55 PM

F_0(x)=h(x;a_0) is the simplest predictor that predicts a constant value minimizing the error for all training data. 

Screen Shot 2017-10-13 at 7.29.47 PM

After determining h(x;a_0), we start to build h(x;a_1). The cut v_1 is determined by some criterion not mentioned. But assuming that v_1 has already been obtained,  we can then determine each leaf’s prediction output, i.e., \gamma_{k1}, k=1,2:

Screen Shot 2017-10-13 at 8.00.54 PM

After determining the first and second trees, F_1(x)=h(x;a_0)+h(x;a_1). In other words, for x<v_1, the predicted value will be 1.444; for x \geq v_1, the predicted value will be 3.625. 

Screen Shot 2017-10-14 at 6.27.05 AM

Next, determining h(x;a_2) will be similar: finding a cut v_2, and then determine \gamma_{k2}, k=1,2 as in step 3.

Screen Shot 2017-10-14 at 6.42.16 AM

Interestingly, when we fit h(x;a_2), it looks like that we fit a tree with each data point’s label updated as y_i - F_1(x_i), that is, we are fitting the residual of what previous trees are not able to predict. Below is the plot showing each data point’s original label and its updated label before fitting h(x;a_2).

Screen Shot 2017-10-14 at 6.46.16 AM

In fact, the fitting process described above is based on the assumption that L is a square loss function. When fitting the m-th tree, we are fitting the tree to the residual y - F_{m-1}(x). Interestingly, y - F_{m-1}(x) = - \frac{\partial L(y, F_{m-1}(x))}{\partial F_{m-1}(x)}. That’s why MART is also called Gradient Boosted Decision Tree because that each tree fits the gradient of the loss.

To be more general and mathematical,  when we learn \{R_{km}\}_1^K and \{\gamma_{km}\}_1^K in the m-th iteration, our objective function is:

min \;\; \sum\limits_i L(y_i, F_{m-1}(x_i) + h(x_i;a_m))

If we take Taylor series of the objective function up to order 2 at 0, we get [10]:

min \;\; \sum\limits_i L(y_i, F_{m-1}(x_i) + h(x_i;a_m)) \newline =\sum\limits_i L(y_i,F_{m-1}(x_i)) + L'(y_i,F_{m-1}(x_i)) h(x_i;a_m) +\frac{1}{2} L''(y_i,F_{m-1}(x_i)) h(x_i;a_m)^2 

Each time, the new regression tree (a_m = \{R_{km}\}_1^K and \{\gamma_{km}\}_1^K) should minimize this objective function, which is a quadratic function with regard to h(x_i;a_m). As long as we know \frac{\partial L}{\partial F_{m-1}} and \frac{\partial^2 L}{\partial^2 F_{m-1}}, we can learn the best a_m to minimize the objective function. 

Now, back to the previous example, if L is a square loss function L(y, \hat{y})=\frac{1}{2}(y - \hat{y})^2, then L'(y_i,F_{m-1}(x_i))=-(y_i-F_{m-1}(x_i)) and L''(y_i,F_{m-1}(x_i)) = 1. Thus, the objective function becomes:

min \;\; \sum\limits_i L(y_i, F_{m-1}(x_i) + h(x_i;a_m)) \newline = \sum\limits_i -(y_i-F_{m-1}(x_i)) h(x_i;a_m) + \frac{1}{2} h(x_i;a_m)^2 + constant

This is not different than min \;\; \sum\limits_i L(y_i - F_{m-1}(x_i), h(x_i;a_m)). That’s why in the above fitting process, it seems like we are fitting each new tree to the residual between the real label and previous trees’ output. In another equivalent perspective, we are fitting each new tree to the gradient - \frac{\partial L(y, F_{m-1}(x))}{\partial F_{m-1}(x)}.

The loss function L can take many kinds of forms, making MART a general supervised learning model. When MART is applied on classification problems, then the loss function for each data point (x_i, y_i) is L(y_i, p_i)=- [y_i \log (p_i) + (1-y_i) \log (1-p_i)], where p_i is the predicted probability p_i=\Pr(y_i=1|x_i). The regression trees’ output is used to fit the logit of p_i, i.e.,

p_i=\frac{1}{1+\exp{(-F_{m-1}(x_i)-h(x_i;a_m))}}

If we formulate all things in P_i=logit(p_i), then we have: 

P_i = logit(p_i) = \log (\frac{p_i}{1-p_i}) = F_{m-1}(x_i)+ h(x_i;a_m) See [11])

L(y_i, P_i) = - y_i P_i + \log (1 + \exp(P_i)) (See [12])

After such formulation, it will be easy to calculate L'(y_i,F_{m-1}(x_i)) = \frac{\partial L}{\partial P_i} \cdot \frac{\partial P_i}{\partial F_{m-1}(x_i)}=\frac{\partial L}{\partial P_i} and L''(y_i,F_{m-1}(x_i)) =\frac{\partial^2 L}{\partial^2 P_i} \cdot \frac{\partial^2 P_i}{\partial^2 F_{m-1}(x_i)}=\frac{\partial^2 L}{\partial^2 P_i}.

2. RankNet

We finished introducing MART. Now let’s talk about RankNet and LambdaMART. Starting from now, we follow the notations from [7]:

Screen Shot 2017-10-06 at 10.57.30 AM

(Table from [9])

After reading the first three chapters of [7], you probably reach the plot: 

Screen Shot 2017-10-06 at 10.51.41 AM

The plot illustrates the motivation for the invent of LambdaRank based on RankNet. In RankNet, the cost function is an easily defined, differentiable function:

C=\frac{1}{2}(1-S_{ij})\sigma(s_i-s_j)+\log(1+e^{-\sigma(s_i - s_j)})

If S_{ij}=1, C=log(1+e^{-\sigma(s_i - s_j)}); if S_{ij}=1, C=log(1+e^{-\sigma(s_j-s_i)}). The intuition of this cost function is that the more consistent the relative order between s_i and s_j is with S_{ij}, the smaller C will be.

Since the scores s_i and s_j are from a score function of parameters w, the cost function C is a function of w too. Our goal is to adjust w to minimize C. Therefore, we arrive at the gradient of C regarding w (using just one parameter w_k as example):

\frac{\partial C}{\partial w_k}=\lambda_{ij}(\frac{\partial s_i}{\partial w_k} - \frac{\partial s_j}{\partial w_k}), where \lambda_{ij} \equiv \sigma (\frac{1}{2}(1-S_{ij})-\frac{1}{1+e^{\sigma(s_i-s_j)}}). (This is from Eqn. 3 in [7].)

\lambda_{ij} is derived from a pair of URLs U_i and U_j. For each document, we can aggregate to \lambda_i by taking into account of all its interaction with other documents:

 \lambda_i = \sum\limits_{j:\{i,j\}\in I} \lambda_{ij} - \sum\limits_{j:\{j,i\}\in I} \lambda_{ij}

Based on the derivation from Section 2.1 in [7], the update rule of score function parameter w can be eventually expressed by \lambda_i

w_k \rightarrow w_k - \eta \frac{\partial C}{\partial w_k} = w_k - \eta\sum\limits_i \lambda_i \frac{\partial s_i}{\partial w_k} 

\lambda_i can be seen as the force to drive high and low of w_k from document U_i. Here is how one can understand the effect of \lambda_i through a simple but wordy example. Suppose for a document U_i which only has one pair U_i \triangleright U_j (S_{ij}=1). Assume in one particular update step, we have \frac{\partial s_i}{\partial w_k}>0, i.e., an increase of w_k will cause s_i to increase if everything else is fixed. Therefore, \lambda_i=\lambda_{ij}=-\frac{1}{1+e^{(s_i-s_j)}}<0 assuming \sigma=1. According to the update rule w_k \rightarrow w_k - \eta\sum\limits_i \lambda_i \frac{\partial s_i}{\partial w_k}, immediately we know that w_k is going increase further to increase s_i. Nevertheless, depending on whether s_i > s_j or s_j > s_i, the scale of \lambda_i would be different. If s_i > s_j, that is, the current calculated scores has a relative order consistent with the label, then -0.5 < \lambda_i < 0. On the other hand, if s_i < s_j, that is, the current calculated scores are inconsistent with the label, |\lambda_i| will become relatively large (\approx 1). 

3. LambdaMART

The problem of RankNet is that sometimes |\lambda_i| does not perfectly correlate to the force we want to drag the document to the ideal place.  On the right of the example plot, the black arrows denote |\lambda_i| the blue documents are assigned. However, if we want to use metrics that emphasize on the top few results (e.g., NDCG/ERR) we want |\lambda_i| to correlate with the red arrows. Unfortunately, many metrics like NDCG and ERR are not easy to come up with a well-formed cost function C. Recall the cost function of RankNet: as long as s_i and s_j are consistent with S_{ij}, C will be low regardless the absolution positions that U_i and U_j appear.  The fortunate part is that we do not need to know C when we train the model: all we need to know is \lambda_{ij} and we can design our own \lambda_{ij} for those not well-formed cost functions. Thus, the core idea of LambdaRank is to define \lambda_{ij} by:

\lambda_{ij} = \frac{\partial C(s_i- s_j)}{\partial s_i} = \frac{-\sigma}{1+e^{\sigma(s_i - s_j)}}|\triangle Z_{ij}|

where \triangle Z_{ij} is the change after swapping U_i and U_j of any measure you apply. This swap takes place in the sorted list of all documents according to their current calculated scores, and all documents’ positions are kept fixed except the two documents being swapped.

Now, if the model we are training contains parameters that form differential score functions then we can easily apply the update rule on those parameters like we talked about in <2. RankNet>: w_k \rightarrow w_k - \eta \frac{\partial C}{\partial w_k} = w_k - \eta \sum\limits_i \lambda_i \frac{\partial s_i}{\partial w_k}

However, if we are going to train MART, then we need to train it according to its iterative process, which only requires knowing \lambda_{ij}. This finally arrives at the LambdaMART algorithm:

Note that in LambdaMART we usually define \lambda_i in the way that fitting it is equivalent to maximizing some utility function C. Therefore, the Newton step to calculate \gamma_{lk} is actually doing gradient ascent. Recall that in <1. MART> section the objective that each MART tree optimizes for is:

min \;\; \sum\limits_i L(y_i, F_{m-1}(x_i) + h(x_i;a_m)) \newline=\sum\limits_i L(y_i,F_{m-1}(x_i)) + L'(y_i,F_{m-1}(x_i)) h(x_i;a_m) +\frac{1}{2} L''(y_i,F_{m-1}(x_i)) h(x_i;a_m)^2

In the LambdaMART algorithm, y_i corresponds to L'(y_i,F_{m-1}(x_i)) and w_i corresponds to L''(y_i,F_{m-1}(x_i)).

 

Reference

[1] http://www.cnblogs.com/wowarsenal/p/3900359.html

[2] https://www.quora.com/search?q=lambdamart

[3] https://liam0205.me/2016/07/10/a-not-so-simple-introduction-to-lambdamart/

[4]  http://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/materials/Schamoni_boosteddecisiontrees.pdf

[5] https://www.slideshare.net/Julian.Qian/learning-to-rank-an-introduction-to-lambdamart

[6] https://en.wikipedia.org/wiki/Newton%27s_method

[7] https://www.microsoft.com/en-us/research/publication/from-ranknet-to-lambdarank-to-lambdamart-an-overview/

[8] https://www.youtube.com/watch?v=IXZKgIsZRm0

[9] https://www.slideshare.net/Julian.Qian/learning-to-rank-an-introduction-to-lambdamart

[10] http://xgboost.readthedocs.io/en/latest/model.html

[11] https://stats.stackexchange.com/questions/157870/scikit-binomial-deviance-loss-function

[12] https://stats.stackexchange.com/questions/204154/classification-with-gradient-boosting-how-to-keep-the-prediction-in-0-1

[13] http://dasonmo.cn/2019/02/08/from-ranknet-to-lambda-mart/

Update 2020/02/22:

This reference contains an example of computing \lambda_i using NDCG metric:

[14] http://dasonmo.cn/2019/02/08/from-ranknet-to-lambda-mart/

Leave a comment

Your email address will not be published. Required fields are marked *