LambdaMART [7] is one of Learn to Rank algorithms. It emphasizes on fitting on the correct order of a list, which contains all documents returned by a query and marked as different relevance. It is a derivation/combination of RankNet, LambdaRank and MART (Multiple Addictive Regression Tree).

For me, the most helpful order to approach LambdaMART is to understand: 1. MART; 2. RankNet; 3. LambdaRank; and 4. LambdaMART. [4] actually aligns with my order to introduce LambdaMART. (Sidenote: another good resource to learn MART is [8])

1. MART

Screen Shot 2017-10-13 at 4.31.11 PM

Above is an illustration plot for one regression tree. If data of different labels actually come from disjoint feature spaces, then one regression tree would be enough. However, there are many situations where data do come from interleaved feature spaces. Hence, it is natural to propose using multiple regression trees to differentiate irregular feature subspaces and make predictions.

Now, suppose MART’s prediction function is $F(x)=\sum\limits_{m=0}^{M} h(x;a_m)$ , where each $h(x;a_m)$ is one regression tree’s output and $a_m$ are the parameters of that tree. Generally speaking, each regression tree has K-terminal nodes. Hence $a_m$ contains $\{R_{km} \}_1^K$ , which are the K feature subspaces that divide data, and $\{\gamma_{km}\}_1^K$ , which are the predicted outputs of data belonging to $\{R_{km} \}_1^K$ . $h(x;a_m)$ can be represented as: $h(x;a_m) = \sum\limits_{k=1}^K \gamma_{km} \mathbb{I}(x \in R_{km})$

$\{a_m\}_0^M$ are iteratively learned to fit to the training data:

Set $F_0(x)$ to initial guess
for each $m=1,2,\cdots, M$ , we learn optimal feature space divisions $\{R_{km}\}_1^K$
Based on the learned $\{R_{km}\}_1^K$ , the prediction value of each $R_{km}$ (leaf) is deterministically fixed as: $\gamma_{km}=argmin_{\gamma} \sum\limits_{x_i \in R_{km}} L(y_i, F_{m-1}(x_i) + \gamma)$

Although [4] doesn’t provide details for learning $\{R_{km}\}_1^K$ , I guess they should be very similar to what adopted in normal decision trees. One possible criteria could be to minimize the total loss: $\{R_{km}\}_1^K = argmin_{\{R_{km}\}_1^K} \sum\limits_{k=1}^K \sum\limits_{x_i \in R_{km}} L(y_i, F_{m-1}(x_i) + \gamma_{km})$ . (If using this criteria, $\{R_{km}\}_1^K$ and $\{\gamma_{km}\}_1^K$ are actually learned simultaneously)

[4] provides an example to illustrate the iterative learning process. The dataset contains data points in 2 dimensions and labels $\{1, 2, 3, 4\}$ . The loss function $L(\cdot)$ is least square loss. Each regression tree is only a decision stump.

Screen Shot 2017-10-13 at 7.19.55 PM

$F_0(x)=h(x;a_0)$ is the simplest predictor that predicts a constant value minimizing the error for all training data.

Screen Shot 2017-10-13 at 7.29.47 PM

After determining $h(x;a_0)$ , we start to build $h(x;a_1)$ . The cut $v_1$ is determined by some criterion not mentioned. But assuming that $v_1$ has already been obtained, we can then determine each leaf’s prediction output, i.e., $\gamma_{k1}, k=1,2$ :

Screen Shot 2017-10-13 at 8.00.54 PM

After determining the first and second trees, $F_1(x)=h(x;a_0)+h(x;a_1)$ . In other words, for $x<v_1$ , the predicted value will be 1.444; for $x \geq v_1$ , the predicted value will be 3.625.

Screen Shot 2017-10-14 at 6.27.05 AM

Next, determining $h(x;a_2)$ will be similar: finding a cut $v_2$ , and then determine $\gamma_{k2}, k=1,2$ as in step 3.

Screen Shot 2017-10-14 at 6.42.16 AM

Interestingly, when we fit $h(x;a_2)$ , it looks like that we fit a tree with each data point’s label updated as $y_i - F_1(x_i)$ , that is, we are fitting the residual of what previous trees are not able to predict. Below is the plot showing each data point’s original label and its updated label before fitting $h(x;a_2)$ .

Screen Shot 2017-10-14 at 6.46.16 AM

In fact, the fitting process described above is based on the assumption that $L$ is a square loss function. When fitting the $m$ -th tree, we are fitting the tree to the residual $y - F_{m-1}(x)$ . Interestingly, $y - F_{m-1}(x) = - \frac{\partial L(y, F_{m-1}(x))}{\partial F_{m-1}(x)}$ . That’s why MART is also called Gradient Boosted Decision Tree because that each tree fits the gradient of the loss.

To be more general and mathematical, when we learn $\{R_{km}\}_1^K$ and $\{\gamma_{km}\}_1^K$ in the $m$ -th iteration, our objective function is:

$min \;\; \sum\limits_i L(y_i, F_{m-1}(x_i) + h(x_i;a_m))$

If we take Taylor series of the objective function up to order 2 at 0, we get [10]:

$min \;\; \sum\limits_i L(y_i, F_{m-1}(x_i) + h(x_i;a_m)) \newline =\sum\limits_i L(y_i,F_{m-1}(x_i)) + L'(y_i,F_{m-1}(x_i)) h(x_i;a_m) +\frac{1}{2} L''(y_i,F_{m-1}(x_i)) h(x_i;a_m)^2$

Each time, the new regression tree ( $a_m = \{R_{km}\}_1^K$ and $\{\gamma_{km}\}_1^K$ ) should minimize this objective function, which is a quadratic function with regard to $h(x_i;a_m)$ . As long as we know $\frac{\partial L}{\partial F_{m-1}}$ and $\frac{\partial^2 L}{\partial^2 F_{m-1}}$ , we can learn the best $a_m$ to minimize the objective function.

Now, back to the previous example, if $L$ is a square loss function $L(y, \hat{y})=\frac{1}{2}(y - \hat{y})^2$ , then $L'(y_i,F_{m-1}(x_i))=-(y_i-F_{m-1}(x_i))$ and $L''(y_i,F_{m-1}(x_i)) = 1$ . Thus, the objective function becomes:

$min \;\; \sum\limits_i L(y_i, F_{m-1}(x_i) + h(x_i;a_m)) \newline = \sum\limits_i -(y_i-F_{m-1}(x_i)) h(x_i;a_m) + \frac{1}{2} h(x_i;a_m)^2 + constant$

This is not different than $min \;\; \sum\limits_i L(y_i - F_{m-1}(x_i), h(x_i;a_m))$ . That’s why in the above fitting process, it seems like we are fitting each new tree to the residual between the real label and previous trees’ output. In another equivalent perspective, we are fitting each new tree to the gradient $- \frac{\partial L(y, F_{m-1}(x))}{\partial F_{m-1}(x)}$ .

The loss function $L$ can take many kinds of forms, making MART a general supervised learning model. When MART is applied on classification problems, then the loss function for each data point $(x_i, y_i)$ is $L(y_i, p_i)=- [y_i \log (p_i) + (1-y_i) \log (1-p_i)]$ , where $p_i$ is the predicted probability $p_i=\Pr(y_i=1|x_i)$ . The regression trees’ output is used to fit the logit of $p_i$ , i.e.,

$p_i=\frac{1}{1+\exp{(-F_{m-1}(x_i)-h(x_i;a_m))}}$

If we formulate all things in $P_i=logit(p_i)$ , then we have:

$P_i = logit(p_i) = \log (\frac{p_i}{1-p_i}) = F_{m-1}(x_i)+ h(x_i;a_m)$ See [11])

$L(y_i, P_i) = - y_i P_i + \log (1 + \exp(P_i))$ (See [12])

After such formulation, it will be easy to calculate $L'(y_i,F_{m-1}(x_i)) = \frac{\partial L}{\partial P_i} \cdot \frac{\partial P_i}{\partial F_{m-1}(x_i)}=\frac{\partial L}{\partial P_i}$ and $L''(y_i,F_{m-1}(x_i)) =\frac{\partial^2 L}{\partial^2 P_i} \cdot \frac{\partial^2 P_i}{\partial^2 F_{m-1}(x_i)}=\frac{\partial^2 L}{\partial^2 P_i}$ .

2. RankNet

We finished introducing MART. Now let’s talk about RankNet and LambdaMART. Starting from now, we follow the notations from [7]:

Screen Shot 2017-10-06 at 10.57.30 AM

(Table from [9])

After reading the first three chapters of [7], you probably reach the plot:

Screen Shot 2017-10-06 at 10.51.41 AM

The plot illustrates the motivation for the invent of LambdaRank based on RankNet. In RankNet, the cost function is an easily defined, differentiable function:

$C=\frac{1}{2}(1-S_{ij})\sigma(s_i-s_j)+\log(1+e^{-\sigma(s_i - s_j)})$

If $S_{ij}=1$ , $C=log(1+e^{-\sigma(s_i - s_j)})$ ; if $S_{ij}=1$ , $C=log(1+e^{-\sigma(s_j-s_i)})$ . The intuition of this cost function is that the more consistent the relative order between $s_i$ and $s_j$ is with $S_{ij}$ , the smaller $C$ will be.

Since the scores $s_i$ and $s_j$ are from a score function of parameters $w$ , the cost function $C$ is a function of $w$ too. Our goal is to adjust $w$ to minimize $C$ . Therefore, we arrive at the gradient of $C$ regarding $w$ (using just one parameter $w_k$ as example):

$\frac{\partial C}{\partial w_k}=\lambda_{ij}(\frac{\partial s_i}{\partial w_k} - \frac{\partial s_j}{\partial w_k})$ , where $\lambda_{ij} \equiv \sigma (\frac{1}{2}(1-S_{ij})-\frac{1}{1+e^{\sigma(s_i-s_j)}})$ . (This is from Eqn. 3 in [7].)

$\lambda_{ij}$ is derived from a pair of URLs $U_i$ and $U_j$ . For each document, we can aggregate to $\lambda_i$ by taking into account of all its interaction with other documents:

$\lambda_i = \sum\limits_{j:\{i,j\}\in I} \lambda_{ij} - \sum\limits_{j:\{j,i\}\in I} \lambda_{ij}$

Based on the derivation from Section 2.1 in [7], the update rule of score function parameter $w$ can be eventually expressed by $\lambda_i$ :

$w_k \rightarrow w_k - \eta \frac{\partial C}{\partial w_k} = w_k - \eta\sum\limits_i \lambda_i \frac{\partial s_i}{\partial w_k}$

$\lambda_i$ can be seen as the force to drive high and low of $w_k$ from document $U_i$ . Here is how one can understand the effect of $\lambda_i$ through a simple but wordy example. Suppose for a document $U_i$ which only has one pair $U_i \triangleright U_j$ ( $S_{ij}=1$ ). Assume in one particular update step, we have $\frac{\partial s_i}{\partial w_k}>0$ , i.e., an increase of $w_k$ will cause $s_i$ to increase if everything else is fixed. Therefore, $\lambda_i=\lambda_{ij}=-\frac{1}{1+e^{(s_i-s_j)}}<0$ assuming $\sigma=1$ . According to the update rule $w_k \rightarrow w_k - \eta\sum\limits_i \lambda_i \frac{\partial s_i}{\partial w_k}$ , immediately we know that $w_k$ is going increase further to increase $s_i$ . Nevertheless, depending on whether $s_i > s_j$ or $s_j > s_i$ , the scale of $\lambda_i$ would be different. If $s_i > s_j$ , that is, the current calculated scores has a relative order consistent with the label, then $-0.5 < \lambda_i < 0$ . On the other hand, if $s_i < s_j$ , that is, the current calculated scores are inconsistent with the label, $|\lambda_i|$ will become relatively large ( $\approx$ 1).

3. LambdaMART

The problem of RankNet is that sometimes $|\lambda_i|$ does not perfectly correlate to the force we want to drag the document to the ideal place. On the right of the example plot, the black arrows denote $|\lambda_i|$ the blue documents are assigned. However, if we want to use metrics that emphasize on the top few results (e.g., NDCG/ERR) we want $|\lambda_i|$ to correlate with the red arrows. Unfortunately, many metrics like NDCG and ERR are not easy to come up with a well-formed cost function $C$ . Recall the cost function of RankNet: as long as $s_i$ and $s_j$ are consistent with $S_{ij}$ , $C$ will be low regardless the absolution positions that $U_i$ and $U_j$ appear. The fortunate part is that we do not need to know $C$ when we train the model: all we need to know is $\lambda_{ij}$ and we can design our own $\lambda_{ij}$ for those not well-formed cost functions. Thus, the core idea of LambdaRank is to define $\lambda_{ij}$ by:

$\lambda_{ij} = \frac{\partial C(s_i- s_j)}{\partial s_i} = \frac{-\sigma}{1+e^{\sigma(s_i - s_j)}}|\triangle Z_{ij}|$

where $\triangle Z_{ij}$ is the change after swapping $U_i$ and $U_j$ of any measure you apply. This swap takes place in the sorted list of all documents according to their current calculated scores, and all documents’ positions are kept fixed except the two documents being swapped.

Now, if the model we are training contains parameters that form differential score functions then we can easily apply the update rule on those parameters like we talked about in <2. RankNet>: $w_k \rightarrow w_k - \eta \frac{\partial C}{\partial w_k} = w_k - \eta \sum\limits_i \lambda_i \frac{\partial s_i}{\partial w_k}$

However, if we are going to train MART, then we need to train it according to its iterative process, which only requires knowing $\lambda_{ij}$ . This finally arrives at the LambdaMART algorithm:

Note that in LambdaMART we usually define $\lambda_i$ in the way that fitting it is equivalent to maximizing some utility function $C$ . Therefore, the Newton step to calculate $\gamma_{lk}$ is actually doing gradient ascent. Recall that in <1. MART> section the objective that each MART tree optimizes for is:

$min \;\; \sum\limits_i L(y_i, F_{m-1}(x_i) + h(x_i;a_m)) \newline=\sum\limits_i L(y_i,F_{m-1}(x_i)) + L'(y_i,F_{m-1}(x_i)) h(x_i;a_m) +\frac{1}{2} L''(y_i,F_{m-1}(x_i)) h(x_i;a_m)^2$