Causal Inference in Recommendation Systems

We have briefly touched some concepts of causal inference in [1, 2]. This post introduces some more specific works which apply causal inference in recommendation systems. Some works need to know the background of backdoor and frontdoor adjustments. So we will introduce them first.

Backdoor and frontdoor adjustment 

Suppose we have a causal graph like below and we want to know the causal effect of X on Y. This can be translated to computing P(y|do(X=x))

Graph

P(y|do(X=x)) can be seen as the post-intervention distribution on a modified causal graph, in which we want to intervene on X by setting a specific value x on X. Therefore, the parents of X do not affect its value anymore and this corresponds to removing all the parents of X in the original causal graph:

enter image description here

When backdoor or frontdoor criteria are satisfied, we can apply backdoor and frontdoor adjustments to express P(y|do(X=x)) only using the observational distribution on the original causal graph:

Backdoor:

    \[P(y|do(X=x)) = \sum\limits_u P(y|x,u)P(u)\]

Frontdoor:

    \[P(y|do(X=x)) = \sum\limits_z P(z|x) \sum\limits_{x'} P(y|x', z)P(x')\]

The difference between backdoor and frontdoor adjustment is that the backdoor adjustment needs to assume that all confounder factors are observed (note u is used in the backdoor adjustment formula but not in the frontdoor adjustment formula).

I find two places which shed lights on how backdoor and frontdoor adjustment formulas are derived. Let’s first look at how the backdoor adjustment formula is derived, from [3]: 

P(Y|do(X=x)) := P^*(Y|x) \newline \text{(Suppose }P^*\text{ is the post-intervention distribution in the modified causal graph)} \newline = \sum\limits_u P^*(Y|x,u)P^*(u|x) \newline= \sum\limits_u P^*(Y|x,u)P^*(u) \newline(\text{Because }P^*(U|X)=P^*(U) \text{ in the modified causal graph}) \newline =\sum\limits_u P(Y|x,u)P(u) \newline (\text{Because the observational and post interventional distribution have } \newline \text{the same causal structure for Y and U})

Note that the backdoor adjustment assumes we can observe the confounder. That’s why you can see the summation over u in P(y|do(X=x)) = \sum\limits_u P(y|x,u)P(u). The frontdoor adjustment assumes we cannot observer u. Therefore, we have to use other observational distributions to decompose P(y|do(X=x)). While [3] also explains how to derive the frontdoor adjustment for P(y|do(X=x)), I feel [4] did a better job in explanation.

P(Y|do(X=x)) \newline= \sum\limits_z P(Y|z, do(X=x))P(z|do(X=x)) \newline =\sum\limits_z P(Y|z, do(X=x))P(z|x) \newline (P(z|do(X=x))=P(z|x) \text{ because there is no confounder between X and Z}) \newline (\text{However, we can't change }P(Y|z, do(X=x)) \text{ to }P(Y|z, x) \text{ because} \newline \text{there is an (unobserved) confounder between X and Y} )\newline=\sum\limits_z P(Y|do(Z=z), do(X=x))P(z|x) \newline \text{A trick to change from } P(Y|z, do(X=x))\text{ to } P(Y|do(Z=z), do(X=x)) \newline = \sum\limits_z P(Y|do(Z=z))P(z|x) \newline do(X=x)\text{'s effect is completely blocked by }do(Z=z)\newline\text{hence }P(Y|do(Z=z), do(X=x))=P(Y|do(Z=z)) \newline=\sum\limits_z P(z|x) \left[\sum\limits_{x'}P(Y|z, x')P(x')\right] \newline \text{Applying backdoor adjustment between do(Z=z) and Y}

Note that in the last equation, P(Y|z, x') cannot be further simplified to P(Y|z) because of the presence of U which can affect both X and Y [5].

Backdoor adjustment in real world papers

After introducing backdoor / front adjustment, let’s examine how they are used in practice. I came across two papers [9, 10], both of which applied backdoor adjustment.

[9] tries to account for popularity bias explicitly, for which many other models fail to take into account. In the diagram below, I represents items, U represents users, C represents user interaction on specific items, and Z represents popularity. Popularity is a confounder for items because more popular items tend to get exposed more often and create a feedback loop to continue biasing the model. Popularity is also a confounder for user interaction because users tend to have “herd mentality” thus tend to follow the majority to consume popular items.

To really know the intrinsic value of an item to a user, we have to estimate P(C|do(U,I)). By applying the backdoor adjustment, we have P(C|do(U,I))=\sum\limits_z P(C|U, I, z) P(z). P(C|U, I, z) can be estimated by any deep learning model which takes as input user-side features, item-side features, as well as popularity. But the paper has a nice trick to decompose P(C|U, I, z) = intrinsic\_value_{\theta}(U,I) \cdot pop\_adjust(z), where \theta is model parameters. Then they can use intrinsic\_value_{\theta}(U,I) to rank items based on real intrinsic value.

[10] provides another idea to do backdoor adjustment. They are tackling the bias of video length which is commonly seen in modern video recommendation systems. The bias exists because video watch time is often a topline metric video recommendation systems try to optimize. As a result, video with long duration tends to be overexposured and tend to amplify the video length bias feedback loop. The causal graph the authors model is as below, where U, V, D, and W represents user, video, duration, and watch time, respectively:

The backdoor adjustment to estimate the real intrinsic value E[W|do(U,V)] is 

Basically, the formula says that we will group videos into different duration buckets. The real intrinsic value h(u, v) is an estimator to estimate the watch time percentile within that duration bucket. h(u, v) can be a shared model that predicts the watch time percentile in different buckets.

It is interesting to see that during the inference, they will still rank based on predicted watch time, not the intrinsic value h(u,v):

A question could arise naturally that why we could not use a deep learning model to take as input user and video features as well as the confounder duration? This goes back to the motivation of the paper itself [10]. Such a deep learning model tends to predict high watch time for long videos simply because they are over-exposed (i.e., many of them are put in top positions so naturally long videos attract more watch time). However, with bucketing durations and changing labels to percentiles within each bucket, the model prediction will be less biased.

Causal Discovery

The paper [7] introduces a cool way to trim user history in user sequence modeling. The context is that in sequential recommendation problems, the objective function can be formulated as following:

    \[min -\sum\limits_{k=1}^N\sum\limits_{j=1}^{l_k} log f(\vec{v}_k^j | \mathbf{H}_k^j),\]

where \vec{v}_k^j is user k‘s j-th interacted item in history represented as a one-hot vector, and \mathbf{H}_k^j is the user interacted item history before the j-th item. f(\cdot) can be a softmax function so that minimizing the function will encourage the model to accurately predict which item the user will most likely interact with next given the history.

However, user history may contain many irrelevant items which makes prediction hard. [7] proposes using causal discovery to discover causally related items so that we only use purified history to predict next items users may interact with. The sequential model will use a more focused history for prediction and will be potentially more accurate in prediction.

Denote all users’ history as \mathbf{V}=\{\vec{v}_k^j\}_{k=1\cdots N,j=1\cdots l_k}, the objective then becomes:

min - \ell(\mathbf{V}, W) + \lambda ||W||_1 \newline = min -\sum\limits_{k=1}^N\sum\limits_{j=1}^{l_k}\sum\limits_{b=1}^{|\mathcal{V}|} log f(\left[\vec{v}_k^j\right]_b | \mathbf{\tilde{H}}_{kb}^j(W)) + \lambda ||W||_1 \newline s.t. \quad trace(e^{W \odot W})=\vert\mathcal{V}\vert, 

where W \in \mathbb{R}^{\vert\mathcal{V}\vert \times \vert\mathcal{V}\vert} is the discovered causal graph between items, and \mathbf{\tilde{H}}_{kb}^j(W)) is the purified history for each potential item b at the time of prediction. W is a real-valued matrix which approximates the actual binarized causal DAG, i.e., binarize(W_{ij}) = 1 means that the change of item i will cause the change of item j

Note that the form of min - \ell(\mathbf{V}, W) + \lambda ||W||_1 \; s.t. \; trace(e^{W \odot W})=\vert\mathcal{V}\vert is from NOTEARS [8], the work introducing differentiable causal discovery. Originally in [8], \mathbf{V} \in \{0,1\}^{N\times \vert \mathcal{V} \vert}, which can be thought as a matrix expressing all interacted items for each user without temporal order. \ell(\mathbf{V}, W)=\sum\limits_{k=1}^N\sum\limits_{b=1}^{\vert\mathcal{V}\vert}(\mathbf{V}_k^b, \mathbf{V}_k^{T} W_{\cdot b})^2 is a least-square error, which intuitively means that each observed interacted item should be explained by all other observed interacted items and the causal graph W. [8] shows that, if B \in \{0, 1\}^{\vert\mathcal{V}\vert \times \vert\mathcal{V}\vert} is the binarized version of W (the actual causal directed acyclic graph), then trace\left( e^{B} \right) = \vert\mathcal{V}\vert. W as a relaxed version of B can also be constrained in similar way but we need to slightly modify the constraint as trace(e^{W \odot W})=\vert\mathcal{V}\vert.

The novelty of [7] is that it puts the causal discovery framework proposed in [8] in a sequential recommendation context such that temporal information is taken into account and \mathbf{V} is no longer just a matrix of all interacted items for each user. Instead, \mathbf{V}=\{\vec{v}_k^j\}_{k=1\cdots N,j=1\cdots l_k}.

On a side note, the paper [7] also introduces a differentiable clustering method which is also pretty cool. 

 

References

[1] Causal Inference https://czxttkl.com/2020/10/13/causal-inference/

[2] Tools needed to build an RL debugging tool: https://czxttkl.com/2020/09/08/tools-needed-to-build-an-rl-debugging-tool/

[3] https://stats.stackexchange.com/questions/312992/causal-effect-by-back-door-and-front-door-adjustments/313127#313127

[4] http://www.degeneratestate.org/posts/2018/Sep/03/causal-inference-with-python-part-3-frontdoor-adjustment/

[5] https://stats.stackexchange.com/questions/448004/intuition-behind-conditioning-y-on-x-in-the-front-door-adjustment-formula/

[6] https://stats.stackexchange.com/questions/538477/causal-inference-difference-between-blocking-a-backdoor-path-and-adding-a-vari

[7] Sequential Recommendation with Causal Behavior Discovery: https://arxiv.org/abs/2204.00216

[8] DAGs with NO TEARS Continuous Optimization for Structure Learning: https://arxiv.org/abs/1803.01422

[9] Causal Intervention for Leveraging Popularity Bias in Recommendation: https://arxiv.org/pdf/2102.11107.pdf

[10] Deconfounding Duration Bias in Watch-time Prediction for Video Recommendation: https://arxiv.org/abs/2206.06003

Leave a comment

Your email address will not be published. Required fields are marked *