Many ways towards recommendation diversity

Diversity of recommendations keeps users engaged and prevents boredom [1]. In this post, I will introduce several machine learning-based methods to achieve diverse recommendations. The literature in this post is mostly retrieved from the overview paper “Recent Advances in Diversified Recommendation” [6]. 

Determinant Point Process

Let’s first review what is the determinant of a matrix [2]. The determinant of matrix A can be understood as representing the (signed) volume of the parallelepiped transformed from an n-dimensional unit cube by A. The parallelepiped is defined by the vertices of coordinates same as A‘s column vectors. Let’s look at a 2D example:

(1)   \begin{align*}|A| = \begin{vmatrix} a & b\\c & d \end{vmatrix}=ad - bc \end{align*}

 

A transforms a unit square on the 2D plane to a parallelogram characterized by two vertices (a,c) and (b,d) (corresponding to A‘s first column and second column). Therefore, according to basic math [3], this parallelogram’s area can be computed as ad - bc, which is the definition of the determinant of A.

A point process \mathcal{P} on an N-cardinality set S is a probability distribution over the power set of S (i.e., 2^{N}). Determinant point process represents a family of probability distributions such that for any subset s \subseteq S, its probability density has a closed form as:

(2)   \begin{align*}P(s \subseteq S)=\frac{det(L_s)}{det(L+I)} \propto det(L_s),\end{align*}


where L is a constructed N\times N matrix indexed by the elements of S, I is an identity matrix of the same size, and L_s is a sub-matrix of L: L_s \equiv [L_{ij}]_{i,j \in s}.

L can be constructed in a way to give the determinant point process a beautiful interpretation on balancing individual items’ quality and inter-item diversity. First, we can rewrite L as a Gram matrix: L=B^T B, where B has a size K \times N. Then, each column of B can be represented as B_i=q_i \dot \phi_i, i=1,\dots,N, which can be thought of as the normalized feature vector of one item in s (\Vert\phi_i \Vert_2=1) scaled by a quality scalar q_i. Next, as we know that P(s \subseteq S) \propto det(L_s) and det(AB)=det(A)det(B), we know that det(L_s)=det^2(B)=Vol\_of\_Parallelepiped^2(\{B_i\}_{i\in s}). The volume of the parallelepiped characterized by \{B_i\}_{i\in s} is affected by both q_i and \phi_i. The larger the q_i is, the larger the volume. The larger <\phi_i, \phi_j> (for any i,j\in s) is, the more similar the two items are, and consequently the smaller the volume is. This can be best understood in when |s|=2 or 3

We denote s^* as the best subset of S that can achieves the highest P(s \subseteq S). Now we introduce a concept: P(s \subseteq S) is log-submodular (Section 2.2 in [7]). Suppose f(s)=log P(s \subseteq S). Then f(s) is submodular because f(s \cup i) - f(s) \geq f(s' \cup i) - f(s') for any s \subseteq s' \subseteq S \setminus \{i\}. Intuitively, adding elements to s yields diminishing returns in f(s) as s is getting large. As [5] reveals (around Eqn. 74), f(s) is also non-monotone, as the additional of even a single element can decrease the probability. Submodular maximization corresponds to finding s^* that has the highest P(s \subseteq S) (or log P(s \subseteq S)). Submodular maximization is NP-hard (i.e., can be reduced to an NP problem in polynomial time). (But interestingly, submodular minimization can be solved exactly in polynomial time, see introduction in Section 1.2.1 in one’s thesis). Therefore, in practice, we turn to approximate method for P(s \subseteq S) maximization. [4] introduces one simple greedy algorithm (Eqn. 15). 

Simple Submodular Diversification

Before DPP is introduced to solve the diversification problem, Amazon has proposed a simpler idea to model diversification [9], which is also one kind of submodular minimization.

Here, \mathcal{A}_k is an item candidate set, \mathbf{a}_i is each item’s category (one-hot vector), s(\mathbf{a}_i) is the score of each item, and \mathbf{w} is a utility vector for governing the total utility of the final selected item set. The optimal item set will optimize \pho. And in practice it is optimized by a greedy method. \mathbf{w} is the only learnable parameter of the same dimension as the number of categories. If we want to learn a global optimal \mathbf{w}, we will need to estimate each category’s CTR by bayesian probability. If we want to learn a personalized \mathbf{w}, we will need to estimate each user’s CTR on each category.  

Calibrate recommendation to user history 

[8] introduces another idea of doing calibration. The tenet of it is that the recommended list should have similar topic distributions as the user watched history. You can look at Section 3 of [8] to get the exact formula. Basically, they use KL-divergence to measure the difference between the recommendation and user history.

References

[1] https://about.instagram.com/blog/engineering/on-the-value-of-diversified-recommendations

[2] https://en.wikipedia.org/wiki/Determinant#Geometric_meaning

[3] https://socratic.org/questions/how-do-you-find-the-area-of-a-parallelogram-with-vertices

[4] Practical Diversified Recommendations on YouTube with Determinantal Point Processes

[5] Determinantal point processes for machine learning

[6] Recent Advances in Diversified Recommendation

[7] Fast Greedy MAP Inference for Determinantal Point Process to Improve Recommendation Diversity

[8] Calibrated Recommendations

[9] Adaptive, Personalized Diversity for Visual Discover

Leave a comment

Your email address will not be published. Required fields are marked *