Reinforcement learning (RL) is an area of machine learning concerned with optimizing a notion of cumulative rewards. Although it has been applied in video game AI, robotics and control optimization for years, we have seen less of its presence in web products. In this post, I am going to introduce some works that apply RL in web products. In a more general sense, this post will how web products can benefit from a view of sequential decision making. Indeed, the order of news read, songs listened, or videos watched would all affect user experience online. Thus, how to model and optimize sequential user experience becomes more and more crucial.
RL / Monte Carlo Tree Search on Song list Recommendation
In music services, a song list contains a sequence of songs curated by players to listen to. This is the case in point that sequential decision making may makes a difference. The song list recommendation contains two pieces of works [3, 4]. [3] details how to extract features of single songs and song transitions, and how to learn a parameterized linear reward function. Here, the observed reward is human listeners’ like/dislike and fitted by linear combination of the reward function’s weights and song and transition features. After learning the reward function, [3] uses heuristic-based planning to recommend song lists. [3] shows in simulation and real-world evaluations that their recommended song lists can result to higher long-term rewards for listeners (either synthetic or real humans).
What I don’t understand well in [3] is Algorithm 3, in which the reward function weights are updated according to observed rewards. Note that this is different than function-approximation-based TD-learning, what I have known better, where weights are updated for estimating state or state-action values. Algorithm 3 looks like inverse reinforcement learning, but the authors did not mention that. But it may just be a simple linear regression fitting procedure based on gradient descent.
[4] improves the planning with a more advanced search algorithm, Monte Carlo Tree Search (MCTS). This planning takes place after the reward function is fully known, i.e., the weights have been learned already. The goal of planning is to generate a song list achieving the highest accumulated rewards. [4] uses a specific version of MCTS called $latex MaxMCTS(\lambda)$, which employs a more sophisticated Q-value backpropagation strategy. Using MCTS is pretty straightforward here, since one cannot exhaust all possible song lists in a limited time.
Q-Learning on Recommendation
Applying Q-learning on recommendation is sometimes straightforward. You define user features as states, things to recommend as actions, and metrics you want to improve as rewards. [8] is one example in this vein, where actions are news articles and rewards are clicks + user activeness. They claim using Dueling Bandit Gradient Descent for exploration so that exploration of news articles can focus more on relevant ones. They propose to divide Q-value function into value function $latex V(S)$ and advantage function $latex A(s,a)$ much like Dueling network architectures.
Multi-Agent RL on Recommendation
[11] works on a wholistic way to do recommendation in multiple stages of user interaction with a website. For example, the website needs to recommend diverse items on the entrance page. Once the user clicks on an interested item, the user will enter an item detail page and the website needs to recommend more related items. And after the user decides to buy some item, the website needs to recommend more items post-purchase to incentivize more future purchases. Traditionally at each stage you can have one dedicated model to optimize stage-wise reward. But you can utilize more information across different stages and make a decision at each stage such that it can benefit for all other future stages. That’s the motivation behind [11] but to be honest I am not sure how much room for improvement there will be in real application compared to so much complexity introduced.
Here are some technical details about [11]. Its high-level model is DDPG, an actor-critic architecture. The actor is a policy network that outputs the action based on the state. The state is some low-dimension, dense vector which embeds user history using a memory-based network like RNN. The critic is a value network that outputs state-action values in which state is embedded in the same way as in the actor. Note, there are multiple actors, one in each stage of recommendation; but there is only one critic which outputs a global state-action value. The training of the critic needs to define a target state-value function, usually in a form of $latex y=R+\gamma Q(s’,\pi(s’))$. The authors think that fitting the target state-value function would need a huge sample size; however, decomposing $latex y$ into several parts may make the fitting easier. (Personally I am not sure if it is true.) The authors propose to decompose $latex y$ into the expectation of state-action values over the probability distribution of each action. The probability of each action can be estimated by a supervised learning model.
Ads
There are two classic algorithms on applying RL for ads, bidding particularly. The state space is auction/campaign real-time information while the action is the next bid price to set. [14] was published earlier, in which it defined the full MDP (transition and reward) such that the optimal policy can be learned by dynamic programming. [13] used DQN to learn the Q-function, where the action space is the parameter to control which bidding model to use, rather than the direct price to bid. I really admire the details of designing the MDP in [13]. For example, in Figure 2 of [13], they found that bidding patterns are similar day by day, indicating that they should treat one episode as one day and collect <state, action> pairs per hour.
Ranking
There has been an increasing interest in using Reinforcement Learning for list-wise ranking. The idea is that traditional ranking usually acts like the greedy policy and doesn’t consider the interaction between ranked items. For example, if a user likes soccer, all soccer-related posts are predicted with high click-through rate and thus placed in higher order, despite that the user could get fatigued by seeing posts of the same category consecutively. It is likely that showing posts of interweaving categories, such as soccer and entertainment, could foster healthier cadence of reading posts and make the user more satisfied.
[7] proposes an algorithm working on the following setting: “a user is displayed to a page of k items and she provides feedback by clicking on one or none of these items, then the system recommends a new page of k items”. The first part of the paper proposes to use Generative Adversarial Network (GAN) to train a user model for learning which item a user would click given a page and what would be their satisfaction upon click. In my opinion, this part can be done using other standard ML models. The user model will be used as a simulator to provide data for RL training. In the second part, [7] proposes cascade Q-learning (CDQN). Suppose the total number of items is $latex \mathcal{I}$. There is permutation of $latex (\mathcal{I}, k)$ ways to form a page of k items (the order of k items in a page also matters) and CDQN only needs to make $latex \mathcal{O}(k\mathcal{I})$ times of argmax calls. My understanding of CDQN is that it creates spurious $latex k$ sub-states between $latex s_t$, the state when the user is about to receive a slate, and $latex s_{t+1}$ when the user is about to receive the next slate. And you can learn to estimate Q-values at each k sub-state, with the assumption that the optimal Q-value at each sub-state should be the same as the optimal Q-value at the last sub-state. I think the $latex \mathcal{O}(k\mathcal{I})$ time complexity is still too computationally expensive for real products and I’d like to see an $latex \mathcal{O}(k)$ or $latex \mathcal{O}(l)$ algorithm.
Another work called SlateQ [9] works in the same setting but uses a different way to learn Q-values and concoct slates. The paper makes two reasonable assumptions: (1) SC: a user consumes one or zero item from each slate; (2) reward and transition functions only depend on the consumed item. With the two assumptions, $latex Q^*(s, A) = \sum_{i\in A} P(i|s,A)Q^*(s,i)$, meaning that the optimal Q-value of a state $latex s$ and a slate $latex A$ is equivalent to the expected optimal “decomposed” Q-value of $latex s$ and an item $latex i$ where the expectation is taken over the user choice probability. $latex Q^*(s,i)$ can be learned through learning schema very like normal SARSA and Q-Learning (see Eqn. (19) and (20)), and $latex P(i|s,A)$ is usually accessible from extant CTR prediction models most commercial recommendation systems have already had. Using decomposed Q-values simplifies the learning process a lot but we still have one thing to resolve: in Q-learning, you need to compute the $latex max_a$ operator (the best of all possible slates in the next slate); in the policy improvement phase of SARSA, you need to come up with a better slate. The brute-force method is to enumerate all possible slates which is combinatorial problem. But since we know $latex P(i|s,A)$ and $latex Q^*(s,i)$, the authors propose to use Linear Programming to do exact optimization, or Greedy/Top-K for approximate optimization. SlateQ has been tested on Youtube and shown 1% improvement in engagement.
Rather than using value-based models, [10] uses policy-gradient based methods to do slate recommendation. Its application context is very similar to CDQN and SlateQ, where you need to recommend k items in a slate each time. The question how to model the policy probability $latex \alpha(a|s)$, which is the probability of showing the observed clicked video $latex a$ in a slate given user state $latex s$. In Section 4.3, the authors make assumptions that a slate is generated by sampling $latex K$ times from the softmax distribution of video values ($latex \pi(a|s)$ from Eqn. 5). Following the assumptions, we have $latex \alpha(a|s)=1-(1-\pi(a|s))^K$. I think the policy gradient method would require the policy probability of $latex \Pi(A|s)$ where $latex A$ is the whole slate. But since we know that generating any slate that contains the observed click video $latex a$ differs with $latex \Pi(A|s)$ only by a scaling constant, we can just use $latex \alpha(a|s)$ in the policy gradient formula Eqn. (6). They propose to use REINFORCE to learn the policy. Since REINFORCE is an on-policy learning algorithm but their training data is off-policy, they introduce importance sampling in the REINFORCE update formula. Importance sampling itself brings more technical considerations: how to estimate the behavior policy’s propensities (sec 4.2)? and how to reduce variance of importance sampling (section 4.4)? They also show using Boltzmann exploration could help make wiser exploration.
Simulator
A very ideal idea is to train a simulator based on real data and then learn an RL model only with interaction with the simulator. If the simulator can truly reflect the characteristics of real data, RL models can be born just based on virtual data. [12] shows one example of this idea. It models two MDPs in a customer-search environment: the search engine makes sequential decisions following an MDP, and the customer themselves follows another MDP. The paper first uses Generative Adversarial Network and Multi-agent Adversarial Imitation Learning to learn a reliable simulator of customer features and learn the inherent policy of customers. Then, the paper proposes to use TRPO to learn the search engine’s policy using simulated data. It claims that using the learned policy resulted in real-world improvement.
Reference
[1] Personalized Ad Recommendation Systems for Life-Time Value Optimization with Guarantees
[2] A contextual-Bandit Approach to Personalized News Article Recommendation
[3] DJ-MC: A Reinforcement-Learning Agent for Music Playlist Recommendation
[4] Designing Better Playlists with Monte Carlo Tree Search
[5] http://info.ividence.com/deep-reinforcement-learning-advertising/
[6] https://www.youtube.com/watch?v=XiN9Hx3Y6TA
[7] Generative Adversarial User Model for Reinforcement Learning Based Recommendation System: https://arxiv.org/abs/1812.10613
[8] DRN: A Deep Reinforcement Learning Framework for News Recommendation
[10] Top-K Off-Policy Correction for a REINFORCE Recommender System
[11] Model-Based Reinforcement Learning for Whole-Chain Recommendations
[12] Virtual-Taobao- Virtualizing Real-world Online Retail Environment for Reinforcement Learning
[13] Deep Reinforcement Learning for Sponsored Search Real-timeBidding
[14] Real-Time Bidding by Reinforcement Learning in Display Advertising