TRPO, PPO, Graph NN + RL

Writing this post to share my notes on Trust Region Policy Optimization [2], Proximal Policy Optimization [3], and some recent works leveraging graph neural networks on RL problems.  We start from the objective of TRPO. The expected return of a policy is . The return of another policy can be expressed as and a relative …

Notes on “Recommending What Video to Watch Next: A Multitask Ranking System”

Share some thoughts on this paper: Recommending What Video to Watch Next: A Multitask Ranking System [1] The main contribution of this work is two parts: (1) a network architecture that learns on multiple objectives; (2) handles position bias in the same model To first contribution is achieved by “soft” shared layers. So each objective …