Control Variate

In this post, let’s talk about another variance reduction method called “control variate”. This is yet another variance reduction method besides importance sampling [1]. We start from a simple example: we want to estimate , the expectation of function applied on a random variable with sampled from . Analytically, . You can also think of …

Personalized Re-ranking

In the industry there is a trend to add a re-ranker at the final stage of a recommendation system. The re-ranker ranks the items that have already been filtered out from an enormous candidate set, aiming to provide the finest level of personalized ordering before the items are ultimately delivered to the user. In this …

Practical considerations of off-policy policy gradient

I’d like to talk more about policy gradient [1], which I touched upon in 2017. In common online tutorials, policy gradient theorem takes a lot of spaces to prove that the gradient of the policy in the direction to improve accumulated returns is: where is the accumulated return beginning from step from real samples.  Note …

Constrained RL / Multi-Objective RL

Learning a policy that can optimize multiple types of rewards or satisfy different constraints is a much desired feature in the industry. In real products, we often care about not only single one metric but several that interplay with each other. For example, we want to derive a policy to recommend news feeds which expects …

Hash table revisited

I came across how Facebook implements Hash table from this post: https://engineering.fb.com/developer-tools/f14/. It introduces several techniques making modern hash tables more efficient. The first technique is called chunking, which reduces the time for resolving hash collision. The idea is to map keys to a chunk (a block of slots) rather than a single slot then …

TRPO, PPO, Graph NN + RL

Writing this post to share my notes on Trust Region Policy Optimization [2], Proximal Policy Optimization [3], and some recent works leveraging graph neural networks on RL problems.  We start from the objective of TRPO. The expected return of a policy is . The return of another policy can be expressed as and a relative …

Notes on “Recommending What Video to Watch Next: A Multitask Ranking System”

Share some thoughts on this paper: Recommending What Video to Watch Next: A Multitask Ranking System [1] The main contribution of this work is two parts: (1) a network architecture that learns on multiple objectives; (2) handles position bias in the same model To first contribution is achieved by “soft” shared layers. So each objective …

Convergence of Q-learning and SARSA

Here, I am listing some classic proofs regarding the convergence of Q-learning and SARSA in finite MDPs (by definition, in finite Markov Decision Process the sets of states, actions and rewards are finite [1]). The very first Q-learning convergence proof comes from [4]. The proof is based on a very useful theorem: Note that this theorem is general to be …

Cross entropy with logits

I keep forgetting the exact formulation of `binary_cross_entropy_with_logits` in pytorch. So write this down for future reference. The function binary_cross_entropy_with_logits takes as two kinds of inputs: (1) the value right before the probability transformation (softmax) layer, whose range is (-infinity, +infinity); (2) the target, whose values are binary binary_cross_entropy_with_logits calculates the following loss (i.e., negative …

Notes for “Defensive Quantization: When Efficiency Meets Robustness”

I have been reading “Defensive Quantization: When Efficiency Meets Robustness” recently. Neural network quantization is a brand-new topic to me so I am writing some notes down for learning.  The first introduction I read is [1], from which I learn that the term “quantization” generally refers to reducing the memory usage of model weights by …