Counterfactual Policy Evaluation

Evaluating trained RL policies offline is extremely important in real-world production: a trained policy with unexpected behaviors or unsuccessful learning would cause the system regress online therefore what safe to do is to evaluate their performance on the offline training data, based on which we decide whether to deploy. Evaluating policies offline is an ongoing research …

Resources about Attention is all you need

There are several online posts [1][2] that illustrate the idea of Transformer, the model introduced in the paper “attention is all you need” [4]. Based on [1] and [2], I am sharing a short tutorial for implementing Transformer [3]. In this tutorial, the task is “copy-paste”, i.e., to let a Transformer learn to output the …

Implementation notes for world model

I’ve been recently implementing world model [1], which seems a promising algorithm to effectively learn controls after learning environments first. Here I share some implementation notes. Loss of Gaussian Mixture Model The memory model of world model is a Mixture-Density-Network Recurrent Neural Network (MDN-RNN). It takes current state and action as inputs, and outputs the …