Counterfactual Policy Evaluation

Evaluating trained RL policies offline is extremely important in real-world production: a trained policy with unexpected behaviors or unsuccessful learning would cause the system regress online therefore what safe to do is to evaluate their performance on the offline training data, based on which we decide whether to deploy. Evaluating policies offline is an ongoing research …

Implementation notes for world model

I’ve been recently implementing world model [1], which seems a promising algorithm to effectively learn controls after learning environments first. Here I share some implementation notes. Loss of Gaussian Mixture Model The memory model of world model is a Mixture-Density-Network Recurrent Neural Network (MDN-RNN). It takes current state and action as inputs, and outputs the …

DPG and DDPG

In this post, I am sharing my understanding regarding Deterministic Policy Gradient Algorithm (DPG) [1] and its deep-learning version (DDPG) [2]. We have introduced policy gradient theorem in [3, 4]. Here, we briefly recap. The objective function of policy gradient methods is: $latex J(\theta)=\sum\limits_{s \in S} d^\pi(s) V^\pi(s)=\sum\limits_{s \in S} d^\pi(s) \sum\limits_{a \in A} \pi(a|s) Q^\pi(s,a), &s=2$ where …

LSTM + DQN

Sequential decision problems can usually be formatted as Markov Decision Problems (MDPs), where you define states, actions, rewards and transitions. In some practical problems, states can just be described by action histories. For example, we’d like to decide notification delivery sequences for a group of similar users to maximize their accumulated clicks. We define two …

DQN + Double Q-Learning + OpenAI Gym

Here I am providing a script to quickly experiment with the openai gym environment: https://github.com/czxttkl/Tutorials/tree/master/experiments/lunarlander. The script has the features of both Deep Q-Learning and Double Q-Learning.   I ran my script to benchmark one open ai environment LunarLander-v2. The most stable version of the algorithm has following hyperparameters: no double q-learning (just use one q-network), gamma=0.99, batch size=64, learning …