RL – czxttkl

Evaluating trained RL policies offline is extremely important in real-world production: a trained policy with unexpected behaviors or unsuccessful learning would cause the system regress online therefore what safe to do is to evaluate their performance on the offline training data, based on which we decide whether to deploy. Evaluating policies offline is an ongoing research …

Continue reading “Counterfactual Policy Evaluation”

Implementation notes for world model

I’ve been recently implementing world model [1], which seems a promising algorithm to effectively learn controls after learning environments first. Here I share some implementation notes. Loss of Gaussian Mixture Model The memory model of world model is a Mixture-Density-Network Recurrent Neural Network (MDN-RNN). It takes current state and action as inputs, and outputs the …

Continue reading “Implementation notes for world model”

DPG and DDPG

In this post, I am sharing my understanding regarding Deterministic Policy Gradient Algorithm (DPG) [1] and its deep-learning version (DDPG) [2]. We have introduced policy gradient theorem in [3, 4]. Here, we briefly recap. The objective function of policy gradient methods is: where represents , is the stationary distribution of Markov chain for , , and . is …

Continue reading “DPG and DDPG”

LSTM + DQN

Sequential decision problems can usually be formatted as Markov Decision Problems (MDPs), where you define states, actions, rewards and transitions. In some practical problems, states can just be described by action histories. For example, we’d like to decide notification delivery sequences for a group of similar users to maximize their accumulated clicks. We define two …

Continue reading “LSTM + DQN”

DQN + Double Q-Learning + OpenAI Gym

Here I am providing a script to quickly experiment with the openai gym environment: https://github.com/czxttkl/Tutorials/tree/master/experiments/lunarlander. The script has the features of both Deep Q-Learning and Double Q-Learning. I ran my script to benchmark one open ai environment LunarLander-v2. The most stable version of the algorithm has following hyperparameters: no double q-learning (just use one q-network), gamma=0.99, batch size=64, learning …

Continue reading “DQN + Double Q-Learning + OpenAI Gym”

Category Archives: RL

Reinfocement Learning in LLMs

Simulation on the ads supply problem

Some SOTA Model-based RL

Practical considerations of off-policy policy gradient

Constrained RL / Multi-Objective RL

Counterfactual Policy Evaluation

Implementation notes for world model

DPG and DDPG

LSTM + DQN

DQN + Double Q-Learning + OpenAI Gym