Here I am providing a script to quickly experiment with the openai gym environment: https://github.com/czxttkl/Tutorials/tree/master/experiments/lunarlander. The script has the features of both Deep Q-Learning and Double Q-Learning.
I ran my script to benchmark one open ai environment LunarLander-v2. The most stable version of the algorithm has following hyperparameters: no double q-learning (just use one q-network), gamma=0.99, batch size=64, learning rate=0.001 (for adam optimizer). It finishes around at 400 episodes:
Environment solved in 406 episodes! Average Score: 200.06 Environment solved in 406 episodes! Average Score: 200.06 Environment solved in 545 episodes! Average Score: 200.62 Environment solved in 413 episodes! Average Score: 200.35 Environment solved in 406 episodes! Average Score: 200.06
Several insights:
1. gamma, the discounted factor, could have influence on results. For example, in LunarLander, if I set gamma to 1 instead of 0.9, the agent can never successfully learn, although LunarLander-v2 limits 1000 steps per episode.
2. double q learning does not bring in benefit, at least in LunarLander-v2. Output of turning on double q-learning, which is similar to the most stable version of the algorithm which doesn’t use double q-learning:
lunarlander + mse + double q + gamma 0.99 Environment solved in 435 episodes! Average Score: 205.85 Environment solved in 435 episodes! Average Score: 205.85 Environment solved in 491 episodes! Average Score: 200.97 Environment solved in 406 episodes! Average Score: 200.06 Environment solved in 413 episodes! Average Score: 200.35
3. huber loss seems to even hurts. I change the loss function used in fitting q-learning to huber loss, but results are pretty bad. The agent never succeeds or only succeed after 1000 episodes.
References
[1] https://github.com/RMiftakhov/LunarLander-v2-drlnd/blob/master/dqn_agent.py