Policy Gradient

Reinforcement learning algorithms can be divided into many families. In model-free temporal difference methods like Q-learning/SARSA, we try to learn action value $latex Q(s,a)$ for any state-action pair, either by recording (“memorizing”) exact values in a tabular or learning a function to approximate it. Under $latex \epsilon$-greedy, the action to be selected at a state will therefore …