Difference between SARSA and Q-learning

State-Action-Reward-State-Action (SARSA) and Q-learning are two forms of reinforcement learning. The difference of the two methods are discussed in:

Let’s explain why Q-learning is called off-policy learning and SARSA is called on-policy learning. Suppose at state $latex s_t$, a method takes action $latex a_t$ which results to land in a new state $latex s_{t+1}$ with observed reward $latex r_{t+1}$. At state $latex s_{t+1}$, the next action $latex a_{t+1}$ is taken, which results in $latex s_{t+2}$…

For Q-learning, its update rule is: $latex Q(s_t, a_t)=(1-\alpha) \cdot Q(s_t, a_t) + \alpha \cdot (r_{t+1} + \gamma \cdot \max_a Q(s_{t+1}, a))$

For SARSA, its update rule is: $latex Q(s_t, a_t)=(1-\alpha) \cdot Q(s_t, a_t) + \alpha \cdot (r_{t+1} + \gamma \cdot Q(s_{t+1}, a_{t+1})) $

I think the most intuitive way to differentiate on-policy and off-policy is to check whether the update formula depends on $latex a_{t+1}$. For Q-learning, it is called off-policy learning because no matter what $latex a_{t+1}$ is, the $latex \max_a Q(s_{t+1}, a)$ part is regardless of $latex a_{t+1}$. That is, $latex Q(s_t, a_t)$ gets updated regardless of future policy starting from $latex s_{t+1}$. On the other hand, SARSA is called on-policy learning because the part $latex Q(s_{t+1}, a_{t+1})$ in $latex Q(s_t, a_t)$ depends on the next actual action $latex a_{t+1}$.

Another way to understand the difference between on-policy and off-policy is their estimation on $latex Q(s,a)$. When both SARSA and Q-learning use $latex \epsilon$-greedy policy to strike the balance between exploration and exploitation, they still have different estimations on $latex Q(s,a)$. Q-learning usually has more aggressive estimations, while SARSA usually has more conservative estimations.

Below is a 3×3 grid showing the different behavior path learned from Q-learning and SARSA when:

both methods adopt $latex \epsilon$-greedy policy.
when tie happens, the action of going to right is preferred.
some discount factor $latex \gamma < 1$ is used.

As you can see, Q-learning learns the faster path to the exit without caring the risk of falling off the cliff, i.e., the state-action pairs leading to the faster path have high $latex Q(s,a)$ estimations than the other longer, safer path. This is attributed to the max operator in Q-learning’s update formula: the max operator only tells you the best accumulative discounted reward you could get in the future regardless of danger. SARSA is like a scrupulous person who avoids “dangerous zones” that is close to the cliff. This is because during the training, $latex Q(s,a)$ on the faster path are sometimes offset by the negative rewards due to falling into cliff. Therefore, $latex Q(s,a)$ on the longer, safer path, though discounted by more actions, are higher and more appealing.

Imagine if this environment has larger numbers of rows and columns and we still set the first row as cliffs except the two ends for entrance and exit. The path learned by SARSA may have different distances to the first row depending on the value of $latex \epsilon$ which governs the trade-off between exploration and exploitation and the value of $latex \gamma$ which controls the discounting level. The learned policy will be affected by training configuration. This is another reason why SARSA is called on-policy and Q-learning is called off-policy.

Difference between SARSA and Q-learning

Leave a comment

Cancel reply