Convergence of Q-learning and SARSA

Here, I am listing some classic proofs regarding the convergence of Q-learning and SARSA in finite MDPs (by definition, in finite Markov Decision Process the sets of statesactions and rewards are finite [1]).

The very first Q-learning convergence proof comes from [4]. The proof is based on a very useful theorem:


Note that this theorem is general to be applied on multi-dimensional space (x can be multi-dimensional).

We can apply Theorem 1 to Q-learning whose update rule is as follows:
Q_{t+1}(s_t, a_t) = (1-\alpha(s_t, a_t))Q_t(s_t, a_t) + \alpha_t(s_t, a_t)[r_t +\gamma \max_{b \in \mathcal{A}}Q_t(s_{t+1}, b)]

Subtracting Q^*(s_t, a_t) from both sides and letting \triangle_t(s,a)=Q_t(s,a)-Q^*(s,a), we have:
\triangle_t(s_t,a_t)\newline=(1-\alpha_t(s_t, a_t))\triangle_t(s_t, a_t) + \alpha_t(s,a)[r_t + \gamma\max_{b\in \mathcal{A}}Q_t(s_{t+1}, b) - Q^*(s_t, a_t)]\newline=(1-\alpha_t(s_t, a_t))\triangle_t(s_t, a_t) + \alpha_t(s_t, a_t)F_t(s_t,a_t)

We can also let \alpha_t(s,a), the learning rate, to be zero for \forall s,a \neq s_t, a_t. Till this point, we have modeled the Q-learning process exactly as the random iterative process stated in Theorem 1. Hence, as long as we can prove the 4 assumptions are held, we can prove Q-learning converges to the optimal Q-values (i.e., \triangle_t(s,a) converges to 0).

Assumption 1: trivial to hold since we are focusing on finite MDPs

Assumption 2: \beta_n(x)=\alpha_n(x)=\text{Q-learning learning rate} based on our formulation. If we apply a GLIE learning policy (“greedy in the limit with infinite exploration”) [2], then we make assumption 2 hold:

  

Assumption 3: \|E\{F_t(s,a) | P_n\}\|_w \leq \gamma \|\triangle_t\|_w. There are some notations that may not look familiar to ordinary audience. First, \| x \|_w := \|w \cdot x\|_\infty=max |w_i x_i| means the weighted maximum norm [5]. Therefore, \|\triangle_t\|_w=\max\limits_{s,a} Q_t(s,a)-Q^*(s,a). Second, F_t(s,a) | P_n just means F_t(s,a) can be estimated using all the past interactions, i.e., Q_t in F_t(s,a) can be estimated conditional on all the past information rather than just (s_t, a_t)

To prove assumption 3 holds, we first look at how we define the optimal Q-function:
Q^*(s,a)=\sum\limits_{s'\in\mathcal{S}}P_a(s,s')[r(s,a,s')+\gamma \max\limits_{a'\in\mathcal{A}}Q^*(s',a')]

We can show that Q^*(s,a) is a fix point of a contraction operator \textbf{H}, defined over a generic function q: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}:
(\textbf{H}q)(s,a)=\sum\limits_{s'\in\mathcal{S}}P_a(s, s')[r(s,a,s')+\gamma \max\limits_{a' \in \mathcal{A}}q(s',a')]

Actually, \|\textbf{H}q_1 - \textbf{H}q_2\|_\infty \leq \gamma \|q_1-q_2\|_\infty, a \gamma-contraction, because:
\|\textbf{H}q_1 - \textbf{H}q_2\|_\infty \newline=\max\limits_{s,a} \big|\sum\limits_{s'\in \mathcal{S}} P_a(s,s')[r(s,a,s')+\gamma\max\limits_{a' \in \mathcal{A}}q_1(s',a') - r(s,a, s')-\gamma \max\limits_{a'\in\mathcal{A}}q_2(s',a')]\big|\newline=\max\limits_{s,a}\gamma \big|\sum\limits_{s'\in\mathcal{S}}P_a(s,s') [\max\limits_{a' \in \mathcal{A}}q_1(s',a')-\max\limits_{a' \in \mathcal{A}}q_2(s',a')]\big|\newline\text{Think of }\max |\textbf{p}*\textbf{x}| \leq \max|p_1\cdot\max|\textbf{x}|, p_2\cdot \max|\textbf{x}|, \cdots| \text{ if } \textbf{p}>0\newline\leq\max\limits_{s,a}\gamma \sum\limits_{s'\in \mathcal{S}}P_a(s,s')\big|\max\limits_{a' \in \mathcal{A}}q_1(s',a')-\max\limits_{a' \in \mathcal{A}}q_2(s',a')\big|\newline\text{norm property: }|\textbf{u}-\textbf{v}|_\infty \geq |\textbf{u}|_\infty-|\textbf{v}|_\infty\newline\leq \max\limits_{s,a}\gamma \sum\limits_{s'\in \mathcal{S}}P_a(s,s')\max\limits_{a' \in \mathcal{A}}\big|q_1(s',a')-q_2(s',a')\big| \newline \text{Enlarge the domain of max} \newline\leq \max\limits_{s,a}\gamma \sum\limits_{s'\in \mathcal{S}}P_a(s,s')\max\limits_{s'' \in \mathcal{S},a'' \in \mathcal{A}}\big|q_1(s'',a'')-q_2(s'',a'')\big|\newline=\max\limits_{s,a}\gamma \sum\limits_{s'\in \mathcal{S}}P_a(s,s')\|q_1-q_2\|_\infty\newline=\gamma \|q_1-q_2\|_\infty

Note in one of our previous note [6], we talked about the intuitive meaning of \gamma-contraction: consecutive application of \textbf{H} makes the function q closer and closer to the fixed point q^* at the rate of \gamma. Another intuitive understanding of \|\textbf{H}q_1 - \textbf{H}q_2\| \leq \gamma \|q_1-q_2\| (in terms of L2 norm) is that the distance between q_1 and q_2 is closer after applying \textbf{H}:

After proving \textbf{H} is a \gamma-contraction, we now prove assumption 3:
\|E\{F_t(s,a) | P_n\}\|_w \newline=\sum\limits_{s'\in\mathcal{S}}P_a(s,s')[r(s,a,s')+\gamma\max\limits_{a'\in\mathcal{A}}Q_t(s', a')-Q^*(s,a)]\newline=\sum\limits_{s'\in\mathcal{S}}P_a(s,s')[r(s,a,s')+\gamma\max\limits_{a'\in\mathcal{A}}Q_t(s', a')]-Q^*(s,a)\newline=(\textbf{H}Q_t)(s,a)-Q^*(s,a)\newline\text{Using the fact that }Q^*=\textbf{H}Q^*\newline=(\textbf{H}Q_t)(s,a)-(\textbf{H}Q^*)(s,a)\newline \leq \|\textbf{H}Q_t-\textbf{H}Q^*\|_\infty\newline\leq \gamma \|Q_t-Q^*\|_\infty\newline=\gamma \|\triangle_t\|_\infty

It is less obvious to me how to prove assumption 4, even after reading [3] and proposition 5.5 in [8]. But the take away of assumption 4 is that the variance of F_t(s,a) should be bounded.

At last, y0u can prove the convergence of SARSA [2] in a similar fashion. Hope I’ll have time to cover in the future.  

I think the most important note to take away from this post is the pattern to prove convergence of a learning algorithm. Researchers often need to propose variants of Q-learning (such as soft Q-values in maximum entropy environment [6], or SlateQ for dealing with combinatorial actions [9]). They usually need to prove convergence at least in the case of finite MDPs. One can start from the optimal Q-function of their learning algorithm and prove it is a fixed point of a contraction operator. Then, look at the update rule and construct a random iterative process as stated in Theorem 1. Finally, prove the 4 assumptions hold.

 

Reference

[1] https://medium.com/harder-choices/finite-markov-decision-process-8a34f5e571f9

[2] Convergence Results for Single-Step On-Policy
Reinforcement-Learning Algorithms https://link.springer.com/content/pdf/10.1023/A:1007678930559.pdf

[3] Convergence of Q-learning: a simple proof http://users.isr.ist.utl.pt/~mtjspaan/readingGroup/ProofQlearning.pdf

[4] Convergence of Stochastic Iterative Dynamic Programming Algorithms https://papers.nips.cc/paper/764-convergence-of-stochastic-iterative-dynamic-programming-algorithms.pdf

[5] https://math.stackexchange.com/questions/182054/is-this-a-norm-triangle-inequality-for-weighted-maximum-norm

[6] https://czxttkl.com/2018/10/30/notes-on-soft-actor-critic-off-policy-maximum-entropy-deep-reinforcement-learning-with-a-stochastic-actor/

[7] norm properties: https://en.wikipedia.org/wiki/Norm_(mathematics)#Properties

[8] Neuro Dynamic Programming https://www.dropbox.com/s/naenlfpy0f9vmvt/neuro-dynamic-programming-optimization-and-neural-.pdf?dl=0

[9] SlateQ: https://arxiv.org/abs/1905.12767

Leave a comment

Your email address will not be published. Required fields are marked *