I’ve been reading Prof. Sergey Levine‘s paper on Guided Policy Search (GPS) [2]. However, I do not understand about it but want to have a record of my questions so maybe in the future I could look back and solve.
Based on my understanding, traditional policy search (e.g., REINFORCE) maximizes the likelihood ratio of rewards. This is usually achieved by first collecting some samples, taking derivatives of the likelihood ratio w.r.t. the policy parameters, updating the policy parameters, then collecting new samples again. The shortcoming of such procedure is that: (1) sample inefficient, because off-policy samples are discarded every once the policy parameters are updated; (2) when the real situation is really complex, it is really hard to navigate to the globally optimal parameter in the parameter space. Local optima may often reach and the danger arises for robots when guided into risky trajectories by poor parameters.
Therefore, it is more ideal if we can utilize demonstration trajectories to initialize the policy. Moreover, whatever trajectories we have experimented can be kept instead of being discarded to help improve the policy parameters. These should be the fundamental motivation of GPS.
What I am not sure is how exactly iLQR works. And also regarding Algorithm 1, line 1 from [2]: why are there many DDP solutions ($latex \pi_{\mathcal{G}_1}, \cdots,\pi_{\mathcal{G}_n}$) generated? Does that mean iLQR have many different results when initialized differently? Is iLQR only used in the first line?
Seems like GPS only deals with known dynamics and reward function. When dynamics are not known, we should then look at [3] or Continuous Deep Q-Learning [4, 5].
Reference
[1] http://statweb.stanford.edu/~owen/mc/Ch-var-is.pdf (Importance sampling tutorial)
[2] https://graphics.stanford.edu/projects/gpspaper/gps_full.pdf
[3] https://people.eecs.berkeley.edu/~svlevine/papers/mfcgps.pdf
[4] https://zhuanlan.zhihu.com/p/21609472
[5] https://arxiv.org/abs/1603.00748
[6] http://blog.csdn.net/sunbibei/article/details/51485661 (Notes on GPS written in Chinese)
[7] https://www.youtube.com/watch?v=eKaYnXQUb2g (Levine’s video. In the first half hour he talked about GPS)