欢迎访问喜蛋文章网
你的位置:首页 > 读后感 > 文章正文

Intro to RL Chapter 13: Policy Gradient Methods

时间: 2021-09-19 08:42:04 | 作者:斑马 | 来源: 喜蛋文章网 | 编辑: admin | 阅读: 105次

Intro to RL Chapter 13: Policy Gradient Methods

之前的方法都是action-value methods:基于action values 来选择policy,没有value就没有policy。本章考虑直接学习parameterized policy,不需要value。可以学value function来辅助policy learning,但不需要action value 来选择action。用 boldsymbol{theta} in mathbb{R}^{d^prime} 来表示policy parameter vector, pi (a|s, boldsymbol{theta}) = Pr {A_t = a | S_t = a, boldsymbol{theta}_t = boldsymbol{theta}}mathbf{w} in mathbb{R}^d 表示value function weights。优化performance measure J(boldsymbol{theta}) 来更新policy parameter: boldsymbol{theta}_{t+1} = boldsymbol{theta}_t + alpha widehat{nabla boldsymbol{theta}_t}, space widehat{nabla boldsymbol{theta}_t} in mathbb{R}^{d^prime} ,称为policy gradient methods。policy和value都用的叫actor-critic methods。本章先从episodic开始,performance就是value of start state,再考虑continuing case,用average reward rate作为performance measure,最后把它们联系起来。

13.1 Policy Approximation and its Advantages

policy parameter只要是differentiable,只要 nabla pi (a|s,boldsymbol{theta}) 存在,且对所有state,action是finite即可。实际操作中,为了保证exploration,一般要求policy不能变成deterministic。本section考虑discrete action spaces,说明policy gradient的advantages。policy-based methods 也有一些处理continuous action spaces的方法,见Section 13.7。action space是discrete且不太多时,一个常用的parameter是给每个state-action pair输出preferences h (s, a,  boldsymbol{theta}) ,再用softmax处理,选择preference最高的action。具体怎么得到preference就随便,可以是neural network,可以是linear function。parameterized policy + softmax一个优势是可以逼近deterministic policy,而epsilon greedy至少有有epsilon。可以用action value + softmax,但这也不能逼近deterministic,action value和直接但prob 0 / 1还是不一样但。如果softmax用temperature,应该逐渐减小来逼近deterministic。但实际操作中很难设定合适但减小速度,连temperature初始值都很难找。action preferences和action value很不同,它们不是逼近特定的值,只要是optimal action,就增大preference就好,知道infinite(如果允许的话)。parameterized policy + softmax 第二个优势是可以根据任何prob选择action。有时optimal policy可能是stochastic,比如card game,不知道全部信息,最好但policy是按一定prob选择多个action。如example 13.1,四个states,左边三个nonterminal,只有左右两个actions,第一个和第三个都正常,第二个state,选择左会走向右,选择右会走向左。state-action feature对三个states都一样。value-based methods会强制都往左或往右,而policy-based会选择合适的prob。第三个优势是有时候policy function更容易得到。有的task state少action少但policy复杂,那就用action-value methods,有的policy简单,env复杂,那就policy methods。最后,用policy function可以很方便地使用prior knowledge,这经常是使用policy-based learning但主要原因。

13.2 The Policy Gradient Theorem

除了上面的advantage,还有一些理论上的advantage。policy parameterization在action之间prob是平滑的,不像action-value based methods,action轻轻一变value就可能变化巨大。这也让policy parameter更容易converge。episodic 和 continuing case有不一样但performance measure J(boldsymbol{theta}) ,虽然但是,本书尽量合起来,保证主要理论结果能用一套公式表示。本section考虑episodic case,performance measure即value of start state。假设start state一样,episodic performance: J(boldsymbol{theta}) = v_{pi_boldsymbol{theta}}(s_0). tag{13.4} 如何优化policy parameter呢?performance不仅仅受policy影响,也有env的作用,而env的model一般不知道。幸好有 policy gradient theorem 解决,得到对于performance,policy parameter的gradient,而和env无关。policy gradient theorem for episodic case: nabla J(boldsymbol{theta}) propto sum_s mu(s) sum_a q_pi (s, a) nabla pi (a|s, boldsymbol{theta}). tag{13.5}

constant proportionality 是 average length of an episode,对于continuing case 为 1。 mu 值on-policy distribution (see P199)。

Proof of policy gradient from Levine's course. let tau be a trajectory, tau = (s_0, a_0, dots, s_T, s_T) , probability of tau is the probability of the whole trajectoryp_theta (tau) = p(s_0) prod_{t=0}^T pi_theta (a_t|s_t) p(s_{t+1} |s_t, a_t) . begin{align} &J(boldsymbol{theta}) = v_{pi_boldsymbol{theta}}(s_0) = mathbb{E}_{tau sim pi_boldsymbol{theta}(tau)}[r(tau)] = int pi_boldsymbol{theta} (tau) r(tau) d tau,  &text{where } r(tau) = sum_{t=0}^T r(s_t, s_t).  & nabla_boldsymbol{theta} J(boldsymbol{theta}) =  int nabla_boldsymbol{theta} pi_boldsymbol{theta} (tau) r(tau) d tau  &text{note } pi_boldsymbol{theta} (tau) nabla_boldsymbol{theta} log pi_boldsymbol{theta} (tau) = pi_boldsymbol{theta}(tau) frac{nabla_boldsymbol{theta} pi_boldsymbol{theta} (tau)}{pi_boldsymbol{theta}(tau)} = nabla_boldsymbol{theta} pi_boldsymbol{theta} (tau) & nabla_boldsymbol{theta} J(boldsymbol{theta}) = int pi_boldsymbol{theta} (tau) nabla_boldsymbol{theta} log pi_boldsymbol{theta}(tau)r(tau) d tau = mathbb{E}_{tausim pi_boldsymbol{theta} (tau)}[nabla_boldsymbol{theta} log pi_boldsymbol{theta} (tau) r(tau)],  &log pi_boldsymbol{theta} (tau) = log p(s_0) + sum_{t=0}^T (log pi_boldsymbol{theta} (a_t|s_t)  + log p(s_{t+1}|s_t, a_t)),  &nabla_boldsymbol{theta} log pi_boldsymbol{theta}(tau) = nabla_boldsymbol{theta}sum_{t=0}^T log pi_boldsymbol{theta} (a_t|s_t),  &nabla_boldsymbol{theta} J(boldsymbol{theta}) = mathbb{E}_{tausim pi_boldsymbol{theta}(tau)}left[ left(sum_{t=0}^T nabla_boldsymbol{theta} log pi_boldsymbol{theta} (a_t|s_t)right) left( sum_{t=0}^T r(s_t, a_t) right) right]  propto sum_s mu(s) sum_a q_pi (s, a) nabla pi (a|s, boldsymbol{theta}) end{align}

13.3 REINFORCE: Monte Carlo Policy Gradient

all-actions methods, 直接从 (13.5) 得到, 用 pi 得到的sample即follow mu ,对于一个step,直接更新所有的actions:begin{align} nabla J(boldsymbol{theta})& propto sum_s mu(s) sum_a q_pi (s, a) nabla pi (a|s, boldsymbol{theta})  &= mathbb{E}_pi left[ sum_a q_pi(S_t, a)nablapi (a|S_t, boldsymbol{theta}) right]. tag{13.6} end{align} boldsymbol{theta}_{t+1} = boldsymbol{theta}_t + alpha sum_a hat{q}(S_t, a, mathbf{w})nabla pi (a|S_t, boldsymbol{theta}). tag{13.7} REINFORCE 只update使用的action A_t ,用sample代替 (13.6) 的expectation: begin{align} nabla J(boldsymbol{theta}) &approx mathbb{E}_pi left[ sum_a pi (a|S_t, boldsymbol{theta}) q_pi(S_t, a)  frac{nablapi (a|S_t, boldsymbol{theta})}{pi(A_t|S_t, boldsymbol{theta})} right]  &=  mathbb{E}_pi left[ q_pi(S_t, A_t)  frac{nablapi (A_t|S_t, boldsymbol{theta})}{pi(A_t|S_t, boldsymbol{theta})} right]  &=  mathbb{E}_pi left[ G_t frac{nablapi (A_t|S_t, boldsymbol{theta})}{pi(A_t|S_t, boldsymbol{theta})} right]  end{align}

得到policy parameter update:

boldsymbol{theta}_{t+1} = boldsymbol{theta}_t + alpha G_t frac{nabla pi(A_t|S_t, boldsymbol{theta}_t)}{pi(A_t|S_t, boldsymbol{theta}_t)}

REINFORCE完成episode后才更新,看起来像是个Monte Carlo (gradient 改成用sum of ln,加上discounting):下图为REINFORCE 在 example 13.1 的结果:用的是标准的SGD,REINFORCE能保证converge,虽然可能high variance因此学起来慢。

13.4 REINFORCE with Baseline

(13.5) 可以加上一个baseline b(s)nabla J(boldsymbol{theta}) propto sum_s mu(s) sum_a (q_pi (s, a) - b(s)) nabla pi (a|s, boldsymbol{theta}). tag{13.10}
文章标题: Intro to RL Chapter 13: Policy Gradient Methods
文章地址: http://www.xdqxjxc.cn/duhougan/123023.html
文章标签:读书笔记
Top