在MDP中,action不仅影响immediate reward,也影响未来的reward。
3.1 The Agent-Environment Interface
Dynamics of finite MDP:
根据3.2,我们可以计算state-transition probabilities
,expected reward given state and reward
,expected reward for state-action-next-state
。RL问题中,state、action的选择更多是art而非science
3.2 Goals and Rewards
reward指明goal,但是不能给agent prior knowledge,比如下围棋,赢了一局给1分,但是agent并不知道该怎么下。what you want it to achieve, not how you want it achieved。
3.3 Returns and Episodes
episodic tasks:有terminal states,cumulative reward直接加起来就好。continuing tasks:更多RL问题没有terminal states,一直跑啊跑啊跑啊,于是用discounting来计算discounted return,
为discounted rate:
若
,只要
有限制,则3.8式为finite value;若
,则只考虑immediate reward。
只要
,3.8就是finite
3.4 Unified Notation for Episodic and Continuing Tasks
用abosorbing state 将episodic tasks 和continuing tasks结合,
于是episodic tasks的reward变成
,
or
(or both)。
3.5 Policies and Value Functions
value function: how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state).policy
: a mapping from states to probabilities of selecting each possible action.The value of a state
under a policy
, denoted
,
state-value function for policy :
action-value function for policy :
Monte Carlo methods:把所有可能都试了,计算average。也可以用parameterized functions来表示
,不断调整参数。虽然依赖于approximator,但也可以听精确。在value function 中,有
Bellman equation成立,是很多计算value function的基础:
3.6 Optimal Policies and Optimal Value Functions
总存在一个policy优于其他policy或相等,称为optimal policy
。所有的optimal policies 有相同的state-value function,称为optimal state-value function
,相同的optimal action-value funtion
。
无需指定policy(independent of policies),直接取max就可。Bellman optimality equation:
同上。
v*和q*的backup disgrams当已知
,我们只需做one-step search,从action中选择能到达
的一个(做greedy search)。若已知
,更简单,选择最大的对应的action即可,不需要知道任何环境信息。
3.7 Optimality and Approximation
算力不够,时间不够,memory不够,必须用approximation。而且可能存在很多很少见的states,approximation不需要在这些states上做出很好的选择。这也是RL和其他解决MDPs方法一个主要区别。
3.8 Summary
文章标题: Intro to RL Chapter 3: Finite Markov Decision Process
文章地址: http://www.xdqxjxc.cn/duhougan/101372.html