reinforcement公式整理

烧锅炉烧累了,整理一波

Posted by LCY on September 2, 2018

Monte Carlo Methods

有策略迭代如下:

: 策略评估(policy evaluation),即利用样本估计行动值

: 策略提升(policy improvement), 按照更新

img

探索(exploration)

  • 在线(on-policy): 更新值函数时使用当前策略产生的样本
    • policy is soft:
  • 离线(off-policy): 更新值函数时不使用当前策略产生的样本

    • 随机动作概率
    • 贪心动作概率

证明关于的新策略优于原策略

当且仅当均为最优策略时,等号成立。

DQN

img

Improve

Target Qnetwork

两个网络,延迟更新

更新网络:

policy gradient

更新参数, 即对损失函数求导:

仅仅从概率的角度来思考问题。我们有一个策略网络,输入状态,输出动作的概率。然后执行完动作之后,我们可以得到reward,或者result。如果某一个动作得到reward多,那么我们就使其出现的概率增大,如果某一个动作得到的reward少,那么我们就使其出现的概率减小。

构造一个好的动作评判指标,来判断一个动作的好与坏,通过改变动作的出现概率来优化策略

令这个评价指标为

则:

Why we consider log likelihood instead of Likelihood in Gaussian Distribution

  1. It is extremely useful for example when you want to calculate the joint likelihood for a set of independent and identically distributed points. Assuming that you have your points:

    The total likelihood is the product of the likelihood for each point, i.e.:

    where are the model parameters: vector of means and covariance matrix . If you use the log-likelihood you will end up with sum instead of product:

  2. Also in the case of Gaussian, it allows you to avoid computation of the exponential:

    Which becomes:

  3. Like you mentioned lnx is a monotonically increasing function, thus log-likelihoods have the same relations of order as the likelihoods:

  4. From a standpoint of computational complexity, you can imagine that first of all summing is less expensive than multiplication (although nowadays these are almost equal). But what is even more important, likelihoods would become very small and you will run out of your floating point precision very quickly, yielding an underflow. That’s why it is way more convenient to use the logarithm of the likelihood. Simply try to calculate the likelihood by hand, using pocket calculator - almost impossible.

    Additionally in the classification framework you can simplify calculations even further. The relations of order will remain valid if you drop the division by 2 and the term. You can do that because these are class independent. Also, as one might notice if variance of both classes is the same (), then you can also remove the term.

策略梯度方法

Actor-Critic

Actor-Critic算法小结

Primal-Dual DDPG

MDP

马尔科夫决策过程(markov Decision Process), 看作一个元组

: 状态state:

: 奖赏reward:

: 传播概率transition probability (where is the transition probability from state to state given action ).

: 初始状态分布

静态策略,映射了状态集合到动作集合的概率分布, 如$$\pi(\alpha s)s\alpha$$的概率.

在策略下的长期奖赏

其中,是折扣因子(discount factor)。是决策路径(trajectory)。其中意味着决策路径下的分布由策略所确定,如

CMDP

受限马尔科夫决策过程(constrained Markov Decision Procession)是使用长期折扣代价(discount count),补充了原马尔科夫决策理论。

在原有MDP上定义代价, 每个片段上的代价.长期代价就是, 对应的限制是

我们的目标是,在满足 的条件下,选出一个策略使长期奖励最大。

algorithm

使用拉格朗日方法求解

其中是拉格朗日乘子。

可以看作:

为解决无约束条件下最大最小问题,在每次迭代过程中一次更新策略

每次迭代:

  1. 固定,执行梯度策略更新:
  2. 固定,执行双重更新,

paper

Accelerated Primal-Dual Policy Optimization for Safe Reinforcement Learning. Qingkai Liang, Fanyu Que, Eytan Modiano