# reinforcement公式整理

## 烧锅炉烧累了，整理一波

Posted by LCY on September 2, 2018

## Monte Carlo Methods

$\mathit{E}$: 策略评估(policy evaluation)，即利用样本估计行动值

$\mathit{I}$: 策略提升(policy improvement), 按照$\pi_{k+1}(s) = {\arg\max}_a q_k(s,a)$更新

### 探索(exploration)

• 在线(on-policy): 更新值函数时使用当前策略产生的样本
• policy is soft: $\pi(a|s)>0, s\in S, a \in A(s)$
• 离线(off-policy): 更新值函数时不使用当前策略产生的样本

• 随机动作概率 $\frac{\epsilon}{|A(s)|}$
• 贪心动作概率 $1-\epsilon+\frac{\epsilon}{|A(s)|}$

%

## DQN

### Improve

#### Why we consider log likelihood instead of Likelihood in Gaussian Distribution

1. It is extremely useful for example when you want to calculate the joint likelihood for a set of independent and identically distributed points. Assuming that you have your points:

The total likelihood is the product of the likelihood for each point, i.e.:

where$Θ$ are the model parameters: vector of means $μ$ and covariance matrix $Σ$ . If you use the log-likelihood you will end up with sum instead of product:

2. Also in the case of Gaussian, it allows you to avoid computation of the exponential:

Which becomes:

3. Like you mentioned lnx is a monotonically increasing function, thus log-likelihoods have the same relations of order as the likelihoods:

4. From a standpoint of computational complexity, you can imagine that first of all summing is less expensive than multiplication (although nowadays these are almost equal). But what is even more important, likelihoods would become very small and you will run out of your floating point precision very quickly, yielding an underflow. That’s why it is way more convenient to use the logarithm of the likelihood. Simply try to calculate the likelihood by hand, using pocket calculator - almost impossible.

Additionally in the classification framework you can simplify calculations even further. The relations of order will remain valid if you drop the division by 2 and the $dln(2π)$ term. You can do that because these are class independent. Also, as one might notice if variance of both classes is the same ($Σ_1=Σ_2$), then you can also remove the $ln(detΣ)$ term.

Actor-Critic算法小结

## Primal-Dual DDPG

#### MDP

$\mathcal{R}$: 状态state: $\mathcal{R}$

$\mathcal{R}$: 奖赏reward: $\mathcal{R}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\mapsto \mathbb{R}$

$\mathcal{P}$: 传播概率transition probability $\mathcal{P}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\mapsto [0,1]$ (where $P(s'|s,a)$ is the transition probability from state $s$ to state $s'$ given action $a$).

$p_0$: 初始状态分布 $\mathcal{S}\mapsto [0,1]$

 静态策略$\pi$,映射了状态集合到动作集合的概率分布, 如$$\pi(\alpha s)$是在状态$s$时选择动作$\alpha$$的概率.

### algorithm

1. 固定$\lambda=\lambda^{(k)}$，执行梯度策略更新： $\theta_{k+1}=\theta_k + \alpha_k\nabla_{\theta}(\mathcal{L}(\pi(\theta),\lambda^{(k)}))|_{\theta=\theta_k}$
2. 固定$\pi=\pi_k$，执行双重更新，$\lambda^{(k+1)}=f_k(\lambda^{(k)},\pi_k)$

# paper

Accelerated Primal-Dual Policy Optimization for Safe Reinforcement Learning. Qingkai Liang, Fanyu Que, Eytan Modiano