KLRL
KL
KL in RL consists in penalizing a new policy from being too far from the previous one
In deep case, it is used in
Trust Region Policy Optimization (TRPO)
Maximum a Posteriori Policy Optimization (MPO)
Entropy-regularized MDP
Azar et al 2012: Dynamic policy programming, Journal of Machine Learning Research 13 (2012) 3207–3245
Haarnoja et al 2018: Soft Actor-Critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor
Kozuno et al 2019: Theoretical analysis of efficiency and robustness of softmax and gap-increasing operators in reinforcement learning
Ziebart et al 2008: Maximum entropy inverse reinforcement learning
Reward function is regularized
when the action is discrete
Drawback of of model-free RL:
(-) sample compexity
(-) hyperparameter sensitivity or instability
KL used in
- VAE - deep generative model
- t-distributed stochastic neighbor embedding
KL is a measure of how a probability distribution differs from another.
KL can be written into a cross entropy between P and Q, denoted H(p,q), and an entropy of p, H(P(x))
Soft Policy Gradient Method for Maximum Entropy Deep Reinforcement Learning
Shi et al 2019
policy-based -> TRPO <- on-policy learning, expensive, suffer from poor sample complexity
value-based -> DQN <- off-policy, not stably interact with function approximation, need a complex approximate sampling procedure in continuous action spaces
actor-critic -> DDPG <- unsettled issue including how to set the types of policy and exploration noise
max-entropy RL - augments the standard maximum RL objective with an entropy regularization term
Toussaint 2009: Robot trajectory optimization using approximate inference
Todorov 2008: General duality between optimal control and estimation
Advantages of max-entropy RL:
(+) entropy regularization encourages exploration and helps prevent early convergence to suboptimal policies
(+) the resulting policies can serve as a good initialization for finetuning to a more specific behavior
(+) the framework provides a better exploration mechanism for seeking out the best mode in a multimodal reward landscape
(+) the resulting policies are more robust in the face of adversarial perturbations
Taiming the Noise in RL via Soft Updates
In noisy domains, in early stages of the learning process, the min or max operator in Q-learning brings about a bias in the estimates. The problem is akin to the “winner’s curse” in auctions
G-learning, regularize the state-action value function that penalizes deterministic policies which diverge from a simple stochastic prior policy [10]
[10] Rubin et al. Trading value and information in MDPs 2012
?? G-learning with an e-greedy exploration policy is exploration-aware, and chosses a less costly exploration policy, thus reducing the costs incurred during the learning process. Such awareness to the cost of exploration is usually attributed to on-policy algs, such as SARSA and expected SARSA.
G-learning, measure the KL-divergence from a fixed prior policy
Relative Entropy Policy Search, KL from the empirical distribution generated by the previous policy
G-learning use soft-greedy policies
Regularization
Regularization is a very common practice in machine learning. In supervized learning it is used to reduce model complexity. It also helps prevent the model from over fitting the test data. In RL because of the availablility of limitless data regularization does not need to be used to reduce model complexity. Instead it is used to reduce the change between policy updates.
KL divergence has been given a lot of attention recently because it can be used in a clear way to measure the statistical divergence between two probability distributions.
KL D is not a metric however the change in kl divergence can be used measure the change in a statistical distribution.
overfitting <- because the model is trying too hard to capture the noise in the traning data
balancing bias and variance is helpful in understanding overfitting
to avoid overfitting:
- cross validation - helps in estimating the error over test set and in deciding what parameters work best for the model
- regularization
in regression, it constrains/regularizes or shrinks the coefficient estimates towards zero
in other words, it discourages learning a more complex or flexible model, so as to avoid the risk of overfitting
Ridge Regression
Y ≈ β0 + β1x1 + β2x2 + ... + βpxp
n p
min RSS = Σ (yi - β0 - Σ βjx_ij)^2
i j
p
min RSS + λ Σ βj^2 <- new obj func
j
λ-term represents the flexibility of the model
since the increase in flexibility of a model is represented by increase in its coefficients
the new obj constrains the coefficients to be small
Regularization, significantly reduces the variance of the model, without substantial increase in its bias, so λ controls the impact on bias and variance
As λ rises, it reduces the value of coefficients and thus reducing the variance (avoiding overfitting)
Regularization helps improving the accuracy of the regression models
Regularization is a technique used for tuning the function by adding an additional penalty term in the error function.
The additional term controls the excessively fluctuating function such that the coefficients don’t take extreme values.
This technique of keeping a check or reducing the value of error coefficients are called shrinkage methods or weight decay in case of neural networks.
Revisiting the softmax bellman operator
Song et al 2019
the max function in the bellman equation suggests that the optimal policy should be greedy wrt the Q-values
On the other hand, the trade-off between exploration and exploitation motivates the use of exploratory and potentially sub-optimal actions during learning.
One commonly-used strategy is to add randomness by replacing the max function with the softmax function, as in Boltzmann exploration.
softmax function:
(+) is a differentiable approximation to the max function, and hence can facilitate analysis
(-) makes the accuracy of the Q-values get weak
(-) softmax operator is not a contraction for certain temperature parameters [Littmann 1996 page 205]
Image:
The convenient properties of the softmax operator would come at the expense of the accuracy of the resulting value or q-functions, or the quality of the resulting policies
However,
In deep case, it’s incorrect
The results show that the variants using the softmax operator can achieve higher test scores, and reduce the q-value overestimation as well as the gradient noise on most of them
Entropy regularizers have been used to smooth policies
Motivations of entropy regularizers are conputational convenience, exploration, or robustness
Fox et al 2016: Taming the noise in RL via soft updates
Haarnoja et al 2017: RL with deep energy-based policies
Schulman et al 2017: Equivalence between policy gradients and soft Q learning
Neu et al 2017: A unified view of entropy-regularized MDP
Nachum et al 2017: Bridging the gap between value and policy based rL
Lee et al 2018: Sparse MDPs with causal sparse Tsallis entropy regularization for RL
Asadi and Littman 2017; proposed mellowmax <- contraction
The experimental results suggested that it can improve exploration, but the possibility that the sub-optimal bellman operator could, independent of exploration, lead to superior policies was not considered
mellowmax <- proven as a contraction, and has convenient mathematical properties
log-sum-exp back to
Todorov 2007: Linearly-solvable Markov decision problems
where the control cost was regularized by a KL on the trasition pbobabilities
Fox et al 2016: G-learning with soft updates
to reduce the bias for the Q-value estimation, by regularizing the cost with the KL on policies
Asadi and Littman 2017: applied log-sum-exp to the on-policy updates
showed that the state-dependent inverse temperature can be numerically computed
log-sum-exp used in Boltzmann backup operator for entropy-regularized RL
“soft” treatment tackles:
overestimation and noise
REPS
problem:
there is a disconnect between finding an optimal policy and staying close to the observed data
approaches that allow stepping further away from the data are problematic
particularly, off policy approaches directly optimizing a policy will automatically result in a loss of data, as an improved policy needs to forget experience to avoid the mistakes of the past and to aim on the observed success
optimization bias problem [Mannor et al 2007]:
however, choosing an improved policy purely based on its return, favors biased solutions that eliminate states in which only bad actions have been tried out
solution:
staying close to the previous policy
policy updates may often result in a loss of essential information due to the policy improvement step
i.e. a policy update that eliminates most exploration by taking the best observed action often yields fast but premature convergence to a suboptimal policy [Kakade 2002]
The regularization of δθ’δθ = ε is problematic, therefore proposed δθ’F(θ)δθ = ε, F() is Fisher information metric
Max-Ent RL
entropy-regularized MDP
Azar et al 2012: Dynamic policy programming
Haarnoja et al 2018: Soft AC
Kozuno et al 2019: Theoretical analysis of efficiency and robustness of softmax and gap-increasing operators in reinforcement learning
Ziebart et al 2008: Maximum entropy inverse reinforcement learning
Fox et al 2016: G-learnign, Taming the noise in reinforcement learning via soft updates
Haarnoja et al 2017: SQL
Schulman et al 2017: Equivalence between policy gradients and soft Q-learning
Neu et al 2017: A unified view of entropy-regularized Markov decision processes
Asadi and Littman 2017: mellow max
Todorov 2007: Linearly-solvable Markov decision problems
where the control cost was regularized by a KL on the trasition pbobabilities
Toussaint 2009: Robot trajectory optimization using approximate inference
Todorov 2008: General duality between optimal control and estimation
Todorov 2009: Efficient computation of optimal actions
Todorov 2006: Linearly-solvable markov decision problems
Ψ-learning algorithm: same as reps??
Ψ-learning is an information-theoretic gap-increasing bellman operator
Rawlik et al 2010: Approximate inference and stochastic optimal control
Azar et al 2012: Dynamic policy programming
Still and Precup 2012: An information-theoretic approach to curiosity-driven reinforcement learning
A unified view of entropy-regularized Markov decision processes
average-reward RL?
entropy regularization in dynamic programming:
Fox et al 2016: safe exploration
Howard and Matheson 1972, Marcus et al 1997, Ruszczynsk 2010: risk-sensitive policies
Ziebart et al 2010, Ziebart 2010, Braun et al 2011: model observed behavior of imperfect decision-makers
entropy regularization in objectivs in direct policy search:
William and Peng 1991: Function optimization using connectionist reinforcement learning algorithms
Peters et al 2010: REPS
Schulman et al 2015: Trust region policy optimization
Mnih et al 2016: Asynchronous methods for deep reinforcement learning
O’Donoghue et al 2017: PGQ: Combining policy gradient and Q-learning
with the main goal of driving a safe online exploration procedure in an unknown MDP
misc
gap-increasing operator?
Bellemare et al 2015: Increasing the Action Gap: New Operators for Reinforcement Learning
Reference
[1] Leverage the Average: an Analysis of KL Regularization in Reinforcement Learning
Regularization in ML
KL Divergence for ML
Regularization: an important concept in ML
Maximum Entropy Policies in Reinforcement Learning & Everyday Life
Learning Diverse Skills via Maximum Entropy Deep Reinforcement Learning