KL

KL in RL consists in penalizing a new policy from being too far from the previous one

In deep case, it is used in
Trust Region Policy Optimization (TRPO)
Maximum a Posteriori Policy Optimization (MPO)

Entropy-regularized MDP

Azar et al 2012: Dynamic policy programming, Journal of Machine Learning Research 13 (2012) 3207–3245
Haarnoja et al 2018: Soft Actor-Critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor
Kozuno et al 2019: Theoretical analysis of efficiency and robustness of softmax and gap-increasing operators in reinforcement learning
Ziebart et al 2008: Maximum entropy inverse reinforcement learning

Reward function is regularized
when the action is discrete

Drawback of of model-free RL:
(-) sample compexity
(-) hyperparameter sensitivity or instability

KL used in

  1. VAE - deep generative model
  2. t-distributed stochastic neighbor embedding

KL is a measure of how a probability distribution differs from another.

KL can be written into a cross entropy between P and Q, denoted H(p,q), and an entropy of p, H(P(x))

Soft Policy Gradient Method for Maximum Entropy Deep Reinforcement Learning

Shi et al 2019

policy-based -> TRPO <- on-policy learning, expensive, suffer from poor sample complexity
value-based -> DQN <- off-policy, not stably interact with function approximation, need a complex approximate sampling procedure in continuous action spaces
actor-critic -> DDPG <- unsettled issue including how to set the types of policy and exploration noise

max-entropy RL - augments the standard maximum RL objective with an entropy regularization term

Toussaint 2009: Robot trajectory optimization using approximate inference
Todorov 2008: General duality between optimal control and estimation

Advantages of max-entropy RL:
(+) entropy regularization encourages exploration and helps prevent early convergence to suboptimal policies
(+) the resulting policies can serve as a good initialization for finetuning to a more specific behavior
(+) the framework provides a better exploration mechanism for seeking out the best mode in a multimodal reward landscape
(+) the resulting policies are more robust in the face of adversarial perturbations

Taiming the Noise in RL via Soft Updates

In noisy domains, in early stages of the learning process, the min or max operator in Q-learning brings about a bias in the estimates. The problem is akin to the “winner’s curse” in auctions

G-learning, regularize the state-action value function that penalizes deterministic policies which diverge from a simple stochastic prior policy [10]

[10] Rubin et al. Trading value and information in MDPs 2012

?? G-learning with an e-greedy exploration policy is exploration-aware, and chosses a less costly exploration policy, thus reducing the costs incurred during the learning process. Such awareness to the cost of exploration is usually attributed to on-policy algs, such as SARSA and expected SARSA.

G-learning, measure the KL-divergence from a fixed prior policy
Relative Entropy Policy Search, KL from the empirical distribution generated by the previous policy

G-learning use soft-greedy policies

Regularization

Regularization is a very common practice in machine learning. In supervized learning it is used to reduce model complexity. It also helps prevent the model from over fitting the test data. In RL because of the availablility of limitless data regularization does not need to be used to reduce model complexity. Instead it is used to reduce the change between policy updates.

KL divergence has been given a lot of attention recently because it can be used in a clear way to measure the statistical divergence between two probability distributions.

KL D is not a metric however the change in kl divergence can be used measure the change in a statistical distribution.

overfitting <- because the model is trying too hard to capture the noise in the traning data

balancing bias and variance is helpful in understanding overfitting

to avoid overfitting:

  1. cross validation - helps in estimating the error over test set and in deciding what parameters work best for the model
  2. regularization

in regression, it constrains/regularizes or shrinks the coefficient estimates towards zero
in other words, it discourages learning a more complex or flexible model, so as to avoid the risk of overfitting

Ridge Regression

Y ≈ β0 + β1x1 + β2x2 + ... + βpxp

          n            p
min RSS = Σ (yi - β0 - Σ βjx_ij)^2
          i            j

            p
min RSS + λ Σ βj^2    <- new obj func
            j

λ-term represents the flexibility of the model
since the increase in flexibility of a model is represented by increase in its coefficients
the new obj constrains the coefficients to be small

Regularization, significantly reduces the variance of the model, without substantial increase in its bias, so λ controls the impact on bias and variance

As λ rises, it reduces the value of coefficients and thus reducing the variance (avoiding overfitting)

Regularization helps improving the accuracy of the regression models

Regularization is a technique used for tuning the function by adding an additional penalty term in the error function.
The additional term controls the excessively fluctuating function such that the coefficients don’t take extreme values.
This technique of keeping a check or reducing the value of error coefficients are called shrinkage methods or weight decay in case of neural networks.

Revisiting the softmax bellman operator

Song et al 2019

the max function in the bellman equation suggests that the optimal policy should be greedy wrt the Q-values On the other hand, the trade-off between exploration and exploitation motivates the use of exploratory and potentially sub-optimal actions during learning.
One commonly-used strategy is to add randomness by replacing the max function with the softmax function, as in Boltzmann exploration.

softmax function:

(+) is a differentiable approximation to the max function, and hence can facilitate analysis
(-) makes the accuracy of the Q-values get weak
(-) softmax operator is not a contraction for certain temperature parameters [Littmann 1996 page 205]

Image:
The convenient properties of the softmax operator would come at the expense of the accuracy of the resulting value or q-functions, or the quality of the resulting policies

However,
In deep case, it’s incorrect
The results show that the variants using the softmax operator can achieve higher test scores, and reduce the q-value overestimation as well as the gradient noise on most of them

Entropy regularizers have been used to smooth policies
Motivations of entropy regularizers are conputational convenience, exploration, or robustness

Fox et al 2016: Taming the noise in RL via soft updates
Haarnoja et al 2017: RL with deep energy-based policies
Schulman et al 2017: Equivalence between policy gradients and soft Q learning
Neu et al 2017: A unified view of entropy-regularized MDP
Nachum et al 2017: Bridging the gap between value and policy based rL
Lee et al 2018: Sparse MDPs with causal sparse Tsallis entropy regularization for RL

Asadi and Littman 2017; proposed mellowmax <- contraction
The experimental results suggested that it can improve exploration, but the possibility that the sub-optimal bellman operator could, independent of exploration, lead to superior policies was not considered

mellowmax <- proven as a contraction, and has convenient mathematical properties

log-sum-exp back to

Todorov 2007: Linearly-solvable Markov decision problems
where the control cost was regularized by a KL on the trasition pbobabilities

Fox et al 2016: G-learning with soft updates
to reduce the bias for the Q-value estimation, by regularizing the cost with the KL on policies

Asadi and Littman 2017: applied log-sum-exp to the on-policy updates
showed that the state-dependent inverse temperature can be numerically computed

log-sum-exp used in Boltzmann backup operator for entropy-regularized RL

“soft” treatment tackles:
overestimation and noise

REPS

problem:
there is a disconnect between finding an optimal policy and staying close to the observed data

approaches that allow stepping further away from the data are problematic
particularly, off policy approaches directly optimizing a policy will automatically result in a loss of data, as an improved policy needs to forget experience to avoid the mistakes of the past and to aim on the observed success

optimization bias problem [Mannor et al 2007]:
however, choosing an improved policy purely based on its return, favors biased solutions that eliminate states in which only bad actions have been tried out

solution:
staying close to the previous policy

policy updates may often result in a loss of essential information due to the policy improvement step

i.e. a policy update that eliminates most exploration by taking the best observed action often yields fast but premature convergence to a suboptimal policy [Kakade 2002]

The regularization of δθ’δθ = ε is problematic, therefore proposed δθ’F(θ)δθ = ε, F() is Fisher information metric

Max-Ent RL

entropy-regularized MDP

Azar et al 2012: Dynamic policy programming
Haarnoja et al 2018: Soft AC
Kozuno et al 2019: Theoretical analysis of efficiency and robustness of softmax and gap-increasing operators in reinforcement learning
Ziebart et al 2008: Maximum entropy inverse reinforcement learning

Fox et al 2016: G-learnign, Taming the noise in reinforcement learning via soft updates
Haarnoja et al 2017: SQL
Schulman et al 2017: Equivalence between policy gradients and soft Q-learning
Neu et al 2017: A unified view of entropy-regularized Markov decision processes
Asadi and Littman 2017: mellow max

Todorov 2007: Linearly-solvable Markov decision problems
where the control cost was regularized by a KL on the trasition pbobabilities

Toussaint 2009: Robot trajectory optimization using approximate inference
Todorov 2008: General duality between optimal control and estimation

Todorov 2009: Efficient computation of optimal actions

Todorov 2006: Linearly-solvable markov decision problems

Ψ-learning algorithm: same as reps??
Ψ-learning is an information-theoretic gap-increasing bellman operator
Rawlik et al 2010: Approximate inference and stochastic optimal control
Azar et al 2012: Dynamic policy programming

Still and Precup 2012: An information-theoretic approach to curiosity-driven reinforcement learning

A unified view of entropy-regularized Markov decision processes

average-reward RL?

entropy regularization in dynamic programming:
Fox et al 2016: safe exploration
Howard and Matheson 1972, Marcus et al 1997, Ruszczynsk 2010: risk-sensitive policies
Ziebart et al 2010, Ziebart 2010, Braun et al 2011: model observed behavior of imperfect decision-makers

entropy regularization in objectivs in direct policy search:
William and Peng 1991: Function optimization using connectionist reinforcement learning algorithms
Peters et al 2010: REPS
Schulman et al 2015: Trust region policy optimization
Mnih et al 2016: Asynchronous methods for deep reinforcement learning
O’Donoghue et al 2017: PGQ: Combining policy gradient and Q-learning

with the main goal of driving a safe online exploration procedure in an unknown MDP

misc

gap-increasing operator?

Bellemare et al 2015: Increasing the Action Gap: New Operators for Reinforcement Learning

Reference

[1] Leverage the Average: an Analysis of KL Regularization in Reinforcement Learning

Regularization in ML
KL Divergence for ML Regularization: an important concept in ML

Maximum Entropy Policies in Reinforcement Learning & Everyday Life
Learning Diverse Skills via Maximum Entropy Deep Reinforcement Learning