Operators in RL
Background
There is a fundamental tension in decision making between choosing the action that has highest expected utility and avoiding ‘starving’ the other actions.
Related context:
- exploration-exploitation dilemma [Thrun 1992]
- non-stationary decision problems [Sutton 1990]
- when interpreting observed decisions [Baker et al 2007]
softmax operator can be used for
- value-function optimization
- action selection policies
Common operators:
max (X) = max x_i
1~n
n
mean (X) = 1/n Σ x_i
1
eps_ε (X) = ε mean(X) + (1-ε) max(X)
Σ_n x_i*exp(βx_i)
boltz_β (X) = ------------------
Σ_n exp(βx_i)
β -> ∞, boltz_β (X) -> max (X)
β -> 0, boltz_β (X) -> mean (X)
SARSA is known to converge
- in the tabular setting using ε-greedy [Littman & Szepesvari 1996]
- under decreasing exploration [Singh et al 2000]
- to a region in the function approximation setting [Gordon 2001]
Optimal Bellman Q:
Q*(s,a) = r(s,a) + Σ γ p(s'|s,a) max Q*(s',a')
s'∈S a'
Regardless of initial value of Q^, Q^ converges to Q*:
Q^(s,a) <- r(s,a) + γ Σ (s'|s,a) max Q^(s',a')
s'∈S a'
Generalized Value Iteration [Littman & Szepesvari 1996]: replace max operator to ⨂:
Q^(s,a) <- r(s,a) + γ Σ (s'|s,a) ⨂ Q^(s',a')
s'∈S a'
Convergence of GVI to a unique fixed point follows if operator ⨂ is a non-expansion w.r.t the infinity norm:
|⨂Q^(s,a)-⨂Q^'(s,a)| ≤ max |Q^(s,a)-Q^'(s,a)|
a a a
max, mean, eps operators are non-expansions, therefore each of these operators can play the role of ⨂ in GVI, resulting in convergence to the corresponding unique fixed point
boltz operator is not a non-expansion, so it has multiple fixed points and ultimately leads to a misbehavior in learning and planning
Mellow Max operator, an alternative softmax operator:
log(1/n Σ_n exp(wx_i))
MM_w (X) = ------------------------
w
w -> ∞, MM_w (X) -> max (X)
w -> -∞, MM_w (X) -> min (X)
w -> 0, MM_x (X) -> mean (X)
MM can be derived from information theoretical principles as a way of regularizing policies with a cost function defined by KL divergence [Todorov 2006] [Rubin et al 2012] [Fox et al 2016]
MM follows the non-expansion property inside the infinity norm proof, therefore converges to a fixed point
Revisit Softmax Operator
Reference
[1] Asadi, Kavosh, and Michael L. Littman. “An alternative softmax operator for reinforcement learning.” Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.
[Thrun 1992]
[Sutton 1990]
[Baker et al 2007]
SARSA convergence:
[Littman & Szepesvari 1996]
[Singh et al 2000]
[Gordon 2001]
SARSA variant convergence:
[Perkins & Precup 2002]
[Baird & Moore 1999]
[Van Seijen et al 2009]
[2] Song, Zhao, Ron Parr, and Lawrence Carin. “Revisiting the softmax bellman operator: New benefits and new perspective.” International Conference on Machine Learning. 2019.