Operators in RL

Background

There is a fundamental tension in decision making between choosing the action that has highest expected utility and avoiding ‘starving’ the other actions.

Related context:

exploration-exploitation dilemma [Thrun 1992]
non-stationary decision problems [Sutton 1990]
when interpreting observed decisions [Baker et al 2007]

softmax operator can be used for

value-function optimization
action selection policies

Common operators:

max (X) = max x_i
          1~n

               n
mean (X) = 1/n Σ x_i
               1

eps_ε (X) = ε mean(X) + (1-ε) max(X)

              Σ_n x_i*exp(βx_i)
boltz_β (X) = ------------------
                Σ_n exp(βx_i)

β -> ∞, boltz_β (X) -> max (X)
β -> 0, boltz_β (X) -> mean (X)

SARSA is known to converge

in the tabular setting using ε-greedy [Littman & Szepesvari 1996]
under decreasing exploration [Singh et al 2000]
to a region in the function approximation setting [Gordon 2001]

Optimal Bellman Q:

Q*(s,a) = r(s,a) + Σ γ p(s'|s,a) max Q*(s',a')
                 s'∈S             a'

Regardless of initial value of Q^, Q^ converges to Q*:

Q^(s,a) <- r(s,a) + γ Σ (s'|s,a) max Q^(s',a')
                    s'∈S          a'

Generalized Value Iteration [Littman & Szepesvari 1996]: replace max operator to ⨂:

Q^(s,a) <- r(s,a) + γ Σ (s'|s,a) ⨂ Q^(s',a')
                    s'∈S         a'

Convergence of GVI to a unique fixed point follows if operator ⨂ is a non-expansion w.r.t the infinity norm:

|⨂Q^(s,a)-⨂Q^'(s,a)| ≤ max |Q^(s,a)-Q^'(s,a)|
 a         a             a

max, mean, eps operators are non-expansions, therefore each of these operators can play the role of ⨂ in GVI, resulting in convergence to the corresponding unique fixed point

boltz operator is not a non-expansion, so it has multiple fixed points and ultimately leads to a misbehavior in learning and planning

Mellow Max operator, an alternative softmax operator:

            log(1/n Σ_n exp(wx_i))
MM_w (X) = ------------------------
                      w

w -> ∞, MM_w (X) -> max (X)
w -> -∞, MM_w (X) -> min (X)
w -> 0, MM_x (X) -> mean (X)

MM can be derived from information theoretical principles as a way of regularizing policies with a cost function defined by KL divergence [Todorov 2006] [Rubin et al 2012] [Fox et al 2016]

MM follows the non-expansion property inside the infinity norm proof, therefore converges to a fixed point

Revisit Softmax Operator

Reference

[1] Asadi, Kavosh, and Michael L. Littman. “An alternative softmax operator for reinforcement learning.” Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.

[Thrun 1992]
[Sutton 1990]
[Baker et al 2007]

SARSA convergence:
[Littman & Szepesvari 1996]
[Singh et al 2000]
[Gordon 2001]

SARSA variant convergence:
[Perkins & Precup 2002]
[Baird & Moore 1999]
[Van Seijen et al 2009]

[2] Song, Zhao, Ron Parr, and Lawrence Carin. “Revisiting the softmax bellman operator: New benefits and new perspective.” International Conference on Machine Learning. 2019.

Jiexin Wang

Operators in RL

Background

Revisit Softmax Operator

Reference

You May Also Enjoy

Generative Adversarial Nets

Action Selection in RL

Latex test

Model-based RL