Dynamic Policy Programming
Intro
A control problem can be solved by finding the optimal policy using dynamic programming
Dynamic Programing estimates some measures of state-action values eg. Q*(x,a) through Bellman equation
(-) for high-dimensional discrete systems, or for continuous systems, computing the value function by DP is intractable
common approach is: to use function approximation, or Monte-Carlo sampling
DPP: forcing an incremental change between two consecutive policies, accumulating the approximation error of all the previous iterations, rather than just minimizing the approximation error of the current iteration
The proposed setting differs from the standard RL: we rely on a generative model from which samples can be drawn. i.e. the agent has full control on the sample queries that can be made for any arbitrary state
MDP Formulation
{X,A,P,R,γ}
π(·|x) - Markovian stationary policy
Vπ:X->ℝ - given π, the corresponding value function
Vπ(x) - value function in each state x
Qπ:Z->ℝ - given π, the corresponding action value function
Qπ(x,a) - action value function choosing action a from state x following π
Tπ - the Bellman operator
Assumption 1 (MDP Regularity): we assume that X and A are finite sets with the cardinalities |X| and |A|, respectively. Also, the absolute value of the immediate reward r(x,a) is bounded from above by Rmax>0 ∀(x,a)∈Z.
Denote: X*A=>Z (joint space of X and A), Vmax=Rmax/(1-γ)
Stationary Policy: the distribution of the control action is independent of time
Deterministic Policy: ∀x∈X, ∃a s.t. π(a|x)=1
Value Function: the expected total discounted reward in each state x, when the action is chosen by policy π
Action Value Function: the expected total discounted reward upon choosing action a from state x following π
Notation
γ · ∀ ∃ ℝ ≜
∈ ∉ ∀ ∃ ≤ ≥ ≠ ∅ Σ Π π ∞ ⊆ ∤ ± ≅ ≡
Reference
[1] Dynamic Policy Programming JMLR 2012