RL Dynamic Programming
Dynamic Programming in RL
DP here refers to algorithms that can be used to compute optimal policies given a perfect model of the env as MDP.
The key idea is the use of value functions to organize and structure the search for good policies
Assumption: MDP Dynamics is known
Goal: find optimal policy
(-) a perfect model needed
(-) they involve operations over the entire state set of the MDP
(-) computational expense
(-) hard to apply to continuous state and action spaces (very large) problems
(+) DP provides an essential foundation for understanding other RL methods
Notes:
A common way of obtaining approximate solutions for tasks with continuous states and actions is to quantize the state and action spaces, then apply finite-state DP methods.
Recall the Bellman Equations:
We can easily obtain optimal policies once we found the optimal value function V* or Q* which satisfy the Bellman optimality equations
Policy Evaluation
To compute Vπ by given π
Is concerned with making the state or action-values consistent with the current policy being followed
Iterative Policy Evaluation
Expected Updates
based on an expectation over all possible next states rather than on a sample next state
Policy Improvement
To find π’, a new policy by argmax_a Qπ(s,a)
Specifically, the process of making a new policy that improves on an original policy, by making it greedy w.r.t the value function of the original policy
Is concerned with adjusting the current policy to one that when followed leads to a higher return
Recall
Vπ(s) - how good it is to follow the current policy from s
The reason to compute Vπ is to find better policies
How to determine if it is better to change to the new policy?
Use Qπ(s,a) - selecting a in s following policy π
Given a policy π and it’s value functions Vπ and Qπ, we can evaluate a change in the policy at a single state to a particular action
Example 4.1 GridWorld - python, implementation of Policy Evaluation
deterministic: p(s'|s,a)=1 Episodic Task
----|---|---|---- terminal states: ///
|///| 1 | 2 | 3 | actions={up,down,left,right}
|---|---|---|---| if out of grid, leave its location unchanged
| 4 | 5 | 6 | 7 | reward:
|---|---|---|---| r=-1 on all transitions until the terminal state is reached
| 8 | 9 | 10| 11|
|---|---|---|---| gamma=1 <- undiscounted
| 12| 13| 14|///|
----|---|---|---- compute the Vπ(s), π(s,a)=0.25 - uniform random
Vπ(s)=sum(π*[r+gamma*V(s')]) along actions
4X4 record π(s) of argmaxVπ(s)
import numpy as np
nA,nX,nY=4,4
actions=[[0,-1],[-1,0],[0,1],[1,0]] #left,up,right,down
pi=0.25
gamma=1
def is_terminal(s):
return (s[0]==0 and s[1]==0) or (s[0]==nX-1 and s[1]==nY-1)
def step(s,a):
if is_terminal(s):
return s,0
s_=[s[0]+a[0],s[1]+a[1]]
if s_[0]<0 or s_[0]>=nX or s_[1]<0 or s_[1]>=nY:
s_=s
r=-1
return s_,r
v=np.zeros((nX,nY))
pi_op=np.zeros((nX,nY))
iter=0
while True:
v_new=np.zeros_like(v)
for x in range(nX):
for y in range(nY):
vs=[]
for i,a in enumerate(actions):
(x_,y_),r=step([x,y],a)
vs.append(pi*(r+gamma*v[x_,y_]))
v_new[x,y]=np.sum(vs)
pi_op[x,y]=np.argmax(vs)
if np.sum(np.abs(v-v_new))<1e-4:
print(np.round(v,decimals=3))
print(pi_op)
print(iter)
break
v=v_new
iter+=1
>>V
[[ 0. -14. -20. -22.]
[-14. -18. -20. -20.]
[-20. -20. -18. -14.]
[-22. -20. -14. 0.]]
>>pi_op #argmax can only find the first occurance index
[[0. 0. 0. 0.]
[1. 0. 3. 3.]
[1. 2. 2. 3.]
[1. 2. 2. 0.]]
'''
[[l. l. l. l.]
[u. l. d. d.]
[u. r. r. d.]
[u. r. r. l.]]
'''
>>iter:217
Policy Iteration
To find optimal π* with the help of finding Vπ*
π0 -> Vπ0 -> π1 -> Vπ1 -> ... -> π* -> Vπ*
PE PI PE PI PI PE
PE - Policy Evaluation
PI - Policy Improvement
Policy Iteration results in faster convergence than Policy Evaluation, presumably because the value function changes little from one policy to the next
Value Iteration
Update V(s) with max_a ∑∑p(s’,r|s,a)[r+𝛾V(s’)]
Drawback of Policy Iteration:
(-) each of its iterations involves policy evaluation, which may itself be protracted iterative computation requiring multiple sweeps through the state set
Solution:
It may be possible to truncate policy evaluation without losing the convergence guarantees of policy iteration
Ways of viewing Value Iteration
- in each of VI’s sweeps, one sweep of PE and one sweep of PI
- Value Iteration is obtained simply by turning the Bellman Optimality Equation into an update rule
- compare backup diagrams of Vπ and V*
- the max operation is added to some sweeps of policy evaluation
All of these algorithms converge to an optimal policy for discounted finite MDPs
Value Iteration VS Policy Iteration
Recall:
In Policy Iteration each of its iterations involves policy evaluation, which may itself be protracted iterative computation requiring multiple sweeps through the state set
However,
In Value Iteration, with the max operator, it needs to consider all actions
Policy Iteration computes the value for some fixed policy π(s)
It can start with an arbitrary policy π0, and repeat until the policy converges
Policy Iteration Complexity:
Time: O(|S|^3+|A||S|^2)
Space: O(|S|)
Q-Value Iteration
Example 4.1 GridWorld - python, implementation of Value Iteration and Q-Value Iteration
deterministic: p(s'|s,a)=1 Episodic Task
----|---|---|---- terminal states: ///
|///| 1 | 2 | 3 | actions={up,down,left,right}
|---|---|---|---| if out of grid, leave its location unchanged
| 4 | 5 | 6 | 7 | reward:
|---|---|---|---| r=-1 on all transitions until the terminal state is reached
| 8 | 9 | 10| 11|
|---|---|---|---| gamma=1 <- undiscounted
| 12| 13| 14|///|
----|---|---|---- compute the V*(s) and Q*(s,a)
V(s)=max([r+gamma*V(s')]) along actions
4X4 Q(s,a)=r+gamma*max(Q(s',a'))
record π(s) of argmax_a Vπ(s), argmax_a Q(s,a)
#Valute Iteration
v=np.zeros((nX,nY))
pi_op=np.zeros_like(v)
iter=0
while True:
v_new=np.zeros_like(v)
for x in range(nX):
for y in range(nY):
vs=[]
for i,a in enumerate(actions):
(x_,y_),r=step([x,y],a)
vs.append(r+gamma*v[x_,y_])
v_new[x,y]=np.max(vs)
pi_op[x,y]=np.argmax(vs)
if np.sum(np.abs(v-v_new))<1e-4:
print(np.round(v,decimals=3))
print(pi_op)
print(iter)
break
v=v_new
iter+=1
#q-value iteration
def sub2num(x,y):
return x*nX+y
def num2sub(n):
return [n//nX,n%nY]
N=nX*nY
q=np.zeros((N,nA))
n,n_=0,0
iter=0
while True:
q_new=np.zeros_like(q)
for x in range(nX):
for y in range(nY):
qs=[]
for i,a in enumerate(actions):
(x_,y_),r=step([x,y],a)
n=sub2num(x,y)
n_=sub2num(x_,y_)
qs.append(q[n_,i])
q_new[n,a]=r+gamma*max(qs)
if np.sum(np.abs(q-q_new))<1e-4:
q_max=np.max(q,axis=1)
a_max=np.argmax(q,axis=1)
v_out=np.zeros((nX,nY))
a_out=np.zeros((nX,nY))
for j in range(N):
a,b=num2sub(j)
v_out[a,b]=q_max[j]
a_out[a,b]=a_max[j]
print(np.round(v_out,decimals=3))
print(np.round(a_out,decimals=3))
print(cnt)
break
q=q_new
iter+=1
>> #V of Value Iteration
[[ 0. -1. -2. -3.]
[-1. -2. -3. -2.]
[-2. -3. -2. -1.]
[-3. -2. -1. 0.]]
>> #pi_op
[[0. 0. 0. 0.]
[1. 0. 0. 3.]
[1. 0. 2. 3.]
[1. 2. 2. 0.]]
>>iter:3
>> #V=maxQ(s,a) of Q-Value Iteration
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
>> #a=argmaxQ(s,a)
[[0. 2. 2. 2.]
[2. 2. 2. 2.]
[2. 2. 2. 2.]
[2. 2. 2. 0.]]
>>iter:1
Asynchronous DP
One Drawback of DP:
(-) they involve operations over the entire state set of the MDP
Asyn DPs are in-place iterative DPs that are not organized in terms of systematic sweeps of the state set
They updates the values of states in any order whatsoever, using whatever values of other states happen to be available
(+) flexibility
(+) help the agent not get locked into any hopeless long sweep before it can make progress
(+) make it easier to intermix computation with real-time interaction
(-) still need to continue to update the values of all the state
To solve a given MDP, we can run an iterative DP at the same time that an agent is actually experiencing the MDP. The agent’s experience can be used to determine the states to which the DP applies its updates.
Generalized Policy Iteration
To find optimal policy π* and optimal value function V*
Refer to the general idea of letting Policy Evaluation and Policy Improvement processes interact, independent of the granularity and other details of the two processes
The value function only stabilizes when it is consistent with the current policy, and the policy stabilizes only when it is greedy with respect to the current value function
Efficiency of DP
Let |S| and |A| denote the number of states and actions
A DP is guaranteed to find an optimal policy in polynomial time, even the total number of (deterministic) policy is |A|^|S|
Comparable methods which can solve MDPs:
- Dynamic Programming
- Linear Programming
References
Sutton’s RL book second edition, Chapter 4 Dynamic Programming
shangtongzhang github