Temporal-Difference Learning
Note: the notations and formalizations follow the previous post Basic Reinforcement Learning
TD Prediction
constant-\(\alpha\) MC:
\[V(s_t) \leftarrow V(s_t)+\alpha\left[R_t - V(s_t) \right]\]Target: \(R_t\) - the actual return following time \(t\)
\(\alpha\) - a constant step-size parameter
Whereas MC methods must wait until the end of the episode to determine the increment to \(V(s_t)\), TD methods need to wait only until the next step
TD(0) or one-step TD:
\[V(s_t) \leftarrow V(s_t)+\alpha \left[r_t + \gamma V(s_{t+1}) - V(s_t) \right]\]Target: \(r_t + \gamma V(s_{t+1})\)
TD error:
\[\delta_t \triangleq r_t + \gamma V(s_{t+1}) - V(s_t)\]The difference between the estimated value of \(s_t\) (\(V(s_t)\)) and the better estimate \(r_t + \gamma V(s_{t+1})\)
Recall in DP:
\[\begin{align*} V_{\pi}(s) &\triangleq \mathbb{E}_{\pi} \left[R_t \mid s_t=s \right] &\text{:Target for MC}\\ &=\mathbb{E}_{\pi} \left[r_t + \gamma V_{\pi}(s_{t+1}) \mid s_t=s \right] &\text{:Target for DP} \end{align*}\]TD target not only samples the expected value but uses the current estimate \(V\) instead of the true \(V_{\pi}\). Thus it combines the sampling of MC with the bootstrapping of DP
SARSA
SARSA is an on-policy TD control method, for an on-policy method we must estimate \(Q_{\pi}(s,a)\) for the current behavior policy \(\pi\) and for all states \(s\) and actions \(a\)
Consider transitions from state-action pair to state-action pair
\[(s_t,a_t),r_t \rightarrow (s_{t+1},a_{t+1}),r_{t+1} \rightarrow (s_{t+2},a_{t+2}),r_{t+2} ...\]The update rule of SARSA:
For nonterminal state \(s_t\):
\[Q(s_t,a_t) \leftarrow Q(s_t,a_t)+\alpha \left[r_t + \gamma Q(s_{t+1},a_{t+1}) - Q(s_t,a_t) \right]\]For terminal state \(s_{t+1}\):
\[Q(s_{t+1},a_{t+1})=0\]TD error:
\[\delta_t \triangleq r_t + \gamma Q(s_{t+1},a_{t+1}) - Q(s_t,a_t)\]The control algorithm (as in all on-policy methods):
-
estimate \(Q_{\pi}\) for the current behavior policy \(\pi\)
-
update \(\pi\) toward greediness w.r.t \(Q_{\pi}\)
SARSA converges with probability 1 to an optimal policy and action-value function as long as all state-action pairs are visited an infinite number of times and the policy converges in the limit to the greedy policy (when using \(\epsilon\)-greedy or \(\epsilon\)-soft policies)
Q-learning
Q-learning is an off-policy TD control method
The update rule:
\[Q(s_t,a_t) \leftarrow Q(s_t,a_t)+\alpha \left[r_t + \gamma \max_b Q(s_{t+1},b) - Q(s_t,a_t) \right]\]where the learned \(Q\) directly approximates the optimal \(Q_*\) independent of the policy being followed
Some insights:
-
Q-learning drammatically simplifies the analysis of the alg and enabled early convergence proofs
-
though the policy has not been evaluated, it still determines which \((s,a)\) pairs are visited and updated
-
the requirement for convergence is that all pairs continue to be updated
On/Off Policy Revisit
In the TD setup,
-
On-policy: evaluate and improve a \(\epsilon\)-greedy policy, where this policy is also used to generate samples (update policy and behavior policy are the same)
-
Off-policy: evaluate and improve a greedy policy, where a \(\epsilon\)-greedy policy is used to generate samples (update policy and behavior policy are different)
In SARSA, the policy evaluation takes place in \(r+\gamma Q(s',a')\), where \(a' \sim \pi(a' \mid s')\) with \(\epsilon\)-greedy
This means, the agent looks ahead to the next action to see what the agent will do at the next step following the current policy. In other words, the \(\epsilon\)-greedy policy with the property of exploration has been evaluated and updated for whether the next state and action will be safe or dangerous
In Q-learning, the policy evaluation takes place in \(r+\gamma \max_{a'} Q(s',a')\), where an absolute greedy policy has been evaluated and updated all the time
Since the Q-function always updates with greedy evaluations without attempting to resolve what that policy actually is, it doesn’t take into account the exploration effects
Some insights:
-
When the policy is simply a greedy one, Q-learning and SARSA will produce the same results
-
SRASA usually performs better than Q-learning, especially when there is a good chance that the agent will choose to take a random suboptimal action in the next step
-
Q-learning’s assumption that the agent is following the optimal policy maybe far enough from true that SARSA will converge faster and with fewer errors
-
Q-learning is more likely to learn an optimal policy when the agent doesn’t explore too much, where SARSA is less likely to learn such an optimal policy but a safer one
See Cliff Walking for more info
Expected SARSA
The update rule:
\[Q(s_t,a_t) \leftarrow Q(s_t,a_t)+\alpha \left[r_t + \gamma \mathbb{E}_{\pi} \left[Q(s_{t+1},a_{t+1}) \mid s_{t+1} \right] - Q(s_t,a_t) \right]\] \[Q(s_t,a_t) \leftarrow Q(s_t,a_t)+\alpha \left[r_t + \gamma \sum_a \pi(a \mid s_{t+1}) Q(s_{t+1},a) - Q(s_t,a_t) \right]\]Expected SARSA is more complex computationally than SARSA but, in return, it eliminates the variance due to the random selection of \(a_{t+1}\)
Expected SARSA can also do off-policy that it can use a policy different from the target policy to generate behavior
Double Q-learning
A maximum over estimated values can lead to a significant positive bias
Suppose we learn two independent estimates \(Q_1(a), Q_2(a), \forall a \in \mathcal{A}\) of the true value \(Q(a)\), we could use \(Q_1\) to determine the max action \(a^*=\arg \max_a Q_1(a)\)
and use \(Q_2\) to provide the estimate of its value \(Q_2(a^*)=Q_2(\arg \max_a Q_1(a))\)
The estimate will then be unbiased in the sense that \(\mathbb{E}[Q_2(a^*)]=Q(a^*)\), then we repeat the process to yield a second unbiased estimate \(Q_1 (\arg \max_a Q_2(s))\)
The update rule:
\[Q_1(s_t,a_t) \leftarrow Q_1(s_t,a_t)+\alpha \left[r_t + \gamma Q_2(s_{t+1}, \arg \max_a Q_1(s_{t+1},a)) - Q_1(s_t,a_t) \right]\] \[Q_2(s_t,a_t) \leftarrow Q_2(s_t,a_t)+\alpha \left[r_t + \gamma Q_1(s_{t+1}, \arg \max_a Q_2(s_{t+1},a)) - Q_2(s_t,a_t) \right]\]There are also double versions of SARSA and Expected SARSA
See Maximization Bias Example for more info
Important Concepts
Sample updates: involve looking ahead to a sample successor state (or state-action pair), using the value of the successor and the reward along the way to compute a backed-up value and then updating the value of the original state (or state-action pair)
Sample updates differ from the expected updates of DP in that they are based on a single sample successor rather than on a complete distribution of all possible successors
Batch updating: updates are made only after processing each complete batch of training data
Batch MC vs Batch TD: Batch MC always find the estimates that minimize mean-squared error on the training set, whereas batch TD(0) always finds the estimates that would be exactly correct for the maximum-likelihood model of the Markov process
Maximum-likelihood Estimate of a parameter is the parameter value whose probability of generating the data is greatest
Batch TD(0) converges to the certainty-equivalence estimate, which means it assumes that the estimate of the underlying process was known with certainty rather than being approximated
Maximization Bias: the maximum of the true values is zero, but the maximum of the estimates is positive, a positive bias
Random Walk
deterministic dynamics: p(s'|s,a)=1
states={0,1(A),2(B),3(C),4(D),5(E),6}
actions={left,right}, sample uniformly
termination: state 0 and 6
r=+1, at state 6
r=0, otherwise
MC vs TD implementation:
import numpy as np
import matplotlib.pyplot as plt
states=[0,1,2,3,4,5,6]
actions=[-1,1]
V_true=np.arange(1,6)/6.0
def step(s,a):
s_=s+a
if s_==0:
return s_,0,True
elif s_==6:
return s_,1,True
else:
return s_,0,False
def run_td(n_eps,lr=0.1):
V=np.ones(len(states))*0.5
rms=[]
for ep in range(n_eps):
s=np.random.choice(states[1:6])
while True:
s_old=s
a=np.random.choice(actions)
s,r,done=step(s_old,a)
if done:
V[s_old]+=lr*(r-V[s_old])
else:
V[s_old]+=lr*(r+V[s]-V[s_old])
if done:
break
rms.append(np.sqrt(np.sum(np.power(V_true-V[1:6],2))/5.0))
return V,rms
v1,rms1=run_td(1)
v10,rms10=run_td(10)
v100,rms100=run_td(100)
plt.figure(figsize=(8,6))
plt.plot(range(1,6),V_true,'-o',label='true')
plt.plot(range(1,6),v1[1:6],'-o',label='1')
plt.plot(range(1,6),v10[1:6],'-o',label='10')
plt.plot(range(1,6),v100[1:6],'-o',label='100')
plt.grid()
plt.xticks(range(1,6),('A','B','C','D','E'),fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize=15)
plt.savefig('td_randomwalk.png',dpi=350)
def run_mc(n_eps,lr):
V=np.ones(len(states))*0.5
rms=[]
for ep in range(n_eps):
s=np.random.choice(states[1:6])
traj=[s]
while True:
s_old=s
a=np.random.choice(actions)
s,r,done=step(s_old,a)
traj.append(s)
if done:
break
for st in traj:
V[st]+=lr*(r-V[st])
rms.append(np.sqrt(np.sum(np.power(V_true-V[1:6],2))/5.0))
return V,rms
lr_td=[0.15,0.1,0.05]
lr_mc=[0.01,0.02,0.03,0.04]
n_runs=100
ls=['-','--',':','-.']
plt.figure(figsize=(8,6))
for i,lr in enumerate(lr_td):
rms_all=[]
for n in range(n_runs):
_,rms=run_td(n_eps=100,lr=lr)
rms_all.append(rms)
plt.plot(np.array(rms_all).mean(axis=0),color='r',label='td'+str(lr),linestyle=ls[i])
for i,lr in enumerate(lr_mc):
rms_all=[]
for n in range(n_runs):
_,rms=run_mc(n_eps=100,lr=lr)
rms_all.append(rms)
plt.plot(np.array(rms_all).mean(axis=0),color='b',label='mc'+str(lr),linestyle=ls[i])
plt.legend(loc='upper right',fontsize=13)
plt.grid()
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('episodes',fontsize=15)
plt.ylabel('RMS',fontsize=15)
plt.savefig('td_mc_randomwalk.png',dpi=350)
The above figures correspond to Figures in Example 6.2
Batch updating:
Under batch updating, TD(0) converges deterministically to a single answer independent of the lr, as long as lr is chosen to be sufficiently small, while constant MC does the same but to a different answer.
def get_traj(method='td'):
s=np.random.choice(states[1:6])
traj=[s]
rew=[0]
while True:
s_old=s
a=np.random.choice(actions)
s,r,done=step(s_old,a)
traj.append(s)
rew.append(r)
if done:
break
if method=='td':
return traj, rew
else:
return traj, [r]*(len(traj)-1)
def get_batch(method='td',lr=0.001,n_runs=100,n_eps=100):
rms_all=[]
for n in range(n_runs):
trajs=[]
rews=[]
rms=[]
V=np.array([0.,-1.,-1.,-1.,-1.,-1.,1.])
for ep in range(n_eps):
traj,rew=get_traj(method=method)
trajs.append(traj)
rews.append(rew)
while True:
V_batch=np.zeros(len(states))
for tj,r in zip(trajs,rews):
for i in range(0,len(tj)-1):
if method=='td':
V_batch[tj[i]]+=r[i]+V[tj[i+1]]-V[tj[i]]
else:
V_batch[tj[i]]+=r[i]-V[tj[i]]
V_batch*=lr
if np.sum(np.abs(V_batch))<1e-3:
break
V+=V_batch
rms.append(np.sqrt(np.sum(np.power(V_true-V[1:6],2))/5.0))
rms_all.append(rms)
return np.array(rms_all).mean(axis=0)
rms_td=get_batch(method='td')
rms_mc=get_batch(method='mc')
plt.figure(figsize=(8,6))
plt.plot(rms_td,label='TD',linewidth=3)
plt.plot(rms_mc,label='MC',linewidth=3)
plt.ylim([0,0.25])
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('episodes',fontsize=15)
plt.ylabel('RMS',fontsize=15)
plt.legend(fontsize=15)
plt.grid()
plt.savefig('batch_randomwalk.png',dpi=350)
The above figure corresponds to Figure 6.2
Windy Gridworld
a standard gridworld, except there is a crosswind running upward through the middle of grid
an undiscounted episodic task
states = 7 x 10
actions = {up, down, left, right}
START=[3,0]
GOAL=[3,7]
dynamics:
col 3:5,8, wind strengh = 1, shift to one upper state (col started from 0)
col 6:7, wind strengh = 2, shift to two upper state
termination: reach goal
r=-1 until goal
epsilon: 0.1
lr: 0.5
Implementation of SARSA:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
n_cols=10
n_rows=7
actions=[0,1,2,3] #up,down,left,right
n_a=len(actions)
START=[3,0]
GOAL=[3,7]
WIND=[0,0,0,1,1,1,2,2,1,0]
def step(s,a):
row,col=s
if a==0:
s_=[max(row-1-WIND[col],0),col]
elif a==1:
s_=[max(min(row+1-WIND[col],n_rows-1),0),col]
elif a==2:
s_=[max(row-WIND[col],0), max(col-1,0)]
else:
s_=[max(row-WIND[col],0), min(col+1, n_cols-1)]
if s_==GOAL:
return s_,0,True
else:
return s_,-1,False
def e_greedy(eps,q):
if (np.random.random()<=eps):
return np.random.choice(actions)
else:
return np.argmax(q)
def run_sarsa(n_eps=500,n_stps=500,eps=0.1,lr=0.5,gm=1.):
Q=np.zeros((n_rows,n_cols,n_a))
r_all,stp_all,cnt_all=[],[],[]
stpCnt=0
for ep in range(n_eps):
r_sum,done=0,False
s=START
a=e_greedy(eps,Q[s[0],s[1]])
for stp in range(n_stps):
s_,r,done=step(s,a)
a_=e_greedy(eps,Q[s_[0],s_[1]])
delta=r+gm*Q[s_[0],s_[1],a_]-Q[s[0],s[1],a]
Q[s[0],s[1],a]+=lr*delta
s=s_
a=a_
r_sum+=r
stpCnt+=1
if done:
break
r_all.append(r_sum)
stp_all.append(stp)
cnt_all.append(stpCnt)
if ep%10==0:
print(f'ep:{ep}, stps:{stp}, ret:{r_sum}')
return Q,r_all,stp_all,cnt_all
Q,r,stp,cnt=run_sarsa()
Draw figures:
q_heat=np.zeros((n_rows,n_cols))
for i in range(n_rows):
for j in range(n_cols):
q_heat[i,j]=np.max(Q[i,j])
plt.figure(figsize=(6,8))
plt.subplot(211)
sns.heatmap(q_heat,cmap='jet',annot=True)
plt.annotate('S',(0.3,3.7),fontsize=20,color="w")
plt.annotate('G',(7.3,3.7),fontsize=20,color="w")
plt.savefig('heatmap_windygrid.png',dpi=350)
op_act=np.zeros((n_rows,n_cols))
for i in range(n_rows):
for j in range(n_cols):
op_act[i,j]=np.argmax(Q[i,j])
nx=10
ny=7
scale=0.3
edge=ny-0.5
fig=plt.figure(figsize=(6,6))
ax=fig.add_subplot(1,1,1)
ax.set_aspect('equal', adjustable='box')
ax.set_xticks(np.arange(0,nx+1,1))
ax.set_yticks(np.arange(0,ny+1,1))
plt.grid()
plt.ylim((0,ny))
plt.xlim((0,nx))
for i in range(n_rows):
for j in range(n_cols):
if op_act[i,j]==0:
plt.arrow(j+0.5,edge-i,0,scale,width=0.1, head_width=0.2, head_length=0.1,fc='g', ec='g')
elif op_act[i,j]==1:
plt.arrow(j+0.5,edge-i,0,-scale,width=0.1, head_width=0.2, head_length=0.1,fc='c', ec='c')
elif op_act[i,j]==2:
plt.arrow(j+0.5,edge-i,-scale,0,width=0.1, head_width=0.2, head_length=0.1,fc='b', ec='b')
elif op_act[i,j]==3:
plt.arrow(j+0.5,edge-i,scale,0,width=0.1, head_width=0.2, head_length=0.1,fc='r', ec='r')
plt.annotate('S',(0.3,3.3),fontsize=20)
plt.annotate('G',(7.3,3.3),fontsize=20)
plt.savefig('oppi_windygrid.png',dpi=350)
fig=plt.figure(figsize=(12,10))
plt.rcParams['font.size']='10'
plt.subplot(221)
plt.plot(r,linewidth=3,color='orange')
plt.grid()
plt.xlabel('Episodes')
plt.ylabel('Return')
plt.subplot(222)
plt.plot(stp,linewidth=3,color='orange')
plt.grid()
plt.xlabel('Episodes')
plt.ylabel('Steps')
plt.subplot(223)
plt.plot(cnt[:170],range(170),linewidth=3,color='orange')
plt.grid()
plt.xlabel('Steps')
plt.ylabel('Episodes')
plt.savefig('res_windygrid.png',dpi=350)
The above corresponds to figure in Example 6.5
Cliff Walking
a standard gridworld, except there is a cliff in the downside
an undiscounted episodic task
states = 4 x 12
actions = {up, down, left, right}
START = [3,0]
GOAL = [3,11]
termination: enter the cliff zone or reach goal
r=-1 all other than cliff region
r=-100 cliff region
epsilon: 0.1
lr: 0.5
Implementation of SARSA and expected SARSA:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
n_cols=12
n_rows=4
actions=[0,1,2,3] #up,down,left,right
n_a=len(actions)
START=[3,0]
GOAL=[3,11]
CLIFF=list(range(1,11))
def step(s,a):
row,col=s
if a==0: #up
s_=[max(row-1,0),col]
elif a==1: #down
s_=[min(row+1,n_rows-1),col]
elif a==2: #left
s_=[row, max(col-1,0)]
else: #right
s_=[row, min(col+1, n_cols-1)]
if s_==GOAL:
return s_,0,True
elif s_[0]==n_rows-1 and s_[1] in CLIFF:
return s_,-100,True
else:
return s_,-1,False
def e_greedy(eps,q):
if (np.random.random()<=eps):
return np.random.choice(actions)
else:
return np.random.choice([a for a, qs in enumerate(q) if qs==np.max(q)])
#return np.argmax(q)
def run_sarsa(expected=False,n_eps=500,n_stps=500,eps=0.1,lr=0.5,gm=1.):
Q=np.zeros((n_rows,n_cols,n_a))
r_all,stp_all,cnt_all=[],[],[]
stpCnt=0
for ep in range(n_eps):
r_sum,done=0,False
s=START
a=e_greedy(eps,Q[s[0],s[1]])
for stp in range(n_stps):
s_,r,done=step(s,a)
a_=e_greedy(eps,Q[s_[0],s_[1]])
if not expected:
delta=r+gm*Q[s_[0],s_[1],a_]-Q[s[0],s[1],a]
else:
Q_exp=0.0
Q_=Q[s_[0],s_[1],:]
a_bests=np.argwhere(Q_==np.max(Q_))
for act in actions:
if act in a_bests:
Q_exp+=((1.0-eps)/len(a_bests)+eps/len(actions))*Q[s_[0],s_[1],act]
else:
Q_exp+=eps/len(actions)*Q[s_[0],s_[1],act]
delta=r+gm*Q_exp-Q[s[0],s[1],a]
Q[s[0],s[1],a]+=lr*delta
s=s_
a=a_
r_sum+=r
stpCnt+=1
if done:
break
r_all.append(r_sum)
stp_all.append(stp)
cnt_all.append(stpCnt)
#if ep%100==0:
# print(f'ep:{ep}, stps:{stp}, ret:{r_sum}')
return Q,r_all,stp_all,cnt_all
#Q_sarsa,r_sarsa,stp_sarsa,cnt_sarsa=run_sarsa()
n_runs=50
r_sarsa_all=[]
for n in range(n_runs):
Q_sarsa,r_sarsa,stp_sarsa,cnt_sarsa=run_sarsa()
r_sarsa_all.append(r_sarsa)
Implementation of Q-learning:
def run_q(n_eps=500,n_stps=500,eps=0.1,lr=0.5,gm=1.):
Q=np.zeros((n_rows,n_cols,n_a))
r_all,stp_all,cnt_all=[],[],[]
stpCnt=0
for ep in range(n_eps):
r_sum,done=0,False
s=START
for stp in range(n_stps):
a=e_greedy(eps,Q[s[0],s[1]])
s_,r,done=step(s,a)
delta=r+gm*np.max(Q[s_[0],s_[1]])-Q[s[0],s[1],a]
Q[s[0],s[1],a]+=lr*delta
s=s_
r_sum+=r
stpCnt+=1
if done:
break
r_all.append(r_sum)
stp_all.append(stp)
cnt_all.append(stpCnt)
#if ep%100==0:
# print(f'ep:{ep}, stps:{stp}, ret:{r_sum}')
return Q,r_all,stp_all,cnt_all
n_runs=50
r_q_all=[]
for n in range(n_runs):
Q_q,r_q,stp_q,cnt_q=run_q()
r_q_all.append(r_q)
Draw figures:
r_q_all=np.array(r_q_all)
r_sarsa_all=np.array(r_sarsa_all)
plt.rcParams['font.size']='14'
plt.figure(figsize=(8,6))
plt.plot(r_q_all.mean(axis=0),'r',label='q-learning')
plt.plot(r_sarsa_all.mean(axis=0),'b',label='sarsa')
plt.ylim([-100,-10])
plt.legend()
plt.grid()
plt.xlabel('Episodes')
plt.ylabel('Returns during episode')
plt.savefig('sarsa_ql_cliffwalk.png',dpi=350)
plt.close()
def plot_heat(q,alg='sarsa'):
q_heat=np.zeros((n_rows,n_cols))
for i in range(n_rows):
for j in range(n_cols):
q_heat[i,j]=np.max(q[i,j])
plt.figure(figsize=(8,4))
sns.heatmap(q_heat,cmap='jet',annot=True)
plt.annotate('S', (0.3,3.7), fontsize=20, color="w")
plt.annotate('G', (11.3,3.7), fontsize=20, color="w")
plt.title(alg)
plt.savefig(alg+'_heatmap_cliffwalk.png',dpi=350)
plt.close()
def plot_oppi(q,alg='sarsa'):
op_act=np.zeros((n_rows,n_cols))
for i in range(n_rows):
for j in range(n_cols):
op_act[i,j]=np.argmax(q[i,j])
nx=12
ny=4
scale=0.3
edge=ny-0.5
fig=plt.figure(figsize=(8,4))
ax=fig.add_subplot(1,1,1)
ax.set_aspect('equal', adjustable='box')
ax.set_xticks(np.arange(0,nx+1,1))
ax.set_yticks(np.arange(0,ny+1,1))
plt.grid()
plt.ylim((0,ny))
plt.xlim((0,nx))
for i in range(n_rows):
for j in range(n_cols):
if op_act[i,j]==0:
plt.arrow(j+0.5,edge-i,0,scale,width=0.1, head_width=0.2, head_length=0.1,fc='g', ec='g')
elif op_act[i,j]==1:
plt.arrow(j+0.5,edge-i,0,-scale,width=0.1, head_width=0.2, head_length=0.1,fc='c', ec='c')
elif op_act[i,j]==2:
plt.arrow(j+0.5,edge-i,-scale,0,width=0.1, head_width=0.2, head_length=0.1,fc='b', ec='b')
elif op_act[i,j]==3:
plt.arrow(j+0.5,edge-i,scale,0,width=0.1, head_width=0.2, head_length=0.1,fc='r', ec='r')
plt.annotate('S',(0.3,0.3),fontsize=20)
plt.annotate('G',(11.3,0.3),fontsize=20)
plt.title('$\pi_*$ of '+alg)
plt.savefig(alg+'_oppi_cliffwalk.png',dpi=350)
plt.close()
plot_heat(Q_sarsa,'sarsa')
plot_heat(Q_q,'ql')
plot_oppi(Q_sarsa,'sarsa')
plot_oppi(Q_q,'ql')
The above corresponds to figure in Example 6.6
If we anneal \(\epsilon\) from 0.5 with decaying rate 0.99 per episode, we will have the value function and optimal policies as follows
where SARSA converges to a roundabout policy, and Q-learning converges to an optimal one travelling right along the edge of the cliff
If we replace \(\epsilon\)-greedy for SARSA and Q-learning with pure greedy policy, SARSA and Q-learning are the same
Asymptotic and interim performances:
def get_performance(n_runs=10,n_eps=1000):
lrs=np.arange(0.1,1.1,0.1)
performance=np.zeros((len(lrs),3))
for n in range(n_runs):
for i,lr in enumerate(lrs):
_,r_sarsa,_,_=run_sarsa(n_eps=n_eps,eps=.1,lr=lr,gm=1.)
_,r_exp_sarsa,_,_=run_sarsa(expected=True,n_eps=n_eps,eps=.1,lr=lr,gm=1.)
_,r_q,_,_=run_q(n_eps=n_eps,eps=.1,lr=lr,gm=1.)
performance[i][0]+=sum(r_sarsa)
performance[i][1]+=sum(r_exp_sarsa)
performance[i][2]+=sum(r_q)
return performance
asy_performance=get_performance(n_runs=1,n_eps=10000)
int_performance=get_performance(n_runs=50,n_eps=100)
labels=['asy_sarsa','asy_expected_sarsa','asy_ql',
'int_sarsa','int_expected_sarsa','int_ql']
plt.figure(figsize=(8,6))
for i in range(3):
plt.plot((asy_performance/(10*1000))[:,i],'-o',label=labels[i],linewidth=2)
for i in range(3):
plt.plot((int_performance/(100*50))[:,i],'-o',label=labels[i+3],linewidth=2)
plt.legend(bbox_to_anchor=(1.04,1), borderaxespad=0)
plt.grid()
plt.xticks(np.arange(10),(np.round(np.arange(0.1,1.1,0.1),1)))
plt.xlabel('Learning rate')
plt.ylabel('Sum of rewards per episode')
plt.savefig('exp_sarsa_cliffwalk.png',dpi=350)
The above corresponds to Figure 6.3: because the state transitions are all deterministic and all randomness comes from the policy, expected SARSA can safely set learning rate at 1 without suffering any degradation of asymptotic performance, whereas SARSA can only perform well in the long run at a small value of learning rate, at which short-term performance is poor
Maximization Bias Example
states: two non-terminal states A,B
actions in state A: {left, right}
actions in state B: like range(0,10)
episodes always start in A
A - right - terminal,r=0
A - left - B - terminal,r=N(-0.1,1)
The expected return for any traj starting with 'left' is -0.1
thus taking 'left' in A is always a mistake
However, Q-learning may favor 'left' because of maximization bias
Implementation of Q-learning and Double Q-learning:
import numpy as np
import matplotlib.pyplot as plt
A_ACTIONS=[0,1] #right,left
B_ACTIONS=range(0,10)
ACTIONS=[A_ACTIONS, B_ACTIONS]
#state A,B,terminal
A,B,T=0,1,2
STATES=[A,B]
START=A
def step(s,a):
if s==A:
if a==0: #right
return T,0,True
else: #left
s_=B
return s_,0,False
else: #s==B whatever action may lead to Terminal and random Reward
return T,np.random.normal(-0.1,1),True
def e_greedy(eps,q,s):
if (np.random.random()<=eps):
if s==A:
return np.random.choice(A_ACTIONS)
else:
return np.random.choice(B_ACTIONS)
else:
return np.random.choice([a for a, qs in enumerate(q[s]) if qs==np.max(q[s])])
def run_q(double=False,n_eps=300,eps=.1,lr=0.1,gm=1.):
if not double:
Q=[np.zeros(len(A_ACTIONS)),np.zeros(len(B_ACTIONS)),np.zeros(1)]
else:
Q1=[np.zeros(len(A_ACTIONS)),np.zeros(len(B_ACTIONS)),np.zeros(1)]
Q2=[np.zeros(len(A_ACTIONS)),np.zeros(len(B_ACTIONS)),np.zeros(1)]
r_all,stp_all,left_all=[],[],[]
for ep in range(n_eps):
r_sum,done=0,False
s=START
stp_cnt=0
left_cnt=0
while not done:
if not double:
a=e_greedy(eps,Q,s)
else:
a=e_greedy(eps,[q1+q2 for q1,q2 in zip(Q1,Q2)],s)
if s==A:
stp_cnt+=1
if a==1:
left_cnt+=1
s_,r,done=step(s,a)
if not double:
delta=r+gm*np.max(Q[s_])-Q[s][a]
Q[s][a]+=lr*delta
else:
if np.random.choice([0,1])==1:
a_max=np.random.choice([a for a, q in enumerate(Q1[s_]) if q==np.max(Q1[s_])])
delta=r+gm*Q2[s_][a_max]-Q1[s][a]
Q1[s][a]+=lr*delta
else:
a_max=np.random.choice([a for a, q in enumerate(Q2[s_]) if q==np.max(Q2[s_])])
delta=r+gm*Q1[s_][a_max]-Q2[s][a]
Q2[s][a]+=lr*delta
s=s_
r_sum+=r
r_all.append(r_sum)
stp_all.append(stp_cnt)
left_all.append(left_cnt)
#if ep%10==0:
# print(f'ep:{ep}, stps:{stp}, ret:{r_sum}')
if not double:
return Q,r_all,left_all,stp_all
else:
return Q1,Q2,r_all,left_all,stp_all
n_runs=1000
left_all=[]
stp_all=[]
for n in range(n_runs):
Q,r_q,left,stp=run_q()
left_all.append(left)
stp_all.append(stp)
plt.figure(figsize=(8,6))
plt.rcParams['font.size']='14'
plt.plot((np.array(left_all)/np.array(stp_all)).mean(axis=0),label='Q-learning',linewidth=2)
left_all=[]
stp_all=[]
for n in range(n_runs):
Q1,Q2,r_dq,left,stp=run_q(double=True)
left_all.append(left)
stp_all.append(stp)
plt.plot((np.array(left_all)/np.array(stp_all)).mean(axis=0),label='Double Q-learning',linewidth=2)
plt.axhline(0.05,color='g',linestyle='--',label='optimal',linewidth=2)
plt.legend()
plt.grid()
plt.xlabel('Episodes')
plt.ylabel('% left actions from state A')
plt.savefig('doubleq_maxbias.png',dpi=350)
The above corresponds to Figure 6.5
References
Reinforcement Learning an Introduction 2nd edition, Chapter 6 by Sutton and Barto
ShangtongZhang/reinforcement-learning-an-introduction
RL 2nd Edition Excercise Solutions
Deep Reinforcement Learning - Julien Vitay