Cartpole is one of the very ancient and classical RL-for-control benchmark, early appeared in [Barto et al. 1983]. Many RL algorithms from value-based to policy-based, model-free to model-based were justified under it or its variants encompassing simulations and real robots, i.e.[PILCO].

This post is trying to cover the insights derived from the performances of classical RL algorithms under Cartpole environment. The algorithm descriptions are at Basic Reinforcement Learning, Temporal-Difference Learning, Policy Gradient and Actor-Critic. Ipython notebooks are available at [Cartpole Q, SARSA], [Cartpole REINFORCE]

(Temporary) Related algorithms:

  • Q-learning (box, bins): discrete states, discrete actions
  • SARSA(λ): linear function approximator for continuous states, discrete actions
  • DQN: non-linear function approximator for continuous states, discrete actions
  • REINFORCE and baseline: linear Gaussian policy for continuous actions, continuous states
  • and more

Temporary Results with 10 runs in CartPole-v0:

Q: [162]-Box, RBF: [6,6,9,9], DQN: 3x[200 hidden]-MLP

Temporary Summary:

  • SARSA(\(\lambda\)) achieved the best sample efficiency; DQN could reach relatively higher returns; REINFORCE has variance issue, while this can be relieved by adding a baseline

  • Q-learning with box discretization and binning requires more episodes (3000 or more) to converge, since its Q-value hasn’t reached 100 like RBF and DQN

  • a proper number of equal-distance binning of the states will be 15, but the overall performance of binning is worst than box discretization. This suggests that we have to carefully design and divide the states in discrete state cases

  • from the heatmap of the learned value in x=0 and x_dot=0, the highest optimal values learned by RBF and DQN are on diagonal sides from the center. This might be the cause of binary discrete actions (check continous action case for future work)

  • finer RBF networks may not approximate value funtion well but can achieve better sample efficiency

  • DQN requires smaller learning rate for stable learning, say 0.01 or 0.001 compared to Q-box and SARSA(\(\lambda\))s’ 0.1

  • 3-layer network can capture quite nice value approximation compared to 2-layer or 1-layer wider network

  • discounting factor remains 0.99 is good

Future work:

  • analyze and visualize weights and gradients of DQN and RBF

  • tune NN capacities, check different activation functions, e.g. SiLU or try NRBF

  • increase the difficulty of the task by setting threshold to 180deg or to longer step length

  • test different reward functions

  • test different algs with continuous controllers

Problem Setup

[OpenAI CartPole-v0], ends in 200 episodes
[OpenAI CartPole-v1], ends in 500 episodes
[Env source code-v1] [Env Continuous actions]

Num Observation Min Max
0 Cart Position -4.8 4.8
1 Cart Velocity -Inf Inf
2 Pole Angle ~ -0.418 rad (-24°) ~ 0.418 rad (24°)
3 Pole Angular Velocity -Inf Inf
Num Action
0 Push cart to the left with 10N
1 Push cart to the right with 10N

State variables: (4,)

  • \(x\) - cart position
  • \(\dot{x}\) - cart velocity
  • \(\theta\) - pole angle
  • \(\dot{\theta}\) - pole angular velocity

Action variables: (2,)

Reward: +1 unless termination

Termination:

  • \(\theta < -15^o\) or \(\theta > 15^o\), 15 degree = 0.2618 rad
  • \(x<-2.4\) or \(x>2.4\)

Q-box

Box System [Michie et al. 1968] divided the state variables into 3x3x6x3 boxes as follows:

variable     range      
\(x\) <-0.8   (-0.8,0.8)   >0.8  
box: +0   +1   +2  
\(\dot{x}\) <-0.5   (-0.5,0.5   >0.5  
box: +0   +3   +6  
\(\theta\)(deg) <-12 (-12,-1.5) (-1.5,0) (0,1.5) (1.5,12) >12
box: +0 +9 +18 +27 +36 +45
\(\dot{\theta}\) <-50   (-50,50)   >50  
box: +0   +54   +108  

Experimental setting:

  • State space: (4,) continuous -> (162,)
  • Action sapce: (2,)
  • episode length: 2000
  • step length: 200 in CartPole-v0
  • learning rate (lr): 0.1
  • discounting factor (gm): 0.99
  • exploration factor (epsilon): from 1, epsilon*=epsilon_decay_rate for each episode
  • exploration decaying rate (epsilon decay rate): 0.995

Learned optimal Values:

Q-learning implementation with box system:

import numpy as np
import gym
import random

def get_box(s):

    x,x_dot,theta,theta_dot=s
    if x < -.8:
        box_idx = 0
    elif x < .8:
        box_idx = 1
    else:
        box_idx = 2

    if x_dot < -.5:
        pass
    elif x_dot < .5:
        box_idx += 3
    else:
        box_idx += 6

    if theta < np.radians(-12):
        pass
    elif theta < np.radians(-1.5):
        box_idx += 9
    elif theta < np.radians(0):  
        box_idx += 18
    elif theta < np.radians(1.5):
        box_idx += 27
    elif theta < np.radians(12):
        box_idx += 36
    else:
        box_idx += 45

    if theta_dot < np.radians(-50):
        pass
    elif theta_dot < np.radians(50):
        box_idx += 54
    else:
        box_idx += 108

    return box_idx

def e_greedy(q,epsilon,n_a):

    if random.random()<epsilon:
        a=np.random.randint(0,n_a)
    else:
        a=np.argmax(q)

    return a

def run_qbox(n_eps=2000,n_stps=200,gm=0.99,lr=0.1,epsilon=1,epsilon_rate=0.995):

    env=gym.make("CartPole-v0")
    n_s=env.observation_space.shape[0]
    n_a=env.action_space.n

    Q=np.random.rand(162,n_a)
    r_all,stp_all=[],[]
    q_all=[]

    for ep in range(n_eps):

        s=env.reset()
        s_int=get_box(s)
        r_sum=0

        for stp in range(n_stps):

            q_all.append(Q[s_int])
            a=e_greedy(Q[s_int],epsilon,n_a)
            s_,r,done,_=env.step(a)
            s_int_=get_box(s_)

            delta=r+gm*np.max(Q[s_int_])-Q[s_int,a]
            Q[s_int,a]+=lr*delta

            s=s_
            s_int=s_int_
            r_sum+=r

            if done:
                break

        r_all.append(r_sum)
        stp_all.append(stp)
        epsilon*=epsilon_rate
        #if ep%(n_eps//5)==0 or ep==n_eps-1:
        #    print(f"ep:{ep}, stp:{stp}, r:{np.round(r_sum,2)},eps:{epsilon}")

    return r_all,stp_all,q_all

r,s,q=run_qbox()

Q-bins

We could also divide the states in equal-distance bins with each state variables

Experimental setting:

  • n_bins: number of bins for each state variables
  • State space: (4,) continuous -> (n_bins^4,)
  • exploration decaying rate (epsilon decay rate): 0.999
  • others remain the same as in Q-box
def discretize(s,bins):
    idx=[]
    for i in range(n_s):
        idx.append(np.digitize(s[i],bins[i])-1) # -1 will turn bin into index
    return tuple(idx)

def run_qbin(n_bins=15,n_eps=2000,n_stps=200,gm=0.99,lr=0.1,eps=1,eps_decay=0.999):

    env=gym.make('CartPole-v0')
    n_a=env.action_space.n
    n_s=env.observation_space.shape[0]
    bins=[np.linspace(-2.4,2.4,n_bins),
          np.linspace(-2,2,n_bins),
          np.linspace(-.2618,.2618,n_bins),
          np.linspace(-2,2,n_bins)]

    #shape (n_bins,n_bins,n_bins,n_bins,n_a)
    #i.e. (15,15,15,15,2)
    Q=np.zeros(([len(bins[0])]*n_s+[n_a]))

    r_all,stp_all,q_all=[],[],[]

    for ep in range(n_eps):

        r_sum,done=0,False
        s_int=discretize(env.reset(),bins)

        for stp in range(n_stps):

            q_all.append(Q[s_int])
            a=e_greedy(Q[s_int],eps,n_a)
            s_,r,done,_=env.step(a)
            s_int_=discretize(s_,bins)

            delta=r+gm*np.max(Q[s_int_])-Q[s_int][a]
            Q[s_int][a]+=lr*delta

            s=s_
            s_int=s_int_
            r_sum+=r

            if done:
                break

        eps*=eps_decay
        if ep%(n_eps//10)==0 or ep==n_eps-1:
            print(f"ep:{ep}, stp:{stp}, r:{np.round(r_sum,2)}, eps:{eps}")

        r_all.append(r_sum)
        stp_all.append(stp)

    return r_all,stp_all,q_all

r,s,q=run_qbin(n_bins=15)

SARSA(λ)-rbf

We use RBF networks to approximate the Q function

RBF networks:

\[\begin{align*} \Phi_s(i)=\exp \left(-\frac{\|s-c_i\|^2}{2\sigma_i^2} \right) \end{align*}\]

where \(c_i\) is the \(i\)th center point and \(\sigma_i\) is the \(i\)th standard deviation

Normalized RBF networks:

\[\begin{align*} \Phi_s(i)= \frac{\exp \left(-\frac{\|s-c_i\|^2}{2\sigma_i^2} \right)}{\sum_i \exp \left(-\frac{\|s-c_i\|^2}{2\sigma_i^2} \right)} \end{align*}\]

RBF parameters:

  • n_rbf: [3,3,5,5]
  • n_feature: 3x3x5x5=225
  • s_range_low: [-2.4,-4,-np.radians(30),-np.radians(180)]
  • s_range_high: [2.4,4,np.radians(30),np.radians(180)]

RBF network plotted w.r.t the above parameters:

Experimental setting:

  • State space to Feature space: (4,) continuous -> (225,) continuous
  • eligibility trace factor (lmd): 0.5
  • others remain the same as in Q-box

Customized Gym Env:

  • In original gym env, state initialization is U(-0.05,0.05) for 4 states, this small range doesn’t encourage thorough state visits in terms of exploration, we changed it to U(-0.26,0.26) for 4 states (further can be changed to U(range of each state))
  • also in self.theta_threshold_radians = 15 * 2 * math.pi / 360, 15 can be changed to 180 to test the property of continuous controller (further topic)

Learned optimal Values:

SARSA(λ) implementation:

import numpy as np
import pandas as pd
import gym
import math
import random
import matplotlib.pyplot as plt

def plot_rbf(s_range,center,sigma):
    plt.figure(figsize=(10,10))
    title=['x','x_dot','theta','theta_dot']
    for i in range(1,5):
        plt.subplot(2,2,i)
        x=np.linspace(s_range[0,i-1],s_range[1,i-1],num=100)
        y={}
        for j in range(len(center[i-1])):
            y[j]=np.exp(-(x-center[i-1][j])**2/(2*sigma[i-1]**2))
            plt.plot(x,y[j])
        plt.title(title[i-1])

    plt.savefig('rbf.png',dpi=350)

def get_feature(s):

    rbf={}
    f=1 #feature
    for i in range(4):
        rbf[i]=np.exp(-(s[i]-center[i])**2/(2*sigma[i]**2))
        f=np.outer(f,rbf[i])
        f=f.ravel()

    return f

def get_Q(F,w):
    Q=np.dot(w.T,F)
    return Q

def get_Q_a(F,a,w):
    Q=np.dot(w[:,a],F)

def e_greedy(e,Q):
    rand=np.random.random()
    if rand<1.-e:
        a=Q.argmax()
    else:
        a=env.action_space.sample()
    return int(a)

def get_v(x,x_dot,w):
    n_grid=50
    v=np.zeros((n_grid,n_grid))
    the=np.linspace(s_range[0,2]/2,s_range[1,2]/2,num=n_grid)
    the_dot=np.linspace(s_range[1,3],s_range[0,3],num=n_grid)

    for i in range(n_grid):
        for j in range(n_grid):
            s=np.array([x,x_dot,the[j],the_dot[i]])
            F=get_feature(s)
            v[i,j]=np.max(np.dot(w.T,F))

    return v

def plot_v(w,angle=15,angular=229):
    i=1
    x_range=[-2.4,0,2.4]
    x_dot_range=[-2,0,2]
    plt.figure(figsize=(10,10))
    for x_dot in reversed(x_dot_range):
        for x in x_range:
            v=get_v(x,x_dot,w)

            plt.subplot(3,3,i)
            plt.imshow(v,cmap='jet')
            plt.colorbar()
            plt.title('x='+str(x)+' x_dot='+str(x_dot))
            plt.xticks(np.arange(50,step=25),('-'+str(angle),'0',))
            plt.yticks(np.arange(50,step=25),(str(angular),'0',))
            plt.xlabel('theta')
            plt.ylabel('theta_dot')
            i+=1

    plt.savefig('q_op_rbf.png',dpi=350)

def initialize(env,n_s,n_a):

    s_range=np.zeros((2,n_s))
    s_range[0,:]=np.array([-2.4,-4,-np.radians(30),-np.radians(180)])
    s_range[1,:]=np.array([2.4,4,np.radians(30),np.radians(180)])

    n_rbf=np.array([3,3,5,5]).astype(int)
    n_feature=np.prod(n_rbf)
    w=np.zeros((n_feature,n_a))

    interval={}
    center={}
    sigma={}

    for i in range(n_s):
        interval[i]=(s_range[1,i]-s_range[0,i])/(n_rbf[i]-1)
        sigma[i]=interval[i]/2
        center[i]=[np.around(s_range[0,i]+j*interval[i],2) for j in range(n_rbf[i])]

    return s_range,center,sigma

def run_sarsalmd(n_eps=2000,n_stps=200,gm=0.99,lr=0.1,lmd=0.5,epsilon=1,epsilon_rate=0.995):

    env=gym.make("CartPole-v0")
    n_s=env.observation_space.shape[0]
    n_a=env.action_space.n

    s_range,center,sigma=initialize(env,n_s,n_a)
    plot_rbf(s_range,center,sigma)

    r_all,stp_all,q_all=[],[],[]

    for ep in range(n_eps):

        r_sum,done=0,False
        #eligibility traces
        e=np.zeros((n_feature,n_a))
        F=get_feature(env.reset(),center,sigma)
        Q_old=get_Q(F,w)
        a=e_greedy(epsilon,Q_old)

        for stp in range(n_stps):

            #show animation of last 5 episodes
            #if ep>n_eps-5:
            #    env.render()

            q_all.append(Q_old)
            s_,r,done,_=env.step(a)
            F_=get_feature(s_,center,sigma)
            Q=get_Q(F_,w)
            a_=e_greedy(epsilon,Q)

            if done:
                delta=r-Q_old[a]
            else:
                delta=r+gm*Q[a_]-Q_old[a]

            e[:,a]=F

            for m in range(n_feature):
                for n in range(n_a):
                    w[m,n]+=lr*delta*e[m,n]

            e*=gm*lmd

            s=s_
            F=F_
            a=a_
            Q_old=Q
            r_sum+=r

            if done:
                break

        r_all.append(r_sum)
        stp_all.append(stp)
        epsilon*=epsilon_rate
        #print(f"ep:{ep}, stp:{stp}, r:{np.round(r_sum,2)}, eps:{epsilon}")
    plot_v(w)
    return r_all,stp_all,q_all

r,s,q=run_sarsalmd()

DQN

We use deep neural network to approximate Q values

NN parameters:

  • MLP layers: 3
  • hidden units: 200 for each layer
  • activator: relu
  • state input: (4,)
  • action output: (2,)

DQN Experimental setting:

  • buffer size: 40000
  • batch size: 512
  • update target: every 10 episodes
  • episode length: 3000
  • step length: 1000
  • lr: 0.001
  • gm: 0.99
  • eps: from 1.
  • eps decay: 0.995

Customized Gym Env:

  • the initial angle of pole ~ U(-15deg,15deg)
  • end of episode [-30deg, 30deg]

Learned optimal Values:

Implementation:

import random
import copy
import math
import numpy as np
import pandas as pd
from collections import deque

import gym
import torch
import torch.nn as nn
import torch.nn.functional as F

class DQN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1=nn.Linear(4, 200)
        self.fc2=nn.Linear(200, 200)
        self.fc3=nn.Linear(200, 200)
        self.fc4=nn.Linear(200, 2)

    def forward(self, x):
        x=self.fc1(x)
        x=F.relu(x) #SiLU
        x=self.fc2(x)
        x=F.relu(x)
        x=self.fc3(x)
        x=F.relu(x)
        x=self.fc4(x)
        return x

class NN:

    def __init__(self,batch_size=32,buffer_size=2000,**kw):

        self.batch_size=batch_size
        self.buffer=deque(maxlen=buffer_size)
        self.lr=kw['lr']
        self.gm=kw['gm']

    def initialize(self):

        self.model,self.model_tar=self.create()
        self.criterion=torch.nn.MSELoss()
        self.opt=torch.optim.Adam(self.model.parameters(),lr=self.lr)

    def create(self):

        model=DQN()
        model_tar=copy.deepcopy(model)

        return model,model_tar

    def update_target(self):

        self.model_tar.load_state_dict(self.model.state_dict())

    def update(self):

        q_batch,q_tar_batch=[],[]
        batch=random.sample(self.buffer,min(len(self.buffer),self.batch_size))
        for s,a,r,s_, done in batch:
            q=self.model(s)
            q_tar=q.clone().detach()
            with torch.no_grad():
                q_tar[0][a]=r if done else r+self.gm*torch.max(self.model_tar(s_)[0])
            q_batch.append(q[0])
            q_tar_batch.append(q_tar[0])

        q_batch=torch.cat(q_batch)
        q_tar_batch=torch.cat(q_tar_batch)

        self.opt.zero_grad()
        loss=self.criterion(q_batch, q_tar_batch)
        loss.backward()
        self.opt.step()

        return loss.item()

    def add_memory(self,s,a,r,s_,done):

        r=torch.tensor(r)
        self.buffer.append((s,a,r,s_,done))

    def get_q(self,s):

        return self.model(s).detach().numpy()

    def get_tar(self,s):

        return self.model_tar(s).detach().numpy()

    def save_model(self,pre):

        torch.save(self.model.state_dict(), 'model_weights.pth')

class Agent:

    def __init__(self,env,buffer_size,batch_size,update_tar,n_eps,n_stps,eps,eps_decay,**kw):

        self.env=env
        self.nn=NN(batch_size=batch_size,
                   buffer_size=buffer_size,
                   lr=kw['lr'],
                   gm=kw['gm'])

        self.update_tar=update_tar
        self.n_eps=n_eps
        self.n_stps=n_stps

        self.epsilon=eps
        self.epsilon_decay=eps_decay

    def train(self,n=5,save_result=False,save_model=False,record_q=False):

        for i in range(n):
            self.train_once(save_result=save_result,
                            save_model=save_model,
                            record_q=record_q,
                            prefix='n'+str(i))

    def train_once(self,save_result=True,save_model=False,record_q=False,**kw):

        r_all,stp_all,stpCnt=[],[],0
        v_all,q_all=[],[]

        self.nn.initialize()

        for ep in range(self.n_eps):
            r_sum,done=0,False

            s=self.preprocess(self.env.reset())
            ent=self.epsilon

            for stp in range(self.n_stps):

                if ep>self.n_eps-5:
                    self.env.render()

                q=self.nn.get_q(s)
                a=self.e_greedy(s,ent)

                if record_q:
                    q_all.append(q[0])

                s_,r,done,_=self.env.step(a)
                s_=self.preprocess(s_)

                self.nn.add_memory(s,a,r,s_,done)

                s=s_
                r_sum+=r
                stpCnt+=1

                if done:
                    break

            if ep%self.update_tar==0:
                self.nn.update_target()
                print('Target Updated!')

            loss=self.nn.update()
            self.epsilon*=self.epsilon_decay

            print("Ep:"+str(ep)+" Stps:"+str(stp)+" R:"+str(r_sum)+" loss:"+str(loss)+" ent:"+str(ent))

            r_all.append(np.round(r_sum,2))
            stp_all.append(stp)

        self.env.close()

        if save_result:
            pre=kw['prefix']
            result=pd.DataFrame(zip(stp_all,r_all),columns=['step','return'])
            result.to_csv(pre+'_result.csv',index=False)

        if record_q:
            pre=kw['prefix']
            np.save(pre+'_q.npy',q_all)

        if save_model:
            pre=kw['prefix']
            self.nn.save_model(pre)

    def preprocess(self,s):
        return torch.tensor(np.reshape(s, [1, 4]), dtype=torch.float32)

    def e_greedy(self, s, epsilon):
        if (np.random.random() <= epsilon):
            return self.env.action_space.sample()
        else:
            with torch.no_grad():
                return torch.argmax(self.nn.model(s)).numpy()

def main():
    agt=Agent(env=gym.make('CartPole-v0'),
              buffer_size=40000,
              batch_size=512,
              update_tar=10,
              n_eps=2000,
              n_stps=200,
              eps=1.0,
              eps_decay=0.995,
              lr=0.001,
              gm=0.99,)

    agt.train(n=1,
              save_result=True,
              save_model=True,
              record_q=True)

if __name__ == '__main__':

    main()

REINFORCE and baseline

We directly learn a linear parameterized Gaussian policy, which is different from epsilon-greedy policy derived from a learned value function in the above methods

\[\pi(a \mid s, \boldsymbol{\theta})=\mathcal{N}(\textbf{w}^T X, \left[\exp(\textbf{w}^T X)\right]^2)\]

where \(X=[x,\dot{x},\theta,\dot{\theta}]^T\)

Note:

  • since Gaussian policy is a stochastic policy, the exploration is done in the parameter space with the Gaussian variance \(\sigma^2\)

  • REINFORCE requires the policy to be differentiable and we have to manually derive the log derivative of the policy for gradient updating

  • we simply use an average-return baseline

See this post for more details Policy Gradient and Actor Critic

Customized Gym Env:

The continuous action env is from this link

Experimental setting:

  • State space: (4,) continuous
  • Action sapce: (1,) continuous
  • learning rate (lr): 0.0001
  • others remain the same as in Q-box
def Gaussian_policy(w_mu,w_sig,s):

    mu=w_mu.T.dot(s)[0]
    upper=w_sig.T.dot(s)
    sig=np.exp(upper-np.max(upper))[0]

    return np.random.normal(mu,sig)[0],mu[0],sig[0]

def get_dlog(mu,sig,s,a):

    dlog_mu=((a-mu)/(sig**2))*s
    dlog_sig=(((a-mu)**2/sig**2)-1)*s

    return dlog_mu,dlog_sig

def get_return(rewards,gm):

    R=np.zeros(len(rewards))
    R[-1]=rewards[-1]
    for i in range(2,len(R)+1):
        R[-i]=gm*R[-i+1]+rewards[-i]

    return R

def run_reinforce(n_eps=2000,n_stps=200,gm=0.99,lr=0.0001,baseline=False):

    from env_continuous import ContinuousCartPoleEnv
    env=ContinuousCartPoleEnv()

    w_mu=np.zeros((4,1))
    w_sig=np.zeros((4,1))

    r_all,s_all=[],[]
    for ep in range(n_eps):
        stp,r_sum,done=0,0,False
        states,actions,rewards,mus,sigs=[],[],[],[],[]
        s=env.reset().reshape((4,1))

        for stp in range(n_stps):
            a,mu,sig=Gaussian_policy(w_mu,w_sig,s)
            s_,r,done,_=env.step(a)
            s_=s_.reshape((4,1))

            states.append(s)
            actions.append(a)
            rewards.append(r)
            mus.append(mu)
            sigs.append(sig)

            s=s_
            stp+=1

            if done:
                break

        R=get_return(rewards,gm)

        #update policy parameters
        gmt=1
        for i in range(len(rewards)):
            dlog_mu,dlog_sig=get_dlog(mus[i],sigs[i],states[i],actions[i])
            if baseline:
                w_mu=w_mu+lr*gmt*(R[i]-sum(R)/len(R))*dlog_mu
                w_sig=w_sig+lr*gmt*(R[i]-sum(R)/len(R))*dlog_sig
            else:
                w_mu=w_mu+lr*gmt*(R[i])*dlog_mu
                w_sig=w_sig+lr*gmt*(R[i])*dlog_sig
            gmt*=gm

        #if ep%(n_eps//10)==0:
        #    print(f'ep:{ep}, R:{sum(rewards)}, stp:{stp}')

        r_all.append(sum(rewards))
        s_all.append(stp)

    return r_all,s_all

r_rf,s_rf=run_reinforce(baseline=False)
r_rfb,s_rfb=run_reinforce(baseline=True)

References

[Barto et al. 1983] Barto, Andrew G., Richard S. Sutton, and Charles W. Anderson. “Neuronlike adaptive elements that can solve difficult learning control problems.” IEEE transactions on systems, man, and cybernetics 5 (1983): 834-846.

[PILCO 2011] Deisenroth, Marc, and Carl E. Rasmussen. “PILCO: A model-based and data-efficient approach to policy search.” Proceedings of the 28th International Conference on machine learning (ICML-11). 2011.

[Michie et al. 1968] Michie, Donald, and Roger A. Chambers. “BOXES: An experiment in adaptive control.” Machine intelligence 2.2 (1968): 137-152.

[Doya 2000] Doya, Kenji. “Reinforcement learning in continuous time and space.” Neural computation 12.1 (2000): 219-245.

[Sutton’s C code]

[Java application of continuous RL in Cartpole]