All About Cartpole

Cartpole is one of the very ancient and classical RL-for-control benchmark, early appeared in [Barto et al. 1983]. Many RL algorithms from value-based to policy-based, model-free to model-based were justified under it or its variants encompassing simulations and real robots, i.e.[PILCO].

This post is trying to cover the insights derived from the performances of classical RL algorithms under Cartpole environment. The algorithm descriptions are at Basic Reinforcement Learning, Temporal-Difference Learning, Policy Gradient and Actor-Critic. Ipython notebooks are available at [Cartpole Q, SARSA], [Cartpole REINFORCE]

(Temporary) Related algorithms:

Q-learning (box, bins): discrete states, discrete actions
SARSA(λ): linear function approximator for continuous states, discrete actions
DQN: non-linear function approximator for continuous states, discrete actions
REINFORCE and baseline: linear Gaussian policy for continuous actions, continuous states
and more

Temporary Results with 10 runs in CartPole-v0:

Q: [162]-Box, RBF: [6,6,9,9], DQN: 3x[200 hidden]-MLP

Temporary Summary:

SARSA(\(\lambda\)) achieved the best sample efficiency; DQN could reach relatively higher returns; REINFORCE has variance issue, while this can be relieved by adding a baseline
Q-learning with box discretization and binning requires more episodes (3000 or more) to converge, since its Q-value hasn’t reached 100 like RBF and DQN
a proper number of equal-distance binning of the states will be 15, but the overall performance of binning is worst than box discretization. This suggests that we have to carefully design and divide the states in discrete state cases
from the heatmap of the learned value in x=0 and x_dot=0, the highest optimal values learned by RBF and DQN are on diagonal sides from the center. This might be the cause of binary discrete actions (check continous action case for future work)
finer RBF networks may not approximate value funtion well but can achieve better sample efficiency
DQN requires smaller learning rate for stable learning, say 0.01 or 0.001 compared to Q-box and SARSA(\(\lambda\))s’ 0.1
3-layer network can capture quite nice value approximation compared to 2-layer or 1-layer wider network
discounting factor remains 0.99 is good

Future work:

analyze and visualize weights and gradients of DQN and RBF
tune NN capacities, check different activation functions, e.g. SiLU or try NRBF
increase the difficulty of the task by setting threshold to 180deg or to longer step length
test different reward functions
test different algs with continuous controllers

Problem Setup

[OpenAI CartPole-v0], ends in 200 episodes
[OpenAI CartPole-v1], ends in 500 episodes
[Env source code-v1] [Env Continuous actions]

Num	Observation	Min	Max
0	Cart Position	-4.8	4.8
1	Cart Velocity	-Inf	Inf
2	Pole Angle	~ -0.418 rad (-24°)	~ 0.418 rad (24°)
3	Pole Angular Velocity	-Inf	Inf

Num	Action
0	Push cart to the left with 10N
1	Push cart to the right with 10N

State variables: (4,)

\(x\) - cart position
\(\dot{x}\) - cart velocity
\(\theta\) - pole angle
\(\dot{\theta}\) - pole angular velocity

Action variables: (2,)

Reward: +1 unless termination

Termination:

\(\theta < -15^o\) or \(\theta > 15^o\), 15 degree = 0.2618 rad
\(x<-2.4\) or \(x>2.4\)

Q-box

Box System [Michie et al. 1968] divided the state variables into 3x3x6x3 boxes as follows:

variable			range
\(x\)	<-0.8		(-0.8,0.8)		>0.8
box:	+0		+1		+2
\(\dot{x}\)	<-0.5		(-0.5,0.5		>0.5
box:	+0		+3		+6
\(\theta\)(deg)	<-12	(-12,-1.5)	(-1.5,0)	(0,1.5)	(1.5,12)	>12
box:	+0	+9	+18	+27	+36	+45
\(\dot{\theta}\)	<-50		(-50,50)		>50
box:	+0		+54		+108

Experimental setting:

State space: (4,) continuous -> (162,)
Action sapce: (2,)
episode length: 2000
step length: 200 in CartPole-v0
learning rate (lr): 0.1
discounting factor (gm): 0.99
exploration factor (epsilon): from 1, epsilon*=epsilon_decay_rate for each episode
exploration decaying rate (epsilon decay rate): 0.995

Learned optimal Values:

Q-learning implementation with box system:

import numpy as np
import gym
import random

def get_box(s):

    x,x_dot,theta,theta_dot=s
    if x < -.8:
        box_idx = 0
    elif x < .8:
        box_idx = 1
    else:
        box_idx = 2

    if x_dot < -.5:
        pass
    elif x_dot < .5:
        box_idx += 3
    else:
        box_idx += 6

    if theta < np.radians(-12):
        pass
    elif theta < np.radians(-1.5):
        box_idx += 9
    elif theta < np.radians(0):  
        box_idx += 18
    elif theta < np.radians(1.5):
        box_idx += 27
    elif theta < np.radians(12):
        box_idx += 36
    else:
        box_idx += 45

    if theta_dot < np.radians(-50):
        pass
    elif theta_dot < np.radians(50):
        box_idx += 54
    else:
        box_idx += 108

    return box_idx

def e_greedy(q,epsilon,n_a):

    if random.random()<epsilon:
        a=np.random.randint(0,n_a)
    else:
        a=np.argmax(q)

    return a

def run_qbox(n_eps=2000,n_stps=200,gm=0.99,lr=0.1,epsilon=1,epsilon_rate=0.995):

    env=gym.make("CartPole-v0")
    n_s=env.observation_space.shape[0]
    n_a=env.action_space.n

    Q=np.random.rand(162,n_a)
    r_all,stp_all=[],[]
    q_all=[]

    for ep in range(n_eps):

        s=env.reset()
        s_int=get_box(s)
        r_sum=0

        for stp in range(n_stps):

            q_all.append(Q[s_int])
            a=e_greedy(Q[s_int],epsilon,n_a)
            s_,r,done,_=env.step(a)
            s_int_=get_box(s_)

            delta=r+gm*np.max(Q[s_int_])-Q[s_int,a]
            Q[s_int,a]+=lr*delta

            s=s_
            s_int=s_int_
            r_sum+=r

            if done:
                break

        r_all.append(r_sum)
        stp_all.append(stp)
        epsilon*=epsilon_rate
        #if ep%(n_eps//5)==0 or ep==n_eps-1:
        #    print(f"ep:{ep}, stp:{stp}, r:{np.round(r_sum,2)},eps:{epsilon}")

    return r_all,stp_all,q_all

r,s,q=run_qbox()

Q-bins

We could also divide the states in equal-distance bins with each state variables

Experimental setting:

n_bins: number of bins for each state variables
State space: (4,) continuous -> (n_bins^4,)
exploration decaying rate (epsilon decay rate): 0.999
others remain the same as in Q-box

def discretize(s,bins):
    idx=[]
    for i in range(n_s):
        idx.append(np.digitize(s[i],bins[i])-1) # -1 will turn bin into index
    return tuple(idx)

def run_qbin(n_bins=15,n_eps=2000,n_stps=200,gm=0.99,lr=0.1,eps=1,eps_decay=0.999):

    env=gym.make('CartPole-v0')
    n_a=env.action_space.n
    n_s=env.observation_space.shape[0]
    bins=[np.linspace(-2.4,2.4,n_bins),
          np.linspace(-2,2,n_bins),
          np.linspace(-.2618,.2618,n_bins),
          np.linspace(-2,2,n_bins)]

    #shape (n_bins,n_bins,n_bins,n_bins,n_a)
    #i.e. (15,15,15,15,2)
    Q=np.zeros(([len(bins[0])]*n_s+[n_a]))

    r_all,stp_all,q_all=[],[],[]

    for ep in range(n_eps):

        r_sum,done=0,False
        s_int=discretize(env.reset(),bins)

        for stp in range(n_stps):

            q_all.append(Q[s_int])
            a=e_greedy(Q[s_int],eps,n_a)
            s_,r,done,_=env.step(a)
            s_int_=discretize(s_,bins)

            delta=r+gm*np.max(Q[s_int_])-Q[s_int][a]
            Q[s_int][a]+=lr*delta

            s=s_
            s_int=s_int_
            r_sum+=r

            if done:
                break

        eps*=eps_decay
        if ep%(n_eps//10)==0 or ep==n_eps-1:
            print(f"ep:{ep}, stp:{stp}, r:{np.round(r_sum,2)}, eps:{eps}")

        r_all.append(r_sum)
        stp_all.append(stp)

    return r_all,stp_all,q_all

r,s,q=run_qbin(n_bins=15)

SARSA(λ)-rbf

We use RBF networks to approximate the Q function

RBF networks:

\[\begin{align*} \Phi_s(i)=\exp \left(-\frac{\|s-c_i\|^2}{2\sigma_i^2} \right) \end{align*}\]

where \(c_i\) is the \(i\)th center point and \(\sigma_i\) is the \(i\)th standard deviation

Normalized RBF networks:

\[\begin{align*} \Phi_s(i)= \frac{\exp \left(-\frac{\|s-c_i\|^2}{2\sigma_i^2} \right)}{\sum_i \exp \left(-\frac{\|s-c_i\|^2}{2\sigma_i^2} \right)} \end{align*}\]

RBF parameters:

n_rbf: [3,3,5,5]
n_feature: 3x3x5x5=225
s_range_low: [-2.4,-4,-np.radians(30),-np.radians(180)]
s_range_high: [2.4,4,np.radians(30),np.radians(180)]

RBF network plotted w.r.t the above parameters:

Experimental setting:

State space to Feature space: (4,) continuous -> (225,) continuous
eligibility trace factor (lmd): 0.5
others remain the same as in Q-box

Customized Gym Env:

In original gym env, state initialization is U(-0.05,0.05) for 4 states, this small range doesn’t encourage thorough state visits in terms of exploration, we changed it to U(-0.26,0.26) for 4 states (further can be changed to U(range of each state))
also in self.theta_threshold_radians = 15 * 2 * math.pi / 360, 15 can be changed to 180 to test the property of continuous controller (further topic)

Learned optimal Values:

SARSA(λ) implementation:

import numpy as np
import pandas as pd
import gym
import math
import random
import matplotlib.pyplot as plt

def plot_rbf(s_range,center,sigma):
    plt.figure(figsize=(10,10))
    title=['x','x_dot','theta','theta_dot']
    for i in range(1,5):
        plt.subplot(2,2,i)
        x=np.linspace(s_range[0,i-1],s_range[1,i-1],num=100)
        y={}
        for j in range(len(center[i-1])):
            y[j]=np.exp(-(x-center[i-1][j])**2/(2*sigma[i-1]**2))
            plt.plot(x,y[j])
        plt.title(title[i-1])

    plt.savefig('rbf.png',dpi=350)

def get_feature(s):

    rbf={}
    f=1 #feature
    for i in range(4):
        rbf[i]=np.exp(-(s[i]-center[i])**2/(2*sigma[i]**2))
        f=np.outer(f,rbf[i])
        f=f.ravel()

    return f

def get_Q(F,w):
    Q=np.dot(w.T,F)
    return Q

def get_Q_a(F,a,w):
    Q=np.dot(w[:,a],F)

def e_greedy(e,Q):
    rand=np.random.random()
    if rand<1.-e:
        a=Q.argmax()
    else:
        a=env.action_space.sample()
    return int(a)

def get_v(x,x_dot,w):
    n_grid=50
    v=np.zeros((n_grid,n_grid))
    the=np.linspace(s_range[0,2]/2,s_range[1,2]/2,num=n_grid)
    the_dot=np.linspace(s_range[1,3],s_range[0,3],num=n_grid)

    for i in range(n_grid):
        for j in range(n_grid):
            s=np.array([x,x_dot,the[j],the_dot[i]])
            F=get_feature(s)
            v[i,j]=np.max(np.dot(w.T,F))

    return v

def plot_v(w,angle=15,angular=229):
    i=1
    x_range=[-2.4,0,2.4]
    x_dot_range=[-2,0,2]
    plt.figure(figsize=(10,10))
    for x_dot in reversed(x_dot_range):
        for x in x_range:
            v=get_v(x,x_dot,w)

            plt.subplot(3,3,i)
            plt.imshow(v,cmap='jet')
            plt.colorbar()
            plt.title('x='+str(x)+' x_dot='+str(x_dot))
            plt.xticks(np.arange(50,step=25),('-'+str(angle),'0',))
            plt.yticks(np.arange(50,step=25),(str(angular),'0',))
            plt.xlabel('theta')
            plt.ylabel('theta_dot')
            i+=1

    plt.savefig('q_op_rbf.png',dpi=350)

def initialize(env,n_s,n_a):

    s_range=np.zeros((2,n_s))
    s_range[0,:]=np.array([-2.4,-4,-np.radians(30),-np.radians(180)])
    s_range[1,:]=np.array([2.4,4,np.radians(30),np.radians(180)])

    n_rbf=np.array([3,3,5,5]).astype(int)
    n_feature=np.prod(n_rbf)
    w=np.zeros((n_feature,n_a))

    interval={}
    center={}
    sigma={}

    for i in range(n_s):
        interval[i]=(s_range[1,i]-s_range[0,i])/(n_rbf[i]-1)
        sigma[i]=interval[i]/2
        center[i]=[np.around(s_range[0,i]+j*interval[i],2) for j in range(n_rbf[i])]

    return s_range,center,sigma

def run_sarsalmd(n_eps=2000,n_stps=200,gm=0.99,lr=0.1,lmd=0.5,epsilon=1,epsilon_rate=0.995):

    env=gym.make("CartPole-v0")
    n_s=env.observation_space.shape[0]
    n_a=env.action_space.n

    s_range,center,sigma=initialize(env,n_s,n_a)
    plot_rbf(s_range,center,sigma)

    r_all,stp_all,q_all=[],[],[]

    for ep in range(n_eps):

        r_sum,done=0,False
        #eligibility traces
        e=np.zeros((n_feature,n_a))
        F=get_feature(env.reset(),center,sigma)
        Q_old=get_Q(F,w)
        a=e_greedy(epsilon,Q_old)

        for stp in range(n_stps):

            #show animation of last 5 episodes
            #if ep>n_eps-5:
            #    env.render()

            q_all.append(Q_old)
            s_,r,done,_=env.step(a)
            F_=get_feature(s_,center,sigma)
            Q=get_Q(F_,w)
            a_=e_greedy(epsilon,Q)

            if done:
                delta=r-Q_old[a]
            else:
                delta=r+gm*Q[a_]-Q_old[a]

            e[:,a]=F

            for m in range(n_feature):
                for n in range(n_a):
                    w[m,n]+=lr*delta*e[m,n]

            e*=gm*lmd

            s=s_
            F=F_
            a=a_
            Q_old=Q
            r_sum+=r

            if done:
                break

        r_all.append(r_sum)
        stp_all.append(stp)
        epsilon*=epsilon_rate
        #print(f"ep:{ep}, stp:{stp}, r:{np.round(r_sum,2)}, eps:{epsilon}")
    plot_v(w)
    return r_all,stp_all,q_all

r,s,q=run_sarsalmd()

DQN

We use deep neural network to approximate Q values

NN parameters:

MLP layers: 3
hidden units: 200 for each layer
activator: relu
state input: (4,)
action output: (2,)

DQN Experimental setting:

buffer size: 40000
batch size: 512
update target: every 10 episodes
episode length: 3000
step length: 1000
lr: 0.001
gm: 0.99
eps: from 1.
eps decay: 0.995

Customized Gym Env:

the initial angle of pole ~ U(-15deg,15deg)
end of episode [-30deg, 30deg]

Learned optimal Values:

Implementation:

import random
import copy
import math
import numpy as np
import pandas as pd
from collections import deque

import gym
import torch
import torch.nn as nn
import torch.nn.functional as F

class DQN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1=nn.Linear(4, 200)
        self.fc2=nn.Linear(200, 200)
        self.fc3=nn.Linear(200, 200)
        self.fc4=nn.Linear(200, 2)

    def forward(self, x):
        x=self.fc1(x)
        x=F.relu(x) #SiLU
        x=self.fc2(x)
        x=F.relu(x)
        x=self.fc3(x)
        x=F.relu(x)
        x=self.fc4(x)
        return x

class NN:

    def __init__(self,batch_size=32,buffer_size=2000,**kw):

        self.batch_size=batch_size
        self.buffer=deque(maxlen=buffer_size)
        self.lr=kw['lr']
        self.gm=kw['gm']

    def initialize(self):

        self.model,self.model_tar=self.create()
        self.criterion=torch.nn.MSELoss()
        self.opt=torch.optim.Adam(self.model.parameters(),lr=self.lr)

    def create(self):

        model=DQN()
        model_tar=copy.deepcopy(model)

        return model,model_tar

    def update_target(self):

        self.model_tar.load_state_dict(self.model.state_dict())

    def update(self):

        q_batch,q_tar_batch=[],[]
        batch=random.sample(self.buffer,min(len(self.buffer),self.batch_size))
        for s,a,r,s_, done in batch:
            q=self.model(s)
            q_tar=q.clone().detach()
            with torch.no_grad():
                q_tar[0][a]=r if done else r+self.gm*torch.max(self.model_tar(s_)[0])
            q_batch.append(q[0])
            q_tar_batch.append(q_tar[0])

        q_batch=torch.cat(q_batch)
        q_tar_batch=torch.cat(q_tar_batch)

        self.opt.zero_grad()
        loss=self.criterion(q_batch, q_tar_batch)
        loss.backward()
        self.opt.step()

        return loss.item()

    def add_memory(self,s,a,r,s_,done):

        r=torch.tensor(r)
        self.buffer.append((s,a,r,s_,done))

    def get_q(self,s):

        return self.model(s).detach().numpy()

    def get_tar(self,s):

        return self.model_tar(s).detach().numpy()

    def save_model(self,pre):

        torch.save(self.model.state_dict(), 'model_weights.pth')

class Agent:

    def __init__(self,env,buffer_size,batch_size,update_tar,n_eps,n_stps,eps,eps_decay,**kw):

        self.env=env
        self.nn=NN(batch_size=batch_size,
                   buffer_size=buffer_size,
                   lr=kw['lr'],
                   gm=kw['gm'])

        self.update_tar=update_tar
        self.n_eps=n_eps
        self.n_stps=n_stps

        self.epsilon=eps
        self.epsilon_decay=eps_decay

    def train(self,n=5,save_result=False,save_model=False,record_q=False):

        for i in range(n):
            self.train_once(save_result=save_result,
                            save_model=save_model,
                            record_q=record_q,
                            prefix='n'+str(i))

    def train_once(self,save_result=True,save_model=False,record_q=False,**kw):

        r_all,stp_all,stpCnt=[],[],0
        v_all,q_all=[],[]

        self.nn.initialize()

        for ep in range(self.n_eps):
            r_sum,done=0,False

            s=self.preprocess(self.env.reset())
            ent=self.epsilon

            for stp in range(self.n_stps):

                if ep>self.n_eps-5:
                    self.env.render()

                q=self.nn.get_q(s)
                a=self.e_greedy(s,ent)

                if record_q:
                    q_all.append(q[0])

                s_,r,done,_=self.env.step(a)
                s_=self.preprocess(s_)

                self.nn.add_memory(s,a,r,s_,done)

                s=s_
                r_sum+=r
                stpCnt+=1

                if done:
                    break

            if ep%self.update_tar==0:
                self.nn.update_target()
                print('Target Updated!')

            loss=self.nn.update()
            self.epsilon*=self.epsilon_decay

            print("Ep:"+str(ep)+" Stps:"+str(stp)+" R:"+str(r_sum)+" loss:"+str(loss)+" ent:"+str(ent))

            r_all.append(np.round(r_sum,2))
            stp_all.append(stp)

        self.env.close()

        if save_result:
            pre=kw['prefix']
            result=pd.DataFrame(zip(stp_all,r_all),columns=['step','return'])
            result.to_csv(pre+'_result.csv',index=False)

        if record_q:
            pre=kw['prefix']
            np.save(pre+'_q.npy',q_all)

        if save_model:
            pre=kw['prefix']
            self.nn.save_model(pre)

    def preprocess(self,s):
        return torch.tensor(np.reshape(s, [1, 4]), dtype=torch.float32)

    def e_greedy(self, s, epsilon):
        if (np.random.random() <= epsilon):
            return self.env.action_space.sample()
        else:
            with torch.no_grad():
                return torch.argmax(self.nn.model(s)).numpy()

def main():
    agt=Agent(env=gym.make('CartPole-v0'),
              buffer_size=40000,
              batch_size=512,
              update_tar=10,
              n_eps=2000,
              n_stps=200,
              eps=1.0,
              eps_decay=0.995,
              lr=0.001,
              gm=0.99,)

    agt.train(n=1,
              save_result=True,
              save_model=True,
              record_q=True)

if __name__ == '__main__':

    main()

REINFORCE and baseline

We directly learn a linear parameterized Gaussian policy, which is different from epsilon-greedy policy derived from a learned value function in the above methods

\[\pi(a \mid s, \boldsymbol{\theta})=\mathcal{N}(\textbf{w}^T X, \left[\exp(\textbf{w}^T X)\right]^2)\]

where \(X=[x,\dot{x},\theta,\dot{\theta}]^T\)

Note:

since Gaussian policy is a stochastic policy, the exploration is done in the parameter space with the Gaussian variance \(\sigma^2\)
REINFORCE requires the policy to be differentiable and we have to manually derive the log derivative of the policy for gradient updating
we simply use an average-return baseline

See this post for more details Policy Gradient and Actor Critic

Customized Gym Env:

The continuous action env is from this link

Experimental setting:

State space: (4,) continuous
Action sapce: (1,) continuous
learning rate (lr): 0.0001
others remain the same as in Q-box

def Gaussian_policy(w_mu,w_sig,s):

    mu=w_mu.T.dot(s)[0]
    upper=w_sig.T.dot(s)
    sig=np.exp(upper-np.max(upper))[0]

    return np.random.normal(mu,sig)[0],mu[0],sig[0]

def get_dlog(mu,sig,s,a):

    dlog_mu=((a-mu)/(sig**2))*s
    dlog_sig=(((a-mu)**2/sig**2)-1)*s

    return dlog_mu,dlog_sig

def get_return(rewards,gm):

    R=np.zeros(len(rewards))
    R[-1]=rewards[-1]
    for i in range(2,len(R)+1):
        R[-i]=gm*R[-i+1]+rewards[-i]

    return R

def run_reinforce(n_eps=2000,n_stps=200,gm=0.99,lr=0.0001,baseline=False):

    from env_continuous import ContinuousCartPoleEnv
    env=ContinuousCartPoleEnv()

    w_mu=np.zeros((4,1))
    w_sig=np.zeros((4,1))

    r_all,s_all=[],[]
    for ep in range(n_eps):
        stp,r_sum,done=0,0,False
        states,actions,rewards,mus,sigs=[],[],[],[],[]
        s=env.reset().reshape((4,1))

        for stp in range(n_stps):
            a,mu,sig=Gaussian_policy(w_mu,w_sig,s)
            s_,r,done,_=env.step(a)
            s_=s_.reshape((4,1))

            states.append(s)
            actions.append(a)
            rewards.append(r)
            mus.append(mu)
            sigs.append(sig)

            s=s_
            stp+=1

            if done:
                break

        R=get_return(rewards,gm)

        #update policy parameters
        gmt=1
        for i in range(len(rewards)):
            dlog_mu,dlog_sig=get_dlog(mus[i],sigs[i],states[i],actions[i])
            if baseline:
                w_mu=w_mu+lr*gmt*(R[i]-sum(R)/len(R))*dlog_mu
                w_sig=w_sig+lr*gmt*(R[i]-sum(R)/len(R))*dlog_sig
            else:
                w_mu=w_mu+lr*gmt*(R[i])*dlog_mu
                w_sig=w_sig+lr*gmt*(R[i])*dlog_sig
            gmt*=gm

        #if ep%(n_eps//10)==0:
        #    print(f'ep:{ep}, R:{sum(rewards)}, stp:{stp}')

        r_all.append(sum(rewards))
        s_all.append(stp)

    return r_all,s_all

r_rf,s_rf=run_reinforce(baseline=False)
r_rfb,s_rfb=run_reinforce(baseline=True)

References

[Barto et al. 1983] Barto, Andrew G., Richard S. Sutton, and Charles W. Anderson. “Neuronlike adaptive elements that can solve difficult learning control problems.” IEEE transactions on systems, man, and cybernetics 5 (1983): 834-846.

[PILCO 2011] Deisenroth, Marc, and Carl E. Rasmussen. “PILCO: A model-based and data-efficient approach to policy search.” Proceedings of the 28th International Conference on machine learning (ICML-11). 2011.

[Michie et al. 1968] Michie, Donald, and Roger A. Chambers. “BOXES: An experiment in adaptive control.” Machine intelligence 2.2 (1968): 137-152.

[Doya 2000] Doya, Kenji. “Reinforcement learning in continuous time and space.” Neural computation 12.1 (2000): 219-245.

[Sutton’s C code]

[Java application of continuous RL in Cartpole]