Action Selection in RL

when it explores it chooses equally among all actions

this means that it is as likely to choose the worst-appearing actions as it is to shoose the next-to-best action

Gibbs or Boltzmann distribution

\[\frac{\exp{Q_t(a)/\tau}}{\sum_{b=1}^n \exp{Q_t(b)/\tau}}\]

\(\tau\) - temperature

high \(\tau\) causes the actions to be all euiprobable

low \(\tau\) cause a greater difference in selection probability for actions that differ in their value estimates

\(\tau \rightarrow 0\) - same as greedy action selection

note: when number of action =2, softmax becomes sigmoid

“Any time we wish to represent a probability distribution over a discrete variable with n possible values, we may use the softmax function”

when the nn is configured to output N class labels (multi-class classification)

can be used as an activation function for a hidden layer, when the model internally needs to choose or weight multiple different inputs at a bottleneck or concatenation layer
softmax is a softened version of the argmax function that returns the index of the largest value in a list

given [1,3,2]

hard max returns [0, 1, 0]

softmax returns [0.09, 0.67, 0.24]

import numpy as np

def softmax(vector):
    e=np.exp(vector)
    return e/e.sum()

softmax([1,3,2])

>>array([0.09003057, 0.66524096, 0.24472847])

from scipy.special import softmax

softmax([1,3,2])

>>array([0.09003057, 0.66524096, 0.24472847])

encoded class labels (one-hot encoding):

class 0: [1,0,0]
class 1: [0,1,0]
class 2: [0,0,1]

in softmax case:

class 2: [0.09, 0.67, 0.24]

the error between the expected and predicted multinomial probability distribution is often calculated using cross-entropy, this error is then used to update the model

we may want to convert the probabilities back into an integer encoded class label

You May Also Enjoy