Curiosity
Background
Deal with sparse reward:
-
curiosity-driven / intrinsic motivation
intrinsic curiosity model
curiosity in model-based RL -
curriculum learning
automatic generation of easy goals
learning to select easy tasks -
auxiliary tasks
-
reward shaping
-
dense reward
Curiosity [1]
curiosity can serve as an intrinsic reward signal to enable the agent to explore its env
curiosity helps an agent explore its env in the quest for new knowledge
curiosity is a mechanism for an agent to learn skills that might be helpful in future scenarios
fomulate curiosity as the error in an agent’s ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model
Intrinsic Reward:
- encourage the agent to explore “novel” states
- encourage the agent to perform actions that reduce the error/uncertainty in the agent’s ability to predict the consequence of its own actions
Difficulty:
- measuring “novelty” requires a statistical model of the distribution of the env states
- measuring prediction error/uncertainty requires building a model of env dynamics that predicts the next state given the current state and the action
Self-supervision:
Training an NN on a proxy inverse dynamics task of predicting the agent’s action given its current and next states
Two subsystems:
- a curiosity-driven intrinsic reward signal
- a policy that outputs a sequence of actions to maximize that reward signal
divide all sources that can modify the agent’s observations into 3 cases:
- things that can be controlled by the agent
- things that the agent cannot control but can affect the agent (a vehicle driven by another agent)
- things out of the agent’s control and not affecting the agent (moving leaves)
A good feature space for curiosity should model 1 and 2 and be unaffected by 3
Two sub-modules:
- encodes the raw state into a feature vector st -> φ(st)
- input φ(st) and φ(st+1) and output a_hat_t, a_hat_t=g(st,st+1;θ_I), a_hat_t is the predicted estimate of the action at
Loss function of inverse model:
min L_I (a_hat_t,a_t)
θ_I
the learned model g() is also known as the inverse dynamics model
Loss function of feature prediction:
min L_F (φ(st),φ_hat_(st+1)) = 1/2 ||φ_hat_(st+1)-φ(st+1)||^2_2
φ_hat_(st+1)=f(φ(st),at:θ_F)
the learned model f() is known as the forward dynamics model
The intrinsic reward signal r^i_t is
r^i_t=η/2 ||φ_hat_(st+1)-φ(st+1)||^2_2
The overall optimization problem:
min -λE [Σ r_t] + (1 − β)L_I + β_LF]
θP ,θI ,θF π(st;θP )
Refs
Reinforcement Learning: Dealing with Sparse Reward Environments
[1] Pathak et al. “curiosity-driven exploration by self-supervised prediction.”
[2] Burda et al. “Large-Scale Study of Curiosity-Driven Learning”
[3]