symjax.rl
¶
Implementation of basic agents, environment utilites and learning policies
Buffer (maxlen[, priority_sampling, gamma, lam]) |
Buffer holding different values of experience |
run (env, agent, buffer[, rewarder, noise, …]) |
Actor (states[, actions_distribution, name]) |
actor (state to action mapping) for RL |
Critic (states[, actions]) |
REINFORCE (state_shape, actions_shape, …[, …]) |
policy gradient reinforce also called reward-to-go policy gradient |
ActorCritic (state_shape, actions_shape, …) |
this corresponds to Q actor critic or V actor critic depending on the given critic |
PPO (state_shape, actions_shape, batch_size, …) |
instead of using target networks one can record the old log probs |
DDPG (state_shape, actions_shape, batch_size, …) |
Detailed Descriptions¶
-
class
symjax.rl.utils.
Buffer
(maxlen, priority_sampling=False, gamma=0.99, lam=0.95)[source]¶ Buffer holding different values of experience
By default this contains
"reward", "reward-to-go", "V" or "Q", "action", "state", "episode", "priorities", "TD-error", "terminal", "next-state"
𝑄𝜋(𝑠,𝑎)=𝐸𝜋{𝑅𝑡|𝑠𝑡=𝑠,𝑎𝑡=𝑎}=𝐸𝜋{∑𝑘=0∞𝛾𝑘𝑟𝑡+𝑘+1|𝑠𝑡=𝑠,𝑎𝑡=𝑎} 𝑉𝜋(𝑠)=𝐸𝜋{𝑅𝑡|𝑠𝑡=𝑠}=𝐸𝜋{∑𝑘=0∞𝛾𝑘𝑟𝑡+𝑘+1|𝑠𝑡=𝑠} 𝛾∈[0,1] is called discount factor and determines if one focuses on immediate rewards (𝛾=0), the total reward (𝛾=1) or some trade-off. lam (float): Lambda for GAE-Lambda. (Always between 0 and 1,close to 1.)
-
symjax.rl.utils.
run
(env, agent, buffer, rewarder=None, noise=None, action_processor=None, max_episode_steps=10000, max_episodes=1000, update_every=1, update_after=1, skip_frames=1, reset_each_episode=False, wait_end_path=False, eval_every=10, eval_max_episode_steps=10000, eval_max_episodes=10)[source]¶
-
class
symjax.rl.agents.
Actor
(states, actions_distribution=None, name='actor')[source]¶ actor (state to action mapping) for RL
This class implements an actor. The user must first define its own class inheriting from
Actor
and implementing only the create_network method. This method will then be used internally to instantiace the actor network.If the used distribution is symjax.probabilities.Normal then the output of the create_network method should be first the mean and then the covariance.
In general the user should not instanciate this class, instead pass the user’s inherited class (uninstanciated) to a policy-learning method.
- states: Tensor-like
- the states of the environment (batch size in first axis)
- batch_size: int
- the batch size
- actions_distribution: None or symjax.probabilities.Distribution object
- the distribution for the actions, if the policy is deterministic, then put this to None. Note, this is different than the noise parameter employed for exploration, this is simply the rv modeling of the actions used to compute probabilities of sampled actions and the likes
-
class
symjax.rl.
REINFORCE
(state_shape, actions_shape, n_episodes, episode_length, actor, lr=0.001, gamma=0.99)[source]¶ policy gradient reinforce also called reward-to-go policy gradient
the vanilla policy gradient uses the total reward of each episode as a weight. In this implementation it is the discounted rewards to go that are used. Setting
gamma
to 1 leads to the reward to go policy gradienthttps://medium.com/@thechrisyoon/deriving-policy-gradients-and-implementing-reinforce-f887949bd63
-
class
symjax.rl.
ActorCritic
(state_shape, actions_shape, n_episodes, episode_length, actor, critic, lr=0.001, gamma=0.99, train_v_iters=10)[source]¶ this corresponds to Q actor critic or V actor critic depending on the given critic
(with GAE-Lambda for advantage estimation)