symjax.rl

Implementation of basic agents, environment utilites and learning policies

Buffer(maxlen[, priority_sampling, gamma, lam]) Buffer holding different values of experience
run(env, agent, buffer[, rewarder, noise, …])
Actor(states[, actions_distribution, name]) actor (state to action mapping) for RL
Critic(states[, actions])
REINFORCE(state_shape, actions_shape, …[, …]) policy gradient reinforce also called reward-to-go policy gradient
ActorCritic(state_shape, actions_shape, …) this corresponds to Q actor critic or V actor critic depending on the given critic
PPO(state_shape, actions_shape, batch_size, …) instead of using target networks one can record the old log probs
DDPG(state_shape, actions_shape, batch_size, …)

Detailed Descriptions

class symjax.rl.utils.Buffer(maxlen, priority_sampling=False, gamma=0.99, lam=0.95)[source]

Buffer holding different values of experience

By default this contains "reward", "reward-to-go", "V" or "Q", "action", "state", "episode", "priorities", "TD-error", "terminal", "next-state" 𝑄𝜋(𝑠,𝑎)=𝐸𝜋{𝑅𝑡|𝑠𝑡=𝑠,𝑎𝑡=𝑎}=𝐸𝜋{∑𝑘=0∞𝛾𝑘𝑟𝑡+𝑘+1|𝑠𝑡=𝑠,𝑎𝑡=𝑎} 𝑉𝜋(𝑠)=𝐸𝜋{𝑅𝑡|𝑠𝑡=𝑠}=𝐸𝜋{∑𝑘=0∞𝛾𝑘𝑟𝑡+𝑘+1|𝑠𝑡=𝑠} 𝛾∈[0,1] is called discount factor and determines if one focuses on immediate rewards (𝛾=0), the total reward (𝛾=1) or some trade-off. lam (float): Lambda for GAE-Lambda. (Always between 0 and 1,

close to 1.)
symjax.rl.utils.run(env, agent, buffer, rewarder=None, noise=None, action_processor=None, max_episode_steps=10000, max_episodes=1000, update_every=1, update_after=1, skip_frames=1, reset_each_episode=False, wait_end_path=False, eval_every=10, eval_max_episode_steps=10000, eval_max_episodes=10)[source]
class symjax.rl.agents.Actor(states, actions_distribution=None, name='actor')[source]

actor (state to action mapping) for RL

This class implements an actor. The user must first define its own class inheriting from Actor and implementing only the create_network method. This method will then be used internally to instantiace the actor network.

If the used distribution is symjax.probabilities.Normal then the output of the create_network method should be first the mean and then the covariance.

In general the user should not instanciate this class, instead pass the user’s inherited class (uninstanciated) to a policy-learning method.

states: Tensor-like
the states of the environment (batch size in first axis)
batch_size: int
the batch size
actions_distribution: None or symjax.probabilities.Distribution object
the distribution for the actions, if the policy is deterministic, then put this to None. Note, this is different than the noise parameter employed for exploration, this is simply the rv modeling of the actions used to compute probabilities of sampled actions and the likes
class symjax.rl.agents.Critic(states, actions=None)[source]
class symjax.rl.REINFORCE(state_shape, actions_shape, n_episodes, episode_length, actor, lr=0.001, gamma=0.99)[source]

policy gradient reinforce also called reward-to-go policy gradient

the vanilla policy gradient uses the total reward of each episode as a weight. In this implementation it is the discounted rewards to go that are used. Setting gamma to 1 leads to the reward to go policy gradient

https://medium.com/@thechrisyoon/deriving-policy-gradients-and-implementing-reinforce-f887949bd63

class symjax.rl.ActorCritic(state_shape, actions_shape, n_episodes, episode_length, actor, critic, lr=0.001, gamma=0.99, train_v_iters=10)[source]

this corresponds to Q actor critic or V actor critic depending on the given critic

(with GAE-Lambda for advantage estimation)

https://www.freecodecamp.org/news/an-intro-to-advantage-actor-critic-methods-lets-play-sonic-the-hedgehog-86d6240171d/

class symjax.rl.PPO(state_shape, actions_shape, batch_size, actor, critic, lr=0.001, K_epochs=80, eps_clip=0.2, gamma=0.99, entropy_beta=0.01)[source]

instead of using target networks one can record the old log probs

have better advantage estimates

class symjax.rl.DDPG(state_shape, actions_shape, batch_size, actor, critic, lr=0.001, gamma=0.99, tau=0.01)[source]