symjax.nn.optimizers

symjax.nn.optimizers

alias of symjax.nn.optimizers

class symjax.nn.optimizers.Adam(*args, name=None, **kwargs)[source]

Adaptive Gradient Based Optimization with renormalization.

The update rule for variable with gradient g uses an optimization described at the end of section 2 of the paper with learning rate α.

If amsgrad is False:

initialization:

  • \(m_0 = 0\) (Initialize initial 1st moment vector)
  • \(v_0 = 0\) (Initialize initial 2nd moment vector)
  • \(t = 0\) (Initialize timestep)

update:

  • \(t = t + 1\)
  • \(α_t = α × \sqrt{1 - β_2^t}/(1 - β_1^t)\)
  • \(m_t = β_1 × m_{t-1} + (1 - β_1) × g\)
  • \(v_t = β_2 × v_{t-1} + (1 - β_2) × g \odot g\)
  • \(variable = variable - α_t × m_t / (\sqrt{v_t} + ε)\)

If amsgrad is True:

initialization:

  • \(m_0 = 0\) (Initialize initial 1st moment vector)
  • \(v_0 = 0\) (Initialize initial 2nd moment vector)
  • \(v'_0 = 0\) (Initialize initial 2nd moment vector)
  • \(t = 0\) (Initialize timestep)

update:

  • \(t = t + 1\)
  • \(α_t = α × \sqrt{1 - β_2^t}/(1 - β_1^t)\)
  • \(m_t = β_1 × m_{t-1} + (1 - β_1) × g\)
  • \(v_t = β_2 × v_{t-1} + (1 - β_2) × g \odot g\)
  • \(v'_t := \max(v'_{t-1}, v_t)\)
  • \(variable = variable - α_t × m_t / (\sqrt{v'_t} + ε)\)

The default value of \(\epsilon=1e-7\) might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the “epsilon” referred to here is “epsilon hat” in the paper.

Parameters:
  • grads_or_loss (scalar tensor or list of gradients) – either the loss (scalar of Tensor type) to be differentied or the list of gradients already computed and possibly altered manually (such as clipping)
  • (α) (learning_rate) – the learning rate use to update the parameters
  • amsgrad (bool) – whether to use the amsgrad updates or not
  • β_1 (constant or Tensor) – the value of the exponential moving average of the average of the gradients through time (updates)
  • β_2 (constant or Tensor) – the value of the exponential moving average of the variance of the gradients through time
  • ε (constant or Tensor) – the value added to the second order moment
  • params (list (optional)) – if grads_or_loss is al list then it should be ordered w.r.t. the given parameters, if not given then the optimizer will find all variables that are traininable and involved with the given loss
updates
Type:list of updates
variables
Type:list of variables
class symjax.nn.optimizers.NesterovMomentum(*args, name=None, **kwargs)[source]

Nesterov momentum Optimization

Parameters:
  • grads_or_loss (scalar tensor or list of gradients) – either the loss (scalar of Tensor type) to be differentied or the list of gradients already computed and possibly altered manually (such as clipping)
  • learning_rate (constant or Tensor) – the learning rate use to update the parameters
  • momentum (constant or Tensor) – the amount of momentum to be applied
  • params (list (optional)) – if grads_or_loss is al list then it should be ordered w.r.t. the given parameters
updates
Type:list of updates
variables
Type:list of variables
class symjax.nn.optimizers.SGD(*args, name=None, **kwargs)[source]

Stochastic gradient descent optimization.

Notice that SGD is also the acronym employed in tf.keras.optimizers.SGD and in torch.optim.sgd but might be misleading. In fact, those and this implementation implement GD, the SGD term only applies if one performs GD optimization only using 1 (random) sample to compute the gradients. If multiple samples are used it is commonly referred as mini-batch GD and when the entire dataset is used then the optimizer is refered as GD. See an illustrative discussion here.

The produced update for parameter θ and a given learning rate α is:

\[θ = θ - α ∇_{θ} L\]
Parameters:
  • grads_or_loss (scalar tensor or list of gradients) – either the loss (scalar of Tensor type) to be differentied or the list of gradients already computed and possibly altered manually (such as clipping)
  • learning_rate (constant or Tensor) – the learning rate use to update the parameters
  • params (list (optional)) – if grads_or_loss is al list then it should be ordered w.r.t. the given parameters
updates
Type:list of updates
variables
Type:list of variables
symjax.nn.optimizers.conjugate_gradients(Ax, b)[source]

Conjugate gradient algorithm (see https://en.wikipedia.org/wiki/Conjugate_gradient_method)