symjax.nn.optimizers
¶
-
symjax.nn.
optimizers
¶ alias of
symjax.nn.optimizers
-
class
symjax.nn.optimizers.
Adam
(*args, name=None, **kwargs)[source]¶ Adaptive Gradient Based Optimization with renormalization.
The update rule for variable with gradient g uses an optimization described at the end of section 2 of the paper with learning rate α.
If
amsgrad
isFalse
:initialization:
- \(m_0 = 0\) (Initialize initial 1st moment vector)
- \(v_0 = 0\) (Initialize initial 2nd moment vector)
- \(t = 0\) (Initialize timestep)
update:
- \(t = t + 1\)
- \(α_t = α × \sqrt{1 - β_2^t}/(1 - β_1^t)\)
- \(m_t = β_1 × m_{t-1} + (1 - β_1) × g\)
- \(v_t = β_2 × v_{t-1} + (1 - β_2) × g \odot g\)
- \(variable = variable - α_t × m_t / (\sqrt{v_t} + ε)\)
If
amsgrad
isTrue
:initialization:
- \(m_0 = 0\) (Initialize initial 1st moment vector)
- \(v_0 = 0\) (Initialize initial 2nd moment vector)
- \(v'_0 = 0\) (Initialize initial 2nd moment vector)
- \(t = 0\) (Initialize timestep)
update:
- \(t = t + 1\)
- \(α_t = α × \sqrt{1 - β_2^t}/(1 - β_1^t)\)
- \(m_t = β_1 × m_{t-1} + (1 - β_1) × g\)
- \(v_t = β_2 × v_{t-1} + (1 - β_2) × g \odot g\)
- \(v'_t := \max(v'_{t-1}, v_t)\)
- \(variable = variable - α_t × m_t / (\sqrt{v'_t} + ε)\)
The default value of \(\epsilon=1e-7\) might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the “epsilon” referred to here is “epsilon hat” in the paper.
Parameters: - grads_or_loss (scalar tensor or list of gradients) – either the loss (scalar of Tensor type) to be differentied or the list of gradients already computed and possibly altered manually (such as clipping)
- (α) (learning_rate) – the learning rate use to update the parameters
- amsgrad (bool) – whether to use the amsgrad updates or not
- β_1 (constant or Tensor) – the value of the exponential moving average of the average of the gradients through time (updates)
- β_2 (constant or Tensor) – the value of the exponential moving average of the variance of the gradients through time
- ε (constant or Tensor) – the value added to the second order moment
- params (list (optional)) – if grads_or_loss is al list then it should be ordered w.r.t. the given parameters, if not given then the optimizer will find all variables that are traininable and involved with the given loss
-
updates
¶ Type: list of updates
-
variables
¶ Type: list of variables
-
class
symjax.nn.optimizers.
NesterovMomentum
(*args, name=None, **kwargs)[source]¶ Nesterov momentum Optimization
Parameters: - grads_or_loss (scalar tensor or list of gradients) – either the loss (scalar of Tensor type) to be differentied or the list of gradients already computed and possibly altered manually (such as clipping)
- learning_rate (constant or Tensor) – the learning rate use to update the parameters
- momentum (constant or Tensor) – the amount of momentum to be applied
- params (list (optional)) – if grads_or_loss is al list then it should be ordered w.r.t. the given parameters
-
updates
¶ Type: list of updates
-
variables
¶ Type: list of variables
-
class
symjax.nn.optimizers.
SGD
(*args, name=None, **kwargs)[source]¶ Stochastic gradient descent optimization.
Notice that SGD is also the acronym employed in
tf.keras.optimizers.SGD
and intorch.optim.sgd
but might be misleading. In fact, those and this implementation implement GD, the SGD term only applies if one performs GD optimization only using 1 (random) sample to compute the gradients. If multiple samples are used it is commonly referred as mini-batch GD and when the entire dataset is used then the optimizer is refered as GD. See an illustrative discussion here.The produced update for parameter θ and a given learning rate α is:
\[θ = θ - α ∇_{θ} L\]Parameters: - grads_or_loss (scalar tensor or list of gradients) – either the loss (scalar of Tensor type) to be differentied or the list of gradients already computed and possibly altered manually (such as clipping)
- learning_rate (constant or Tensor) – the learning rate use to update the parameters
- params (list (optional)) – if grads_or_loss is al list then it should be ordered w.r.t. the given parameters
-
updates
¶ Type: list of updates
-
variables
¶ Type: list of variables
-
symjax.nn.optimizers.
conjugate_gradients
(Ax, b)[source]¶ Conjugate gradient algorithm (see https://en.wikipedia.org/wiki/Conjugate_gradient_method)