Intro to Adaptive Moment Estimation (ADAM)

4 minute read

Published:

This is the introduction of Adaptive Moment Estimation (ADAM) method with some key concepts of algorithms.


Adaptive Moment Estimation

Adaptive Moment Estimation (ADAM) is a variant of vanilla SGD that works particularly well for training deep neural networks with stochastic optimization. Given the default tuning variables which work well in practice as following:

Hyper-parametersSettingInterpretation
$\alpha$0.001The step size that bounds the effective step taken in parameter space
$\beta_1$0.9To control exponential decay rates for the first moment estimates
$\beta_2$0.999To control exponential decay rates for the second moment estimates
$\epsilon$10e-8A small scalar used to prevent division by 0

For a stochastic objective function $f(\theta)$ which is differentiable with respect to the parameter $\theta$, the ADAM algorithm is:

Initialization (Step 0): We set up the initial parameter $\theta_0$ along with the first moment vector $m_0=0$, the second moment vector $v_0=0$ and time $t=0$.

Update Loop: Before convergence, we do the following steps:
Step 1 (Update time): $t=t+1$
Step 2 (Updating the gradients): $g_t = \nabla_\theta f_t (\theta_{t-1})$
Step 3 (Updating the moment estimate): $m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$ and $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$, where $g_t$ is the element-wise square.
Step 4 (Updating the parameters): $\theta_t = \theta_{t-1} -\frac{\alpha m_t \sqrt{1-\beta_2^t}}{(1-\beta_1^t)(\sqrt{v_t}+\epsilon)}$

Difference from vanilla SGD

As we can see above, the main difference from vanilla SGD is that:

  • The algorithm of ADAM utilizes both first and second moment estimates to gain faster convergence and reduced oscillation. However, the vanilla SGD only uses the gradients.

  • In addition, the effective magnitude of the steps taken in parameter space at each time step are approximately bounded by the step size setting $\alpha$, which means that we can know the right scale of $\alpha$ in advance if we have any prior information of data.
  • In this case, when the parameters are toward the optimum, it will lead to the smaller step size, which is a good feature as annealing.
  • The algorithm updates exponential moving averages of the gradient $m_t$ and the squared gradient $v_t$, Adam is more suitable for non stationary objectives and problems with very noisy and/or sparse gradients

Initialization Bias Correction

In the ADAM update loop, we initialize the moment estimate as zero, which makes them biased towards zero especially during the initial time step or when the decay rates are small. However, we can alleviate it by using bias-correlated moment estimates. For sparse gradients, we can average over many gradients by choosing a small value of $\beta_2$ to get a reliable estimate of the second moment. In ADAM algorithm, we make such update rule as shown above to get efficient results.

Recalling that $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$, which can be rewritten as $v_t = (1-\beta_2)\sum_{i=1}^t \beta_2^{t-i}g_i^2$. By taking expectation, we have:
\(E(v_t)=E(g_t^2)(1-\beta_2^t) + \xi\) where $\xi=0$ if the true second moment $E(g_t^2)$stationary. However, the $\xi$ can be controlled to be small by the exponential decay rate. In this case, we could use the portion $(1-\beta_2^t)$ the portion to correct the bias.

AdaMax

Since the ADAM update rule is to scale the gradient inversely proportionally to the $L_2$ norm of the past and current gradients, we can generalize this update to the $L_p$ norm. Since we have:
\(u_t = \lim_{p \rightarrow \infty} v_t^{\frac{1}{p}} = \max (\beta_2 \cdot u_{t-1},|g_t|)\) In this case, the AdaMax algorithm is created. The initialization step is the same as ADAM except for $\alpha = 0.002$, we just modify the two steps of the update loop:
Step 3 (Updating the moment estimate): $m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$ and $u_t = \max (\beta_2 \cdot u_{t-1},|g_t|)$
Step 4 (Updating the parameters): $\theta_t = \theta_{t-1} -\frac{\alpha m_t}{u_t(1-\beta_1^t)}$
As we can see the AdaMax update rule, the decay term is parameterised as $\beta_2^p$, which means that we don’t need to correct for initialization bias in this case. In addition, the magnitude of parameter updates has a simpler bound with AdaMax than Adam.