Intro to Batch Normalization

2 minute read


This is the introduction of Batch Normalization method in deep learning with some key concepts and rationale.

Internal Covariate Shift

In deep learning algorithms, the unstable change in the distribution of layers’ inputs is problematic since each deeper layer needs to adapt to the new distribution. In this case, when the input distribution to a learning system changes, it is called that the system experiences covariate shift. In general, the internal covariate shift is defined as the change in the distribution of network activations due to the change in network parameters in the course of training.
The internal covariate shift can really affect the training of deep neural networks. The changing distribution of inputs slows down the training by requiring lower learning rates and careful parameter initialization since small changes to the network parameters amplify as the network becomes deeper. In addition, it even makes it hard to train models with saturating nonlinearities.


Batch Normalization is a beneficial trick used for efficiently training deep neural networks by reducing the internal covariate shift. To be more specific, BN transform is introduced to normalize the activations into each batch of the networks. The key of this method consists of two part:

  • Firstly, we will normalize each scalar feature independently, with the result of the mean of 0 and the variance of 1;

  • Secondly, since we use mini-batches in stochastic gradient training, each mini-batch produces estimates of the mean and variance of each activation.

In this case, to Batch-Normalize a network, we need to specify a subset of activations and then apply the BN transform for each of them. For each layer, we transform the input from $X$ to $BN(X)$. Since the BN transform is differentiable, the training layers can continue learning on input distributions with less internal covariate shift, which accelerates the training speed. To be more specific, it help alleviate the internal covariate shift by the following methods:

  • It can reduce the dependence of gradients on the scale of the parameters or of their initial values, which means we are allowed to use higher learning rates without the risk of divergence.

  • In this case, backpropagation through a layer is unaffected by the scale of its parameters, and then it stabilize the parameter growth.

  • It can regularize the model and then reduce the need for Dropout layers in batch-normalized networks.

  • It’s possible to use saturating nonlinearities by preventing the network from getting stuck in the saturated modes with batch normalization.