Rather than improving the optimisation algorithms, **normalization layers** improve the network structure itself. They are additional layers in between existing layers. The goal is to improve the optimisation and generalisation performance.
In neural networks, we typically alternate linear operations with non-linear operations. The non-linear operations are also known as activation functions, such as ReLU. We can place normalisation layers before the linear layers, or after the activation functions. The most common practice is to put them between the linear layers and activation functions, as in the figure below.
![[Screenshot 2023-12-04 alle 01.05.47.png|700]]
Note that the normalisation layers affect the data flows through, but they do not change the power of the network in the sense that, with proper configuration of the weights, the unnormalised network can still give the same output as a normalised network.
- **Normalisation Operations**
There is the generic notation for normalisation: $y = \frac{a}{\sigma}(x - \mu)+b$
where $x$ is the input vector, $y$ is the output vector, $\mu$ is the estimate of the mean of $x$, $\sigma$ is the estimate of the standard deviation of $x$, $a$ is the learnable scaling factor, and $b$ is the learnable bias term.
Without the learnable parameters $a$ and $b$, the distribution of output vector $y$ will have fixed mean $0$ and std $1$. The scaling factor $a$ and bias term $b$ maintain the representation power of the network, i.e., the output values can still be over any particular range. Note that $a$ and $b$ do not reverse the normalisation, because they are learnable parameters and are much more stable than $\mu$ and $\sigma$.
![[Pasted image 20231204011057.png|700]]
There are several ways to normalise the input vector, based on how to select samples for normalisation. The figure above lists 4 different normalisation approaches, for a mini-batch of $N$ images of height $H$ and width $W$, with $C$ channels:
- *Batch Norm*: the normalisation is applied only over one channel of the input. This is the first proposed and the most well-known approach.
- *Layer Norm*: the normalisation is applied within one image across all channels.
- *Instance Norm*: the normalisation is applied only over one image and one channel.
- *Group Norm*: the normalisation is applied over one image but across a number of channels. For example, channel 0 to 9 is a group, then channel 10 to 19 is another group, and so on. In practice, the group size is almost always 32. This is the approach recommended by Aaron Defazio, since it has good performance in practice and it does not conflict with SGD.
In practice, batch norm and group norm work well for computer vision problems, while layer norm and instance norm are heavily used for language problems.
- **Why Does Normalisation Help?**
Although normalisation works well in practice, the reasons behind its effectiveness are still disputed. Originally, normalisation is proposed to reduce "internal covariate shift", but some scholars proved it wrong in experiments. Nevertheless, normalisation clearly has a combination of the following factors:
- Networks with normalisation layers are easier to optimise, allowing for the use of larger learning rates. **Normalisation has an optimisation effect** that speeds up the training of NNs.
- The mean/std estimates are noisy due to the randomness of the samples in batch. This extra "noise" results in better generalisation in some cases. **Normalisation has a regularisation effect**.
- Normalisation reduces sensitivity to weight initialisation.
As a result, normalisation lets you be more "careless" - you can combine almost any neural network building blocks together and have a good chance of training it without having to consider how poorly conditioned it might be.
- **Practical Considerations**
It is important that back-propagation is done through the calculation of the mean and std, as well as the application of the normalisation: the network training will diverge otherwise. The back-propagation calculation is fairly difficult and error-prone, but PyTorch is able to automatically calculate it for us, which is very helpful. Two normalisation layer classes in PyTorch are listed below: ```torch.nn.BatchNorm2D(num_features, ...)``` and ```torch.nn.GroupNorm(num_groups, num_channels, ...)```
Batch norm was the first method developed and is the most widely known. However, Aaron Defazio recommends using group norm instead. It's more stable, theoretically simpler, and usually works better. Group size 32 is a good default.
Note that for batch norm and instance norm, the mean/std used are fixed after training, rather than re-computed every time the network is evaluated, because multiple training samples are needed to perform normalisation. This is not necessary for group norm and layer norm, since their normalisation is over only one training sample.