In this post, I am going to talk my understanding in tricks in training deep neural network.
ResNet [1]
Why does ResNet network work? https://www.quora.com/How-does-deep-residual-learning-work
Here is my answer:
It is hard to know the desired depth of a deep network. If layers are too deep, errors are hard to propagate back correctly. if layers are too narrow, we may not learn enough representation power.
However, in deep residual network, it is safe to train very deep layers in order to get enough learning power without worrying about the degradation problem too much, because in the worst case, blocks in those “unnecessary layers” can learn to be an identity mapping and do no harm to performance. This is achieved by the solver driving weights of ReLu layers close to zeros thus only the shortcut connection is active and acts as an identity mapping. Though not proved theoretically, adjusting weights close to zeros might be an easier task to the solver than adjusting weights to effective representation all at once.
The authors observe empirically (Fig. 7) that ResNet has smaller magnitude of layer responses on average than plain networks, suggesting that many blocks are just learning little, incremental information.
To conclude, the core idea of ResNet is providing shortcut connection between layers, which make it safe to train very deep network to gain maximal representation power without worrying about the degradation problem, i.e., learning difficulties introduced by deep layers.
All my answer is based on empirical observation and intuition. I’d like to know more theories behind ResNet.
Now according to [2], all the following techniques are regularization methods.
Data Augmentation
augment the training set via domain-specific transformations. For image data, commonly used transformations include random cropping, random perturbation of brightness, saturation, hue and contrast.
Early Stopping
Early stopping was shown to implicitly regularize on some convex learning problems (Yao et al., 2007; Lin et al., 2016)
Dropout
mask out each element of a layer output randomly with a given dropout probability. [1] says that We adopt batch normalization (BN) [16] right after each convolution and before activation, following [16]. but We
do not use dropout [14], following the practice in [16].
Weight Decay
Why is it equivalent to $latex \ell_2$ norm regularizer on the weights? Also equivalent to a hard constraint of the weights to an Euclidean ball, with the radius decided by the amount of weight decay?
Batch Norm
https://theneuralperspective.com/2016/10/27/gradient-topics/
http://leix.me/2017/03/02/normalization-in-deep-learning/
ReLU, Leaky ReLU & MaxOut
Comparison between different activation functions
https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions
Reference
[1] Deep Residual Learning for Image Recognition
[2] Understand Deep Learning Requires Rethinking Generalization
[3] Note for [2]: https://danieltakeshi.github.io/2017/05/19/understanding-deep-learning-requires-rethinking-generalization-my-thoughts-and-notes