SELU — Make FNNs Great Again (SNN)

Last month I came across a recent article (published June 22nd, 2017) presenting a new concept called Self Normalizing Networks (SNN).
In this post I will review what’s different about them and show some comparisons.
Link to the article —Klambauer et al.
Code for this post is taken from bioinf-jku’s github.

The Idea

Before we get into what are SNN lets speak about the motivation to create them.
Right in the abstract of the article the writer mentions a good point; while neural networks are gaining success at many domains it seems like the main stage belongs to convolution networks and recurrent networks (LSTM, GRU) while the feed forward neural networks (FNNs) are left behind in the beginner tutorial sections.
Also noted is that the FNNs that did manage to get winning results at Kaggle were at most 4 layers deep.

When using very deep architectures, networks become prone to gradient issues which is exactly why batch normalization came to be standard — this is where the writer puts FNNs weak link, in its sensitivity to normalization in training.
SNNs are a way to instead use external normalization techniques (like batch norm), the normalization occurs inside the activation function.
To make it clear, instead of normalizing the output of the activation function — the activation function suggested (SELU — scaled exponential linear units) outputs normalized values.
For SNNs to work, they need two things, a custom weight initialization method and the SELU activation function.

Meet SELU

Before we explain it, lets take a look what it’s all about.

SELU is some kind of ELU but with a little twist.
α and λ are two fixed parameters, meaning we don’t backpropagate through them and they are not hyperparameters to make decisions about.
α and λ are derived from the inputs — I will not go into this, but you can see the math for yourself in the article (which has 93 pages appendix :O, for math).
For standard scaled inputs (mean 0, stddev 1), the values are α=1.6732~, λ=1.0507~.
Lets plot and see what it looks like for these values.

Looks pretty similar to leaky ReLU, but wait to see its magic.

Weight Initialization

SELU can’t make it work alone, so a custom weight initialization technique is being used.
SNNs initialize weights with zero mean and use standard deviation of the squared root of 1/(size of input).
In code this looks as follows (taken from the github mentioned in the opening)

# Standard layer
tf.Variable(tf.random_normal([n_input, n_hidden_1], stddev=np.sqrt(
1 / n_input))# Convolution layer
tf.Variable(tf.random_normal([5, 5, 1, 32], stddev=np.sqrt(1/25)))

So now that we understand the initialization and activation methods, lets put it to work.

Performance

Lets examine how SNNs, using the specified initialization and the SELU activation function, does on the MNIST and CIFAR-10 datasets.
First lets see if it really does keep the outputs normalized, using TensorBoard, on a 2 layer SNN (both hidden layers, are of 784 nodes, MNIST).
Plotting the activation function outputs of layer 1, and the weights of layer 2.
The plotting of layer1_act is not present in the github code, I added it for the sake of this histogram.

Keeps up to the expectations, both the activations of the first and the resulting weights on the second layer are almost perfect zero mean (I got 0.000201 on my run).
Trust me that the histogram is pretty much the same on the first layers weights.

More important, SNNs seem to be able to perform better, as you can see from the plots taken from the mentioned github, comparing 3 convolutional networks with identical architecture only different by their activation function and initialization.
SELU vs ELU vs ReLU.

Seems like SELU converges better and gets better accuracy on the test set.

Notice that using SELU + the mentioned initialization we got improved accuracy and faster convergence on a CNN network — so don’t hesitate to try it on architectures that are not pure FNNs as it seems to be able to boost performance in other architectures as well.