Activation Functions
Why Do We Need Activation Functions?
Let's take the previous equations in the earlier simple neural network, remove the activation functions and see what happens.
We had the following equations:
$Z^{[1]} = W^{[1]}A^{[0]} + b^{[1]}$
$A^{[1]} = \sigma (Z^{[1]})$
$Z^{[2]} = W^{[2]}A^{[1]} + b^{[2]}$
$A^{[2]} = \sigma (Z^{[2]})$
Let's remove the activation function now.
No Actually. A better way to do it, would be to replace the sigmoid activation function with a linear one, such that g(z) = z.
Therefore, applying a linear activation function, we now have
$Z^{[1]} = W^{[1]}A^{[0]} + b^{[1]}$
$A^{[1]} = g(Z^{[1]}) = Z^{[1]} $
$Z^{[2]} = W^{[2]}A^{[1]} + b^{[2]}$
$A^{[2]} = g(Z^{[2]}) = Z^{[2]}$
All of which are linear functions
Now we have
$A^{[2]} = Z^{[2]} = W^{[2]}A^{[1]} + b^{[1]}$
Substituting $A^{[1]}$ with $W^{[1]}A^{[0]} + b^{[1]}$, we have
$A^{[2]} = W^{[2]}(W^{[1]}A^{[0]} + b^{[1]}) + b^{[2]}$
$\implies A^{[2]} = (W^{[2]}W^{[1]})A^{[0]} + (W^{[2]}b^{[1]}+b^{[2]}) $ which is just another linear function.
Point being, using linear activation function (basically no activation function) in the neural networks don't calculate anything complex.
So using multiple such layers is basically equal to having absolutely no hidden layers.
If you implement linear activation function only in hidden layers and put a sigmoid activation function in the output layer, it's only equivalent to a normal logistic regression.
The only place where having a linear activation function makes sense is, if the output is a continuous range of numbers (as in case of regression problems)
But even in these regression problems, you can only have the linear activation function in the final output layer only, all the other hidden layers must have some non-linear activation function.
Now, that we know WHY we use activation functions, let us look at the different types of activation functions.
Different Types Of Activation Functions
In General, we use 4 different types of activation functions :
- Sigmoid Activation Function
- Hyperbolic Tangent Function (a.k.a. tanh function)
- ReLU function (Rectified Linear Unit)
- Leaky ReLU
Let's discuss all of them in detail :
1. Sigmoid Function.
You might already have a fair bit of idea about sigmoid function. But, what you might not have known is that it is not the most efficient function. The most appropriate use of a sigmoid function is in the output layer, where y = 0 or 1. Having $\hat y$ lie between 0 to 1 is desirable in these cases and hence, using sigmoid activation function makes perfect sense.
Friendly reminder : The sigmoid function is expressed as $$\sigma(z) = \frac{1}{1+e^{(-z)}}$$
2. Tanh Function.
Tanh function is essentially a sigmoid function, just shifted a bit down so that the range of output now lies between -1 to 1. This is what it looks like :
And tanh is expressed as $$g(z) = \frac {e^{z}-e^{-z}}{e^{z}+e^{-z}}$$
When used in the hidden layers, Tanh usually performs better than sigmoid in most of the cases.
Just like while taking the input, we centre the data and make it have mean = 0, tanh also does the same thing, so training the data becomes much more simpler, and hence faster.
3. ReLU Function.
Now one downside of both sigmoid and tanh function is that when the values are too large, or too small, the slope of this function tends to zero, hence deceasing the learning rate during backpropagation.
Therefore, ReLU is used.
Rectified Linear Unit (ReLU) is expressed by $g(z) = max(0, z)$ and looks something like this :
The expression $g(z) = max(0, z)$ means that when the the value of z is less than zero, the output is 0 and when z is more than 0, the output is z
4. Leaky ReLU
One disadvantage of ReLU is that when z = -ve, the slope again, is 0.
And while in practice, it doesn't make much of a difference in learning, we can improve it by introducing a slight curve,
The expression of Leaky ReLU is $g(z) = max(0.01z, z)$
That 0.01 can really be replaced by any number, but 0.01 is usually preferred.
Previous Post : Neural Networks
Next Post :
Comments
Post a Comment