Backpropagation (Part 1)

Computational Graph is a directional graph where each nodes is associated with some mathematical operation.
We'll use these computational graph to understand we find the partial derivates of Cost Function with respect to different weights and biases

To understand how to calculate the partial derivatives through a computational graph, let's assume for a second that J is a function of 3 parameters, a, b and c.
And that J is expressed as $$J = 3(a+bc)$$
When thinking about how to step-by-step compute J, you'll realize that you'll have to first calculate b*c, let us that product is u
We then have to add this u with a.
Let us now call this sum v.
And finally, we'll have to multiply this v with 3, in order to (finally) get J.
Let's visualize this with our computational graph


 
Let's assign some random values to a, b, and c.


Now, we know, that by it's very definition, the derivative of a funciton x with respect to y tells us the rate of change of x with respect to y.
Keeping that in mind, notice that when you change v by 1(i.e., make v = 12), J changes by 3 (because it increases to 36 and 36-33 =3, therefore $\Delta J = 3$) 
Therefore, we can say that $\frac {dJ}{dv} = 3$

And very similarly, we can find out the change in v with respect to a (i.e., $\frac {dv}{da}$)
Because increasing a by 1, v also increases by 1, therefore $\frac {dv}{da} = 1$

And in exactly the same way, $\frac {dv}{du} = 1$

And following the pattern, we can find out $\frac{du}{db}$ and $\frac {du}{dc}$
(if you increase b by 1, u will increase by 2, which means $\frac{du}{db} = 2$ and if you increase c by 1, u increases by 3, which implies $\frac{du}{dc} = 3$)

And all these observations are consistent with the general formula of differentiation that we've learnt in our high school. If u = bc, then $\frac{du}{db}$ = c (which is equal to 2 and hence the increment of 2)
and $\frac {dv}{da} = 1$ and $\frac {dJ}{dv} = 3$

Now!
Let's move on to the chain rule.
We know that $$\frac{dJ}{da} = \frac{dJ}{dv}*\frac{dv}{da}$$

Therefore, calculating the change of J with respect to a, b, and c, we have
$$\frac{dJ}{da} = \frac{dJ}{dv}*\frac{dv}{da} = 3*1 = 3$$
$$\frac{dJ}{db} = \frac{dJ}{dv}*\frac{dv}{du}*\frac{du}{db} = 3*1*2 = 6$$
$$\frac{dJ}{dc} = \frac{dJ}{dv}*\frac{dv}{du}*\frac{du}{dc} = 3*1*3 = 9$$

PS. You can absolutely check each of them manually by changing one term by 1, keeping the rest constant and observing how J changes.

And now!

Let's apply these concepts to find out those derivatives that were used in the Gradient Descent of logistic regression.
Now, lemme just clarify it here on before you scroll down and lose hopes.
The calculation is NOT difficult, it's just lengthy.
And I've intentionally made it lengthy so that you never lose track of the progression at any step.

Now that we have that cleared, let's begin!

We have
w1, x1, w2, x2 and b as inputs (assuming there are only 2 features of x and w)
z = w1x1 + w2x2 + b
a = $\sigma (z)$
(Loss) L(a) = -ylog(a) - (1-y)log(1-a)            (Loss is only a function of a since y is always constant for an example)

and this can be drawn as :



Let us first find out $\frac {dL}{da}$ :
$$L = -ylog(a) - (1-y)log(1-a)$$
$$\implies \frac {dL}{da} = \frac {-y}{a} - \frac {1-y}{1-a}*(-1)$$
$$\implies \frac {dL}{da} = \frac {-y}{a} + \frac {1-y}{1-a} $$

Now calculating $\frac {da}{dz}$:
$$a = \sigma(z)$$
$$\implies a  =  \frac {1}{1+e^{-z}} $$
$$\implies a = (1+e^{-z})^{-1}$$
$$\implies \frac {da}{dz} = (-1)*(1+e^{-z})^{-1-1}*(0+(e^{-z}*-1))$$
$$\implies \frac {da}{dz} = \frac{e^{-z}}{(1+e^{-z})^2}$$
$$\implies \frac {da}{dz} = (\frac {1}{1+e^{-z}})*(\frac{e^{-z}}{1+e^{-z}})$$
$$\implies \frac {da}{dz} = a * (\frac{1+e^{-z}-1}{1+e^{-z}})$$
$$\implies \frac {da}{dz} = a * (1 - \frac{1}{1+e^{-z}})$$
$$\implies \frac {da}{dz} = a * (1 - a)$$

Now $\frac{\partial z}{\partial w_1}$:
$$z = w_1x_1 + w_2 x_2 + b$$
$$\implies \frac{\partial dz}{\partial w_1} = x_1 + 0 + 0$$
$$\implies \frac{\partial dz}{\partial w_1} = x_1$$

Similarly for $\frac{\partial z}{\partial w_2}$:
$$z = w_1x_1 + w_2 x_2 + b$$
$$\implies \frac{\partial dz}{\partial w_2} = x_2 + 0 + 0$$
$$\implies \frac{\partial dz}{\partial w_2} = x_2$$

And for $\frac{\partial z}{\partial b}$:
$$z = w_1x_1 + w_2 x_2 + b$$
$$\implies \frac{\partial dz}{\partial b} = 0 + 0 + 1$$
$$\implies \frac{\partial dz}{\partial b} = 1$$

And now the main part (applying the chain rule)... 

Therefore $$\frac {dL}{dz} = \frac {dL}{da} * \frac {da}{dz}$$
$$\implies \frac {dL}{dz} = (\frac {-y}{a} + \frac {1-y}{1-a})*( a * (1 - a))$$
$$\implies \frac {dL}{dz} = -y(1-a)+(1-y)a$$
$$\implies \frac {dL}{dz} = -y + ya +a - ya$$
$$\implies \frac {dL}{dz} = a - y  $$

So our Final desired parts:

$$\frac {dL}{dw_1} = \frac {dL}{ja} * \frac {da}{dz} * \frac{dz}{dw_1}$$
$$\frac {dL}{dw_1} = \frac {dL}{dz} * \frac {dz}{dw_1}$$
$$\frac {dL}{dw_1} = (a-y)*x_1$$

Similarly 
$$\frac {dL}{dw_2} = (a-y)*x_2$$

And $$\frac {dL}{db} = a-y$$

But now, if you observe carefully at the Gradient Descent for Logistic Regression, it's not $\frac{dL}{dw}$ and $\frac{dL}{db}$ that we want.

It is $\frac{dJ}{dw}$ and $\frac{dJ}{db}$ 

Turns out, it's not hard to find actually,
Remember how we got J(w, b) from L($\hat y$, y)?
It was simply by averaging all the loss of each individual training examples.
Similarly,
In order to find $\frac{dJ}{dw}$ and $\frac{dJ}{db}$ we do the same
Therefore the final result comes out to be
$$\frac {\partial J}{\partial w_1} = \frac{1}{m}*\sum_{i=1}^{m} \frac{\partial L(\hat y^{(i)}, y)}{\partial w_1} =  \frac{1}{m}*(\sum_{i=1}^{m}(a^{(i)}-y^{(i)})*x_1^{(i)})  $$
$$\frac {\partial J}{\partial w_2} = \frac{1}{m}*\sum_{i=1}^{m} \frac{\partial L(\hat y^{(i)}, y)}{\partial w_2} =  \frac{1}{m}*(\sum_{i=1}^{m}(a^{(i)}-y^{(i)})*x_2^{(i)})$$
$$\frac {\partial J}{\partial b} = \frac{1}{m}*\sum_{i=1}^{m} \frac{\partial L(\hat y^{(i)}, y)}{\partial b} =  \frac{1}{m}*\sum_{i=1}^{m}(a^{(i)}-y^{(i)})$$


Previous Post : Gradient Descent

Comments

Popular posts from this blog

Solving Sudoku

Plotly

Computing Expressions