Backpropagation (Part 1)

Computational Graph is a directional graph where each nodes is associated with some mathematical operation.
We'll use these computational graph to understand we find the partial derivates of Cost Function with respect to different weights and biases

To understand how to calculate the partial derivatives through a computational graph, let's assume for a second that J is a function of 3 parameters, a, b and c.
And that J is expressed as J=3(a+bc)
When thinking about how to step-by-step compute J, you'll realize that you'll have to first calculate b*c, let us that product is u
We then have to add this u with a.
Let us now call this sum v.
And finally, we'll have to multiply this v with 3, in order to (finally) get J.
Let's visualize this with our computational graph


 
Let's assign some random values to a, b, and c.


Now, we know, that by it's very definition, the derivative of a funciton x with respect to y tells us the rate of change of x with respect to y.
Keeping that in mind, notice that when you change v by 1(i.e., make v = 12), J changes by 3 (because it increases to 36 and 36-33 =3, therefore ΔJ=3
Therefore, we can say that dJdv=3

And very similarly, we can find out the change in v with respect to a (i.e., dvda)
Because increasing a by 1, v also increases by 1, therefore dvda=1

And in exactly the same way, dvdu=1

And following the pattern, we can find out dudb and dudc
(if you increase b by 1, u will increase by 2, which means dudb=2 and if you increase c by 1, u increases by 3, which implies dudc=3)

And all these observations are consistent with the general formula of differentiation that we've learnt in our high school. If u = bc, then dudb = c (which is equal to 2 and hence the increment of 2)
and dvda=1 and dJdv=3

Now!
Let's move on to the chain rule.
We know that dJda=dJdvdvda

Therefore, calculating the change of J with respect to a, b, and c, we have
dJda=dJdvdvda=31=3
dJdb=dJdvdvdududb=312=6
dJdc=dJdvdvdududc=313=9

PS. You can absolutely check each of them manually by changing one term by 1, keeping the rest constant and observing how J changes.

And now!

Let's apply these concepts to find out those derivatives that were used in the Gradient Descent of logistic regression.
Now, lemme just clarify it here on before you scroll down and lose hopes.
The calculation is NOT difficult, it's just lengthy.
And I've intentionally made it lengthy so that you never lose track of the progression at any step.

Now that we have that cleared, let's begin!

We have
w1, x1, w2, x2 and b as inputs (assuming there are only 2 features of x and w)
z = w1x1 + w2x2 + b
a = σ(z)
(Loss) L(a) = -ylog(a) - (1-y)log(1-a)            (Loss is only a function of a since y is always constant for an example)

and this can be drawn as :



Let us first find out dLda :
L=ylog(a)(1y)log(1a)
dLda=ya1y1a(1)
dLda=ya+1y1a

Now calculating dadz:
a=σ(z)
a=11+ez
a=(1+ez)1
dadz=(1)(1+ez)11(0+(ez1))
dadz=ez(1+ez)2
dadz=(11+ez)(ez1+ez)
dadz=a(1+ez11+ez)
dadz=a(111+ez)
dadz=a(1a)

Now zw1:
z=w1x1+w2x2+b
dzw1=x1+0+0
dzw1=x1

Similarly for zw2:
z=w1x1+w2x2+b
dzw2=x2+0+0
dzw2=x2

And for zb:
z=w1x1+w2x2+b
dzb=0+0+1
dzb=1

And now the main part (applying the chain rule)... 

Therefore dLdz=dLdadadz
dLdz=(ya+1y1a)(a(1a))
dLdz=y(1a)+(1y)a
dLdz=y+ya+aya
dLdz=ay

So our Final desired parts:

dLdw1=dLjadadzdzdw1
dLdw1=dLdzdzdw1
dLdw1=(ay)x1

Similarly 
dLdw2=(ay)x2

And dLdb=ay

But now, if you observe carefully at the Gradient Descent for Logistic Regression, it's not dLdw and dLdb that we want.

It is dJdw and dJdb 

Turns out, it's not hard to find actually,
Remember how we got J(w, b) from L(ˆy, y)?
It was simply by averaging all the loss of each individual training examples.
Similarly,
In order to find dJdw and dJdb we do the same
Therefore the final result comes out to be
Jw1=1mmi=1L(ˆy(i),y)w1=1m(mi=1(a(i)y(i))x(i)1)
Jw2=1mmi=1L(ˆy(i),y)w2=1m(mi=1(a(i)y(i))x(i)2)
Jb=1mmi=1L(ˆy(i),y)b=1mmi=1(a(i)y(i))


Previous Post : Gradient Descent

Comments

Popular posts from this blog

Solving Sudoku

Plotly

NumPy