Logistic Regression

INTRO

Logistic Regression is a learning algorithm used for classification, i.e., given an input feature vector x, the output y will be some discrete value, i.e., either 0 or 1 (representing no or yes in a binary classification) or 0, 1, 2..and so on (if we're classifying between multiple categories say, cat, dog, trees etc)

Given an input feature vector x, corresponding to an image that you want to classify as either cat (y = 1), or not cat (y = 0), we want the algorithm to output a prediction ($\hat y$) that is an estimate of y, i.e., the probability of it being y (probability of it being a cat)

If x is an n_x dimensional vector.

Given that information, the parameters of logistic regression would be:

w, which is also an n_x dimensional vector
b, which is just a real number

Given the input x and the parameters w and b, this prediction could be generated as

$$ \hat y= \sigma (w^Tx+b) $$

Where $ \sigma (w^Tx+b) $ is

$$ \sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}} $$

if (w^Tx+b) = z, then $ \hat y = \sigma(z) $ and...

$$ \hat y = \sigma(z) = \frac{1}{1 + e^{(-z)}} $$

This sigmoid function looks like:

When the value of z is very low (tending to -$\infty$), then $$e^{(-z)} = \infty$$ and therefore the prediction

$$ \hat y = \sigma(z) = \frac{1}{1 + \infty} = 0 $$

Conversely, when the value of z is very large (tending to +$\infty$), then $$e^{(-z)} = 0$$ and therefore the prediction

$$ \hat y = \sigma(z) = \frac{1}{1 + 0} = 1 $$

Notice how the value of the prediction, i.e., $\hat y$ always lies between 0 to 1.

And this is good because the probability of a pic, having a cat will always lie between 0 to 1.

Also, the prediction $\hat y$ = 0.5 when z = 0

When implementing the logistic regression, our job is to learn the parameters w, and b so that $\hat y$ becomes a good estimate of the chance of y being equal to 1.

In other words, we want the value of $\hat y$ to be very small when y = 0 and very high when y = 1.

(because a smaller value of $\hat y$ would mean that there is very low chance of the pic having a cat, and conversely, a higher value of $\hat y$ would imply that there's a high chance of having a cat being present in the picture)

In order to optimize the value of the parameters w, and b, we'll first have to check how close the prediction ($\hat y$) is to the actual value (y).

We call this difference between the prediction and the original value, as Loss Function.

And we want this to be as low as possible.

COST FUNCTION

The Cost Function (J), is a function of our parameters w(weights), and b(bias)...and that cost function tells us how bad we are doing.

A Loss function (L) tells us how bad we're doing on a single training example.

Consider the training set x = {(x⁽¹⁾, y⁽¹⁾), (x⁽²⁾, y⁽²⁾), .... (x^(m), y^(m))}

For any of the training example, (x⁽ⁱ⁾, y⁽ⁱ⁾), the prediction $\hat y$⁽ⁱ⁾ will be given as : $$ \hat y^{(i)}= \sigma (w^Tx^{(i)}+b) $$

Where $ \sigma (w^Tx^{(i)}+b) $ is

$$ \sigma(w^T x^{(i)} + b) = \frac{1}{1 + e^{-(w^T x^{(i)} + b)}} $$

and therefore the loss function, associated with $\hat y$ and y is given by

$$L( \hat y, y) = -[y*log(\hat y) + (1-y)*log(1-\hat y)]$$

We want this Loss to be as low as possible

To understand why we're using this loss function, let's look at 2 cases:

When y = 1
L($\hat y$, y) = -y*log($\hat y$) = -log($\hat y$) ...(since y = 1)

In order for L($\hat y$, y) to be low, we want -log($\hat y$) to be as low as possible
This implies, $\hat y$ should be as high as possible (and since $\hat y$ can go only as high as 1, with an optimized set of parameters, $\hat y$ will be close to y, which is what we want afterall!(the predictions to be as close to the real value))
When y = 0
L($\hat y$, y) = -(1-y)*log(1-$\hat y$) = -log(1-$\hat y$) ...(since y = 0)

In order for L($\hat y$, y) to be low, we want -log(1-$\hat y$) to be as low as possible
This implies (1-$\hat y$) should be as large as possible
And that implies, $\hat y$ should be as small as possible (and since $\hat y$ can only go as low as 0, with an optimized set of parameters, $\hat y$ will be close to y)

Finally, this loss function was defined with respect to a single training example, it measures how well you're doing on that particular example

Deriving the Cost Function (J) from the above explanation..
Cost Function (J) measures how well you're doing on the entire training set.

So the Cost Function (J), which is applied to parameters w, and b, is going to be the average of all the loss function applied to each of the training examples. Like so

$$J(w, b) = \frac {1}{m} * \int_{i=1}^{m} L(\hat y^{(i)}, y^{(i)}) = -\frac {1}{m} * \int_{i=1}^{m} [y^{(i)}*log(\hat y^{(i)}) + (1-y^{(i)})*log(1-\hat y^{(i)})]$$

So basically, the loss function is applied to just a single training example, whereas the cost function is the cost of our parameters, so in training our logistic regression model, we're going to find the parameters w, and b that minimize the overall Cost Function (J)

Graphically, Cost Function is a curve.

Looking something like this :

Image Source : https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781786466587/3/ch03lvl1sec21/minimizing-the-cost-function

When we say we want to minimize the cost function, what we're saying is that we want to reach to the bottom of that curve, i.e., its minima.

Previous Post : Notations

Next Post : Gradient Descent

Algos Expeditus