Understanding Logistic Regression
February 20, 2019
Logistic Regression is one of the most widely used algorithms for classification in the industry today.
Inspite of the name, Logistic Regression, just like the previously discussed Adaline model is a linear model for binary classification and NOT regression.
It just as-well can be extended to work as a multi-class classifier via using techniques like the OneVersesRest(OvR) technique.
The reasons for Logistic Regression’s popularity?
- It is an easy model to implement and understand
- Performs very well on linearly seprable data classes
Logistic Regression is a lot similar to Adaline model but, with a different activation and cost function, as we will see.
New Terms and functions
The odds in favor of a particular event, the odds ratio can be written as where is the probability of the positive event
(positive = the event we are looking for, not necessarily in a good context, eg: presence of a disease)
where again is the probability of the positive event
The logit function takes input values in the range 0 to 1 and transforms them to values over the entire real-number range, which we can use to express a linear realtionship between feature values and log-odds:
here, is the conditional probability that a particular sample belongs to class 1 given features x.
we are actually interested in predicting the probability that a certain sample belongs to a class, which is the inverse form of logit function. It is called the logistic sigmoid, often abbreviated to simple sigmoid function because of its characteristic S-shape:
where is the net input, ie the linear combination of the weights and sample features,
we can see that approaches 1 if z goes to infinity as becomes dismally small, and it reaches - 0 when z goes to -inifinity.
There’s an intercept at
The following illustrates the difference between Adaline and Logistic Regression rules
The activation function in Logistic Regression is a sigmoid function, instead of the linear identity function in Adaline.
The output from the Logistic Regression’s Activation function is the probability that the given sample belongs to class 1.
If we want the probability that the sample belongs to class 0, subtract the output from 1
The cost function is different
Logistic Regression has a final threshold function (not seen in the schematic) such that:
If the probability of the sample is 0.5, it is classified as class 1
If the probability of the sample is 0.5, it is classifed as class 0
= 1 if , otherwise.
Looking at the sigmoid function, this is equivalent to
if , else otherwise.
The cost function
Long story short, the cost function for logistic regression is:-
- is the class label in the target vector (a 1 or a 0)
- i.e. the probability output by the activation function.
For better understanding, lets look at the cost calculation for a single sample training instance
so, this means: -
when y = 1, the term in cost function becomes 0,
Now, as we want to Minimize our cost function, we want as big as possible such that when multiplied by , it is as small as possible, Thus we want to be as large as possible.
Since, is a sigmoid function, thus the max value it can have is , which is as close to class 1 as possible. See how it all fits together ?
when y = 0, the term in the cost function becomes 0,
Now, as we want to minimize our cost function, we want to be as big as possible, such that when multiplieg by -1, it becomes as small as possible. This only happens when is as small as possible, which is 0, since it is afterall a sigmoid function.
Read through all that? Now you know quite a lot about the basis and theory of the Linear Regression Binary classification model. peace out.