Understanding Logistic Regression
February 20, 2019
Logistic Regression is one of the most widely used algorithms for classification in the industry today.
Inspite of the name, Logistic Regression, just like the previously discussed Adaline model is a linear model for binary classification and NOT regression.
It just aswell can be extended to work as a multiclass classifier via using techniques like the OneVersesRest(OvR) technique.
The reasons for Logistic Regression’s popularity?
 It is an easy model to implement and understand
 Performs very well on linearly seprable data classes
Logistic Regression is a lot similar to Adaline model but, with a different activation and cost function, as we will see.
New Terms and functions

Odds ratio:
The odds in favor of a particular event, the odds ratio can be written as $\frac{p}{(1p)}$ where $p$ is the probability of the positive event
(positive = the event we are looking for, not necessarily in a good context, eg: presence of a disease) 
logit function:
$logit(p) = log \frac{p}{(1p)}$
where again $p$ is the probability of the positive event
The logit function takes input values in the range 0 to 1 and transforms them to values over the entire realnumber range, which we can use to express a linear realtionship between feature values and logodds:
here, $p(y=1  x)$ is the conditional probability that a particular sample belongs to class 1 given features x.

sigmoid function:
we are actually interested in predicting the probability that a certain sample belongs to a class, which is the inverse form of logit function. It is called the logistic sigmoid, often abbreviated to simple sigmoid function because of its characteristic Sshape:
$\phi(z) = \frac{1}{1+e^{z}}$
where $z$ is the net input, ie the linear combination of the weights and sample features, $z = w^Tx$
we can see that $\phi(z)$ approaches 1 if z goes to infinity $(z > \infty)$ as $e^{z}$ becomes dismally small, and it reaches  0 when z goes to inifinity.
There’s an intercept at $\phi(0) = 0.5$
The Schematic
The following illustrates the difference between Adaline and Logistic Regression rules

key differences:

The activation function in Logistic Regression is a sigmoid function, instead of the linear identity function in Adaline.

The output from the Logistic Regression’s Activation function is the probability that the given sample belongs to class 1.

If we want the probability that the sample belongs to class 0, subtract the output from 1

The cost function is different

Logistic Regression has a final threshold function (not seen in the schematic) such that:
If the probability of the sample is $\geq$ 0.5, it is classified as class 1
If the probability of the sample is $<$ 0.5, it is classifed as class 0
$\hat y$ = 1 if $\phi(z) \geq 0.5$, $0$ otherwise.
Looking at the sigmoid function, this is equivalent to
$\hat y = 1$ if $z\geq 0.0$, else $0$ otherwise.
The cost function
Long story short, the cost function for logistic regression is:
$J(w)\ =\ \sum_{i=1}^{n} \big[ y^{(i)}log(\hat y)  (1y^{(i)})log(1\hat y)\big]$
where:
 $y^{(i)}$ is the $i_{th}$ class label in the target vector (a 1 or a 0)
 $\hat y = \phi(z)$ i.e. the probability output by the activation function. $\in [0, 1]$
For better understanding, lets look at the cost calculation for a single sample training instance
$J(\phi(z), y;w) = ylog(\phi(z))  (1y)(log(1  \phi(z)))$so, this means: 

when y = 1, the term $(1  y)$ in cost function becomes 0,
thus leaving $J(\phi(z), y;w) = log(\phi(z))$Now, as we want to Minimize our cost function, we want $log(\phi(z))$ as big as possible such that when multiplied by $1$, it is as small as possible, Thus we want $\phi(z)$ to be as large as possible.
Since, $\phi(z)$ is a sigmoid function, thus the max value it can have is $1$, which is as close to class 1 as possible. See how it all fits together ? 
when y = 0, the term $ylog(\phi(z))$ in the cost function becomes 0,
thus leaving $J(\phi(z), y;w) = (log(1  \phi(z)))$Now, as we want to minimize our cost function, we want $log(1  \phi(z)$ to be as big as possible, such that when multiplieg by 1, it becomes as small as possible. This only happens when $\phi(z)$ is as small as possible, which is 0, since it is afterall a sigmoid function.
Read through all that? Now you know quite a lot about the basis and theory of the Linear Regression Binary classification model. peace out.