Part of list:
- The logistic (or sigmoid) function
- Logistic regression for binary classification
- Cost function
- Optimization technique
Logistic Regression is a variant of linear regression where dependent or output variable is categorical. Don't be confused by the name logistic regression, its a classification algorithm. In particular, we can use it for binary classification.
The term logistic in logistic regression comes from the logistic function (also known as sigmoid function), which can be written as:
The following is a plot of y(z):
[Source : Wikipedia]
The sigmoid function squashes the input value between [0,1] so that we can interpret the output as a probability. It also has the desirable property that it is differentiable function (we can use gradient descent as the optimization technique).
In linear regression we have a linear equation ( y = w0 + w1x1 ) as our hypothesis. Since y can take any arbitrarily large negative or positive value, linear regression is not a good choice when output is a binary categorical variable, i.e. yi = 0 or 1.
In logistic regression, the output hw(x) is squashed by a sigmoid function, i.e.
and is interpreted as the probability that y = 1 given the input x, i.e. P(y=1| x). The probability that y = 0 is simply 1 - P(y=1|x).
The cost function (or loss function) we use for logistic regression is the negative log-likelihood function,
i.e. sum over the training data of negative log probability assigned to the target class. As you can see in the figure below, this cost is high when the target class is assigned a low probability, and is 0 if the assigned probability is 1.
plot of -log(p) for p in range 0 to 1
We use gradient descent to optimize the model. In fact, the cost function above is chosen so that the gradients dL/dw we get are meaningful. Roughly speaking, the log in the cost function 'undoes' the sigmoid function (looking at the whole thing from the perspective of the gradient descent optimizer).
After training, we can predict the class by calculating probability of each class. The prediction will be the class with the highest probability, i.e. if hw(x) > 0.5hw then the class is 1 otherwise 0.