Logistic Regression

September 29, 2017

Logistic Regression is a variant of linear regression where the target variable is categorical, i.e. it takes one out of a few possible discrete values. Each of these values is called a label. Don’t be confused by the name logistic regression, it’s a classification algorithm.

In particular, we can use logistic regression for binary classification (two labels). For example, we might want to predict whether or not a person has diabetes, or whether or not an email is spam.

The logistic (or sigmoid) function

The term logistic in logistic regression comes from the logistic function (also known as sigmoid function), which can be written as:

$$ f(z)=\frac{1}{1 + e^{-z}} = =\frac{e^z}{e^z + 1} $$

The following is what we get if we plot f(z):

Plot of logistic (sigmoid) function. Source : Wikipedia

As we can see, the sigmoid function squashes the input value between [0, 1]. For large negative values, the output is very close to 0, and the large positive values, the output is very close to 1. Since the range of the output is between 0 and 1, we can interpret the output as a probability.

The logistic function also has the desirable property that it is a differentiable function. Hence, we can train the machine learning model using gradient descent.

Logistic Regression Model (for binary classification)

In linear regression we have a linear equation $( y = w_1x_1+b )$ as our hypothesis. Since y can take any arbitrarily large negative or positive value, linear regression is not a good choice when the output is a binary categorical variable. For example, y = spam or not spam.

In logistic regression, the output y_w,b(x) is squashed by a sigmoid function, i.e.

$$ y_{w,b}(x) = \sigma(w^\top x + b) = \frac{\exp(w^\top x+b)}{\exp(w^\top x+b) + 1} $$

where w^Tx is w₁x₁ + w₂x₂ + … + w_kx_k and b is the bias.

Let’s take an example. Suppose the input x has two dimensions, and for the current data point, (x₁, x₂) = (2.0, 1.0). If our weights are (w₁, w₂) = ( 2.0, -1.0) and b = 1.0, then we have w^Tx + b = 2.02.0 + -1.01.0 + 1.0= 4.0, and hence y_w,b(x) = e^4.0/(e^4.0+1) = 0.982.

Since the output is always between 0 and 1, it can be interpreted as the probability that y = spam given input x, i.e.

$$ \begin{aligned} P(y=spam|x) &= y_{w,b}(x) \\ P(y=not\ spam|x) &= 1 - P(y=spam|x) = 1 - y_{w,b}(x) \end{aligned} $$

The probability that y = not spam is simply 1 - P(y=spam|x).

For example, if the model $y_w(x)$ = 0.8, then it means that the model’s prediction is that there is an 80% chance that the email is spam, or mathematically, $P(y = spam|x) = 0.8$.

Cost function

The cost function (or loss function) we use for logistic regression is the average negative log-likelihood function.

$$ L(w) = - \frac{1}{n} \sum_i (y^{(i)}_{true} \log( y_w(x^{(i)}) ) + (1 - y^{(i)}_{true}) \log( 1 - y_w(x^{(i)}) ) $$

It is the average over the training data (i-th data point is x⁽ⁱ⁾, y⁽ⁱ⁾), of the negative log probability assigned to the target class. Note that in the loss function, y_true is either 0 (not spam) or 1 (spam). Hence, for each data point, only one of the above terms is non-zero.

Example

Let us take an example. If $y_{true} = 1$ (email is spam) and the prediction $y_w(x^{(i)}) = 0.8$, then the loss for this data point is given by:

$$ L = - \big( 1 \times \log(0.8) + 0 \times \log(0.2) \big) = -\log(0.8) = 0.22 $$

On the other hand, if $y_{true} = 0$ (email is not spam) and the prediction $y_w(x^{(i)}) = 0.8$, then the loss for the data point is given by:

$$ L = - \big( 0 \times \log(0.8) + 1 \times \log(0.2) \big) = -\log(0.2) = 1.60 $$

As you can see, the cost is much higher when the target value is assigned a low probability by the model. In general, the log-likelihood function can be extended to multiple categories, and the loss is given by log(prob(target class)), i.e. the log of the probability assigned to the target class.

Plot

The figure below is a plot of negative log probability. As we can see, this cost is high when the target class is assigned a low probability, and is 0 if the assigned probability is 1.

plot of -log(p) for p in range 0 to 1

Training the model / Optimization

We use gradient descent to optimize the model. In fact, the cost function above is chosen so that the gradients dL/dw we get are meaningful. In fact, if you do the algebra and derive the mathematical formula for the gradients (we’ll skip this since the algebra is quite messy and doesn’t add much to the conceptual knowledge), you’ll find that the gradients are exactly the same as in linear regression (even though the predictions are not the same, and the cost is not the same). Roughly speaking, the log in the cost function ‘undoes’ the sigmoid function.

Making Predictions

After training, we can predict the class by calculating probability of each class. The prediction will be the class with the highest probability, i.e. if y_w(x) > 0.5 then the class is 1 otherwise 0.

Python Implementation using NumPy

Let’s start by loading the iris dataset, which contains information about from iris flowers.

import numpy as np
from sklearn import datasets
# load iris dataset 
iris = datasets.load_iris()
X = iris.data[:, :] 
y = (iris.target != 0) * 1   # original dataset has 4 classes. We make y binary.
print(X[:10, :])
print(y[:10])

We have 4 input features in the dataset. The target variable here is which type of iris flower it is (there are 4 types in the dataset). To make it binary, we will only classify whether the flower is Type 0 or not.

Next, let’s split the dataset into train and test

# random shuffle
a = np.arange(150)
np.random.seed(42)
np.random.shuffle(a)
test = list(a[0:30])
train = list(a[30:150])
# train/test split
X_train = X[train]
X_test = X[test]
y_train = y[train]
y_test = y[test]

Next we will write all the helper functions we need for logistic regression:

## general functions
def sigmoid(z):
  return 1 / (1 + np.exp(-z))
def zed(W, b, x):
  # Computes the weighted sum of inputs
  return x.dot(W) + b
def prob_prediction(W, b, x):
  # Returns the probability after passing through sigmoid
  return sigmoid(zed(W, b, x))
def cost_func(h, y):
  return (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()
## evaluation functions 
# label predictions
def predict(W, b, x):
  prob = prob_prediction(W, b, x)
  return prob >= 0.5
# accuracy evaluation
def evaluate(h_test, y):
  return (h_test == y).mean() * 100

Finally, we can perform the training and testing:

## initialization
W = np.random.uniform(low=-0.1, high=0.1, size=X_train.shape[1])
b = 0.0
learning_rate = 0.1
epochs = 20000
## training: fit the model to training set
print('Training:')
for i in range(epochs):
  # calculate predictions
  y_predict = prob_prediction(W, b, X_train) 
  # calculate error and cost (mean squared error)
  cost = cost_func(y_predict, y_train)
  # calculate gradients
  W_gradient = (1.0/len(X_train)) * (y_predict - y_train).dot(X_train)
  b_gradient = (1.0/len(X_train)) * np.sum(y_predict - y_train)
  # diagnostic output
  if i % 1000 == 0: print("Epoch %d: Cost = %f" % (i, cost))
  # update parameters
  W = W - (learning_rate * W_gradient)
  b = b - (learning_rate * b_gradient)
## evaluation
print('Testing:')
test_set_predictions = predict(W, b, X_test)
print('Accuracy percentage:', evaluate(test_set_predictions, y_test))

Summary

Logistic Regression is a classification algorithm.
It is very similar to the linear regression model, but in addition, we also apply the logistic function $f(z) = \frac{1}{1+e^{-z}}$ to the output (also known as the sigmoid function).
The logistic function always outputs a value between 0 and 1. Thus the output can be interpreted as a probability.
In a binary classification problem, the output is the probability of the data belonging to one of the class labels.
The cost function of this model is the average negative log-likelihood function.
The gradient for the negative log-likelihood function in Logistic Regression, is exactly the same as the gradient for the Mean Squared Error loss function in Linear Regression.
After the model has been trained, we can predict the output class based on the output probability.

Functions

logistic or sigmoid function: $f(z) = \frac{1}{1+e^{-z}}$
probability of the $i^{th}$ data with features $x_1, x_2, ....x_n$ of belonging to a class label: $f(w_1x^{(i)}_1+w_2x^{(i)}_2+ ....+w_nx^{(i)}_n+b) = \frac{1}{1+e^{-(w_1x^{(i)}_1+w_2x^{(i)}_2+ ....+w_nx^{(i)}_n+b)}}$
Loss function: $L(w) = - \frac{1}{n} \sum_i (y^{(i)}_{true} \log( y_w(x^{(i)}) ) + (1 - y^{(i)}_{true}) \log( 1 - y_w(x^{(i)}) )$