Logistic Regression is a variant of linear regression where dependent or output variable is *categorical*, i.e. it takes out of a few possible discrete values. Don't be confused by the name logistic *regression*, it's a *classification* algorithm.

In particular, we can use it for **binary classification** (two categories). For example, we might want to predict whether or not a person has diabetes, or whether or not an email is spam.

# The logistic (or sigmoid) function

The term *logistic* in logistic regression comes from the **logistic function **(also known as **sigmoid function**), which can be written as:

The following is what we get if we plot *f*(*z*):

As we can see, the sigmoid function squashes the input value between [0, 1]. For large negative values, the output is very close to 0, and the large positive values, the output is very close to 1. Since the range of the output is between 0 and 1, we can interpret the output as a probability.

The logistic function also has the desirable property that it is a differentiable function. Hence, we can train the machine learning model using gradient descent.

# Logistic Regression Model (for binary classification)

In linear regression we have a linear equation ( y = *w*_{0} + *w*_{1}*x*_{1} ) as our hypothesis. Since y can take any arbitrarily large negative or positive value, linear regression is not a good choice when output is a binary categorical variable. For example, y = *spam* or *not spam*.

In logistic regression, the output *y _{w}*(

*x*) is squashed by a sigmoid function, i.e.

where w^{T}x is w_{0} + w_{1}x_{1} + ... w_{d}x_{d} (we assume x_{0} = 1 to simplify notation).

Let's take an example. Suppose the input *x* has two dimensions, and for the current data point, (x_{1}, x_{2}) = (2.0, 1.0). If our weights are (w_{0}, w_{1}, w_{2}) = (1.0, 2.0, -1.0) then we have w^{T}x = 1.0 + 2.0*2.0 + -1.0*1.0 = 4.0, and hence y_{w}(x) = e^{4.0}/(e^{4.0}+1) = 0.982.

Since the output is always between 0 and 1, it can be interpreted as the *probability* that *y* = spam given input *x*, i.e.

The probability that y = not spam is simply 1 - P(y=spam|x).

For example, if the model y_w(x) = 0.8, then it means that the model's prediction is that there is an 80% chance that the email is spam, or mathematically, P(y = spam|x) = 0.8.

# Cost function

The cost function (or loss function) we use for logistic regression is the average negative log-likelihood function.

It is the average over the training data (*i*-th data point is *x*^{(i)}, *y*^{(i)}), of the negative log probability assigned to the target class. Note that in the loss function, *y _{true}* is either 0 (not spam) or 1 (spam). Hence, for each data point, only one of the above terms is non-zero.

## Example

Let us take an example. If y_{true} = 1 (email is spam) and the prediction y_w(x^{(i)}) = 0.8, then the loss for this data point is given by:

On the other hand, if y_{true} = 0 (email is not spam) and the prediction y_w(x^{(i)}) = 0.8, then the loss for the data point is given by:

As you can see, the cost is much higher when the target value is assigned a low probability by the model. In general, the log-likelihood function can be extended to multiple categories, and the loss is given by log(prob(target class)), i.e. the log of the probability assigned to the target class.

## Plot

The figure below is a plot of negative log probability. As we can see, this cost is high when the target class is assigned a low probability, and is 0 if the assigned probability is 1.

# Training the model / Optimization

We use gradient descent to optimize the model. In fact, the cost function above is chosen so that the gradients dL/dw we get are meaningful. In fact, if you do the algebra and derive the mathematical formula for the gradients (we'll skip this since the algebra is quite messy and doesn't add much to the conceptual knowledge), you'll find that the *gradients* are exactly the same as in linear regression (even though the predictions are not the same, and the cost is not the same). Roughly speaking, the log in the cost function 'undoes' the sigmoid function.

# Making Predictions

After training, we can predict the class by calculating probability of each class. The prediction will be the class with the highest probability, i.e. if *y _{w}*(

*x*) > 0.5 then the class is 1 otherwise 0.

# Example with Scikit-learn on predicting Diabetes

In this section, we'll see an example for using logistic regression. We'll use the Pima Indians Diabetes Database, where all patients belong to the Pima Indian heritage (subgroup of Native Americans), and are females of ages 21 and above.

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.linear_model import LogisticRegression## load the datadiabetesDF = pd.read_csv('diabetes.csv')print(diabetesDF.head())

Outcome is whether or not the patient is diabetic. 0 denotes non-diabetic, and 1 is diabetic.

## split and normalize the data# split into train and testdfTrain = diabetesDF[:650]dfTest = diabetesDF[650:750]dfCheck = diabetesDF[750:]# split features from target variabletrainLabel = np.asarray(dfTrain['Outcome'])trainData = np.asarray(dfTrain.drop('Outcome',1))testLabel = np.asarray(dfTest['Outcome'])testData = np.asarray(dfTest.drop('Outcome',1))# normalize the data# makes it easier to interpret the model by looking at its weightsmeans = np.mean(trainData, axis=0)stds = np.std(trainData, axis=0)trainData = (trainData - means)/stdstestData = (testData - means)/stds## train and evaluate the model# models target t as sigmoid(w0 + w1*x1 + w2*x2 + ... + wd*xd)diabetesCheck = LogisticRegression()diabetesCheck.fit(trainData, trainLabel)accuracy = diabetesCheck.score(testData, testLabel)print("accuracy = ", accuracy * 100, "%")# prints "accuracy = 78.0%"## interpreting the modelcoeff = list(diabetesCheck.coef_[0])labels = list(dfTrain.drop('Outcome',1).columns)features = pd.DataFrame()features['Features'] = labelsfeatures['importance'] = coefffeatures.sort_values(by=['importance'], ascending=True, inplace=True)features['positive'] = features['importance'] > 0features.set_index('Features', inplace=True)features.importance.plot(kind='barh', figsize=(11, 6),color = features.positive.map({True: 'blue', False: 'red'}))plt.xlabel('Importance')

Notice how the model assigns largest weights corresponding to features glucose and BMI. It is good to see the machine learning model match what we have been hearing from doctors our entire lives!

## making predictionssampleData = dfCheck[:1]# prepare samplesampleDataFeatures = np.asarray(sampleData.drop('Outcome',1))sampleDataFeatures = (sampleDataFeatures - means)/stds# predictpredictionProbability = diabetesCheck.predict_proba(sampleDataFeatures)prediction = diabetesCheck.predict(sampleDataFeatures)print('Probability:', predictionProbability)print('prediction:', prediction)

The output produced by the above code is

Probability: [[ 0.4385153, 0.5614847]]prediction: [1]

That is, the model thinks where is a 56.14% chance that the person is diabetic. To see this example in more detail, check out this tutorial: End-to-End Example: Using Logistic Regression for predicting Diabetes.