Logistic Regression is a variant of linear regression where dependent or output variable is categorical, i.e. it takes out of a few possible discrete values. Don't be confused by the name logistic regression, it's a classification algorithm.
In particular, we can use it for binary classification (two categories). For example, we might want to predict whether or not a person has diabetes, or whether or not an email is spam.
The logistic (or sigmoid) function
The term logistic in logistic regression comes from the logistic function (also known as sigmoid function), which can be written as:
The following is what we get if we plot f(z):
As we can see, the sigmoid function squashes the input value between [0, 1]. For large negative values, the output is very close to 0, and the large positive values, the output is very close to 1. Since the range of the output is between 0 and 1, we can interpret the output as a probability.
The logistic function also has the desirable property that it is a differentiable function. Hence, we can train the machine learning model using gradient descent.
Logistic Regression Model (for binary classification)
In linear regression we have a linear equation ( y = w0 + w1x1 ) as our hypothesis. Since y can take any arbitrarily large negative or positive value, linear regression is not a good choice when output is a binary categorical variable. For example, y = spam or not spam.
In logistic regression, the output yw(x) is squashed by a sigmoid function, i.e.
where wTx is w0 + w1x1 + ... wdxd (we assume x0 = 1 to simplify notation).
Let's take an example. Suppose the input x has two dimensions, and for the current data point, (x1, x2) = (2.0, 1.0). If our weights are (w0, w1, w2) = (1.0, 2.0, -1.0) then we have wTx = 1.0 + 2.0*2.0 + -1.0*1.0 = 4.0, and hence yw(x) = e4.0/(e4.0+1) = 0.982.
Since the output is always between 0 and 1, it can be interpreted as the probability that y = spam given input x, i.e.
The probability that y = not spam is simply 1 - P(y=spam|x).
For example, if the model y_w(x) = 0.8, then it means that the model's prediction is that there is an 80% chance that the email is spam, or mathematically, P(y = spam|x) = 0.8.
Cost function
The cost function (or loss function) we use for logistic regression is the average negative log-likelihood function.
It is the average over the training data (i-th data point is x(i), y(i)), of the negative log probability assigned to the target class. Note that in the loss function, ytrue is either 0 (not spam) or 1 (spam). Hence, for each data point, only one of the above terms is non-zero.
Example
Let us take an example. If y_{true} = 1 (email is spam) and the prediction y_w(x^{(i)}) = 0.8, then the loss for this data point is given by:
On the other hand, if y_{true} = 0 (email is not spam) and the prediction y_w(x^{(i)}) = 0.8, then the loss for the data point is given by:
As you can see, the cost is much higher when the target value is assigned a low probability by the model. In general, the log-likelihood function can be extended to multiple categories, and the loss is given by log(prob(target class)), i.e. the log of the probability assigned to the target class.
Plot
The figure below is a plot of negative log probability. As we can see, this cost is high when the target class is assigned a low probability, and is 0 if the assigned probability is 1.
Training the model / Optimization
We use gradient descent to optimize the model. In fact, the cost function above is chosen so that the gradients dL/dw we get are meaningful. In fact, if you do the algebra and derive the mathematical formula for the gradients (we'll skip this since the algebra is quite messy and doesn't add much to the conceptual knowledge), you'll find that the gradients are exactly the same as in linear regression (even though the predictions are not the same, and the cost is not the same). Roughly speaking, the log in the cost function 'undoes' the sigmoid function.
Making Predictions
After training, we can predict the class by calculating probability of each class. The prediction will be the class with the highest probability, i.e. if yw(x) > 0.5 then the class is 1 otherwise 0.
Example with Scikit-learn on predicting Diabetes
In this section, we'll see an example for using logistic regression. We'll use the Pima Indians Diabetes Database, where all patients belong to the Pima Indian heritage (subgroup of Native Americans), and are females of ages 21 and above.
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.linear_model import LogisticRegression## load the datadiabetesDF = pd.read_csv('diabetes.csv')print(diabetesDF.head())
Outcome is whether or not the patient is diabetic. 0 denotes non-diabetic, and 1 is diabetic.
## split and normalize the data# split into train and testdfTrain = diabetesDF[:650]dfTest = diabetesDF[650:750]dfCheck = diabetesDF[750:]# split features from target variabletrainLabel = np.asarray(dfTrain['Outcome'])trainData = np.asarray(dfTrain.drop('Outcome',1))testLabel = np.asarray(dfTest['Outcome'])testData = np.asarray(dfTest.drop('Outcome',1))# normalize the data# makes it easier to interpret the model by looking at its weightsmeans = np.mean(trainData, axis=0)stds = np.std(trainData, axis=0)trainData = (trainData - means)/stdstestData = (testData - means)/stds## train and evaluate the model# models target t as sigmoid(w0 + w1*x1 + w2*x2 + ... + wd*xd)diabetesCheck = LogisticRegression()diabetesCheck.fit(trainData, trainLabel)accuracy = diabetesCheck.score(testData, testLabel)print("accuracy = ", accuracy * 100, "%")# prints "accuracy = 78.0%"## interpreting the modelcoeff = list(diabetesCheck.coef_[0])labels = list(dfTrain.drop('Outcome',1).columns)features = pd.DataFrame()features['Features'] = labelsfeatures['importance'] = coefffeatures.sort_values(by=['importance'], ascending=True, inplace=True)features['positive'] = features['importance'] > 0features.set_index('Features', inplace=True)features.importance.plot(kind='barh', figsize=(11, 6),color = features.positive.map({True: 'blue', False: 'red'}))plt.xlabel('Importance')
Notice how the model assigns largest weights corresponding to features glucose and BMI. It is good to see the machine learning model match what we have been hearing from doctors our entire lives!
## making predictionssampleData = dfCheck[:1]# prepare samplesampleDataFeatures = np.asarray(sampleData.drop('Outcome',1))sampleDataFeatures = (sampleDataFeatures - means)/stds# predictpredictionProbability = diabetesCheck.predict_proba(sampleDataFeatures)prediction = diabetesCheck.predict(sampleDataFeatures)print('Probability:', predictionProbability)print('prediction:', prediction)
The output produced by the above code is
Probability: [[ 0.4385153, 0.5614847]]prediction: [1]
That is, the model thinks where is a 56.14% chance that the person is diabetic. To see this example in more detail, check out this tutorial: End-to-End Example: Using Logistic Regression for predicting Diabetes.