End-to-End Example: Using Logistic Regression for predicting Diabetes

March 23, 2018

In this tutorial, we will see how to predict whether a person has diabetes or not, based on information like blood pressure, body mass index (BMI), age, etc.

The data was collected and made available by “National Institute of Diabetes and Digestive and Kidney Diseases” as part of the Pima Indians Diabetes Database. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here belong to the Pima Indian heritage (subgroup of Native Americans), and are females of ages 21 and above.

Project Template on Google Colaboratory

Notebook Link

Work on this project directly in-browser via Google Colaboratory. The link above is a starter template that you can save to your own Google Drive and work on. Google Colab is a free tool that lets you run small Machine Learning projects through your web browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory. Note that, for this project, you’ll have to upload the dataset linked below to Google Colab after saving the notebook to your own system.

Overview

We will be using Python as our programming language, and making use of some popular python machine learning and data science related packages. First of all, we will import pandas to read our data from a CSV file and manipulate it for further use. We will also use numpy to convert out data into a format suitable to feed our classification model. We’ll use seaborn and matplotlib for visualizations. We will then import Logistic Regression algorithm from sklearn. This algorithm will help us build our classification model. Lastly, we will use joblib available in sklearn to save our model for future use. If you’re missing any packages, you can install them using the pip3 install command, so pip3 install pandas numpy seaborn matplotlib sklearn

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib

Data Description

We have our data saved in a CSV file called diabetes.csv which you can download here. We first read our dataset in a pandas dataframe called diabetesDF, and then use the head() function to show the first five records from our dataset.

diabetesDF = pd.read_csv('diabetes.csv')
print(diabetesDF.head())

First 5 records in the Pima Indians Diabetes Database

The following features have been provided to help us predict whether a person is diabetic or not:

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration over 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)²)
DiabetesPedigreeFunction: Diabetes pedigree function (a function which scores likelihood of diabetes based on family history)
Age: Age (years)
Outcome: Class variable (0 if non-diabetic, 1 if diabetic)

Let’s also make sure that our data is clean (has no null values, etc).

diabetesDF.info() # output shown below
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     768 non-null int64
BloodPressure               768 non-null int64
SkinThickness               768 non-null int64
Insulin                     768 non-null int64
BMI                         768 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

Note that the data does have some missing values (see Insulin = 0) in the samples in the previous figure. For the model we will be using, (logistic regression), values of 0 automatically imply that the model will simply be ignoring these values. Ideally we could replace these 0 values with the mean value for that feature, but we’ll skip that for now.

Data Exploration

Let us now explore our data set to get a feel of what it looks like and get some insights about it.

Let’s start by finding correlation of every pair of features (and the outcome variable), and visualize the correlations using a heatmap.

corr = diabetesDF.corr()
print(corr)
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns)

Output of feature (and outcome) correlations

Heatmap of feature (and outcome) correlations

In the above heatmap, brighter colors indicate more correlation. As we can see from the table and the heatmap, glucose levels, age, BMI and number of pregnancies all have significant correlation with the outcome variable. Also notice the correlation between pairs of features, like age and pregnancies, or insulin and skin thickness.

Let’s also look at how many people in the dataset are diabetic and how many are not. Below is the barplot of the same:

Barplot visualization of number of non-diabetic (0) and diabetic (1) people in the dataset

It is also helpful to visualize relations between a single variable and the outcome. Below, we’ll see the relation between age and outcome. You can similarly visualize other features. The figure is a plot of the mean age for each of the output classes. We can see that the mean age of people having diabetes is higher.

Average age of non-diabetic and diabetic people in the dataset

Dataset preparation (splitting and normalization)

When using machine learning algorithms we should always split our data into a training set and test set. (If the number of experiments we are running is large, then we can divide our data into 3 parts, namely - training set, development set and test set). In our case, we will also separate out some data for manual cross checking.

The data set of record consists of 767 patients in total. To train our model we will be using 650 records. We will be using 100 records for testing, and the last 17 records to cross check our model.

dfTrain = diabetesDF[:650]
dfTest = diabetesDF[650:750]
dfCheck = diabetesDF[750:]

Next, we separate the label and features (for both training and test dataset). In addition to that, we will also convert them into NumPy arrays as our machine learning algorithm process data in NumPy array format.

trainLabel = np.asarray(dfTrain['Outcome'])
trainData = np.asarray(dfTrain.drop('Outcome',1))
testLabel = np.asarray(dfTest['Outcome'])
testData = np.asarray(dfTest.drop('Outcome',1))

As the final step before using machine learning, we will normalize our inputs. Machine Learning models often benefit substantially from input normalization. It also makes it easier for us to understand the importance of each feature later, when we’ll be looking at the model weights. We’ll normalize the data such that each variable has a 0 mean and a standard deviation of 1.

means = np.mean(trainData, axis=0)
stds = np.std(trainData, axis=0)
trainData = (trainData - means)/stds
testData = (testData - means)/stds
# np.mean(trainData, axis=0) => check that new means equal 0
# np.std(trainData, axis=0) => check that new stds equal 1

Training and Evaluating Machine Learning Model

We can now train our classification model. We’ll be using a simple machine learning model called logistic regression. Since the model is readily available in sklearn, the training process is quite easy and we can do it in few lines of code. First, we create an instance called diabetesCheck and then use the fit function to train the model.

diabetesCheck = LogisticRegression()
diabetesCheck.fit(trainData, trainLabel)

Next, we will use our test data to find the accuracy of the model.

accuracy = diabetesCheck.score(testData, testLabel)
print("accuracy = ", accuracy * 100, "%")

The print statement will print accuracy = 78.0 %.

Interpreting the ML model

To get a better sense of what is going on inside the logistic regression model, we can visualize how our model uses the different features and which features have greater effect.

coeff = list(diabetesCheck.coef_[0])
labels = list(dfTrain.drop('Outcome',1).columns)
features = pd.DataFrame()
features['Features'] = labels
features['importance'] = coeff
features.sort_values(by=['importance'], ascending=True, inplace=True)
features['positive'] = features['importance'] > 0
features.set_index('Features', inplace=True)
features.importance.plot(kind='barh', figsize=(11, 6),color = features.positive.map({True: 'blue', False: 'red'}))
plt.xlabel('Importance')

Visualization of the weights in the Logistic Regression model corresponding to each of the feature variables

From the figure above, we can draw the following conclusions:

Glucose level, BMI, pregnancies and diabetes pedigree function have significant influence on the model, specially glucose level and BMI. It is good to see our machine learning model match what we have been hearing from doctors our entire lives!
Blood pressure has a negative influence on the prediction, i.e. higher blood pressure is correlated with a person not being diabetic. (also, note that blood pressure is more important as a feature than age, because the magnitude is higher for blood pressure).
Although age was more correlated than BMI to the output variables (as we saw during data exploration), the model relies more on BMI. This can happen for several reasons, including the fact that the correlation captured by age is also captured by some other variable, whereas the information captured by BMI is not captured by other variables.

Note that the interpretations above require that our input data be normalized. Without that, we can’t claim that importance is proportional to weights.

Save Model

Now we will save our trained model for future use using joblib.

joblib.dump([diabetesCheck, means, stds], 'diabeteseModel.pkl')

To check whether we have saved the model properly or not, we will use our test data to check the accuracy of our saved model (we should observe no change in accuracy if we have saved it properly).

diabetesLoadedModel, means, stds = joblib.load('diabeteseModel.pkl')
accuracyModel = diabetesLoadedModel.score(testData, testLabel)
print("accuracy = ",accuracyModel * 100,"%")

Predicting using the model

We will now use our unused data to see how predictions can be made. We have our unused data in dfCheck.

print(dfCheck.head())

We will now use the first record to make our prediction.

sampleData = dfCheck[:1]
# prepare sample  
sampleDataFeatures = np.asarray(sampleData.drop('Outcome',1))
sampleDataFeatures = (sampleDataFeatures - means)/stds
# predict 
predictionProbability = diabetesLoadedModel.predict_proba(sampleDataFeatures)
prediction = diabetesLoadedModel.predict(sampleDataFeatures)
print('Probability:', predictionProbability)
print('prediction:', prediction)

From above code we get:

Probability: [[ 0.4385153,  0.5614847]]
prediction: [1]

The first element of array predictionProbability 0.438 is the probability of the class being 0 and second element 0.561 is the probability of the class being 1. The probability sum up to 1. As we can see that the 1 is more probable class, we get [1] as our prediction, which means that the model predicts that the person has diabetes.

Next steps

There are lots of ways to improve the above model. Here are some ideas:

Input feature bucketing should help, i.e. create new variables for blood pressure in a particular range, glucose levels in a particular range, and so on.
You could also improve the data cleaning, by replacing 0 values with the mean value.
Read a bit about what metrics doctors rely on the most to diagnose a diabetic patient, and create new features accordingly.

See if you can get to 85-90% accuracy. You can get started with the jupyter notebook for this tutorial: pima_indians.ipynb. You can also view the entire code in one place here: Code for End-to-End Example: Using Logistic Regression for predicting Diabetes

Solution on Google Colaboratory

Notebook Link

The complete notebook with all the cells executed is available via Google Colaboratory using the link above. Google Colab is a free tool that lets you run small Machine Learning experiments through your browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory. Note that, for this project, you’ll have to upload the dataset diabetes.csv (which you can download here) to Google Colab after saving the notebook to your own system.