In this tutorial, we will see how to use Data Science to predict whether a person has diabetes or not, based on information like blood pressure, body mass index (BMI), age, etc.

The data was collected and made available by "National Institute of Diabetes and Digestive and Kidney Diseases" as part of the Pima Indians Diabetes Database. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here belong to the Pima Indian heritage (subgroup of Native Americans), and are females of ages 21 and above.

# Overview

We will be using Python as our programming language, and making use of some popular python data science related packages. First of all, we will import pandas to read our data from a CSV file and manipulate it for further use. We will also use numpy to convert out data into a format suitable to feed our classification model. We'll use seaborn and matplotlib for visualizations. We will then import Logistic Regression algorithm from sklearn. This algorithm will help us build our classification model. Lastly, we will use joblib available in sklearn to save our model for future use.

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

% matplotlib inline

from sklearn.linear_model import LogisticRegression

from sklearn.externals import joblib

# Data Description

We have our data saved in a CSV file called diabetes.csv. We first read our dataset in a pandas dataframe called diabetesDF, and then use the head() function to show the first five records from our dataset.

diabetesDF = pd.read_csv('diabetes.csv')

print(diabetesDF.head())

First 5 records in the Pima Indians Diabetes Database

The following features have been provided to help us predict whether a person is diabetic or not:

**Pregnancies: **Number of times pregnant**Glucose:** Plasma glucose concentration over 2 hours in an oral glucose tolerance test**BloodPressure: **Diastolic blood pressure (mm Hg)**SkinThickness:** Triceps skin fold thickness (mm)**Insulin:** 2-Hour serum insulin (mu U/ml)**BMI:** Body mass index (weight in kg/(height in m)^{2})**DiabetesPedigreeFunction:** Diabetes pedigree function (a function which scores likelihood of diabetes based on family history)**Age:** Age (years)**Outcome:** Class variable (0 if non-diabetic, 1 if diabetic)

Let's also make sure that our data is clean (has no null values, etc).

diabetesDF.info() # output shown below

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 768 entries, 0 to 767

Data columns (total 9 columns):

Pregnancies 768 non-null int64

Glucose 768 non-null int64

BloodPressure 768 non-null int64

SkinThickness 768 non-null int64

Insulin 768 non-null int64

BMI 768 non-null float64

DiabetesPedigreeFunction 768 non-null float64

Age 768 non-null int64

Outcome 768 non-null int64

dtypes: float64(2), int64(7)

memory usage: 54.1 KB

Note that the data does have some missing values (see Insulin = 0) in the samples in the previous figure. For the model we will be using, (logistic regression), values of 0 automatically imply that the model will simply be ignoring these values. Ideally we could replace these 0 values with the mean value for that feature, but we'll skip that for now.

# Data Exploration

Let us now explore our data set to get a feel of what it looks like and get some insights about it.

Let's start by finding correlation of every pair of features (and the outcome variable), and visualize the correlations using a heatmap.

corr = diabetesDF.corr()

print(corr)

sns.heatmap(corr,

xticklabels=corr.columns,

yticklabels=corr.columns)

Output of feature (and outcome) correlations

Heatmap of feature (and outcome) correlations

In the above heatmap, brighter colors indicate more correlation. As we can see from the table and the heatmap, glucose levels, age, BMI and number of pregnancies all have significant correlation with the outcome variable. Also notice the correlation between pairs of features, like age and pregnancies, or insulin and skin thickness.

Let's also look at how many people in the dataset are diabetic and how many are not. Below is the barplot of the same:

Barplot visualization of number of non-diabetic (0) and diabetic (1) people in the dataset

It is also helpful to visualize relations between a single variable and the outcome. Below, we'll see the relation between age and outcome. You can similarly visualize other feature. The figure is a plot of the mean age for each of the output classes. We can see that the mean age of people having diabetes is higher.

Average age of non-diabetic and diabetic people in the dataset

# Dataset preparation (splitting and normalization)

When using machine learning algorithms we should always split our data into a training set and test set. (If the number of experiments we are running is large, then we can should be dividing our data into 3 parts, namely - training set, development set and test set). In our case, we will also separate out some data for manual cross checking.

The data set consists of record of 767 patients in total. To train our model we will be using 650 records. We will be using 100 records for testing, and the last 17 records to cross check our model.

dfTrain = diabetesDF[:650]

dfTest = diabetesDF[650:750]

dfCheck = diabetesDF[750:]

Next, we separate the label and features (for both training and test dataset). In addition to that, we will also convert them into NumPy arrays as our machine learning algorithm process data in NumPy array format.

trainLabel = np.asarray(dfTrain['Outcome'])

trainData = np.asarray(dfTrain.drop('Outcome',1))

testLabel = np.asarray(dfTest['Outcome'])

testData = np.asarray(dfTest.drop('Outcome',1))

As the final step before using machine learning, we will normalize our inputs. Machine Learning models often benefit substantially from input normalization. It also makes it easier for us to understand the importance of each feature later, when we'll be looking at the model weights. We'll normalize the data such that each variable has 0 mean and standard deviation of 1.

means = np.mean(trainData, axis=0)

stds = np.std(trainData, axis=0)

trainData = (trainData - means)/stds

testData = (testData - means)/stds

# np.mean(trainData, axis=0) => check that new means equal 0

# np.std(trainData, axis=0) => check that new stds equal 1

# Training and Evaluating Machine Learning Model

We can now train our classification model. We'll be using a machine simple learning model called *logistic regression*. Since the model is readily available in sklearn, the training process is quite easy and we can do it in few lines of code. First, we create an instance called diabetesCheck and then use the fit function to train the model.

diabetesCheck = LogisticRegression()

diabetesCheck.fit(trainData, trainLabel)

Next, we will use our test data to find out accuracy of the model.

accuracy = diabetesCheck.score(testData, testLabel)

print("accuracy = ", accuracy * 100, "%")

The print statement will print accuracy = 78.0 %.

# Interpreting the ML model

To get a better sense of what is going on inside the logistic regression model, we can visualize how our model uses the different features and which features have greater effect.

coeff = list(diabetesCheck.coef_[0])

labels = list(trainData.columns)

features = pd.DataFrame()

features['Features'] = labels

features['importance'] = coeff

features.sort_values(by=['importance'], ascending=True, inplace=True)

features['positive'] = features['importance'] > 0

features.set_index('Features', inplace=True)

features.importance.plot(kind='barh', figsize=(11, 6),color = features.positive.map({True: 'blue', False: 'red'}))

plt.xlabel('Importance')

Visualization of the weights in the Logistic Regression model corresponding to each of the feature variables

From the above figure, we can draw the following conclusions.

- Glucose level, BMI, pregnancies and diabetes pedigree function have significant influence on the model, specially glucose level and BMI. It is good to see our machine learning model match what we have been hearing from doctors our entire lives!
- Blood pressure has a negative influence on the prediction, i.e. higher blood pressure is correlated with a person not being diabetic. (also, note that blood pressure is more important as a feature than age, because the
*magnitude* is higher for blood pressure). - Although age was more correlated than BMI to the output variables (as we saw during data exploration), the model relies more on BMI. This can happen for several reasons, including the fact that the correlation captured by age is also captured by some other variable, whereas the information captured by BMI is not captured by other variables.

Note that this above interpretations require that our input data is normalized. Without that, we can't claim that *importance* is proportional to *weights*.

# Save Model

Now we will save our trained model for future use using joblib.

joblib.dump([diabetesCheck, means, stds], 'diabeteseModel.pkl')

To check whether we have saved the model properly or not, we will use our test data to check the accuracy of our saved model (we should observe no change in accuracy if we have saved it properly).

diabetesLoadedModel, means, stds = joblib.load('diabeteseModel.pkl')

accuracyModel = diabetesLoadedModel.score(testData, testLabel)

print("accuracy = ",accuracyModel * 100,"%")

# Predicting using the model

We will now use our unused data to see how predictions can be made. We have our unused data in dfCheck.

print(dfCheck.head())

We will now use the first record to make our prediction.

sampleData = dfCheck[:1]

# prepare sample

sampleDataFeatures = np.asarray(sampleData.drop('Outcome',1))

sampleDataFeatures = (sampleDataFeatures - means)/stds

# predict

predictionProbability = diabetesLoadedModel.predict_proba(sampleDataFeatures)

prediction = diabetesLoadedModel.predict(sampleDataFeatures)

print('Probability:', predictionProbability)

print('prediction:', prediction)

From above code we get:

Probability: [[ 0.4385153, 0.5614847]]

prediction: [1]

The first element of array **predictionProbability** 0.438 is the probability of the class being **0** and second element 0.561 is the probability of the class being **1**. The probability sum up to 1. As we can see that the** 1** is more probable class, we get **[1]** as our prediction, which means that the model predicts that the person has diabetes.

# Next steps

There are lots of ways to improve the above model. Here are some ideas.

- Input feature bucketing should help, i.e. create new variables for blood pressure in a particular range, glucose levels in a particular range, and so on.
- You could also improve the data cleaning, by replacing 0 values with the mean value.
- Read a bit about what metrics do doctors rely on the most to diagnose a diabetic patient, and create new features accordingly.

See if you can get to 85-90% accuracy. You can get started with the jupyter notebook for this tutorial: pima_indians.ipynb.