April 17, 2018

# Overview

In this hands-on assignment, we’ll apply linear regression with gradient descent to predict the progression of diabetes in patients. The tutorial will guide you through the process of implementing linear regression with gradient descent in Python, from the ground up.

We’ll first load the dataset, and train a linear regression model using scikit-learn, a Python machine learning library. We will use the results from scikit-learn as a test for whether or not our implementation is correct. Then, we’ll implement linear regression using gradient descent ourselves. Let’s get started!

# Dataset

Below is the code for loading and splitting the dataset. A description of the dataset is available at the following link (Diabetes Data). The dataset is already included with scikit-learn (read more here).

###############################################################################
## Import stuff
import numpy as np
from sklearn import datasets, linear_model, metrics
###############################################################################
diabetes_X = diabetes.data # matrix of dimensions 442x10
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
###############################################################################

Data description: The dataset includes information about 442 diabetes patients. For each patient, there are 10 features or input variables - these include age, sex, body mass index, average blood pressure, and six blood serum measurements. The blood serum measurements are the following: Total Cholesterol (TC), Low Density Lipoprotein (LDL), High Density Lipoprotein (HDL), TC/HDL, Low Tension Glaucoma (LTG) and Glucose. The target is a quantitative measure of disease progression one year after baseline.

# How to check for mistakes

As mentioned previously, we’ll use results from scikit-learn as a test for whether or not our implementation is correct. Since scikit-learn is a machine learning library, linear regression is available as a model and can be trained by just calling function fit on the model. You can see the documentation here: sklearn.linear_model.LinearRegression — documentation

###############################################################################
## Scikit learn
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
mean_squared_error = metrics.mean_squared_error(diabetes_y_test, diabetes_y_pred)
print("Mean squared error: %.2f" % mean_squared_error)
print("="*120)
###############################################################################

This produces the following output:

('Coefficients: \n', array([  3.03499549e-01,  -2.37639315e+02,   5.10530605e+02,
3.27736980e+02,  -8.14131709e+02,   4.92814588e+02,
1.02848452e+02,   1.84606489e+02,   7.43519617e+02,
7.60951722e+01]))
Mean squared error: 2004.57

We can use the above values of the mean squared error (2004.57) as the target value for the mean squared error for our implementation. We can also compare the coefficients after training our model to check if we get the same results. Note that the numbers might not match exactly, but as long as they match reasonable well (say within 1%), we should be fine.

# Implementing Linear Regression with Gradient Descent

Finally, it’s time to implement linear regression ourselves. Here’s a template for you to get started.

###############################################################################
## Our own implementation
# train
X = diabetes_X_train
y = diabetes_y_train
# train: initialize
W = ...
b = ...
learning_rate = ...
epochs = ...
for i in range(epochs):
# calculate predictions
...
# calculate error and cost (mean squared error)
...
...
# update parameters
...
# diagnostic output
if i % 5000 == 0: print("Epoch %d: %f" % (i, mean_squared_error))
# test
X = diabetes_X_test
y = diabetes_y_test
# calculate predictions + calculate error and cost (same code as above)
...
print('Coefficients: \n', W)
print("Mean squared error: %.2f" % mean_squared_error)
print("="*120)
###############################################################################

Here are some hints to and guidance to help you along the way.

• W is of dimension 10 (since we have 10 features), and b is a scalar.
• Use the formulas in the Linear Regression tutorial for predictions, errors and costs.
• Use the formulas in Gradient Descent tutorial for gradients and update rule.
• For gradient descent, the gradient $G = \nabla_w J(w)$ is given by the following formulas
\begin{align} G_{w_k} &= \frac{1}{n}\sum_{i}{x^{(i)}_k*(y^{(i)}_{pred} - y^{(i)}_{true})} \\ G_{b} &= \frac{1}{n}\sum_{i}{(y^{(i)}_{pred} - y^{(i)}_{true})} \end{align}
• Keep playing around with different values of learning rate and number of epochs till you see that the training mean squared error is decreasing too slowly (say decreasing by < 0.0001 per epoch).

It’s okay if you get stuck. The assignment has a number of moving pieces, and it’s okay to peek at the solution (included) after struggling for a bit. Just make sure you don’t look at the entire solution in one go. Only check the parts that you could not solve, and then go back to trying to work it out on your own.

The complete notebook with all the cells executed is available via Google Colaboratory using the link above. Google Colab is a free tool that lets you run small Machine Learning experiments through your browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory.

The solution is also available here: Solution to Hands-on Assignment: Implementing Linear Regression with Gradient Descent. The solution uses the numpy package, which provides functions for vector and matrix operations like dot products, etc. You can learn more about numpy in this tutorial: Introduction to NumPy.

The output produced by the solution is as follows:

Epoch 0: 29468.227258
Epoch 5000: 3048.187264
Epoch 10000: 2941.413239
Epoch 15000: 2927.457390
Epoch 20000: 2924.752271
Epoch 25000: 2923.794789
Epoch 30000: 2923.195048
Epoch 35000: 2922.693787
Epoch 40000: 2922.230650
Epoch 45000: 2921.788828
Epoch 50000: 2921.362712
Epoch 55000: 2920.949917
Epoch 60000: 2920.549099
Epoch 65000: 2920.159313
Epoch 70000: 2919.779804
Epoch 75000: 2919.409942
Epoch 80000: 2919.049181
Epoch 85000: 2918.697046
Epoch 90000: 2918.353121
Epoch 95000: 2918.017037
('Coefficients: \n', array([   3.66165332, -234.66440146,  519.39514482,  325.58168427,
-176.11598458,  -16.43251228, -180.05636948,  108.06092022,
502.78742918,   78.96999166]))
Mean squared error: 1993.53

# Discussion

Notice that the mean squared error matches within 0.5%. Out of the 10 coefficients, coefficients 2, 3, 4 and 10 match but the others don’t. This is because the other features are not linearly independent, hence there are multiple ways to get to the exact same prediction. For example, features 5, 7 and 8 are linearly related in the following way: feature 8 = feature 5 / feature 7.