Linear Regression

September 29, 2017

Linear Regression is a simple machine learning model for regression problems, i.e., when the target variable is a real value.

Example problem

Let’s start with an example — suppose we have a dataset with information about the area of a house (in square feet) and its price (in thousands of dollars) and our task is to build a machine learning model which can predict the price given the area.

Here is what our dataset looks like

If we plot our data, we might get something similar to the following:

In linear regression, we “fit” a straight line to the above data points. Something like this:

Now, let’s say someone constructs a new house and wants us to suggest at what price he / she should sell the house. Say the house has an area of 1550 square feet. So we look at our line, and we see that the y value corresponding to x = 1550 is about 200.

Hence, we would predict that the house will sell at an approximate price of $200,000.

Machine Learning setup

Every machine learning algorithm consists of three parts.

Representation — the representation of a machine learning model decides what types of things it can learn. In linear regression, the model representation is a straight line.
Evaluation — scores the goodness of various possible models using a cost function. For example, in linear regression there are infinitely many possible straight lines each of which represents a model. The cost function “scores” the goodness (or badness) of each possible straight line. The “best line” is the line which achieves the smallest cost.
Optimization — is the process of finding the best model given the set of possible models (model representation) and the cost function.

In this tutorial, we will talk about representation and evaluation. In the next tutorial, we will talk about optimization (gradient descent).

Linear regression

In simple linear regression, we establish a relationship between target variable and input variables by fitting a line, known as the regression line.

In general, a line can be represented by the linear equation $y = m x + b$. Where, y is the dependent variable, x is the independent variable, m is the slope, b is the intercept.

In machine learning, we rewrite our equation as $y_{w,b}(x) = w_1 x_1 + b$ where b and w₁ are the parameters of the model (b is known as the bias, and the w’s are known as weights), x₁ is the input, and y is the target variable.

In our house pricing example, the area of the house would be the input x₁, and the price would be the target y.

Different values of b and w₁ will give us different lines, as shown below:

Depending on the values of the parameters, the model will make different predictions.

For example, let’s consider (b, w₁) = (0.0, 0.2), and the first data point, where x = 3456 and y_true = 600. The prediction made by the model, y_w,b(x) = 0.23456 + 0.0 = 691.2. If instead the weights were (b, w₁) = (80.0, 0.15), then the prediction would be y_w,b(x) = 0.153456 + 80.0 = 598.4, which is much closer to the y_true = 600.

Multiple linear regression

The earlier equation you saw for linear regression, $y_{w,b}(x) = w_1x_1 + b$, can be used when we have one input variable (also called feature). However, in general, we usually deal with datasets which have multiple input variables. For example, in our house pricing dataset, we could have additional feature variables such as number of rooms in the house, the year the home was constructed, and so on.

The case when we have more than one feature is known as multiple linear regression. We can generalize our previous equation for simple linear regression to multiple linear regression:

$$ y_{w,b}(x) = w_1 x_1 + w_2 x_2 + ... + w_k x_k + b $$

The x’s are the various input dimensions. And we also have a model weight w corresponding to each dimension.

In the case of multiple linear regression, instead of our prediction being a line in 2-dimensional space, it is a hyperplane in n-dimensional space. For example, in 3D, our plot would look as follows

Notation

Let’s summarize all the mathematical notation we have introduced so far.

$n$ = number of data points
$x$ = input variables / features. In general, $x$ maybe multi-dimensional, in which case its various dimensions are $x_1, x_2, \dots, x_k$.
$y$ = output / target variable. We will write this as $y_{true}$ sometimes to explicitly distinguish the target $y$ from the predicted $y$.
$(w, b)$ = the model weights and bias, respectively. For linear regression, $w$ has the same number of dimensions as the input $x$, and $b$ is a scalar.
$y_{w,b}(x)$ = the prediction function. We write $y_{w,b}(\cdot)$ to make it clear that $y$ is parameterized by $w$ and $b$.

In addition, we will also use the following notation when we want to talk about specific data-points.

$(x, y)$ = some data point
$(x^{(i)}, y^{(i)})$ = i-th data point.
So, (x⁽¹⁾, y⁽¹⁾) = (3456, 600) means that the first house is 3456 square feet and has a price of $600k.
And, $y_{w, b}(x^{(i)})$ is the prediction for the i-th data point.

Residuals

The cost function defines a cost based on the difference between target value and predicted value, also known as the residual. In the graph below, we have visualized the residuals:

Cost functions

If a particular line is far from all the points, the residuals will be higher, and so should the cost function. If a line is close to the points, the residuals will be small, and so should the cost function.

The cost function measures, given a particular value for the parameters b and w₁, how close the predictions (y(x)) are to the corresponding targets (y_true). That is, how well do a particular set of parameters predict the target value.

The cost function we use for linear regression is mean squared error. We go over all the data points, and take the mean of the squared error between the predicted value y(x) and the target value y_true.

$$ J(w)=\frac{1}{n}\sum\limits_{i=1}^n (y_{w,b}(x^{(i)}) - y_{true}^{(i)})^2 $$

Example

Let’s continue with the same example as earlier. The first data point has x = 3456 and y_true = 600. And the model parameters (b, w₁) = (0.0, 0.2). Given this, the prediction y_w,b(x) we had calculated was 691.2. Hence, the squared error is (y_w,b(x) - y_true)² = (691.2 - 600.0)² = 91.2² = 8,317.44.

Similarly, we calculate the squared error for each data point, and then average them. The squared error for the other two data points are 519.84 and 2621.44, which makes the mean squared error = J(w) = (8,317.44 + 519.84 + 2621.44) / 3 = 3819.57.

Similarly, if we calculate the mean squared error for weights (b, w₁) = (80.0, 0.15) we get mean squared error = J(w) = (2.56 + 2.72 + 3648.16) / 3 = 1217.81. Since the cost J(w) is lower for (b, w₁) = (80.0, 0.15), we say those parameters are better. Calculations for predictions, residual, squared error and cost function can be found on this Google Spreadsheet.

The minimum error is achieved by (b, w₁) = (15.0, 0.17), and the corresponding error is J(w) = 395.83. In the next tutorial on gradient descent, we’ll see how to find the weights which achieve the minimum error.

Why mean squared error?

One question you might have is, why do we not use the sum of the residuals as our error function? Why squared? Why mean?

Squaring makes the existence of any “large” residuals negatively impact the cost function more than if a linear weight (not squared) was used. The result is a regression with more uniform residuals and less drastic outliers. It also makes sure, irrespective of whether: $y_{true} > y(x) \;or\; y_{true} < y(x)$, the residual error always increases the cost.
Mean so that the result is independent of the number of data points. The sum would be proportional to the number of data points, while the mean is not. It makes comparison between data sets easier and the results more meaningful when performing regression in different problem spaces.

Python Implementation

In this section, we will see the code for calculating the prediction and the cost function.

Let’s start by creating a small toy dataset:

# our toy dataset (list of x, y pairs)
dataset = [
  (6.65, 30.7),
  (8.19, 38.1),
  (8.92, 44.7),
  (6.21, 34.9),
  (7.16, 41.0),
  (5.79, 33.1),
  (9.17, 41.4),
  (8.75, 43.9),
  (6.77, 31.5),
  (5.65, 34.3),
  (7.22, 37.5),
  (7.74, 39.9),
  (6.58, 39.2),
  (8.54, 45.0),
  (5.65, 29.5),
  (6.49, 37.5),
  (5.08, 34.2),
  (8.62, 42.7),
  (8.47, 39.2),
  (5.16, 33.0),
]

Next, let’s write the code for calculating the prediction and the cost function:

def predict(x1, w1, b):
    ''' predict y given input and model parameters '''
    return x1 * w1 + b
def cost_function(w1, b):
    ''' calculate cost given model parameters. '''
    # make prediction for each data (x, ytrue) and calculate squared error
    squared_errors = list()
    for x, ytrue in dataset:
        ypred = predict(x, w1, b)
        squared_error = (ypred - ytrue) ** 2
        squared_errors.append(squared_error)
    # return average of squared_errors
    return sum(squared_errors) / len(squared_errors)

Finally, here’s some code for you to see a plot of the dataset (in red) and the model line (in blue). It will also display the value of the cost function.

Note: It’s okay if you don’t understand the plotting code, but you should understand the code above for predict() and cost_function().

# bias and weight (try out various values!)
b  = 15.0          # bias
w1 = 3.0          # weight
# separate xs and ys
xs = [x for x, _ in dataset]
ys = [y for _, y in dataset]
# calculate cost and predictions 
cost = cost_function(w1, b)
predictions = [predict(x, w1, b) for x in xs]
print("Value of cost function =", cost)
# plot data points and regression line
plt.scatter(xs, ys, c='red')
plt.plot(xs, predictions, c='blue')
plt.xlim(4.0, 10.0)
plt.ylim(min(0.0, min(predictions)), max(50.0, max(predictions)))
plt.show()

Click Try It Now! and then Run on the code above to run it. You can also play around. Try out various values for weight and bias and see how the model line and cost change.

Hope you had fun with that!

Optimization using Gradient Descent

Each value of the weight vector w gives us a corresponding cost J(w). We want to find the value of weights for which cost is minimum. We can visualize this as follows:

Note: Above we have used the word “global” because the shape of the cost-function for linear regression is convex (i.e. like a bowl). It has a single minimum, and it smoothly increases in all directions around it.

Given the linear regression model and the cost function, we can use Gradient Descent (covered in the next article) to find a good set of values for the weight vector. The process of finding the best model out of the many possible models is called optimization.

Summary

In simple linear regression, we establish a relationship between target variable and input variables by fitting a straight line, known as the regression line.
In machine learning, we generally write the equation for the linear regression line as $y(x)=b+w_1x_1$ , where the b and w₁ are the parameters of the model, x₁ is the input, and y is the target variable.
$y(x) = w_1 x_1 + w_2 x_2 + ... + w_k x_k + b$ is the equation for multiple linear regression model with features $x_1, x_2, ......., x_k$.
We use the mean squared error cost function to evaluate how well a model fits the data. The lower the cost, the better the model.