Linear Regression is a simple machine learning model for regression problems, i.e., when the target variable is a real value.
Let's start with an example — suppose we have a dataset with information about the area of a house (in square feet) and its price (in thousands of dollars) and our task is to build a machine learning model which can predict the price given the area. Here is what our dataset looks like
If we plot our data, we might get something similar to the following:
In simple linear regression, we establish a relationship between target variable and input variables by fitting a line, known as the regression line.
In general, a line can be represented by linear equation y = m * x + b. Where, y is the dependent variable, x is the independent variable, m is the slope, b is the intercept.
In machine learning, we rewrite our equation as y(x) = w0 + w1 * x where w's are the parameters of the model, x is the input, and y is the target variable. This is the standard notation in machine learning, and makes it easier to add more dimensions. We can simply add variables w2, w3, ... and x2, x3, ... as we add more dimensions. (See footnote ).
Different values of w0 and w1 will give us different lines, as shown below
Each of the values of the parameters determine what predictions the model will make. For example, let's consider (w0, w1) = (0.0, 0.2), and the first data point, where x = 3456 and ytrue = 600. The prediction made by the model, y(x) = 0.0 + 0.2*3456 = 691.2. If instead the weights were (w0, w1) = (80.0, 0.15), then the prediction would be y(x) = 80.0 + 0.15*3456 = 598.4, which is much closer to the ytrue = 600.
The above equation can be used when we have one input variable (also called feature). However, in general, we usually deal with datasets which have multiple input variables. For example, in our dataset, we could have additional feature variables such as number of rooms in the house, the year the home was constructed, and so on.
The case when we have more than one feature is known as multiple linear regression, or simply, linear regression. We can generalize our previous equation for simple linear regression to multiple linear regression:
In the case of multiple linear regression, instead of our prediction being a line in 2-dimensional space, it is a hyperplane in n-dimensional space. For example, in 3D, our plot would look as follows
Different values of the weights (w0, w1, w2, ... wn) gives us different lines (or hyperplanes), and our task is to find weights for which we get best fit.
One question you may have is, how can we determine how well a particular line fits our data? Or, given two lines, how do we determine which one is better? For this, we introduce a cost function which measures, given a particular value for the w's, how close the y's are to corresponding ytrue's. That is, how well do a particular set of weights predict the target value.
For linear regression, the most commonly used cost function is the mean squared error cost function. It is the average over the various data points (xi, yi) of the squared error between the predicted value y(x) and the target value ytrue.
Continuing the same example as above, the squared error for the first data point x = 3456 and ytrue = 600 when (w0, w1) = (0.0, 0.2) is given by (y(x) - ytrue)2 = (691.2 - 600.0)2 = 91.22 = 8,317.44. Similarly, we calculate the squared error for each data point, and then average them. The squared error for the other two data points are 519.84 and 2621.44, which makes the mean squared error = J(w) = (8,317.44 + 519.84 + 2621.44)/3 = 3819.57.
Similarly, if we calculate the mean squared error for weights (w0, w1) = (80.0, 0.15) we get mean squared error = J(w) = (2.56 + 2.72 + 3648.16)/3 = 1217.81. Since the error J(w) is lower for (w0, w1) = (80.0, 0.15), we say those weights are better. 
The minimum error is achieved by (w0, w1) = (15.0, 0.17), and the corresponding error is J(w) = 395.83. In the next tutorial on gradient descent, we'll see how to find the weights which achieve the minimum error.
The cost function defines a cost based on the distance between true target and predicted target (shown in the graph as lines between sample points and the regression line), also known as the residual. The residuals are visualized below,
If a particular line is far from all the points, the residuals will be higher, and so will the cost function. If a line is close to the points, the residuals will be small, and hence the cost function.
One question you might have is, why do we not use the sum of the residuals as our error function? Why squared? Why mean?
- Squaring makes the existence of any "large" residuals negatively impact the cost function more than if a linear weight (not squared) was used. The result is a regression with more uniform residuals and less drastic outliers.
- Mean so that the result is independent of the number of data points used. A sum would be proportional to the number of data points, while a mean is not. It makes comparison between data sets easier and the results more meaningful to when performing regressions in different problem spaces.
Each value of the weight vector w gives us a corresponding cost J(w). We want to find the value of weights for which cost is minimum. We can visualize this as follows:
Note: Above we have used the word "global" because the shape of the cost-function for linear regression is convex (i.e. like a bowl). It has a single minimum, and it smoothly increases in all directions around it.
Given the linear regression model and the cost function, we can use Gradient Descent (covered in the next article) to find a good set of values for the weight vector. The process of finding the best model out of the many possible models is called optimization.
- In addition, variables b and w0 are used interchangeably. Often when the w0 notation is used, we add an imaginary dimension x0 to the input, which is always equal to 1. This makes the dot product w_0x_0 + w_1x_1 + w_2x_2 + ... w_dx_d the final prediction of the model, instead of having a special case for dimension 0. However, there are cases when w_i needs to be treated differently from w_0, such as during regularization (penalizing large weights). In which case, using the w_0 notation can be confusing.
- Calculations for predictions, residual, squared error and cost function can be found on this Google Spreadsheet.