Part of course:
Linear Regression Tutorial with Example
- Linear regression
- Simple linear regression
- Multiple linear regression
- Cost functions
- Why mean squared error?
- Optimization using Gradient Descent
Linear Regression is a simple machine learning model for regression problems, i.e., when the target variable is a real value.
Let's start with an example — suppose we have a dataset with information about the area of a house (in square feet) and its price (in thousands of dollars) and our task is to build a machine learning model which can predict the price given the area. Here is what our dataset looks like
If we plot our data, we might get something similar to the following:
In simple linear regression, we establish a relationship between target variable and input variables by fitting a line, known as the regression line.
In general, a line can be represented by linear equation y = m * x + b. Where, y is the dependent variable, x is the independent variable, m is the slope, b is the intercept.
In machine learning, we rewrite our equation as y(x) = w0 + w1 * x where w's are the parameters of the model, x is the input, and y is the target variable. This is the standard notation in machine learning, and makes it easier to add more dimensions. We can simply add variables w2, w3, ... and x2, x3, ... as we add more dimensions. (See footnote ).
Different values of w0 and w1 will give us different lines, as shown below
Each of the values of the parameters determine what predictions the model will make. For example, let's consider (w0, w1) = (0.0, 0.2), and the first data point, where x = 3456 and ytrue = 600. The prediction made by the model, y(x) = 0.0 + 0.2*3456 = 691.2. If instead the weights were (w0, w1) = (80.0, 0.15), then the prediction would be y(x) = 80.0 + 0.15*3456 = 598.4, which is much closer to the ytrue = 600.
The above equation can be used when we have one input variable (also called feature). However, in general, we usually deal with datasets which have multiple input variables. For example, in our dataset, we could have additional feature variables such as number of rooms in the house, the year the home was constructed, and so on.
The case when we have more than one feature is known as multiple linear regression, or simply, linear regression. We can generalize our previous equation for simple linear regression to multiple linear regression:
In the case of multiple linear regression, instead of our prediction being a line in 2-dimensional space, it is a hyperplane in n-dimensional space. For example, in 3D, our plot would look as follows
Different values of the weights (w0, w1, w2, ... wn) gives us different lines (or hyperplanes), and our task is to find weights for which we get best fit.
One question you may have is, how can we determine how well a particular line fits our data? Or, given two lines, how do we determine which one is better? For this, we introduce a cost function which measures, given a particular value for the w's, how close the y's are to corresponding ytrue's. That is, how well do a particular set of weights predict the target value.
For linear regression, the most commonly used cost function is the mean squared error cost function. It is the average over the various data points (xi, yi) of the squared error between the predicted value y(x) and the target value ytrue.
Continuing the same example as above, the squared error for the first data point x = 3456 and ytrue = 600 when (w0, w1) = (0.0, 0.2) is given by (y(x) - ytrue)2 = (691.2 - 600.0)2 = 91.22 = 8,317.44. Similarly, we calculate the squared error for each data point, and then average them. The squared error for the other two data points are 519.84 and 2621.44, which makes the mean squared error = J(w) = (8,317.44 + 519.84 + 2621.44)/3 = 3819.57.
Similarly, if we calculate the mean squared error for weights (w0, w1) = (80.0, 0.15) we get mean squared error = J(w) = (2.56 + 2.72 + 3648.16)/3 = 1217.81. Since the error J(w) is lower for (w0, w1) = (80.0, 0.15), we say those weights are better. 
The minimum error is achieved by (w0, w1) = (15.0, 0.17), and the corresponding error is J(w) = 395.83. In the next tutorial on gradient descent, we'll see how to find the weights which achieve the minimum error.
The cost function defines a cost based on the distance between true target and predicted target (shown in the graph as lines between sample points and the regression line), also known as the residual. The residuals are visualized below,
If a particular line is far from all the points, the residuals will be higher, and so will the cost function. If a line is close to the points, the residuals will be small, and hence the cost function.
One question you might have is, why do we not use the sum of the residuals as our error function? Why squared? Why mean?
Each value of the weight vector w gives us a corresponding cost J(w). We want to find the value of weights for which cost is minimum. We can visualize this as follows:
Note: Above we have used the word "global" because the shape of the cost-function for linear regression is convex (i.e. like a bowl). It has a single minimum, and it smoothly increases in all directions around it.
Given the linear regression model and the cost function, we can use Gradient Descent (covered in the next article) to find a good set of values for the weight vector. The process of finding the best model out of the many possible models is called optimization.