Statistics is a very broad branch of mathematics that deals with everything related to data, from collection and organization of data to its analysis, interpretation and presentation. With the ever increasing amount of data, statistics has become an indispensable tool in every field where one has to work with data.
When the amount of data we are dealing with is fairly small then it might be possible to talk about all the data items individually. However, when we are dealing with large quantities of data, which is almost always the case in real world situations, we need to have some characteristic values that can represent the data.
In this tutorial, we'll introduce such measures first for a single variable. For example, say the weight of students in a particular school. These measures will include measures of central tendency and measures of dispersion. Then, we'll look at measures for understanding the relation between two different variables. For example, we'll try to answer how height and weight of the students are related. For this, we'll introduce the concept of correlations between two variables.
The measures of central tendencies such as mean, median and mode are measures that gives us some idea about the of centrality of data. Let’s look at each of them briefly.
Mean or average is the value obtained by dividing the sum of all the data by the total number of data points. Mathematically,
Let’s look at a very simple set of data representing the weight of 10 males,
55, 56, 56, 58, 60, 61, 63, 64, 70, 78.
The mean weight is calculated as,
Mean weight = (55 + 56 + 56 + 58 + 60 + 61 + 63 + 64 + 70 + 78) / 10= 62.1
The mean depends upon every data on the list and as a result sometimes extreme data points (or outliers) can distort the mean. For example, if the largest weight in the above data i.e. 78 is replaced by 150, then the mean changes drastically, i.e. from 62.1 to 69.3.
Like mean, median is another measure of central tendency. Qualitatively speaking it refers to the data situated at the middle position of the set.
In a set with odd number of data points the median is the middlemost value while if the number of data points is even then it is the average of the two middle items.
In the previous set since the number of data is 10 (even) the 5th and 6th item correspond to the middle data items. So the median is the average of the 5th and 6th item, which is,
Median = (60 + 61)/2 = 60.5
Unlike mean, the median gives us the positional center of the data so it is not influenced by outliers. For example, if the largest weight in the above data i.e. 78 is replaced by 150 then the mean changes drastically but the median remains unchanged.
Mode refers the data item that occurs most frequently in a given data set. In the above data set the weight 56 occurs twice while all of the others occur only once. So the mode weight is 56. Sometimes two or more items can occur the most number of times. In such cases both of the items are the mode of the data and the data set is multimodal.
Measures of dispersion quantify the spread of the data. They try to measure how much variation is there among the various data points.
One simple such measure is range which is simply the difference between the largest and the smallest data item. For our previous dataset,
Range = 78 – 55 = 23
Qualitatively a large range specifies a large spread in our data and a smaller range specifies a smaller spread in the data.
A more complex measure of dispersion is variance which is mathematically calculated as,
Variance signifies how much the data items are deviating from mean. Larger variance means the data items deviate more from the mean while smaller variance means the data items are closer to the mean.
Calculating the variance for the previous dataset,
Variance = [(55-62.1)^2 + (56-62.1)^2 + (56-62.1)^2 + (58-62.1)^2 +(60-62.1)^2 + (61-62.1)^2 + (63-62.1)^2 + (64-62.1)^2 +(70-62.1)^2 + (78-62.1)^2]/9.= 466.9/9= 51.88.
Standard deviation is simply the square root of the variance. In the above formula, σ is the standard deviation and σ2 is the variance. Hence, in the example, the standard deviation is
Std dev = sqrt(51.88) = 7.20
Correlation is a measure for quantifying how related two different variables are. Let’s say we have a dataset of height and weight of ten males. Normally we expect that the weight and height of a person are correlated, i.e. a taller person has more chances of having more weight than a short person. Correlation measures relationship between these kinds of data.
One such measure is called co-variance, which measures how two variables vary with respect to each other. Mathematically it is defined as,
A positive covariance signifies that the higher values of one variable correspond with the higher values of the other variable, and similarly for the lower ones. A negative covariance, on the other hand, signifies that the higher values of one variable correspond to the lower values of the other. The sign of the covariance therefore shows us the kind of linear relationship between two variables. A covariance very close to zero signifies the lack of correlation between two variables.
The correlation coefficient is obtained by dividing the covariance by the product of the standard deviations of the two variables.
As its values lie between +1 and -1, +1 signifying a perfect increasing linear relationship (correlation) and -1 signifying a perfect decreasing linear relationship (anti-correlation), it is more suited than the co-variance for analyzing correlation.
Let's calculate the correlation for the data set below which shows the average race completion time (Y) vs the average temperature (x) during the race:
For the above data set the co-variance is 82.16 and correlation is 0.855. Since the correlation is 0.855, we say that the two values are positively correlated. We can conclude that the higher value of average temperature means more time is required to complete the race.
There is a famous saying that says that “correlation doesn’t imply causation.” It doesn’t mean that there cannot be a causal relation between two variables having correlation. However, simply the presence of correlation between two variables is not sufficient to establish a causal relation. Correlation might imply several things, for example, either one may imply the other, or both might be caused by another factor.
For instance, there may be a correlation between two different diseases such as diabetes and heart diseases. It may imply that one might cause the other. But a more likely scenario is that the poor diet of a person and the resulting obesity might be the underlying cause of both of these diseases.
- Statistics and Probability in Data Science using Python | UCSanDiegoX
- Introduction to Probability and Statistics | MIT OpenCourseWare