A car manufacturing company releases a new car and it claims that the car has a mileage of 20 kms per liter. Should we believe it or not? And why should or shouldn't we believe it? Hypothesis testing is a method that can be used to make decisions in such situations.
Hypothesis testing generally involves four steps.
- First, we develop two claims: null hypothesis (H0) and alternative hypothesis (Ha). In our case ‘the car can run 20 kms per liter’ is a null hypothesis and ‘the car can’t run 20 kms per liter’ is an alternative hypothesis.
- Second, we collect a sample (sample of newly manufactures cars) and collect relevant data from them and summarize it using statistics.
- Then, we calculate how probable it is to find the result from our analysis if the hypothesis was true.
- Finally, we reach a conclusion based on our result in the previous step. If probability of the observed data is very low, we can safely discard our null hypothesis in favor of our alternative hypothesis. If it isn't low, we can continue to believe the null hypothesis.
Thus, by using hypothesis testing we can evaluate mutually exclusive claims and choose the one which is best supported by our data.
Let’s consider hypothesis testing where the parameter of population which we are interested in is categorical in nature (for example, category based on weight: underweight, normal weight and overweight). For such kind of data, we require using z-test for hypothesis testing using population proportion.
First, check out these videos:
A z-score (aka, a standard score) indicates how many standard deviations an element is from the mean. A z-score can be calculated from the following formula.
z = (X - μ) / σ
where z is the z-score, X is the value of the element, μ is the population mean, and σ is the standard deviation.
A z-score less than 0 represents an element less than the mean.
A z-score equal to 0 represents an element equal to the mean.
A z-score equal to 1 represents an element that is 1 standard deviation greater than the mean; a z-score equal to 2 is 2 standard deviations greater than the mean; etc.
If the number of elements in the set is large, about 68% of the elements have a z-score between -1 and 1; about 95% have a z-score between -2 and 2; and about 99% have a z-score between -3 and 3.
The z-test is named so as it depends upon z-score. The hypothesis testing using z-test follows same four general steps of hypothesis testing we saw earlier:
- In the first step, we declare our null and alternative hypothesis (H0 and Ha). For example, let's say a research claims that 30% of students in a particular university own an iPhone. So our null hypothesis will be H0: p0 = 0.3.
- We then collect random sample of size n = 100 students from the university. The size of sample should meet two criteria: (i) np0 >= 10 and (ii) n(1-p0) >= 10. Since both of our criteria are met, (for (i) we get 30 and for (ii) we get 70), our choice of sample size is fine.
- Now suppose, we found that out of 100 students 35 students used iPhone. Then, we can find our sample proportion p = 35/100 = 0.35, which will be followed by calculation of z-score given as:
By using the value calculated above we can find our z-score as:
What does 1.091 z-score mean? We can interpret it as our sample proportion being 1.091 standard deviation above our null value (0.3).
4. Now, we will find the corresponding probability (p-value) from the calculated z value. The z-value we found has a special property – it follows standard normal distribution as shown in above figure (check the resources to learn about standard normal distribution). For z = 0, the standard deviation of sample proportion from null value is zero and its p-value is highest. As we go left or right according to our z-value (if positive we observe p-value right from 0 and if negative left from zero), the standard deviation increases and p-value decreases. Greater the magnitude of z, smaller will be the p-value. For example, if we have z values: 1, -0.5, -2.5, and 2 then the p-value for z = -2.5 will be the smallest as its magnitude (2.5) is highest among all. We can use either a z-table or a software to find out corresponding p-value for given z-value. P-value basically gives the probability of getting the observed data if the null hypothesis was true. So, if we get very small p-value (usually smaller than 0.05), we can reject the null hypothesis. If the p-value is large, we can continue to believe in the claim made by research that 30% of the student in the university uses iPhone. From the z-value we calculated, we get corresponding p-value of 0.137. The p-value is greater than significance level 0.05, so we can continue to believe in our null hypothesis.
The P value, or calculated probability, is the probability of finding the observed, or more extreme, results when the null hypothesis (H0) of a study question is true.
When the parameter of our interest in the population is quantitative in nature we use population mean for hypothesis testing. We can use either z-test or t-test for that purpose based on the availability of standard deviation of population.
We use z-test when the standard deviation of a population (σ) we are interested in is known. Usually, the standard deviation of a population is unknown, but in practice if our sample size is large, then we can estimate standard deviation from the collected sample. We'll illustrate this method using the same previously followed steps.
- We first define our null (H0: u0) and alternative hypothesis (Ha). Let’s say the average GRE score of students willing to pursue computer science is u0 = 310 and standard deviation σ = 10. Lets, say the head of department of the computer science of university A claims that students enrolled in his department have higher average GRE score. We assume it to be alternative hypothesis.
- We then collect random samples of 5 students and find that the mean GRE score x' is 314.
- From the samples we then find sample mean (u0). From these value we can then calculate z-value as:
4. We then find the corresponding p-value from the calculated z-value. The process is same as discussed in the z- test for population proportion. If the p-value is large, then it is obvious to observe the data for given null hypothesis, and if it isn’t large enough we can discard our null hypothesis. For z-value = 2, the corresponding p-value is 0.02275 which is a small value, so we reject the null hypothesis and believe that the alternative hypothesis is correct.
Normally, the population standard deviation is not available. For example, when a noodle company produces standard 75 gram packets of noodle, it won’t be possible to find standard deviation from the all the packets of noodle it had produced. In such case, we can’t use z-test. Instead we use t-test. T-test is quite similar to z-test. The only difference is that we will calculate t-score using sample standard deviation (s). We use population mean u0 as our null hypothesis. We then take a sample of size n and find sample mean x’ and sample standard deviation s from our data. We the calculate t-value as given below and reach to the conclusion following very similar process described in above tests.
- The size of the sample we collect in the second step of hypothesis testing has a significant role in the process. We know that larger the sample size, the better is its reflection of the population. However, it is not always economical to collect large sample data. Hence, we have a trade-off between statistical significance and practical feasibility.
- Confidence intervals: Instead of having a null hypothesis of p0 = x, we can define a range in which it should lie. We can check whether a null population proportion or mean lies in the interval. If it lies in the interval we can safely accept it.
While drawing a conclusion, two types of errors are possible: Type I and Type II error.
- Type I error arises when we reject null hypothesis when we should have accepted it. This error type is also known as false negative.
- Type II error occurs when we accept null hypothesis when we should have rejected it. This error type is also known as false positive.
Hypothesis testing has a variety of applications. Most prominently, it is used to verify the claims of an organization, or to choose between two options by evaluating which one is better. Some specific examples are described below:
- Testing performance: We deploy multiple statistical models and need to find out the best one. For example, we might be trying to improve our 'related content' recommendation engine. Hypothesis testing is used to decide which model is best.
- AB Testing: We deploy different versions of web products and need to find out the best one. Does the red button lead to higher conversion in sign-ups? Does the blue button increase total number of purchases? Hypothesis testing is used to decide which version is better.
- In social science, it can be used to verify claims like domestic violence against women in rural area is higher than urban area.
- In healthcare industry, it is used to evaluate whether a proposed drug is effective or not, or whether it is more effective than existing drugs. Or even to decide whether hospital carpeting results in more infection or not.
These wide range of applications of hypothesis testing make it an important skill for any data science professional.