We have learned how to use calculate probability of events of interest. Now, we can use these probabilities to make inferences( calculated guesses) of the important characteristics of interest of the entire dataset. The entire collection of elements under investigation is known as the population. As it is difficult to investigate each and every element in this huge set, we randomly select a subset of elements, known as sample. We can use techniques of statistical inference to make these estimations of the entire population by only investigating the smaller sample. Let us solve some questions to get a rough idea of estimation.
Q) Calculate the probability of heads in this given coin toss sequence: HHTH?
Ans. By using the classical probability rule, we can say that the probability of getting heads, a.k.a P(H) =
Q) Calculate the probability of heads in this given coin toss sequence: HHHHT?
Ans. Again, we can find the probability using the above formula as:
Q) Calculate the probability of heads in this given coin toss sequence: TT?
Ans. Using the formula, we get:
After doing these questions, we can make 2 important observations:
- As the number of occurrences increases, so does our probability become closer to the actual probability of getting heads, i.e. 0.5.
- To get the correct result of 0.5, we need to repeat the result many many times.
This naive estimator that uses the mean to calculate probabilities is known as Maximum Likelihood Estimator.
As we can see from above, this estimator is giving a really incorrect result for small sizes.
Laplacian estimators
Based on our previous questions, another important question arises: How can we make our incorrect Maximum Likelihood Estimator more correct for smaller sample sizes? One solution is to add fake data. These data points haven't occurred in the event of interest. Let us do some questions to cement our understanding on these.
Q) Calculate the probability of heads in this given coin toss sequence: HHTH?
Ans. Before applying our usual probability formula, we will add some fake data points to the given data. Let us add this fake data: HHT to our original sequence. So, our new probability will be:
Q) Calculate the probability of heads in this given coin toss sequence: HHHHT?
Ans. Again, let us add our fake data of HHT to our data points as above:
Q) Calculate the probability of heads in this given coin toss sequence: TT?
Ans. Using the formula after adding the fake data, we get:
Observing these new probabilities, we can say that although these are still not correct, they are closer to the real probability. This new estimator is known as Laplacian estimator and gives more correct results than Maximum Likelihood Estimators.
Key Takeaways
1. Population is the entire dataset that we want to study.
2.Sample is a smaller randomly chosen subset of population.
3. Maximum likelihood estimator uses the formula below to calculate probabilities:
where numerator is the number of events of interest and denominator is the total number of events.
4. MLE gives really incorrect or biased results for smaller number of outcomes.
5. Laplacian estimator uses the formula below to calculate probabilities:
where k = number of fake data events, N = number of total outcomes.
6. Laplacian estimator gives more correct results for lesser number of outcomes.
Reference:
I would love to receive feedback in the comment section below.