# Introduction: What is Probability?

Probability is used to mathematically describe the chance of occurrence of an event. It quantifies randomness and uncertainty. For example, probability tells us the chance of it raining on a particular day, or someone winning a lottery. The probability that an event occurs is always between 0 and 1, where 1 represents absolute certainty and 0 represents completely impossible.

Probability can be basically determined in two ways - theoretically and empirically. The theoretical (also called classical) method is used specially to determine probabilities of game of chances like lotteries, roulette wheel, coin flip etc. While the empirical (also called observational) method is used to determine probabilities of an event whose outcome can’t be predetermined.

To describe probability mathematically, we need few basic elements which are discussed below:

**Sample Space**

It is a collection of all possible outcomes of a random experiment. For example, in a six faced die throwing experiment, the sample space **S** = {1, 2, 3, 4, 5, 6}.

**Event Space**

It is a set whose elements (events) are the subset of a sample space. For example, **A** = {1, 2} is an event space of sample space S. Here **A** denotes the event that the dice throw results in a 1 or 2.

**Probability measure**

Probability measure **P** is the function which give the measure of how likely certain event is. For example, the event A happens when the dice throw results in 1 or 2, i.e. 2 out of 6 possibilities. Hence, the probability measure of A, P(A) = 2/6 = 1/3.

# Probability of multiple variables

## Joint Probability

Joint probability is used to denote the probability of multiple variables at the same time. For example, lets say we care about two things in a person - their gender, which is male or female, and the length of their hair, long or short. P(male, short) is the probability that a person is male and has short hair.

## Conditional Probability

Conditional probability denotes the probability of an event given that some other event has already occurred. For example, probability of person being male given long hair is denoted as P(male | long hair). Note that this is different from, P(long hair | male), the probability of having long hair given that a person is male.

## Marginal Probability

When we find probability of occurrence of certain event irrespective of any other event, then it is called marginal probability. For example, probability of person being male is given by P(male). It doesn't matter whether or not the person has short hair or long.

## Relation between joint, conditional and marginal probabilities

Joint, conditional and marginal probabilities satisfy the following relation:

P(A,B) = P(B|A) \space P(A)

That is, probability of A and B, is the same as, probability of A times probability of B given A. For example,

P(\text{male}, \text{long hair}) = P(\text{long hair}|\text{male}) \space P(\text{male})

That is, probability of a person being male and having long hair = probability of person being male times probability of having long hair given person is male.

# Bayes' Theorem

Bayes' Theorem is a relation between conditional and marginal probabilities of two variables. It given as follows:

Bayes Theorem

The Bayes' theorem can be derived from the relation between joint, marginal and conditional probabilities given above, since

P(A|B) \space P(B) = P(A,B) = P(B|A) \space P(A)

## Example of Bayes Theorem

To illustrate the concept of Bayes theorem let’s take an example.

Suppose a man is going to throw and die and tell us the result. However, the man only tells the truth 2 out of 3 times. Since he really likes the number six, if the dice result is a 6, he picks numbers randomly to lie, but when the dice result is not a 6, he always says it is 6. He throws a die and reports that the number he obtained is six. We don't get to see the dice, but we want to know the probability of that the result was actually a six.

To solve the problem, let’s consider **A **be the event the dice throw resulted in a 6, and let **B** be the event that the man reports a six. We would like to calculate P(A|B).

We know that,

- P(A) = 1/6, i.e. probability that six occurs is 1/6 since it is 1 out of 6 possible outputs.
- P(B|A) = 2/3, i.e. probability that man reports a 6 when it is a 6 is 2/3, since he tells the truth 2 out of 3 times.
- P(B), probability that the man reports six, is slightly more complicated. We need to take into account what the man would do in each case, depending on whether the dice throw result was actually a six or not, and then add it up.
- Probability that man reports six when it is actually six, P(B|A) = 2/3. We know that P(A) = 1/6.
- Probability that the man reports six when it is not P(B|A') = 1/3. Also, P(A') = 5/6.

- Hence, P(B) = 1/6*2/3 + 5/6*1/3.

Now, we can use Bayes theorem:

P(A|B)=\frac{P(B|A)P(A)}{P(B)}

P(A|B)=\frac{\frac{1}{6}\times\frac{2}{3}}{\frac{1}{6}\times\frac{2}{3}+\frac{5}{6}\times\frac{1}{3}}=\frac{2}{7}

# Probability Distributions and Random variables

Those variables which can take different variable randomly are called **random variables**. If the the variables are discrete in nature, they are called **discrete random variables**. For instance, the number of head that might occur in a series of coin toss ( let’s say 15 coin toss) is a discrete random variable. This number can take any whole number value in a range 0 to 15. Similarly, if the variables are continuous in nature, then it is called **continuous random variable**. For example, the time taken by a radioactive particle to decay is a continuous random variable as it can take infinite number of possible values which can’t be predetermined.

As we discussed above, random variable can take on randomly take different value. But are all the values that it can take equally likely, or is it more likely that the random variable take a particular value more often than other? To understand these things we use a **probability distribution** which describes how likely a random variable is to take on each of its possible values.

## Probability Mass Function

If the random variable is discrete in nature, we use Probability Mass Function to describe its probability distribution. The figure given below is a graph constructed from two simultaneous six-face dice throwing experiment. It captures the result of the sum of the number appeared in two dice in **x** axis and the probability of the occurrence of the sum in **y** axis. For example, the probability of getting 2 as the result of the sum is 1/36.

## Probability Density Function

If the random variable is continuous in nature, we use Probability Density Function to describe its probability distribution. The graph shown in the figure is the probability distribution of speed of vehicle in a highway at a particular location. The x axis represents speed of vehicle and y axis represents probability of vehicles moving at that speed in that location. From the graph, we can clearly see that vehicle moving at the speed of 65 km/hr has highest probability. The total area under the curve gives the total probability of vehicle moving in various speed at that location.

# Expectation or Expected Value

Expectation E of a random variable x is the average value of the random value when it is sampled from its probability distribution P(x). Mathematically, it is given as:

E[x] = \sum_x{xP(x)}

for discrete random variable

E[x] = \int{xP(x)dx}

for continuous random variable

# Common Probability Distributions and Random Variables

## Discrete Probability Distributions / Random Variables

**Bernoulli**

Bernoulli random variable is used when the experiment results in either success or failure where success is represented as 1 and failure as 0. If P is probability of success, then probability mass function of Bernoulli variable **x** is given as:

\ P(x) = \begin{cases} p & \quad \text{if } x=1\\ 1-p & \quad \text{if } x=0 \end{cases}

**Binomial**

This distribution is used when we want to count how many success we have when we repeat an experiment for **n** numbers of time independently. The probability mass function of binomial variable **x,** where p is probability of success, is given as:

P(x) = {n \choose x} p^x (1-p)^{ n-x}

**Poisson**

Poisson variable **x **is used when we are counting the number of occurrences of an event in a unit of time such that the occurrences are independent and rarely simultaneous. For an average number of occurrence** λ **the probability mass function is given as :

P(x) = e^{-\lambda}\frac{\lambda^x}{x!}

## Continuous Probability Distributions / Random Variables

**Uniform**

We use uniform random variable **x** when the probability density for every value between an interval ( starting from **a** and ending at **b**) is equal. The probability density function of x is given as:

\ P(x) = \begin{cases} \frac{1}{b-a} & \quad \text{if } a\leq x\leq b\\ 0 & \quad \text{otherwise} \end{cases}

**Exponential**

Exponential variable **x** are used when we are measuring the time until the first occurrence of an event, such that the occurrences in disjoint time intervals are independent and rarely simultaneous. For the average number of occurrences per unit time **a**, the probability density function is given as:

\ P(x) = \begin{cases} ae^{-ax} & \quad \text{if } x\geq 0\\ 0 & \quad \text{otherwise} \end{cases}

**Normal**

Normal random variable x comes handy for different situations. We can use it to model physical measurements like weight, height etc. We can also use it to model error made by measuring instruments. In general, it can be used when the average and variance of the quantity being measured is known. The probability density function is given as:

P(x) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2\sigma^2}(x-\mu)^2} where,\mu =mean,\sigma^2=variance

# Reference

- Probability and Information Theory, Deep Learning, Ian Goodfellow: http://www.deeplearningbook.org/contents/prob.html
- Probability Theory Review for Machine Learning: https://see.stanford.edu/materials/aimlcs229/cs229-prob.pdf