Sampling Distribution and Central limit theorem

Dhrubjun
Nerd For Tech
Published in
6 min readNov 22, 2021

--

Photo by Carson Arias on Unsplash

In our real world, we often search for a parameter or statistic of a specific population such as mean or standard deviation. But it is quite difficult to estimate these statistics from the population. In this case, we can collect some random data or samples from the population and can estimate these parameters. Here we will select a specific random sample size and estimate the statistic for each sample using different trials. In repeated sampling, the value of the sample statistic would vary from sample to sample. But the statistic of the sampling distribution will closely resemble the actual statistic of the population.

Lets take a simple example. In a class there are 50 students with different weights. The weights of the students are as follows : 43, 40, 45, 32, 42, 48, 36, 33, 34, 33, 37, 40, 36, 40, 43, 33, 38, 30, 33, 30, 41, 35, 47, 49, 36, 36, 37, 39, 41, 32, 37, 46, 46, 41, 35, 38, 42, 31, 46, 39, 38, 38, 48, 42, 40, 43, 44, 31, 38, and 48. The actual mean of weights of the students is 39 with standard deviation 5.19.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
population = [43, 40, 45, 32, 42, 48, 36, 33, 34, 33, 37, 40, 36, 40, 43, 33, 38, 30, 33, 30, 41, 35, 47, 49, 36, 36, 37, 39, 41, 32, 37, 46, 46, 41, 35, 38, 42, 31, 46, 39, 38, 38, 48, 42, 40, 43, 44, 31, 38, 48]population.mean()
population.std()

But above data is unknown to the class teacher and he wants to estimate the mean of the weights of the students. Since he can not approach each student, he will select 3 students randomly each time and will calculate the mean weight.

random_sample = [np.random.choice(population,3) for i in range(100000)]

In the above case, 100000 random samples are taken with a sample size of 3. These values are stored in an array named random_sample. For each sample, the mean will be calculated and will be stored in an array named sample_mean.

sample_mean=random_sample.mean(axis=1)

These sample means can be plotted in a histogram to get a clear view of the sampling distribution.

plt.hist(sample_mean, bins=100)
Fig 1 : Sampling distribution of the sample mean with sample size 3. (Image by author)

In the above picture, the red line depicts the actual mean of weights of the students i.e. 39. The sample mean of the sampling distribution closely resembles the actual mean of the population.

Suppose, we are going to draw a random sample of n observations from a specific population with mean µ and standard deviation σ. Let X₁, X₂, X₃,……, Xₙ are the n independent observations. If X_sample represents the mean of these n observations, then

Image by author

X_sample is the sample statistic representing the mean of the sample. It is a random variable with a probability distribution and we can call this sampling distribution.

Two main important characteristics of sampling distributions are :

  1. The mean of the sampling distribution of X_sample is equal to the mean of the population from which we are sampling.
Image by author

2. The standard deviation of the sampling distribution of X_sample is given by :

Image by author

From the above example of the weights of the students, we have seen that the mean of the sampling distribution remains almost the same as the actual mean of the population. However, the standard deviation of the sampling distribution will get smaller as the sample size increases. Below are the three sampling distributions of the same example with sample sizes 2, 4 ,and 8 respectively.

Fig 2 : Sampling distribution with sample size of 2, 4 and 8 respectively (Image by author)

From the above pictures it is clearly seen that as we are increasing the sample size, the mean of the distribution remains the same. But the standard deviation of the distribution becomes smaller.

A smaller standard deviation would indicate that the sample data is more representative of the population.

So a large sample size tends to be a more accurate reflection of the population, as their sample means are more likely to be closer to the population mean which will cause less variation.

The sample mean will be close to the actual population mean if the sample size is large.

Central Limit Theorem :

The sample mean becomes normally distributed, if we sample from a normally distributed population (Just like the above example, where weights are normally distributed). But what if the population is not normal? Here comes the central limit theorem, according to which :

If we are sampling from a distribution that is not normal, the sample mean will be normally distributed, provided the sample size is large. In other words, the sample mean will be normally distributed for large sample sizes, regardless of the distribution from which we are sampling.

Let us take an example to get a clear view of this theorem. Let f(x) = aˣ be an exponential function, where a=0.93.

x = list(range(1,100))
y = [(0.93)**i for i in x]
plt.plot(x,y)
Fig 3 : f(x)=0.93ˣ (Image by author)

The distribution of function f(x) is definitely not normal. Now, we will take 100000 samples from the function f(x) with sample size 3, and the mean will be calculated for each sample.

y = np.array(y)
random_sample = [np.random.choice(y,3) for i in range(100000)]
random_sample = np.array(random_sample)
sample_mean=random_sample.mean(axis=1)

Now if we plot the sampling distribution of the sample mean, we will get the following:

Fig 4 : Sampling distribution with sample size 3 (Image by author)

Next, we are going to increase the sample size and will see the effect on distribution.

Fig 5 : Sampling distribution with sample size 7, 15 and 30 respectively. (Image by author)

As we can see from the above pictures, the sample mean tends towards the normal distribution as we are increasing the sample sizes, even though we are sampling from a population that is not normally distributed.

The central limit theorem says that we can use well-developed statistical inference procedures that are based on a normal distribution, even if we are sampling from a distribution that is not normal, provided we have a large sample size.

That’s all for today. Hope you guys like this article. You can check out my already published following articles on statistics.

  1. Measures of variability and spread : Range, quartile, variance and standard deviation
  2. Discrete probability distribution : Part 1
  3. Discrete probability distribution : Part 2
  4. Continuous probability distribution

Keep smiling. 😃

--

--