The building blocks of statistical inference
One of the things that statisticians do is statistical inference.
We are usually blessed with only one dataset. We think of it as being drawn from some data generating process.
We want to infer either the entire data generating process or features of the data generating process using only one dataset.
We are actually reversing the process you have been seeing in our discussions of probability and random variables.
There are three major types of inference:
Slightly more specific:
But these are extremely limited and there are unanswered questions.
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 4.490 4.497 4.494 4.4999 4.5012 4.4986 4.49988 4.5001 4.50031 4.49988
[2,] 0.538 0.264 0.134 0.0638 0.0329 0.0167 0.00811 0.0041 0.00199 0.00105
[3,] 0.733 0.514 0.365 0.2526 0.1812 0.1294 0.09007 0.0640 0.04464 0.03244
[1] 0.52500 0.26250 0.13125 0.06563 0.03281 0.01641 0.00820 0.00410 0.00205
[10] 0.00103
[1] 0.7425 0.5250 0.3712 0.2625 0.1856 0.1313 0.0928 0.0656 0.0464 0.0328
Theorem 1 (The Weak Law of Large Numbers) If \(X_1, \ldots, X_n\) are IID random variables with mean \(\mathbb{E}(X_i)=\mu\) and \(\mathsf{Var}\left(X_i\right)=\sigma^2\), then for all \(\varepsilon>0\), \[\lim_{n\to\infty} \mathbb{P}\left(|\overline{X}_n-\mu|\geq \varepsilon\right)= 0.\]
\(\overline{X}_n\overset{p}{\to} \mu\) is usually expressed in many ways. As \(n\to \infty\),
Take note that \(\mu=\mathbb{E}\left(X_i\right)\), the common mean of the marginal distributions of \(X_1,X_2, ,\ldots\).
Standardize first in order to “counter” two things:
Here we know that \(\mathbb{E}\left(\overline{X}_n\right)=\mu\) and \(\mathsf{Var}\left(\overline{X}_n\right)=\sigma^2/n\).
Plot the distribution of the standardized sample mean \(\dfrac{\overline{X}_n-\mu}{\sigma/\sqrt{n}}\) instead.
We know \(\mu\) and \(\sigma^2\) for the purposes of the simulation. In practice, we do not know them (otherwise, what is the point of a statistics course?).
plot.dist.mean <- function(n)
a <- replicate(10^4, sample(c(2, 3, 5), n, prob = c(0.1, 0.1, 0.8), replace = TRUE))
st.sample.means <- (colMeans(a)-4.5)/(sqrt(1.05/n))
par(mfrow = c(1, 2))
hist(st.sample.means, cex.axis = 1.5, cex.lab = 1.5, main = "", xlim = c(-4,4))
curve(pnorm, add=TRUE, col="red")
The red curve is actually the cdf of a random variable \(Z\) that has a standard normal distribution (another special distribution). The standard normal cdf is usually written at \(\Phi(z)\) has the following form: \[\Phi(z)=\mathbb{P}\left(Z\leq z\right)=\int_{-\infty}^z \frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}x^2\right)\, dz.\]
This is probably one of the more famous continuous distributions.
Compared to the cdfs of discrete random variables you have seen so far, the red curve looks very different. In particular, there are no jumps.
What matters is that the cdf of a standard normal random variable is a good approximation of the cdf of the sampling distribution of the standardized sample mean.
We will just take a small dip into continuous random variables.
In the slides, pay attention to whether the word “discrete” was specified or not. Differences compared to discrete random variables:
Probability density functions (pdf) versus probability mass functions (pmf)
Cdf is continuous
In terms of calculations, we have for a continuous random variable \(X\):
The standard normal distribution has some nice properties:
We can now calculate probabilities involving the sample mean by directly using a large-sample approximation.
To be more specific, suppose we want to know \(\mathbb{P}\left(\overline{X}_{20} \leq 4.3\right)\) when we have IID random variables \(X_1,\ldots, X_{20}\) from the distribution \(\mathbb{P}\left(X=2\right)=1/10\), \(\mathbb{P}\left(X=3\right)=1/10\), \(\mathbb{P}\left(X=5\right)=8/10\).
This is not really impossible to compute, but it is extremely tedious to construct distribution of \(\overline{X}_{20}\).
a <- replicate(10^4, sample(c(2, 3, 5), 20, prob = c(0.1, 0.1, 0.8), replace = TRUE))
sample.means <- colMeans(a)
mean(sample.means <= 4.3)
[1] 0.222
Theorem 2 (The Central Limit Theorem (Wasserman (2004) p.77, Theorem 5.8)) Suppose \(X_1, \ldots, X_n\) are IID random variables with mean \(\mathbb{E}(X_i)=\mu\) and \(\mathsf{Var}\left(X_i\right)=\sigma^2\). Then, \[Z_n=\frac{\overline{X}_n-\mathbb{E}\left(\overline{X}_n\right)}{\sqrt{\mathsf{Var}\left(\overline{X}_n\right)}}=\frac{\sqrt{n}\left(\overline{X}_n-\mu\right)}{\sigma}\overset{d}{\to} Z\] where \(Z\sim N(0,1)\). In other words, \[\lim_{n\to\infty}\mathbb{P}\left(Z_n\leq z\right)=\Phi(z).\]
How do we use the theorem so that we can avoid simulation?
Observe that \[\begin{eqnarray*}\mathbb{P}\left(\overline{X}_{20} \leq 4.3\right) &=& \mathbb{P}\left(Z_{20} \leq \frac{4.3-4.5}{\sqrt{1.05/20}}\right) \\ &=&\mathbb{P}\left(Z_{20}\leq -0.87\right)\end{eqnarray*}\]
By the central limit theorem, we can approximate \(\mathbb{P}\left(Z_{20}\leq -0.87\right)\) using \(\mathbb{P}\left(Z\leq -0.87\right)\).
We can look the latter probability up from a standard normal table or use R.
a <- replicate(10^4, sample(c(2, 3, 5), 30, prob = c(0.1, 0.1, 0.8), replace = TRUE))
sample.means <- colMeans(a)
mean(sample.means <= 4.3)
[1] 0.161
a <- replicate(10^4, sample(c(2, 3, 5), 40, prob = c(0.1, 0.1, 0.8), replace = TRUE))
sample.means <- colMeans(a)
mean(sample.means <= 4.3)
[1] 0.128
a <- replicate(10^4, sample(c(2, 3, 5), 50, prob = c(0.1, 0.1, 0.8), replace = TRUE))
sample.means <- colMeans(a)
mean(sample.means <= 4.3)
[1] 0.0947
The central limit theorem approximation is getting a bit better.
Just like Chebyshev’s inequality we do not really need to know fully the common distribution of each of \(X_i\)’s. We only need to know the common mean, common variance, and the sample size.