Lecture 5c

Summaries of distributions: Quantiles and expected values

Andrew Pua


Quantile functions

Definition 1 (Wasserman (2004, p. 25) Definition 2.16) Let \(X\) be a random variable with cdf \(F\). The inverse cdf or quantile function is defined by \[F^{-1}\left(q\right)=\inf\{x: F(x)>q\}\] for \(q\in [0,1]\).

  1. \(F^{-1}(q)\) is called the \((100q)\)th percentile or the \(q\)-quantile.
  1. You may see a slightly different definition in Evans and Rosenthal Definition 2.10.1. They have \[F^{-1}\left(q\right)=\min\{x: F(x)\geq q\}\] for \(q\in (0,1)\).

  2. Quantile functions are used in everyday life:

    1. rankings
    2. for reporting risk measures

Should a firm introduce a new product?

A company wants to determine if they should enter a market. But there are other potential competitors. We have the following internal projections:

  1. Fixed costs of entering the market: 26 million
  2. Net present value of revenue minus variable costs in the whole market: 100 million
  3. Profits will depend on how many competitors enter the market. The 100 million will be shared equally among those in the market.
  1. Let \(X\) be the number of other entrants in the market. This number is uncertain: \[\mathbb{P}\left(X=x\right)=\begin{cases}0.1 & \mathsf{if}\ x=1\\ 0.25 & \mathsf{if}\ x=2\\ 0.3 & \mathsf{if}\ x=3\\ 0.25 & \mathsf{if}\ x=4\\ 0.1 & \mathsf{if}\ x=5 \end{cases}\]

  2. What would be the quantile function for the profit of the company? Derive it. Produce a simulation.

entrants <- 1:5
profits <- 100/(1+entrants)-26
a <- sample(profits, 10^4, prob = c(0.1, 0.25, 0.3, 0.25, 0.1), replace = TRUE)
probs <- seq(0,1,0.005)
plot(probs, quantile(a, probs))

Expected value

Definition 2 (Wasserman (2004, p. 47) Definition 3.1) The expected value of \(X\) is defined to be \[\mathbb{E}\left(X\right)=\sum_x xf_X(x)\] assuming that the sum is well-defined.

  1. Sometimes \(\mathbb{E}\left(X\right)\) is called the mathematical expectation, the population mean, or the first moment of \(X\).
  1. There are also multiple acceptable notations: \(\mathbb{E}\left(X\right)\), \(\mathbb{E}X\), \(\mu\), \(\mu_X\).

  2. The requirement of a well-defined expected value is not an issue if \(X\) takes on a finite number of values.

  3. But it can be an issue if \(X\) takes on an infinite but countable number of values.

  4. Illustrate using some exercises.

  1. The expected value may not exist. Let \(k=\pm 1, \pm 2, \ldots\). Consider \[\mathbb{P}\left(X=k\right)=\frac{3}{\pi^2 k^2}.\]

  2. Refer to other toy examples from Examples 3.1.10 and 3.1.11 of Evans and Rosenthal.

  3. Another less artificial example is from the St. Petersburg paradox. Refer to Examples 3.1.12 and 3.1.13 of Evans and Rosenthal.

Three big reasons to study the expected value

  1. One criterion to decide which action to take when making decisions under uncertainty is to choose the action which produces the largest expected payoff.
  2. \(\mathbb{E}\left(X\right)\) is our “best” guess of what value \(X\) could take under a very specific criterion.
  3. The law of large numbers

Illustration of the law of large numbers

  1. You already saw this in the context of tossing a coin. Let \(X=1\) if the toss produces heads and \(X=0\) if the toss produces tails. Then \[\mathbb{E}\left(X\right)=\mathbb{P}\left(X=1\right).\]

  2. Set \(\mathbb{P}\left(X=1\right)=0.4\). Toss such a coin independently many times. Look at the relative frequency.

x <- rbinom(10^4, 1, 0.4)
z <- cumsum(x)/(1:length(x))
plot(1:length(z), z, 
     type = "l", xlab = "number of coin flips", ylab = "relative frequency", ylim = c(0, 0.6),
     cex.lab=1.5, cex.axis=1.5)
lines(1:length(z), rep(0.4, length(z)), 
      lty = 3, col = 2, lwd = 3) 
  1. Now, roll a fair die. Let \(X\) be the outcome of one roll. Here, \(\mathbb{E}\left(X\right)=3.5\).

  2. Instead of looking at a relative frequency, consider the sample mean of the outcomes from independently rolling a fair die as you have more and more rolls.

  3. Notice the long-run order of the sample mean in the next slide.

  4. Pay attention to the weighting scheme.

  5. Notice in both cases, the expected value does NOT have to be one of the possible outcomes of \(X\).

x <- sample(1:6, 10^4, replace = TRUE)
z <- cumsum(x)/(1:length(x))
plot(1:length(z), z, 
     type = "l", xlab = "number of coin flips", ylab = "relative frequency", ylim = c(1, 6),
     cex.lab=1.5, cex.axis=1.5)
lines(1:length(z), rep(3.5, length(z)), 
      lty = 3, col = 2, lwd = 3) 

The Rule of the Lazy Statistician

  1. Refer to the distribution of \(X\) in Wasserman Chapter 2 Exercise 2.

    • Find \(\mathbb{E}\left(X\right)\).
    • Let \(Y=X^2\). Find \(\mathbb{E}\left(Y\right)\).
    • Is it true that \(\mathbb{E}\left(X^2\right)\) and \(\left(\mathbb{E}\left(X\right)\right)^2\) are equal?
  2. Wasserman (2004, p. 48) Theorem 3.6: Saves us time in computing expected values of functions of some random variable \(X\).

  1. Some very useful properties of expected values:

    1. Let \(X\) be a random variable. Let \(a\) and \(b\) be constants. \[\mathbb{E}\left(aX+b\right)=a\mathbb{E}\left(X\right)+b.\]
    2. Wasserman (2004, p. 63) Theorem 4.1 on Markov’s inequality: Existence of the expected value creates constraints on the tails of the distribution.
    3. Wasserman (2004, p. 66) Theorem 4.9 on Jensen’s inequality: If \(g\) is convex, then \[\mathbb{E}\left(g(X)\right) \geq g\left(\mathbb{E}\left(X\right)\right).\]


Definition 3 (Wasserman (2004, p. 47)) The \(k\)th moment of \(X\) is defined to be \(\mathbb{E}\left(X^k\right)\) assuming that \(\mathbb{E}\left(|X|^k\right)<\infty\).

  1. Wasserman (2004, p. 49) Theorem 3.10: If \(\mathbb{E}\left(X^k\right)\) exists and if \(j<k\), then \(\mathbb{E}\left(X^j\right)\) exists.
  2. May not appear to have practical relevance, but has consequences for risk management settings


Definition 4 (Wasserman (2004, p. 51)) Let \(X\) be a random variable with mean \(\mu\). The variance of \(X\), denoted by \(\sigma^2\), \(\sigma^2_X\), \(\mathbb{V}\left(X\right)\), \(\mathbb{V}X\), or \(\mathsf{Var}\left(X\right)\) is defined \[\sigma^2=\mathbb{E}\left(X-\mu\right)^2,\] assuming that this expectation exists. The standard deviation is \(\mathsf{sd}(X)=\sqrt{\mathbb{V}\left(X\right)}\) and is also denoted by \(\sigma\) or \(\sigma_X\).

  1. Illustrate using examples.

  2. Wasserman (2004, p. 51) Theorem 3.15 points to two important properties of the variance:

    1. A computational shortcut: \[\mathsf{Var}\left(X\right)=\mathbb{E}\left(X^2\right)-\mu^2\]
    2. Let \(X\) be a random variable. Let \(a\) and \(b\) be constants. Then, \[\mathsf{Var}\left(aX+b\right)=a^2\mathsf{Var}\left(X\right).\]
  1. Where did the name standard deviation come from? The answer lies in Wasserman (2004, p. 64) Theorem 4.2 on Chebyshev’s inequality.

Theorem 1 (Wasserman (2004, p. 64) Chebyshev’s inequality) Let \(\mu=\mathbb{E}\left(X\right)\) and \(\sigma^2=\mathsf{Var}\left(X\right)\). Then, \[\mathbb{P}\left(|X-\mu|\geq t\right) \leq \frac{\sigma^2}{t^ 2}.\]

  1. If we let \(Z=\dfrac{X-\mu}{\sigma}\), then \[\mathbb{P}\left(|Z|\geq k\right)\leq \frac{1}{k^2}.\]

  2. \(Z=\dfrac{X-\mu}{\sigma}\) is called standardization.

  3. Therefore, we know a lot about the tail behavior of a standardized random variable! In particular, \[\mathbb{P}\left(|Z|\geq 2\right)\leq \frac{1}{4}, \ \ \mathbb{P}\left(|Z|\geq 3\right)\leq \frac{1}{9}\]

How can we make predictions?

  • One of the four words in the sentence “I SEE THE MOUSE” will be selected at random.

  • The task is to predict the number of letters in the selected word.

  • Your error would be the number of letters minus the predicted number of letters.

  • The square of your error will be your loss.

  • Because you do not know for sure which word will be chosen, you have to allow for contingencies.

  • Therefore, losses depend on which word was chosen.

  • What would be your prediction rule in order to make your expected loss as small as possible?

  • How much would be the smallest expected loss?

Optimal prediction

  • There is some random variable \(Y\) whose value you want to predict.

  • You make a prediction (a real number \(\beta_0\)).

  • You make an error \(Y-\beta_0\), but this error is a random variable having a distribution.

  • Next, you have have some criterion to assess whether your prediction is “good”.

  • You need a loss function. There are many choices, which are not limited to: \[L\left(z\right)=z^{2},L\left(z\right)=\left|z\right|,L\left(z\right)=\begin{cases} \left(1-\alpha\right)\left|z\right| & \mathsf{if}\ z\leq0\\ \alpha\left|z\right| & \mathsf{if}\ z\geq0 \end{cases}\]

  • But losses will have a distribution too.

  • We can look at expected losses. This may seem arbitrary but many prediction and forecasting contexts use expected losses.

  • Consider \(L\left(z\right)=z^{2}\) and form an expected squared loss \(\mathbb{E}\left[\left(Y-\beta_0\right)^{2}\right]\).

  • This expected loss is sometimes called mean squared error (MSE).

  • You can show that \(\mathbb{E}\left(Y\right)\) is the unique solution to the following optimization problem: \[\min_{\beta_0}\mathbb{E}\left[\left(Y-\beta_0\right)^{2}\right]\]

  • \(\mathbb{E}\left(Y\right)\) is given another interpretation as the optimal predictor.

  • We also have a neat connection with the definition of the variance:

    • \(\mathsf{Var}\left(Y\right)\) becomes the smallest expected loss from using \(\mathbb{E}\left(Y\right)\) as your prediction for \(Y\).
  • What happens if we have additional information which we may use for prediction?