Data 145: Evidence and Uncertainty

Comprehensive Study Guide - Lectures 21 through 22 - Spring 2026
Instructors: Ani Adhikari, William Fithian

Table of Contents

  1. Bridge: From Error Rates to Tail Probabilities
  2. Lecture 21: Concentration Inequalities
  3. Bridge: From Hoeffding to Jacobians
  4. Lecture 22: Multivariable Methods
  5. Big Comparison Map
  6. Master Summary and Formula Sheet
  7. Common Mistakes

Bridge: From Error Rates to Tail Probabilities

Lecture 20 focused on controlling errors when many decisions are made at once. Lecture 21 shifts from decision procedures to random fluctuations themselves: how likely is a statistic to wander far from its target?

The new question is not "how many false discoveries might we make?" but "how concentrated is a random quantity around the value we care about?" In machine learning language: how far can an empirical quantity drift from its population target?

This makes concentration inequalities a natural bridge to later material. Regression, empirical risk, and generalization bounds all depend on controlling the tails of sums and averages, even when exact tail probabilities are hard or impossible to compute.


Lecture 21: Concentration Inequalities

21.1 Why tail bounds matter

Suppose a statistic based on data converges to a target as sample size grows: an empirical risk to its expected loss, a sample mean to its population mean, or a regression quantity to its limiting parameter. Usually we do not need exact tail probabilities. We need guarantees that the bad tail is smaller than some tolerance.

A concentration inequality is an upper bound on tail probability. If the bound is small, we know most of the probability mass lies close to the target value.
Tail bounds answer practical questions such as:

21.2 Markov and Chebyshev: the basic polynomial bounds

Markov's inequality. If $X \geq 0$ and $E[X]=\mu$, then for $t>0$, $$P(X \geq t) \leq \frac{\mu}{t}.$$
The proof idea is the same one from earlier probability: $$\mathbf{1}\{X \geq t\} \leq \frac{X}{t}.$$ Taking expectations gives the result immediately.
Chebyshev's inequality. If $X$ has mean $\mu$ and variance $\sigma^2$, then $$P(|X-\mu| \geq t) \leq \frac{\sigma^2}{t^2}.$$ This follows by applying Markov to $(X-\mu)^2$.
Markov uses only a first moment and gives a $1/t$ tail. Chebyshev uses a second moment and improves that to $1/t^2$. These are broad, robust, and often crude.

For a single right tail, Chebyshev still gives $$P(X-\mu \geq t) \leq \frac{\sigma^2}{t^2}.$$ If the distribution is symmetric about $\mu$, the one-sided bound improves to $$P(X-\mu \geq t) \leq \frac{\sigma^2}{2t^2}.$$

For a standard normal $Z$, symmetry plus $\Var(Z)=1$ gives $$P(Z > z) \leq \frac{1}{2z^2}, \qquad z>0.$$ This is valid but not especially sharp for far-out tails.

21.3 Chernoff bound: exponential tails from MGFs

The moment generating function of $X$ is $$M_X(s)=E(e^{sX}).$$ Since $$e^{sX}=1+\frac{sX}{1!}+\frac{s^2X^2}{2!}+\frac{s^3X^3}{3!}+\cdots,$$ taking expectations gives $$M_X(s)=1+sE[X]+\frac{s^2}{2!}E[X^2]+\frac{s^3}{3!}E[X^3]+\cdots.$$ So the mgf packages all the moments $E[X^k]$ into one function.
Here $s$ is just a fixed tuning parameter that we choose. The random input is still $X$. Some books use $t$ instead of $s$; it plays the same role. Chernoff's method works by picking a convenient value of this parameter and then optimizing over it.
If the moment generating function $M_X(s)=E(e^{sX})$ is finite for some $s>0$, then for any $t$ and any $s>0$, $$P(X \geq t) = P(e^{sX} \geq e^{st}) \leq M_X(s)e^{-st}.$$
Optimizing over $s>0$ gives the Chernoff bound: $$P(X \geq t) \leq \inf_{s>0} M_X(s)e^{-st}.$$ This often yields exponentially decaying tails.
Chernoff is still just Markov's inequality, but applied to the exponential transform $e^{sX}$. The entire trick is to choose a transform that turns a hard tail event into something whose expectation is easy to manage.
For a standard normal $Z$, $$M_Z(s)=E(e^{sZ})=\int_{-\infty}^{\infty} e^{sz}\frac{1}{\sqrt{2\pi}}e^{-z^2/2}\,dz = e^{s^2/2}.$$ The usual derivation completes the square in the exponent: $$sz-\frac{z^2}{2} = -\frac{1}{2}(z-s)^2+\frac{s^2}{2}.$$ So the integral becomes $e^{s^2/2}$ times the integral of a normal density centered at $s$, and that integral is 1.
For a standard normal $Z$, optimizing the Chernoff bound gives $$P(Z > z) \leq e^{-z^2/2}, \qquad z>0.$$ This is dramatically smaller than the Chebyshev bound in the tails.

21.4 Sample means, CLT, and why exponential bounds are useful

Let $X_1,\ldots,X_n$ be i.i.d. with mean $\mu$ and variance $\sigma^2$, and let $\bar X$ be the sample mean. The CLT says $$\frac{\bar X-\mu}{\sigma/\sqrt{n}} \approx N(0,1)$$ for large $n$, so $$P(|\bar X-\mu| > \epsilon) \approx 2P\!\left(Z > \frac{\sqrt{n}\epsilon}{\sigma}\right).$$

Bounding the normal tail gives:
If $X_i \sim \text{Bernoulli}(1/2)$ i.i.d., then $$P(|\bar X-1/2| > \epsilon)$$ is bounded by $$\frac{1}{4n\epsilon^2} \quad \text{(Chebyshev)} \qquad \text{and} \qquad 2e^{-2n\epsilon^2} \quad \text{(Chernoff/Hoeffding form).}$$ The exponential bound is far more useful for sample-size planning.

21.5 Mills inequality: a sharper normal-tail bound

For standard normal $Z$ with density $$\phi(z)=\frac{1}{\sqrt{2\pi}}e^{-z^2/2},$$ the one-sided normal-density bound is $$P(Z > z) \leq \frac{1}{z}\phi(z), \qquad z>0.$$
Graphically, this is another indicator comparison. For fixed $z>0$, the tail indicator is $$\mathbf{1}\{x \geq z\}= \begin{cases} 0, & x<z \\ 1, & x\geq z \end{cases}$$ and on $[z,\infty)$ we have $\mathbf{1}\{x\ge z\}\le x/z$. So the function $x/z$ sits above the indicator on the right tail.
The lecture proof compares the indicator of the tail to $x/z$ on $[z,\infty)$: $$P(Z>z)=\int_z^\infty \phi(x)\,dx \leq \int_z^\infty \frac{x}{z}\phi(x)\,dx = \frac{1}{z}\phi(z).$$ So again the pattern is: upper-bound an indicator by a friendlier function and integrate.
By symmetry, the two-sided form is $$P(|Z| > z) \leq 2\frac{1}{z}\phi(z)=\sqrt{\frac{2}{\pi}}\frac{1}{z}e^{-z^2/2}.$$ This is often called Mills inequality in the course notes.
Compared with Chernoff's $e^{-z^2/2}$, Mills adds the helpful factor $1/(z\sqrt{2\pi})$. So for normal tails it is a sharper analytic bound with almost no extra work.

21.6 Hoeffding's inequality: independent, bounded, not necessarily i.i.d.

If $X_i \in [a_i,b_i]$ are independent and $S_n=\sum_{i=1}^n X_i$, then for $t>0$, $$P(S_n-E[S_n]\geq t) \leq \exp\!\left(\frac{-2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right).$$ The same bound holds for the left tail, hence $$P(|S_n-E[S_n]| \geq t) \leq 2\exp\!\left(\frac{-2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right).$$
Hoeffding assumes independence and bounded support, but it does not assume the variables are identically distributed.

For the sample average, $$P(|\bar X-E\bar X| \geq t) \leq 2\exp\!\left(\frac{-2n^2 t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right).$$ If every $X_i\in[0,1]$, this simplifies to $$P(|\bar X-E\bar X| \geq t) \leq 2e^{-2nt^2}.$$

This is worth stating explicitly: when $a_i=0$ and $b_i=1$ for all $i$, each range length is $b_i-a_i=1$, so Hoeffding reduces to $$P(|\bar X-E\bar X| \geq t) \leq 2e^{-2nt^2}.$$ That is the same formula as the Chernoff-style bound for Bernoulli sample means on $[0,1]$.
This is exactly the same concentration rate we saw for i.i.d. Bernoulli$(1/2)$ sample means. The lecture's puzzle explains why that should not feel accidental: among all distributions on $[0,1]$, the unique one with largest variance is Bernoulli$(1/2)$.

21.7 Application: empirical risk in binary classification

For a classifier $f$, define the $0$-$1$ loss $$\ell(f(X_i),Y_i)=\mathbf{1}\{f(X_i)\neq Y_i\},$$ the empirical risk $$\hat R(f)=\frac{1}{n}\sum_{i=1}^n \ell(f(X_i),Y_i),$$ and its expectation $R(f)=E[\hat R(f)]$.

Because each loss lies in $[0,1]$, Hoeffding gives $$P(|\hat R(f)-R(f)| \geq \epsilon) \leq 2e^{-2n\epsilon^2}.$$
Sample-size consequence:
To guarantee $$P(|\hat R(f)-R(f)| \geq \epsilon) \leq \delta,$$ it is enough to choose $$n \geq \frac{1}{2\epsilon^2}\log\!\left(\frac{2}{\delta}\right).$$

Bridge: From Hoeffding to Jacobians

Lecture 22 starts by proving the concentration result that Lecture 21 stated, then abruptly pivots to multivariable change of variable. At first glance those topics look unrelated.

The real connection is methodological. Both parts of Lecture 22 introduce reusable tools: an exponential-moment machine for tail bounds and a Jacobian machine for transformed densities. Each tool becomes a standard move later in statistics and machine learning.

The first tool finishes the concentration story. The second tool prepares the ground for deriving the $t$ and $F$ distributions that appear in multiple regression.


Lecture 22: Multivariable Methods

22.1 Proof of Hoeffding's inequality

Let $S_n=\sum_{i=1}^n X_i$ with independent $X_i\in[a_i,b_i]$. For any $s>0$, $$P(S_n-E[S_n]\geq t) = P\!\left(e^{s(S_n-E[S_n])}\geq e^{st}\right) \leq e^{-st}E\!\left[e^{s(S_n-E[S_n])}\right].$$ Independence factors the expectation: $$e^{-st}\prod_{i=1}^n E[e^{s(X_i-E[X_i])}].$$ Apply Hoeffding's lemma to each factor: $$E[e^{s(X_i-E[X_i])}] \leq e^{s^2(b_i-a_i)^2/8}.$$ If $c=\sum_{i=1}^n (b_i-a_i)^2$, the bound becomes $$e^{-st+s^2c/8}.$$ Minimize the exponent by differentiating: $$-t+\frac{sc}{4}=0 \quad \Longrightarrow \quad s^*=\frac{4t}{c}.$$ Plugging in gives $$P(S_n-E[S_n]\geq t) \leq \exp\!\left(\frac{-2t^2}{c}\right) = \exp\!\left(\frac{-2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right).$$
Hoeffding is really Chernoff plus one special lemma for bounded random variables. Once the mgf of the centered variable is controlled, the rest is just calculus.

22.2 Hoeffding's lemma: what makes the proof work

Hoeffding's lemma. If $X\in[a,b]$ and $E[X]=0$, then for $s>0$, $$E[e^{sX}] \leq e^{s^2(b-a)^2/8}.$$
The optional proof in lecture has two main ingredients:
  1. Use convexity of $x\mapsto e^{sx}$ to bound $e^{sX}$ by the chord joining the endpoints $a$ and $b$.
  2. Take logs and use a second-order Taylor expansion, with a uniform bound on the second derivative, to show the log of the bound is at most $w^2/8$ where $w=s(b-a)$.
The conclusion is the sub-Gaussian-style mgf bound above.

22.3 Change of variable for joint densities

Now let $(X,Y)$ have joint density $f_{X,Y}$ and define $$(U,V)=g(X,Y)=(g_1(X,Y),g_2(X,Y)).$$ Assume $g$ is smooth and invertible, with inverse $$(X,Y)=h(U,V)=(h_1(U,V),h_2(U,V)).$$

Let $$J=\begin{bmatrix} \partial g_1/\partial x & \partial g_1/\partial y \\ \partial g_2/\partial x & \partial g_2/\partial y \end{bmatrix}, \qquad K=\begin{bmatrix} \partial h_1/\partial u & \partial h_1/\partial v \\ \partial h_2/\partial u & \partial h_2/\partial v \end{bmatrix}.$$ Then $K=J^{-1}$ and $$|\det J|=\frac{1}{|\det K|}.$$
The joint density of $(U,V)$ is $$f_{U,V}(u,v)=\frac{f_{X,Y}(x,y)}{|\det J|}\quad \text{at }(x,y)=h(u,v),$$ or equivalently $$f_{U,V}(u,v)=f_{X,Y}(h_1(u,v),h_2(u,v))\,|\det K|.$$
This is exactly the multivariable version of the one-variable change-of-variable formula. If $U=g(X)$ and $X=h(U)=g^{-1}(U)$, then $$\frac{du}{dx}=g'(x), \qquad \frac{dx}{du}=h'(u)=\frac{1}{du/dx}.$$ So in one dimension we can write $$f_U(u)=\frac{f_X(x)}{|du/dx|}\quad \text{at }x=h(u),$$ or equivalently $$f_U(u)=f_X(h(u))\,|dx/du|.$$ One version has the derivative in the denominator, the other has the inverse derivative in the numerator. The Jacobian formulas above are the same two viewpoints, just in matrix form.
The absolute value on the determinant matters. A Jacobian can be negative, but densities cannot.

22.4 Example: density of a sum via Jacobian machinery

Set $$U=X, \qquad V=X+Y.$$ Then the inverse map is $$x=u, \qquad y=v-u.$$ The Jacobian determinant is 1, so $$f_{U,V}(u,v)=f_{X,Y}(u,v-u).$$

Marginalizing over $U$ recovers the standard convolution formula: $$f_V(v)=\int_{-\infty}^{\infty} f_{X,Y}(u,v-u)\,du.$$ Since $V=X+Y$, this is exactly the density formula for a sum.

22.5 Example: sum and difference of i.i.d. standard normals

If $X,Y$ are i.i.d. $N(0,1)$ and $$U=X+Y, \qquad V=X-Y,$$ then the inverse transformation is $$x=\frac{u+v}{2}, \qquad y=\frac{u-v}{2},$$ and $|\det J|=2$.

Plugging into the joint normal density gives $$f_{U,V}(u,v)=\frac{1}{4\pi}\exp\!\left(-\frac{u^2}{4}\right)\exp\!\left(-\frac{v^2}{4}\right).$$ The density factors, so $U$ and $V$ are independent. Each has mean 0 and variance 2, hence $$U,V \overset{\text{ind}}{\sim} N(0,2).$$
This is a clean illustration of what factorization buys us: once the transformed joint density splits into a product of a function of $u$ and a function of $v$, independence is immediate.

22.6 Example: beta-gamma algebra

Let $X\sim \Gamma(r,\lambda)$ and $Y\sim \Gamma(s,\lambda)$ independently. Define $$U=X+Y, \qquad V=\frac{X}{X+Y}.$$ The inverse map is $$x=uv, \qquad y=u(1-v).$$

Using the inverse Jacobian matrix, $$K=\begin{bmatrix} v & u \\ 1-v & -u \end{bmatrix}, \qquad |\det K|=u.$$
After substitution, the transformed density factors into a gamma density in $u$ and a beta density in $v$: $$f_{U,V}(u,v)=f_U(u)f_V(v).$$ Therefore

This is the beta-gamma algebra result. It extends the familiar fact that sums of independent gamma variables with common rate stay gamma, and it adds the beautiful independence of the sum and the ratio.

22.7 Why this matters for next week

Forward connection:
Beta-gamma algebra and one-dimensional change of variable lead to the $t$ and $F$ distributions. Those distributions are the calibration engine for inference in multiple regression.

Big Comparison Map

Tool Main assumptions Typical output What it is good for
Markov $X\geq 0$, mean known $P(X\geq t)\leq E[X]/t$ Very general first-pass tail bound
Chebyshev Finite variance $P(|X-\mu|\geq t)\leq \sigma^2/t^2$ Distribution-free concentration from second moments
Chernoff MGF finite near the origin $P(X\geq t)\leq \inf_{s>0} M_X(s)e^{-st}$ Exponential tail bounds
Mills inequality Normal tails $P(Z>z)\leq \phi(z)/z$ Sharper analytic bound for normal tails
Hoeffding Independent bounded variables Exponential bound using ranges $(b_i-a_i)$ Sample sums, averages, empirical risk
Jacobian change of variable Smooth invertible transformation New density after transformation Derive densities of sums, ratios, and reparameterizations
Beta-gamma algebra Independent gammas with common rate Independent gamma sum and beta ratio Build $t$ and $F$ distributions for regression inference
Two-lecture big picture:
Lecture 21 tells you how to control random fluctuation. Lecture 22 tells you how to rewrite complicated random objects into forms whose distributions you can recognize.

Master Summary and Formula Sheet

Lecture 21 core formulas

Concept Formula Comment
Markov $P(X\geq t)\leq E[X]/t$ Requires $X\geq 0$
Chebyshev $P(|X-\mu|\geq t)\leq \sigma^2/t^2$ Distribution-free second-moment bound
Chernoff $P(X\geq t)\leq \inf_{s>0} M_X(s)e^{-st}$ Usually exponential tails
Standard normal mgf $M_Z(s)=e^{s^2/2}$ Comes from completing the square in the Gaussian integral
Normal density / Mills $P(Z>z)\leq \phi(z)/z$, $P(|Z|>z)\leq 2\phi(z)/z$ Normal-specific improvement
Hoeffding for sums $P(|S_n-E[S_n]|\geq t)\leq 2\exp\!\left(-2t^2/\sum_i(b_i-a_i)^2\right)$ Independent bounded variables
Hoeffding for averages in $[0,1]$ $P(|\bar X-E\bar X|\geq t)\leq 2e^{-2nt^2}$ Especially useful for empirical risk

Lecture 22 core formulas

Concept Formula Comment
Hoeffding's lemma $E[e^{sX}] \leq e^{s^2(b-a)^2/8}$ For mean-zero $X\in[a,b]$
Joint density transform $f_{U,V}(u,v)=f_{X,Y}(h(u,v))\,|\det K|$ Equivalent to dividing by $|\det J|$
Density of a sum $f_{X+Y}(v)=\int f_{X,Y}(u,v-u)\,du$ Convolution from Jacobian method
Normal sum/difference $X+Y$ and $X-Y$ are independent $N(0,2)$ When $X,Y$ are i.i.d. $N(0,1)$
Beta-gamma algebra $X+Y\sim \Gamma(r+s,\lambda)$, $X/(X+Y)\sim \text{Beta}(r,s)$ And they are independent
Fast memory aid:
Bounded independent variables lead to Hoeffding. Smooth invertible transformations lead to Jacobians. Independent gammas lead to beta-gamma algebra.

Common Mistakes

1. Forgetting to scale the variance of the sample mean.
If $\Var(X_i)=\sigma^2$, then $\Var(\bar X)=\sigma^2/n$, not $\sigma^2$.
2. Thinking Hoeffding requires i.i.d. data.
It only needs independence and bounded ranges.
3. Treating Chernoff or Hoeffding bounds as exact tail probabilities.
They are upper bounds, often useful precisely because exact probabilities are hard.
4. Dropping the absolute value on the Jacobian determinant.
Orientation can flip, but densities must stay nonnegative.
5. Forgetting to track support after a transformation.
For example, in beta-gamma algebra we need $u>0$ and $0<v<1$.

Data 145 Study Guide - Lectures 21-22 - Standalone Review Version