Data 145: Evidence and Uncertainty
Comprehensive Study Guide - Lectures 21 through 22 - Spring 2026
Instructors: Ani Adhikari, William Fithian
Bridge: From Error Rates to Tail Probabilities
Lecture 20 focused on controlling errors when many decisions are made at once. Lecture 21 shifts from decision procedures to random fluctuations
themselves: how likely is a statistic to wander far from its target?
The new question is not "how many false discoveries might we make?" but
"how concentrated is a random quantity around the value we care about?"
In machine learning language: how far can an empirical quantity drift from its population target?
This makes concentration inequalities a natural bridge to later material. Regression, empirical risk, and generalization bounds all depend on
controlling the tails of sums and averages, even when exact tail probabilities are hard or impossible to compute.
Lecture 21: Concentration Inequalities
21.1 Why tail bounds matter
Suppose a statistic based on data converges to a target as sample size grows: an empirical risk to its expected loss, a sample mean to its
population mean, or a regression quantity to its limiting parameter. Usually we do not need exact tail probabilities. We need guarantees that
the bad tail is smaller than some tolerance.
A concentration inequality is an upper bound on tail probability. If the bound is small, we know most of the probability mass
lies close to the target value.
Tail bounds answer practical questions such as:
- Is the error probability below $10^{-6}$?
- How large must $n$ be to make the error probability smaller than a target $\delta$?
- How fast does the statistic concentrate as $n$ grows?
21.2 Markov and Chebyshev: the basic polynomial bounds
Markov's inequality. If $X \geq 0$ and $E[X]=\mu$, then for $t>0$, $$P(X \geq t) \leq \frac{\mu}{t}.$$
The proof idea is the same one from earlier probability: $$\mathbf{1}\{X \geq t\} \leq \frac{X}{t}.$$ Taking expectations gives the result
immediately.
Chebyshev's inequality. If $X$ has mean $\mu$ and variance $\sigma^2$, then $$P(|X-\mu| \geq t) \leq \frac{\sigma^2}{t^2}.$$
This follows by applying Markov to $(X-\mu)^2$.
Markov uses only a first moment and gives a $1/t$ tail. Chebyshev uses a second moment and improves that to $1/t^2$. These are broad, robust,
and often crude.
For a single right tail, Chebyshev still gives $$P(X-\mu \geq t) \leq \frac{\sigma^2}{t^2}.$$ If the distribution is symmetric about $\mu$, the
one-sided bound improves to $$P(X-\mu \geq t) \leq \frac{\sigma^2}{2t^2}.$$
For a standard normal $Z$, symmetry plus $\Var(Z)=1$ gives $$P(Z > z) \leq \frac{1}{2z^2}, \qquad z>0.$$ This is valid but not especially
sharp for far-out tails.
21.3 Chernoff bound: exponential tails from MGFs
The moment generating function of $X$ is $$M_X(s)=E(e^{sX}).$$ Since
$$e^{sX}=1+\frac{sX}{1!}+\frac{s^2X^2}{2!}+\frac{s^3X^3}{3!}+\cdots,$$ taking expectations gives
$$M_X(s)=1+sE[X]+\frac{s^2}{2!}E[X^2]+\frac{s^3}{3!}E[X^3]+\cdots.$$ So the mgf packages all the moments $E[X^k]$ into one function.
Here $s$ is just a fixed tuning parameter that we choose. The random input is still $X$. Some books use $t$ instead of $s$; it plays
the same role. Chernoff's method works by picking a convenient value of this parameter and then optimizing over it.
If the moment generating function $M_X(s)=E(e^{sX})$ is finite for some $s>0$, then for any $t$ and any $s>0$, $$P(X \geq t) = P(e^{sX}
\geq e^{st}) \leq M_X(s)e^{-st}.$$
Optimizing over $s>0$ gives the Chernoff bound: $$P(X \geq t) \leq \inf_{s>0} M_X(s)e^{-st}.$$ This often yields
exponentially decaying tails.
Chernoff is still just Markov's inequality, but applied to the exponential transform $e^{sX}$. The entire trick is to choose a transform that
turns a hard tail event into something whose expectation is easy to manage.
For a standard normal $Z$, $$M_Z(s)=E(e^{sZ})=\int_{-\infty}^{\infty} e^{sz}\frac{1}{\sqrt{2\pi}}e^{-z^2/2}\,dz = e^{s^2/2}.$$ The usual
derivation completes the square in the exponent: $$sz-\frac{z^2}{2} = -\frac{1}{2}(z-s)^2+\frac{s^2}{2}.$$ So the integral becomes $e^{s^2/2}$
times the integral of a normal density centered at $s$, and that integral is 1.
For a standard normal $Z$, optimizing the Chernoff bound gives $$P(Z > z) \leq e^{-z^2/2}, \qquad z>0.$$ This is dramatically smaller than
the Chebyshev bound in the tails.
21.4 Sample means, CLT, and why exponential bounds are useful
Let $X_1,\ldots,X_n$ be i.i.d. with mean $\mu$ and variance $\sigma^2$, and let $\bar X$ be the sample mean. The CLT says $$\frac{\bar
X-\mu}{\sigma/\sqrt{n}} \approx N(0,1)$$ for large $n$, so $$P(|\bar X-\mu| > \epsilon) \approx 2P\!\left(Z >
\frac{\sqrt{n}\epsilon}{\sigma}\right).$$
Bounding the normal tail gives:
- Chebyshev: $$P(|\bar X-\mu| > \epsilon) \leq \frac{\sigma^2}{n\epsilon^2}$$
- Chernoff-style normal tail bound: $$P(|\bar X-\mu| > \epsilon) \lesssim 2\exp\!\left(-\frac{n\epsilon^2}{2\sigma^2}\right)$$
If $X_i \sim \text{Bernoulli}(1/2)$ i.i.d., then $$P(|\bar X-1/2| > \epsilon)$$ is bounded by $$\frac{1}{4n\epsilon^2} \quad
\text{(Chebyshev)} \qquad \text{and} \qquad 2e^{-2n\epsilon^2} \quad \text{(Chernoff/Hoeffding form).}$$ The exponential bound is far more
useful for sample-size planning.
21.5 Mills inequality: a sharper normal-tail bound
For standard normal $Z$ with density $$\phi(z)=\frac{1}{\sqrt{2\pi}}e^{-z^2/2},$$ the one-sided normal-density bound is $$P(Z > z) \leq
\frac{1}{z}\phi(z), \qquad z>0.$$
Graphically, this is another indicator comparison. For fixed $z>0$, the tail indicator is $$\mathbf{1}\{x \geq z\}= \begin{cases} 0, &
x<z \\ 1, & x\geq z \end{cases}$$ and on $[z,\infty)$ we have $\mathbf{1}\{x\ge z\}\le x/z$. So the function $x/z$ sits above the
indicator on the right tail.
The lecture proof compares the indicator of the tail to $x/z$ on $[z,\infty)$: $$P(Z>z)=\int_z^\infty \phi(x)\,dx \leq \int_z^\infty
\frac{x}{z}\phi(x)\,dx = \frac{1}{z}\phi(z).$$ So again the pattern is: upper-bound an indicator by a friendlier function and integrate.
By symmetry, the two-sided form is $$P(|Z| > z) \leq 2\frac{1}{z}\phi(z)=\sqrt{\frac{2}{\pi}}\frac{1}{z}e^{-z^2/2}.$$ This is often called
Mills inequality in the course notes.
Compared with Chernoff's $e^{-z^2/2}$, Mills adds the helpful factor $1/(z\sqrt{2\pi})$. So for normal tails it is a sharper analytic bound with
almost no extra work.
21.6 Hoeffding's inequality: independent, bounded, not necessarily i.i.d.
If $X_i \in [a_i,b_i]$ are independent and $S_n=\sum_{i=1}^n X_i$, then for $t>0$, $$P(S_n-E[S_n]\geq t) \leq
\exp\!\left(\frac{-2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right).$$ The same bound holds for the left tail, hence $$P(|S_n-E[S_n]| \geq t) \leq
2\exp\!\left(\frac{-2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right).$$
Hoeffding assumes independence and bounded support, but it does not assume the variables are identically
distributed.
For the sample average, $$P(|\bar X-E\bar X| \geq t) \leq 2\exp\!\left(\frac{-2n^2 t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right).$$ If every
$X_i\in[0,1]$, this simplifies to $$P(|\bar X-E\bar X| \geq t) \leq 2e^{-2nt^2}.$$
This is worth stating explicitly: when $a_i=0$ and $b_i=1$ for all $i$, each range length is $b_i-a_i=1$, so Hoeffding reduces to $$P(|\bar
X-E\bar X| \geq t) \leq 2e^{-2nt^2}.$$ That is the same formula as the Chernoff-style bound for Bernoulli sample means on $[0,1]$.
This is exactly the same concentration rate we saw for i.i.d. Bernoulli$(1/2)$ sample means. The lecture's puzzle explains why that should not
feel accidental: among all distributions on $[0,1]$, the unique one with largest variance is Bernoulli$(1/2)$.
21.7 Application: empirical risk in binary classification
For a classifier $f$, define the $0$-$1$ loss $$\ell(f(X_i),Y_i)=\mathbf{1}\{f(X_i)\neq Y_i\},$$ the empirical risk $$\hat
R(f)=\frac{1}{n}\sum_{i=1}^n \ell(f(X_i),Y_i),$$ and its expectation $R(f)=E[\hat R(f)]$.
Because each loss lies in $[0,1]$, Hoeffding gives $$P(|\hat R(f)-R(f)| \geq \epsilon) \leq 2e^{-2n\epsilon^2}.$$
Sample-size consequence:
To guarantee $$P(|\hat R(f)-R(f)| \geq \epsilon) \leq \delta,$$ it is enough to choose $$n \geq
\frac{1}{2\epsilon^2}\log\!\left(\frac{2}{\delta}\right).$$
Bridge: From Hoeffding to Jacobians
Lecture 22 starts by proving the concentration result that Lecture 21 stated, then abruptly pivots to multivariable change of variable. At first
glance those topics look unrelated.
The real connection is methodological. Both parts of Lecture 22 introduce reusable tools:
an exponential-moment machine for tail bounds and
a Jacobian machine for transformed densities.
Each tool becomes a standard move later in statistics and machine learning.
The first tool finishes the concentration story. The second tool prepares the ground for deriving the $t$ and $F$ distributions that appear in
multiple regression.
Lecture 22: Multivariable Methods
22.1 Proof of Hoeffding's inequality
Let $S_n=\sum_{i=1}^n X_i$ with independent $X_i\in[a_i,b_i]$. For any $s>0$, $$P(S_n-E[S_n]\geq t) = P\!\left(e^{s(S_n-E[S_n])}\geq
e^{st}\right) \leq e^{-st}E\!\left[e^{s(S_n-E[S_n])}\right].$$ Independence factors the expectation: $$e^{-st}\prod_{i=1}^n
E[e^{s(X_i-E[X_i])}].$$ Apply Hoeffding's lemma to each factor: $$E[e^{s(X_i-E[X_i])}] \leq e^{s^2(b_i-a_i)^2/8}.$$ If $c=\sum_{i=1}^n
(b_i-a_i)^2$, the bound becomes $$e^{-st+s^2c/8}.$$ Minimize the exponent by differentiating: $$-t+\frac{sc}{4}=0 \quad \Longrightarrow \quad
s^*=\frac{4t}{c}.$$ Plugging in gives $$P(S_n-E[S_n]\geq t) \leq \exp\!\left(\frac{-2t^2}{c}\right) = \exp\!\left(\frac{-2t^2}{\sum_{i=1}^n
(b_i-a_i)^2}\right).$$
Hoeffding is really Chernoff plus one special lemma for bounded random variables. Once the mgf of the centered variable is controlled, the rest
is just calculus.
22.2 Hoeffding's lemma: what makes the proof work
Hoeffding's lemma. If $X\in[a,b]$ and $E[X]=0$, then for $s>0$, $$E[e^{sX}] \leq e^{s^2(b-a)^2/8}.$$
The optional proof in lecture has two main ingredients:
- Use convexity of $x\mapsto e^{sx}$ to bound $e^{sX}$ by the chord joining the endpoints $a$ and $b$.
-
Take logs and use a second-order Taylor expansion, with a uniform bound on the second derivative, to show the log of the bound is at most
$w^2/8$ where $w=s(b-a)$.
The conclusion is the sub-Gaussian-style mgf bound above.
22.3 Change of variable for joint densities
Now let $(X,Y)$ have joint density $f_{X,Y}$ and define $$(U,V)=g(X,Y)=(g_1(X,Y),g_2(X,Y)).$$ Assume $g$ is smooth and invertible, with inverse
$$(X,Y)=h(U,V)=(h_1(U,V),h_2(U,V)).$$
Let $$J=\begin{bmatrix} \partial g_1/\partial x & \partial g_1/\partial y \\ \partial g_2/\partial x & \partial g_2/\partial y
\end{bmatrix}, \qquad K=\begin{bmatrix} \partial h_1/\partial u & \partial h_1/\partial v \\ \partial h_2/\partial u & \partial
h_2/\partial v \end{bmatrix}.$$ Then $K=J^{-1}$ and $$|\det J|=\frac{1}{|\det K|}.$$
The joint density of $(U,V)$ is $$f_{U,V}(u,v)=\frac{f_{X,Y}(x,y)}{|\det J|}\quad \text{at }(x,y)=h(u,v),$$ or equivalently
$$f_{U,V}(u,v)=f_{X,Y}(h_1(u,v),h_2(u,v))\,|\det K|.$$
This is exactly the multivariable version of the one-variable change-of-variable formula. If $U=g(X)$ and $X=h(U)=g^{-1}(U)$, then
$$\frac{du}{dx}=g'(x), \qquad \frac{dx}{du}=h'(u)=\frac{1}{du/dx}.$$ So in one dimension we can write $$f_U(u)=\frac{f_X(x)}{|du/dx|}\quad
\text{at }x=h(u),$$ or equivalently $$f_U(u)=f_X(h(u))\,|dx/du|.$$ One version has the derivative in the denominator, the other has the inverse
derivative in the numerator. The Jacobian formulas above are the same two viewpoints, just in matrix form.
The absolute value on the determinant matters. A Jacobian can be negative, but densities cannot.
22.4 Example: density of a sum via Jacobian machinery
Set $$U=X, \qquad V=X+Y.$$ Then the inverse map is $$x=u, \qquad y=v-u.$$ The Jacobian determinant is 1, so $$f_{U,V}(u,v)=f_{X,Y}(u,v-u).$$
Marginalizing over $U$ recovers the standard convolution formula: $$f_V(v)=\int_{-\infty}^{\infty} f_{X,Y}(u,v-u)\,du.$$ Since $V=X+Y$, this is
exactly the density formula for a sum.
22.5 Example: sum and difference of i.i.d. standard normals
If $X,Y$ are i.i.d. $N(0,1)$ and $$U=X+Y, \qquad V=X-Y,$$ then the inverse transformation is $$x=\frac{u+v}{2}, \qquad y=\frac{u-v}{2},$$ and
$|\det J|=2$.
Plugging into the joint normal density gives $$f_{U,V}(u,v)=\frac{1}{4\pi}\exp\!\left(-\frac{u^2}{4}\right)\exp\!\left(-\frac{v^2}{4}\right).$$
The density factors, so $U$ and $V$ are independent. Each has mean 0 and variance 2, hence $$U,V \overset{\text{ind}}{\sim} N(0,2).$$
This is a clean illustration of what factorization buys us: once the transformed joint density splits into a product of a function of $u$ and a
function of $v$, independence is immediate.
22.6 Example: beta-gamma algebra
Let $X\sim \Gamma(r,\lambda)$ and $Y\sim \Gamma(s,\lambda)$ independently. Define $$U=X+Y, \qquad V=\frac{X}{X+Y}.$$ The inverse map is $$x=uv,
\qquad y=u(1-v).$$
Using the inverse Jacobian matrix, $$K=\begin{bmatrix} v & u \\ 1-v & -u \end{bmatrix}, \qquad |\det K|=u.$$
After substitution, the transformed density factors into a gamma density in $u$ and a beta density in $v$: $$f_{U,V}(u,v)=f_U(u)f_V(v).$$
Therefore
- $U \sim \Gamma(r+s,\lambda)$,
- $V \sim \text{Beta}(r,s)$,
- $U$ and $V$ are independent.
This is the beta-gamma algebra result. It extends the familiar fact that sums of independent gamma variables with common rate stay gamma, and it
adds the beautiful independence of the sum and the ratio.
22.7 Why this matters for next week
Forward connection:
Beta-gamma algebra and one-dimensional change of variable lead to the $t$ and $F$ distributions. Those distributions are the calibration engine
for inference in multiple regression.
Big Comparison Map
| Tool |
Main assumptions |
Typical output |
What it is good for |
| Markov |
$X\geq 0$, mean known |
$P(X\geq t)\leq E[X]/t$ |
Very general first-pass tail bound |
| Chebyshev |
Finite variance |
$P(|X-\mu|\geq t)\leq \sigma^2/t^2$ |
Distribution-free concentration from second moments |
| Chernoff |
MGF finite near the origin |
$P(X\geq t)\leq \inf_{s>0} M_X(s)e^{-st}$ |
Exponential tail bounds |
| Mills inequality |
Normal tails |
$P(Z>z)\leq \phi(z)/z$ |
Sharper analytic bound for normal tails |
| Hoeffding |
Independent bounded variables |
Exponential bound using ranges $(b_i-a_i)$ |
Sample sums, averages, empirical risk |
| Jacobian change of variable |
Smooth invertible transformation |
New density after transformation |
Derive densities of sums, ratios, and reparameterizations |
| Beta-gamma algebra |
Independent gammas with common rate |
Independent gamma sum and beta ratio |
Build $t$ and $F$ distributions for regression inference |
Two-lecture big picture:
Lecture 21 tells you how to control random fluctuation. Lecture 22 tells you how to rewrite complicated random objects into forms whose
distributions you can recognize.
Master Summary and Formula Sheet
Lecture 21 core formulas
| Concept |
Formula |
Comment |
| Markov |
$P(X\geq t)\leq E[X]/t$ |
Requires $X\geq 0$ |
| Chebyshev |
$P(|X-\mu|\geq t)\leq \sigma^2/t^2$ |
Distribution-free second-moment bound |
| Chernoff |
$P(X\geq t)\leq \inf_{s>0} M_X(s)e^{-st}$ |
Usually exponential tails |
| Standard normal mgf |
$M_Z(s)=e^{s^2/2}$ |
Comes from completing the square in the Gaussian integral |
| Normal density / Mills |
$P(Z>z)\leq \phi(z)/z$, $P(|Z|>z)\leq 2\phi(z)/z$ |
Normal-specific improvement |
| Hoeffding for sums |
$P(|S_n-E[S_n]|\geq t)\leq 2\exp\!\left(-2t^2/\sum_i(b_i-a_i)^2\right)$ |
Independent bounded variables |
| Hoeffding for averages in $[0,1]$ |
$P(|\bar X-E\bar X|\geq t)\leq 2e^{-2nt^2}$ |
Especially useful for empirical risk |
Lecture 22 core formulas
| Concept |
Formula |
Comment |
| Hoeffding's lemma |
$E[e^{sX}] \leq e^{s^2(b-a)^2/8}$ |
For mean-zero $X\in[a,b]$ |
| Joint density transform |
$f_{U,V}(u,v)=f_{X,Y}(h(u,v))\,|\det K|$ |
Equivalent to dividing by $|\det J|$ |
| Density of a sum |
$f_{X+Y}(v)=\int f_{X,Y}(u,v-u)\,du$ |
Convolution from Jacobian method |
| Normal sum/difference |
$X+Y$ and $X-Y$ are independent $N(0,2)$ |
When $X,Y$ are i.i.d. $N(0,1)$ |
| Beta-gamma algebra |
$X+Y\sim \Gamma(r+s,\lambda)$, $X/(X+Y)\sim \text{Beta}(r,s)$ |
And they are independent |
Fast memory aid:
Bounded independent variables lead to Hoeffding. Smooth invertible transformations lead to Jacobians. Independent gammas lead to beta-gamma
algebra.
Common Mistakes
1. Forgetting to scale the variance of the sample mean.
If $\Var(X_i)=\sigma^2$, then $\Var(\bar X)=\sigma^2/n$, not $\sigma^2$.
2. Thinking Hoeffding requires i.i.d. data.
It only needs independence and bounded ranges.
3. Treating Chernoff or Hoeffding bounds as exact tail probabilities.
They are upper bounds, often useful precisely because exact probabilities are hard.
4. Dropping the absolute value on the Jacobian determinant.
Orientation can flip, but densities must stay nonnegative.
5. Forgetting to track support after a transformation.
For example, in beta-gamma algebra we need $u>0$ and $0<v<1$.
Data 145 Study Guide - Lectures 21-22 - Standalone Review Version