Data 145: Evidence and Uncertainty
Comprehensive Study Guide - Lectures 19 through 20 - Spring 2026
Instructors: Ani Adhikari, William Fithian
Bridge: From Single Tests to Many Tests
By Lecture 18, the course had built a full toolbox for one question at a time: choose the right test, compute one $p$-value, and compare it to
one level $\alpha$. That framework works well when the scientific problem itself is singular.
Lecture 19 changes the unit of analysis. The issue is no longer "what is the right test for one hypothesis?" It is
"what should error control mean when we test dozens, thousands, or even infinitely many related hypotheses at once?"
This is the natural sequel to Lecture 18. Once the class has many testing procedures available, the next real-world obstacle is that modern data
analysis rarely stops after one test. Variable selection, genomics, A/B testing platforms, and simultaneous confidence statements all force us
to think at the level of a family of inferences.
The conceptual move is:
single-test Type I error -> familywise error -> simultaneous inference -> false discovery rate.
Lecture 19: Multiple Testing and Simultaneous Inference
19.1 Motivating example: prostate cancer gene expression
Singh et al. measured expression for $6{,}033$ genes in healthy controls and prostate cancer patients. Testing each gene separately at
$\alpha=0.05$ produced 478 significant genes. But if all $6{,}033$ nulls were true, we would still expect about $6{,}033 \cdot 0.05 \approx 302$
false positives just from noise.
So the existence of many small $p$-values does not automatically mean we have many trustworthy discoveries. The main danger is that
false rejections accumulate when the number of tests is large.
19.2 Setup and notation
We test $m$ null hypotheses $H_{01},\ldots,H_{0m}$ with $p$-values $p_1,\ldots,p_m$.
- $\mathcal{R}(X)=\{i: H_{0i} \text{ is rejected}\}$ is the rejection set, with $R=|\mathcal{R}|$.
- $\mathcal{H}_0=\{i: H_{0i} \text{ is true}\}$ is the set of true nulls, with $m_0=|\mathcal{H}_0|$.
- A false rejection occurs when $\mathcal{R}\cap\mathcal{H}_0 \neq \emptyset$.
If all $m$ nulls are true and each test is run at level $\alpha$, then under independence: $$P(\text{at least one rejection}) =
1-(1-\alpha)^m.$$ For $m=6{,}033$ and $\alpha=0.05$, this is essentially 1.
A level-$0.05$ test sounds conservative in isolation. It is not conservative when repeated thousands of times. Multiple testing is mostly about
translating the words "Type I error" from the one-test world into the many-test world.
19.3 Familywise error rate (FWER)
The familywise error rate is the probability of making any false rejection: $$\FWER_\theta =
P_\theta(\mathcal{R}\cap\mathcal{H}_0 \neq \emptyset).$$ We want procedures with $\sup_\theta \FWER_\theta \leq \alpha$.
FWER is strict. It treats one false rejection among a thousand true rejections as a failure. That makes sense in high-stakes confirmatory
settings, but it can become very conservative in screening problems.
19.4 Bonferroni and where $\alpha/m$ comes from
Bonferroni correction. Reject $H_{0i}$ only if $p_i \leq \alpha/m$. Then $$\sup_\theta \FWER_\theta \leq \alpha.$$
This is exactly the union-bound argument from discussion: $$P(\text{any false rejection}) = P\!\left(\bigcup_{i\in\mathcal{H}_0}\{\text{reject
}H_{0i}\}\right) \leq \sum_{i\in\mathcal{H}_0} P(\text{reject }H_{0i}).$$ Under Bonferroni, each true null is tested at level $\alpha/m$, so
$$\sum_{i\in\mathcal{H}_0} P(\text{reject }H_{0i}) \leq m_0 \frac{\alpha}{m} \leq \alpha.$$ That is the whole reason the denominator is $m$: it
is the price paid to make the sum of all possible false-rejection probabilities stay below $\alpha$.
We divide by the total number of tests $m$, not the number of true nulls $m_0$, because $m_0$ is unknown. If we knew $m_0$, we
could use $\alpha/m_0$, but in practice we do not.
In the prostate study, Bonferroni uses the threshold $$\frac{0.05}{6033} \approx 8.3\times 10^{-6}.$$ The 478 marginal discoveries collapse to
just 3 Bonferroni discoveries. So Bonferroni is valid and simple, but often extremely harsh.
19.5 Sidak: what independence buys you
If the $p$-values are independent, the Sidak correction uses $$\tilde\alpha_m = 1-(1-\alpha)^{1/m}$$ and rejects when $p_i \leq
\tilde\alpha_m$.
Sidak is derived by asking for the probability of no false rejections to be at least $1-\alpha$ and multiplying independent terms. But
for small per-test thresholds, Sidak and Bonferroni are numerically very close. Independence helps a little, not dramatically.
19.6 Scheffe's method and infinitely many related tests
Bonferroni works best when we can literally list the hypotheses. But some simultaneous inference problems involve infinitely many comparisons,
all built from the same data vector.
Suppose $X \sim N_d(\mu, I_d)$ and we want to test $$H_{0,\lambda}: \lambda^T \mu = 0$$ simultaneously for every unit vector $\lambda$.
Why unit vectors? Because scaling $\lambda$ does not change the underlying null statement: if $c \neq 0$, then $(c\lambda)^T\mu = 0$ is the same
hypothesis as $\lambda^T\mu=0$. Normalizing $\lambda$ just removes a meaningless scale choice.
Scheffe's method. Reject $H_{0,\lambda}$ when $$|\lambda^T X|^2 \geq \chi^2_{d,\alpha}.$$ This controls FWER at level $\alpha$
over all such contrasts.
Write $Z=X-\mu \sim N_d(0,I_d)$. If $H_{0,\lambda}$ is true, then $\lambda^T X = \lambda^T Z$. The worst possible contrast over all unit vectors
is bounded by $\|Z\|^2$, and $\|Z\|^2 \sim \chi^2_d$. One random confidence ball in $\mu$-space controls an infinite family of contrast tests at
once.
19.7 Deduction principle and simultaneous confidence intervals
If $C(X)$ is any joint $(1-\alpha)$ confidence region for $\theta$, then any conclusion deduced under the assumption $\theta \in C(X)$ is
automatically FWER-controlled at level $\alpha$.
Intervals $C_1(X),\ldots,C_m(X)$ are simultaneous $(1-\alpha)$ confidence intervals for $g_1(\theta),\ldots,g_m(\theta)$ if
$$P_\theta\bigl(g_i(\theta)\in C_i(X)\text{ for all }i\bigr)\geq 1-\alpha.$$
Bonferroni also gives simultaneous intervals. If each individual interval has marginal coverage $1-\alpha/m$, then $$P(\text{any interval
misses}) \leq \sum_{i=1}^m \alpha/m = \alpha.$$ So marginal $1-\alpha/m$ intervals become simultaneous $1-\alpha$ intervals.
In correlated Gaussian settings, a deduced confidence region can beat Bonferroni. For example, if $X \sim N_d(\theta,\Sigma)$ and $t_\alpha$ is
the upper-$\alpha$ quantile of $\|X-\theta\|_\infty$, then $$[X_i-t_\alpha,\; X_i+t_\alpha]$$ gives exact simultaneous intervals for all
coordinates. Strong positive correlation can make $t_\alpha$ noticeably smaller than the Bonferroni width.
Bridge: Why FWER Can Be Too Strict
Lecture 19 solves the multiplicity problem by preventing even one false rejection. That is the cleanest extension of Type I error, but it may
over-correct in exploratory settings.
In the prostate example, Bonferroni reduced 478 marginal hits to 3. That is wonderful if every false positive is unacceptable. It is less
attractive if the scientific goal is to generate a manageable shortlist for follow-up experiments.
Lecture 20 keeps the multiple-testing mindset but changes the target. Instead of asking for the probability of zero false discoveries, it asks
for control of the fraction of discoveries that are false.
Lecture 20: False Discovery Rate and Benjamini-Hochberg
20.1 Motivation: false fraction, not false existence
For the same $6{,}033$ prostate-gene tests:
| Marginal threshold |
Total rejections |
Expected false rejections (roughly $m\alpha$) |
Estimated false fraction |
| $0.05$ |
478 |
302 |
About 63% |
| $0.01$ |
172 |
60 |
About 35% |
| $0.001$ |
60 |
6 |
About 10% |
This suggests a different goal: keep the
proportion of bad discoveries small, rather than insisting on no false discovery at all.
20.2 FDP and FDR
Let
- $R = |\mathcal{R}|$: total number of discoveries,
- $V = |\mathcal{R}\cap\mathcal{H}_0|$: number of false discoveries.
Then the
false discovery proportion is $$\FDP = \frac{V}{R \vee 1},$$ where $R\vee 1 = \max(R,1)$ so that $\FDP=0$ when $R=0$.
The
false discovery rate is $$\FDR = E\!\left[\frac{V}{R\vee 1}\right] = E[\FDP].$$
FWER asks: "did we make at least one false discovery?" FDR asks: "what fraction of our discoveries are false, on average?" They answer different
scientific questions.
20.3 Why FDR is weaker than FWER
Your discussion note gives the clean comparison. Consider two cases:
- If $V=0$, then $\FDP=0$.
- If $V>0$, then $0 < \FDP = V/(R\vee 1) \leq 1 = \mathbf{1}\{V>0\}$.
So in every outcome, $$\FDP \leq \mathbf{1}\{V>0\}.$$ Taking expectations gives $$\FDR = E[\FDP] \leq E[\mathbf{1}\{V>0\}] = P(V>0) =
\FWER.$$ Therefore every FWER-controlling procedure automatically controls FDR, but the reverse is not true.
This is why Bonferroni is usually much more conservative than BH. Bonferroni is solving a strictly harder problem.
20.4 The Benjamini-Hochberg (BH) procedure
Given $p$-values $p_1,\ldots,p_m$ and target FDR level $\alpha$:
- Sort them: $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}$.
- Find the largest $k$ such that $$p_{(k)} \leq \frac{\alpha k}{m}.$$
- Reject the hypotheses corresponding to $p_{(1)},\ldots,p_{(k)}$.
BH is a step-up rule. Bonferroni uses one horizontal cutoff $\alpha/m$. BH uses the rising line $$y=\frac{\alpha}{m}k,$$ so it
becomes more permissive when many small $p$-values are present.
In the prostate data, BH at FDR $0.05$ rejects 21 genes, while Bonferroni rejects only 3. At FDR $0.10$, BH rejects 60 genes. So BH preserves
much more power in large-scale screening.
20.5 BH as estimated-FDP control
Suppose we reject all hypotheses with $p_i \leq t$. Let $R_t$ be the number of such rejections and $V_t$ the number of false ones. If null
$p$-values are uniform, then $$E[V_t] = m_0 t \leq mt.$$ This suggests the conservative estimator $$\widehat{\FDP}_t = \frac{mt}{R_t \vee 1}.$$
BH chooses the largest threshold $t$ for which the estimated false-discovery proportion stays below $\alpha$. At a candidate threshold
$t=p_{(k)}$, we have $R_t=k$, so the condition $$\widehat{\FDP}_t \leq \alpha$$ becomes $$\frac{mp_{(k)}}{k} \leq \alpha
\quad\Longleftrightarrow\quad p_{(k)} \leq \frac{\alpha k}{m}.$$ So the BH rule is exactly the "control estimated FDP" idea written in
ordered-$p$-value form.
Why do we only check thresholds at observed $p$-values? Between two consecutive ordered $p$-values, the rejection count $R_t$ does not change,
so $\widehat{\FDP}_t = mt/R_t$ only gets larger as $t$ increases. The best threshold in each interval is therefore its left endpoint, namely an
observed ordered $p$-value.
20.6 Proof sketch for FDR control
Under independence and null uniformity, BH controls $$\FDR \leq \alpha \frac{m_0}{m} \leq \alpha.$$
The course proof has four moves:
- Write $V=\sum_{i\in\mathcal{H}_0} V_i$, where $V_i=\mathbf{1}\{H_{0i}\text{ rejected}\}$.
- Show $\FDR = \sum_{i\in\mathcal{H}_0} E[V_i/(R\vee 1)]$.
-
For a fixed true null $i$, replace $p_i$ by $0$ and call the resulting BH rejection count $R^0$. On the event that $i$ is rejected, the
total number of BH rejections does not change, so $R=R^0$ there.
-
Condition on the other $p$-values. Since $R^0$ depends only on $p_{-i}$ and $p_i\sim \text{Unif}(0,1)$ independently,
$$E\!\left[\frac{V_i}{R\vee 1}\right] \leq \frac{\alpha}{m}.$$ Summing over the $m_0$ true nulls gives the result.
The miracle step is the conditioning trick: once the other $p$-values are fixed, the threshold seen by one null $p$-value behaves like a fixed
number, and uniformity gives exactly the cancellation needed.
20.7 Assumptions and dependence remarks
| Assumption |
Role in the proof |
Consequence |
| Null uniformity |
Gives $P(p_i \leq t)=t$ for true nulls |
Can be weakened to conservative null $p$-values |
| Independence |
Makes $p_i$ independent of $p_{-i}$ and hence of $R^0$ |
Harder to relax; BH remains valid under certain positive dependence conditions |
| Arbitrary dependence |
Breaks the basic proof |
A conservative fix is BH at level $\alpha/L_m$, where $L_m=\sum_{k=1}^m 1/k$ |
Big Comparison Map
| Procedure / concept |
What it controls |
Rule or object |
Typical use |
| Marginal testing |
One-test Type I error only |
Reject if $p_i \leq \alpha$ |
Only appropriate when there is effectively one test |
| Bonferroni |
FWER |
Reject if $p_i \leq \alpha/m$ |
Safe default under arbitrary dependence |
| Sidak |
FWER |
Reject if $p_i \leq 1-(1-\alpha)^{1/m}$ |
Independence case; modest improvement over Bonferroni |
| Scheffe / deduced inference |
FWER over a whole family of questions |
Start from one joint confidence region |
Simultaneous contrasts and geometric inference |
| Simultaneous CIs |
Joint coverage of all target parameters |
All intervals cover together with probability at least $1-\alpha$ |
Report many intervals without hidden multiplicity inflation |
| Benjamini-Hochberg |
FDR |
Largest $k$ with $p_{(k)} \leq \alpha k/m$ |
Exploratory screening with many discoveries |
Decision summary:
Use FWER language when even one false positive is costly. Use FDR language when you expect many discoveries and can tolerate a small false
fraction among them.
Master Summary and Formula Sheet
Lecture 19 core formulas
| Concept |
Formula |
Comment |
| Familywise error rate |
$\FWER_\theta = P_\theta(\mathcal{R}\cap\mathcal{H}_0 \neq \emptyset)$ |
Probability of at least one false rejection |
| Bonferroni threshold |
$\alpha/m$ |
Comes directly from the union bound |
| Sidak threshold |
$1-(1-\alpha)^{1/m}$ |
Requires independence |
| Scheffe rule |
Reject if $|\lambda^T X|^2 \geq \chi^2_{d,\alpha}$ |
Controls infinitely many Gaussian contrasts |
| Simultaneous CI guarantee |
$P(g_i(\theta)\in C_i(X)\text{ for all }i)\geq 1-\alpha$ |
All intervals must cover together |
Lecture 20 core formulas
| Concept |
Formula |
Comment |
| False discovery proportion |
$\FDP = V/(R\vee 1)$ |
Defined as 0 when $R=0$ |
| False discovery rate |
$\FDR = E[V/(R\vee 1)]$ |
Expected false fraction |
| FDR-FWER comparison |
$\FDR \leq \FWER$ |
Because $\FDP \leq \mathbf{1}\{V>0\}$ |
| BH rejection rule |
$p_{(k)} \leq \alpha k/m$ |
Take the largest valid $k$ |
| Estimated FDP heuristic |
$\widehat{\FDP}_t = mt/(R_t\vee 1)$ |
BH picks the largest threshold with estimate $\leq \alpha$ |
| BH guarantee |
$\FDR \leq \alpha m_0/m \leq \alpha$ |
Under independence and null uniformity |
Fast memory aid:
Bonferroni protects against any false discovery. BH protects against too many false discoveries on average as a fraction.
Common Mistakes
1. Treating $\alpha/m$ as a magical recipe instead of a union-bound consequence.
If you remember the inequality $P(\cup A_i)\leq \sum P(A_i)$, you remember Bonferroni.
2. Saying FDR and FWER are basically the same.
They are not. FWER controls the chance of any false discovery; FDR controls the expected false proportion.
3. Forgetting the $R\vee 1$ convention.
When there are no rejections, the false discovery proportion is defined to be 0, not undefined.
4. Thinking BH guarantees there are no false positives.
No. BH allows false discoveries; it controls their average proportion.
5. Applying BH under arbitrary dependence without comment.
The standard proof needs independence (or stronger positive-dependence assumptions than were proved in class).
Data 145 Study Guide - Lectures 19-20 - Standalone Review Version