Data 145: Evidence and Uncertainty

Comprehensive Study Guide - Lectures 19 through 20 - Spring 2026
Instructors: Ani Adhikari, William Fithian

Bridge: From Single Tests to Many Tests
Lecture 19: Multiple Testing and Simultaneous Inference
Bridge: Why FWER Can Be Too Strict
Lecture 20: False Discovery Rate and Benjamini-Hochberg
Study Map: Procedures, Guarantees, and Assumptions
Big Comparison Map
Master Summary and Formula Sheet
Common Mistakes

Bridge: From Single Tests to Many Tests

By Lecture 18, the course had built a full toolbox for one question at a time: choose the right test, compute one $p$-value, and compare it to one level $\alpha$. That framework works well when the scientific problem itself is singular.

Lecture 19 changes the unit of analysis. The issue is no longer "what is the right test for one hypothesis?" It is "what should error control mean when we test dozens, thousands, or even infinitely many related hypotheses at once?"

This is the natural sequel to Lecture 18. Once the class has many testing procedures available, the next real-world obstacle is that modern data analysis rarely stops after one test. Variable selection, genomics, A/B testing platforms, and simultaneous confidence statements all force us to think at the level of a family of inferences.

The conceptual move is: single-test Type I error -> familywise error -> simultaneous inference -> false discovery rate.

Lecture 19: Multiple Testing and Simultaneous Inference

19.1 Motivating example: prostate cancer gene expression

Singh et al. measured expression for $6{,}033$ genes in healthy controls and prostate cancer patients. Testing each gene separately at $\alpha=0.05$ produced 478 significant genes. But if all $6{,}033$ nulls were true, we would still expect about $6{,}033 \cdot 0.05 \approx 302$ false positives just from noise.

So the existence of many small $p$-values does not automatically mean we have many trustworthy discoveries. The main danger is that false rejections accumulate when the number of tests is large.

19.2 Setup and notation

We test $m$ null hypotheses $H_{01},\ldots,H_{0m}$ with $p$-values $p_1,\ldots,p_m$.

$\mathcal{R}(X)=\{i: H_{0i} \text{ is rejected}\}$ is the rejection set, with $R=|\mathcal{R}|$.
$\mathcal{H}_0=\{i: H_{0i} \text{ is true}\}$ is the set of true nulls, with $m_0=|\mathcal{H}_0|$.
A false rejection occurs when $\mathcal{R}\cap\mathcal{H}_0 \neq \emptyset$.

If all $m$ nulls are true and each test is run at level $\alpha$, then under independence: $$P(\text{at least one rejection}) = 1-(1-\alpha)^m.$$ For $m=6{,}033$ and $\alpha=0.05$, this is essentially 1.

A level-$0.05$ test sounds conservative in isolation. It is not conservative when repeated thousands of times. Multiple testing is mostly about translating the words "Type I error" from the one-test world into the many-test world.

19.3 Familywise error rate (FWER)

The familywise error rate is the probability of making any false rejection: $$\FWER_\theta = P_\theta(\mathcal{R}\cap\mathcal{H}_0 \neq \emptyset).$$ We want procedures with $\sup_\theta \FWER_\theta \leq \alpha$.

FWER is strict. It treats one false rejection among a thousand true rejections as a failure. That makes sense in high-stakes confirmatory settings, but it can become very conservative in screening problems.

19.4 Bonferroni and where $\alpha/m$ comes from

Bonferroni correction. Reject $H_{0i}$ only if $p_i \leq \alpha/m$. Then $$\sup_\theta \FWER_\theta \leq \alpha.$$

This is exactly the union-bound argument from discussion: $$P(\text{any false rejection}) = P\!\left(\bigcup_{i\in\mathcal{H}_0}\{\text{reject }H_{0i}\}\right) \leq \sum_{i\in\mathcal{H}_0} P(\text{reject }H_{0i}).$$ Under Bonferroni, each true null is tested at level $\alpha/m$, so $$\sum_{i\in\mathcal{H}_0} P(\text{reject }H_{0i}) \leq m_0 \frac{\alpha}{m} \leq \alpha.$$ That is the whole reason the denominator is $m$: it is the price paid to make the sum of all possible false-rejection probabilities stay below $\alpha$.

We divide by the total number of tests $m$, not the number of true nulls $m_0$, because $m_0$ is unknown. If we knew $m_0$, we could use $\alpha/m_0$, but in practice we do not.

In the prostate study, Bonferroni uses the threshold $$\frac{0.05}{6033} \approx 8.3\times 10^{-6}.$$ The 478 marginal discoveries collapse to just 3 Bonferroni discoveries. So Bonferroni is valid and simple, but often extremely harsh.

19.5 Sidak: what independence buys you

If the $p$-values are independent, the Sidak correction uses $$\tilde\alpha_m = 1-(1-\alpha)^{1/m}$$ and rejects when $p_i \leq \tilde\alpha_m$.

Sidak is derived by asking for the probability of no false rejections to be at least $1-\alpha$ and multiplying independent terms. But for small per-test thresholds, Sidak and Bonferroni are numerically very close. Independence helps a little, not dramatically.

19.6 Scheffe's method and infinitely many related tests

Bonferroni works best when we can literally list the hypotheses. But some simultaneous inference problems involve infinitely many comparisons, all built from the same data vector.

Suppose $X \sim N_d(\mu, I_d)$ and we want to test $$H_{0,\lambda}: \lambda^T \mu = 0$$ simultaneously for every unit vector $\lambda$.

Why unit vectors? Because scaling $\lambda$ does not change the underlying null statement: if $c \neq 0$, then $(c\lambda)^T\mu = 0$ is the same hypothesis as $\lambda^T\mu=0$. Normalizing $\lambda$ just removes a meaningless scale choice.

Scheffe's method. Reject $H_{0,\lambda}$ when $$|\lambda^T X|^2 \geq \chi^2_{d,\alpha}.$$ This controls FWER at level $\alpha$ over all such contrasts.

Write $Z=X-\mu \sim N_d(0,I_d)$. If $H_{0,\lambda}$ is true, then $\lambda^T X = \lambda^T Z$. The worst possible contrast over all unit vectors is bounded by $\|Z\|^2$, and $\|Z\|^2 \sim \chi^2_d$. One random confidence ball in $\mu$-space controls an infinite family of contrast tests at once.

19.7 Deduction principle and simultaneous confidence intervals

If $C(X)$ is any joint $(1-\alpha)$ confidence region for $\theta$, then any conclusion deduced under the assumption $\theta \in C(X)$ is automatically FWER-controlled at level $\alpha$.

Intervals $C_1(X),\ldots,C_m(X)$ are simultaneous $(1-\alpha)$ confidence intervals for $g_1(\theta),\ldots,g_m(\theta)$ if $$P_\theta\bigl(g_i(\theta)\in C_i(X)\text{ for all }i\bigr)\geq 1-\alpha.$$

Bonferroni also gives simultaneous intervals. If each individual interval has marginal coverage $1-\alpha/m$, then $$P(\text{any interval misses}) \leq \sum_{i=1}^m \alpha/m = \alpha.$$ So marginal $1-\alpha/m$ intervals become simultaneous $1-\alpha$ intervals.

In correlated Gaussian settings, a deduced confidence region can beat Bonferroni. For example, if $X \sim N_d(\theta,\Sigma)$ and $t_\alpha$ is the upper-$\alpha$ quantile of $\|X-\theta\|_\infty$, then $$[X_i-t_\alpha,\; X_i+t_\alpha]$$ gives exact simultaneous intervals for all coordinates. Strong positive correlation can make $t_\alpha$ noticeably smaller than the Bonferroni width.

19.8 Why Bonferroni can be wasteful under correlation

Bonferroni is intentionally blind to dependence. It protects against a false rejection in each coordinate separately, then adds those risks by the union bound. That is why it is always safe, but it can spend probability budget on combinations of errors that the joint distribution almost never produces.

Blue boxes represent axis-aligned Bonferroni-style simultaneous intervals. Green shapes represent the actual joint error cloud. When the coordinates are strongly correlated, the joint cloud is thinner than the box, so a method that uses the covariance structure can be tighter.

The firefly picture: the data point moves inside a joint error cloud. If the cloud is round, an axis-aligned box is not wildly inefficient. If the cloud is a long ellipse, Bonferroni still builds a box wide enough to cover the extreme coordinate tips, including corners where the firefly essentially cannot go. That unused corner area is the "wasted" probability budget.

Dependence has two slightly different stories here. For two-sided simultaneous coordinate intervals, strong correlation of either sign can make the effective dimension smaller and make Bonferroni conservative. For one-sided upper-tail testing events, positive correlation makes rejection events overlap more, while negative dependence can make them more disjoint; in that case the union bound can be closer to sharp. Always ask: which events are overlapping?

Bridge: Why FWER Can Be Too Strict

Lecture 19 solves the multiplicity problem by preventing even one false rejection. That is the cleanest extension of Type I error, but it may over-correct in exploratory settings.

In the prostate example, Bonferroni reduced 478 marginal hits to 3. That is wonderful if every false positive is unacceptable. It is less attractive if the scientific goal is to generate a manageable shortlist for follow-up experiments.

Lecture 20 keeps the multiple-testing mindset but changes the target. Instead of asking for the probability of zero false discoveries, it asks for control of the fraction of discoveries that are false.

Lecture 20: False Discovery Rate and Benjamini-Hochberg

20.1 Motivation: false fraction, not false existence

For the same $6{,}033$ prostate-gene tests:

Marginal threshold	Total rejections	Expected false rejections (roughly $m\alpha$)	Estimated false fraction
$0.05$	478	302	About 63%
$0.01$	172	60	About 35%
$0.001$	60	6	About 10%

This suggests a different goal: keep the proportion of bad discoveries small, rather than insisting on no false discovery at all.

20.2 FDP and FDR

Let

$R = |\mathcal{R}|$: total number of discoveries,
$V = |\mathcal{R}\cap\mathcal{H}_0|$: number of false discoveries.

Then the false discovery proportion is $$\FDP = \frac{V}{R \vee 1},$$ where $R\vee 1 = \max(R,1)$ so that $\FDP=0$ when $R=0$. The false discovery rate is $$\FDR = E\!\left[\frac{V}{R\vee 1}\right] = E[\FDP].$$

FWER asks: "did we make at least one false discovery?" FDR asks: "what fraction of our discoveries are false, on average?" They answer different scientific questions.

20.3 Why FDR is weaker than FWER

Your discussion note gives the clean comparison. Consider two cases:

If $V=0$, then $\FDP=0$.
If $V>0$, then $0 < \FDP = V/(R\vee 1) \leq 1 = \mathbf{1}\{V>0\}$.

So in every outcome, $$\FDP \leq \mathbf{1}\{V>0\}.$$ Taking expectations gives $$\FDR = E[\FDP] \leq E[\mathbf{1}\{V>0\}] = P(V>0) = \FWER.$$ Therefore every FWER-controlling procedure automatically controls FDR, but the reverse is not true.

This is why Bonferroni is usually much more conservative than BH. Bonferroni is solving a strictly harder problem.

20.4 The Benjamini-Hochberg (BH) procedure

Given $p$-values $p_1,\ldots,p_m$ and target FDR level $\alpha$:

Sort them: $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}$.
Find the largest $k$ such that $$p_{(k)} \leq \frac{\alpha k}{m}.$$
Reject the hypotheses corresponding to $p_{(1)},\ldots,p_{(k)}$.

BH is a step-up rule. Bonferroni uses one horizontal cutoff $\alpha/m$. BH uses the rising line $$y=\frac{\alpha}{m}k,$$ so it becomes more permissive when many small $p$-values are present.

In the prostate data, BH at FDR $0.05$ rejects 21 genes, while Bonferroni rejects only 3. At FDR $0.10$, BH rejects 60 genes. So BH preserves much more power in large-scale screening.

20.5 BH as estimated-FDP control

Suppose we reject all hypotheses with $p_i \leq t$. Let $R_t$ be the number of such rejections and $V_t$ the number of false ones. If null $p$-values are uniform, then $$E[V_t] = m_0 t \leq mt.$$ This suggests the conservative estimator $$\widehat{\FDP}_t = \frac{mt}{R_t \vee 1}.$$

BH chooses the largest threshold $t$ for which the estimated false-discovery proportion stays below $\alpha$. At a candidate threshold $t=p_{(k)}$, we have $R_t=k$, so the condition $$\widehat{\FDP}_t \leq \alpha$$ becomes $$\frac{mp_{(k)}}{k} \leq \alpha \quad\Longleftrightarrow\quad p_{(k)} \leq \frac{\alpha k}{m}.$$ So the BH rule is exactly the "control estimated FDP" idea written in ordered-$p$-value form.

Why do we only check thresholds at observed $p$-values? Between two consecutive ordered $p$-values, the rejection count $R_t$ does not change, so $\widehat{\FDP}_t = mt/R_t$ only gets larger as $t$ increases. The best threshold in each interval is therefore its left endpoint, namely an observed ordered $p$-value.

20.6 Proof sketch for FDR control

Under independence and null uniformity, BH controls $$\FDR \leq \alpha \frac{m_0}{m} \leq \alpha.$$

The course proof has four moves:

Write $V=\sum_{i\in\mathcal{H}_0} V_i$, where $V_i=\mathbf{1}\{H_{0i}\text{ rejected}\}$.
Show $\FDR = \sum_{i\in\mathcal{H}_0} E[V_i/(R\vee 1)]$.
For a fixed true null $i$, replace $p_i$ by $0$ and call the resulting BH rejection count $R^0$. On the event that $i$ is rejected, the total number of BH rejections does not change, so $R=R^0$ there.
Condition on the other $p$-values. Since $R^0$ depends only on $p_{-i}$ and $p_i\sim \text{Uniform}(0,1)$ independently, $$E\!\left[\frac{V_i}{R\vee 1}\right] \leq \frac{\alpha}{m}.$$ Summing over the $m_0$ true nulls gives the result.

The miracle step is the conditioning trick: once the other $p$-values are fixed, the threshold seen by one null $p$-value behaves like a fixed number, and uniformity gives exactly the cancellation needed.

20.7 Assumptions and dependence remarks

Assumption	Role in the proof	Consequence
Null uniformity	Gives $P(p_i \leq t)=t$ for true nulls	Can be weakened to conservative null $p$-values
Independence	Makes $p_i$ independent of $p_{-i}$ and hence of $R^0$	Harder to relax; BH remains valid under certain positive dependence conditions
Arbitrary dependence	Breaks the basic proof	A conservative fix is BH at level $\alpha/L_m$, where $L_m=\sum_{k=1}^m 1/k$

The phrase "positive dependence" in the BH theorem is not just "the scatterplot slopes upward" in every possible sense. The standard lecture proof used full independence. A more advanced theorem allows certain positive regression dependence conditions because increasing evidence against one null should not perversely make the BH rejection threshold behave in the wrong direction. Arbitrary or strongly negative dependence is not covered by the basic proof.

Study Map: Procedures, Guarantees, and Assumptions

For studying, every method in Lectures 19-20 should be remembered through five questions: what is the target error concept, what is the procedure, what does it achieve, what assumptions does the guarantee need, and how does it relate to the other procedures?

Method	Process	Achieves	Assumes / proof needs	Relationship
Marginal testing	Reject $H_{0i}$ when $p_i\leq\alpha$.	Controls each individual Type I error.	Only needs each true-null $p_i$ to be valid for its own test.	Baseline; fails to control family-level error when $m$ is large.
Bonferroni	Reject $H_{0i}$ when $p_i\leq\alpha/m$.	Controls FWER: $P(V>0)\leq\alpha$.	Valid true-null $p$-values. No independence needed.	Union-bound method; safest and often most conservative.
Sidak	Reject $H_{0i}$ when $p_i\leq1-(1-\alpha)^{1/m}$.	Controls FWER, with a slightly larger cutoff than Bonferroni.	Independence among true-null $p$-values, plus null uniformity or conservativeness.	Independence-refined Bonferroni; usually numerically close.
Scheffe	Build a $\chi^2$ confidence ball, then test all contrasts from that ball.	Controls FWER over infinitely many Gaussian contrasts.	Gaussian linear-contrast setup and known covariance structure in the lecture version.	Shows that geometry can replace counting hypotheses.
Deduction principle	Start with one joint $1-\alpha$ confidence region and deduce all valid conclusions from it.	Any wrong deduced conclusion occurs only if the joint region missed $\theta$.	The starting confidence region must have joint coverage at least $1-\alpha$.	Generalizes Scheffe and simultaneous intervals.
BH	Sort $p$-values and take the largest $k$ with $p_{(k)}\leq \alpha k/m$.	Controls FDR: $E[V/(R\vee1)]\leq\alpha m_0/m\leq\alpha$.	Lecture proof needs independence and null uniformity; uniformity can relax to conservative null $p$-values.	Step-up estimated-FDP control; more powerful than FWER methods for screening.
BH under dependence	Use ordinary BH only under approved positive-dependence settings; otherwise reduce the target level.	PRDS-type positive dependence still allows BH control; arbitrary dependence needs a correction.	Arbitrary dependence uses the conservative Benjamini-Yekutieli level $\alpha/L_m$, $L_m=\sum_{k=1}^m1/k$.	This is the FDR analogue of asking when dependence breaks or preserves the proof.

Relationship in one line:
Bonferroni/Sidak/Scheffe/deduction control the chance of any false claim; BH controls the expected false fraction among the claims you make.

Big Comparison Map

Procedure / concept	What it controls	Rule or object	Typical use
Marginal testing	One-test Type I error only	Reject if $p_i \leq \alpha$	Only appropriate when there is effectively one test
Bonferroni	FWER	Reject if $p_i \leq \alpha/m$	Safe default under arbitrary dependence
Sidak	FWER	Reject if $p_i \leq 1-(1-\alpha)^{1/m}$	Independence case; modest improvement over Bonferroni
Scheffe / deduced inference	FWER over a whole family of questions	Start from one joint confidence region	Simultaneous contrasts and geometric inference
Simultaneous CIs	Joint coverage of all target parameters	All intervals cover together with probability at least $1-\alpha$	Report many intervals without hidden multiplicity inflation
Benjamini-Hochberg	FDR	Largest $k$ with $p_{(k)} \leq \alpha k/m$	Exploratory screening with many discoveries

Decision summary:
Use FWER language when even one false positive is costly. Use FDR language when you expect many discoveries and can tolerate a small false fraction among them.

Master Summary and Formula Sheet

Lecture 19 core formulas

Concept	Formula	Comment
Familywise error rate	$\FWER_\theta = P_\theta(\mathcal{R}\cap\mathcal{H}_0 \neq \emptyset)$	Probability of at least one false rejection
Bonferroni threshold	$\alpha/m$	Comes directly from the union bound
Sidak threshold	$1-(1-\alpha)^{1/m}$	Requires independence
Scheffe rule	Reject if $\|\lambda^T X\|^2 \geq \chi^2_{d,\alpha}$	Controls infinitely many Gaussian contrasts
Simultaneous CI guarantee	$P(g_i(\theta)\in C_i(X)\text{ for all }i)\geq 1-\alpha$	All intervals must cover together

Lecture 20 core formulas

Concept	Formula	Comment
False discovery proportion	$\FDP = V/(R\vee 1)$	Defined as 0 when $R=0$
False discovery rate	$\FDR = E[V/(R\vee 1)]$	Expected false fraction
FDR-FWER comparison	$\FDR \leq \FWER$	Because $\FDP \leq \mathbf{1}\{V>0\}$
BH rejection rule	$p_{(k)} \leq \alpha k/m$	Take the largest valid $k$
Estimated FDP heuristic	$\widehat{\FDP}_t = mt/(R_t\vee 1)$	BH picks the largest threshold with estimate $\leq \alpha$
BH guarantee	$\FDR \leq \alpha m_0/m \leq \alpha$	Under independence and null uniformity

Fast memory aid:
Bonferroni protects against any false discovery. BH protects against too many false discoveries on average as a fraction.

Common Mistakes

1. Treating $\alpha/m$ as a magical recipe instead of a union-bound consequence.
If you remember the inequality $P(\cup A_i)\leq \sum P(A_i)$, you remember Bonferroni.

2. Saying FDR and FWER are basically the same.
They are not. FWER controls the chance of any false discovery; FDR controls the expected false proportion.

3. Forgetting the $R\vee 1$ convention.
When there are no rejections, the false discovery proportion is defined to be 0, not undefined.

4. Thinking BH guarantees there are no false positives.
No. BH allows false discoveries; it controls their average proportion.

5. Applying BH under arbitrary dependence without comment.
The standard proof needs independence (or stronger positive-dependence assumptions than were proved in class).

Data 145 Study Guide - Lectures 19-20 - Standalone Review Version