Data 145: Evidence and Uncertainty

Comprehensive Study Guide - Lectures 19 through 20 - Spring 2026
Instructors: Ani Adhikari, William Fithian

Table of Contents

  1. Bridge: From Single Tests to Many Tests
  2. Lecture 19: Multiple Testing and Simultaneous Inference
  3. Bridge: Why FWER Can Be Too Strict
  4. Lecture 20: False Discovery Rate and Benjamini-Hochberg
  5. Big Comparison Map
  6. Master Summary and Formula Sheet
  7. Common Mistakes

Bridge: From Single Tests to Many Tests

By Lecture 18, the course had built a full toolbox for one question at a time: choose the right test, compute one $p$-value, and compare it to one level $\alpha$. That framework works well when the scientific problem itself is singular.

Lecture 19 changes the unit of analysis. The issue is no longer "what is the right test for one hypothesis?" It is "what should error control mean when we test dozens, thousands, or even infinitely many related hypotheses at once?"

This is the natural sequel to Lecture 18. Once the class has many testing procedures available, the next real-world obstacle is that modern data analysis rarely stops after one test. Variable selection, genomics, A/B testing platforms, and simultaneous confidence statements all force us to think at the level of a family of inferences.

The conceptual move is: single-test Type I error -> familywise error -> simultaneous inference -> false discovery rate.


Lecture 19: Multiple Testing and Simultaneous Inference

19.1 Motivating example: prostate cancer gene expression

Singh et al. measured expression for $6{,}033$ genes in healthy controls and prostate cancer patients. Testing each gene separately at $\alpha=0.05$ produced 478 significant genes. But if all $6{,}033$ nulls were true, we would still expect about $6{,}033 \cdot 0.05 \approx 302$ false positives just from noise.

So the existence of many small $p$-values does not automatically mean we have many trustworthy discoveries. The main danger is that false rejections accumulate when the number of tests is large.

19.2 Setup and notation

We test $m$ null hypotheses $H_{01},\ldots,H_{0m}$ with $p$-values $p_1,\ldots,p_m$.
If all $m$ nulls are true and each test is run at level $\alpha$, then under independence: $$P(\text{at least one rejection}) = 1-(1-\alpha)^m.$$ For $m=6{,}033$ and $\alpha=0.05$, this is essentially 1.
A level-$0.05$ test sounds conservative in isolation. It is not conservative when repeated thousands of times. Multiple testing is mostly about translating the words "Type I error" from the one-test world into the many-test world.

19.3 Familywise error rate (FWER)

The familywise error rate is the probability of making any false rejection: $$\FWER_\theta = P_\theta(\mathcal{R}\cap\mathcal{H}_0 \neq \emptyset).$$ We want procedures with $\sup_\theta \FWER_\theta \leq \alpha$.

FWER is strict. It treats one false rejection among a thousand true rejections as a failure. That makes sense in high-stakes confirmatory settings, but it can become very conservative in screening problems.

19.4 Bonferroni and where $\alpha/m$ comes from

Bonferroni correction. Reject $H_{0i}$ only if $p_i \leq \alpha/m$. Then $$\sup_\theta \FWER_\theta \leq \alpha.$$
This is exactly the union-bound argument from discussion: $$P(\text{any false rejection}) = P\!\left(\bigcup_{i\in\mathcal{H}_0}\{\text{reject }H_{0i}\}\right) \leq \sum_{i\in\mathcal{H}_0} P(\text{reject }H_{0i}).$$ Under Bonferroni, each true null is tested at level $\alpha/m$, so $$\sum_{i\in\mathcal{H}_0} P(\text{reject }H_{0i}) \leq m_0 \frac{\alpha}{m} \leq \alpha.$$ That is the whole reason the denominator is $m$: it is the price paid to make the sum of all possible false-rejection probabilities stay below $\alpha$.
We divide by the total number of tests $m$, not the number of true nulls $m_0$, because $m_0$ is unknown. If we knew $m_0$, we could use $\alpha/m_0$, but in practice we do not.
In the prostate study, Bonferroni uses the threshold $$\frac{0.05}{6033} \approx 8.3\times 10^{-6}.$$ The 478 marginal discoveries collapse to just 3 Bonferroni discoveries. So Bonferroni is valid and simple, but often extremely harsh.

19.5 Sidak: what independence buys you

If the $p$-values are independent, the Sidak correction uses $$\tilde\alpha_m = 1-(1-\alpha)^{1/m}$$ and rejects when $p_i \leq \tilde\alpha_m$.
Sidak is derived by asking for the probability of no false rejections to be at least $1-\alpha$ and multiplying independent terms. But for small per-test thresholds, Sidak and Bonferroni are numerically very close. Independence helps a little, not dramatically.

19.6 Scheffe's method and infinitely many related tests

Bonferroni works best when we can literally list the hypotheses. But some simultaneous inference problems involve infinitely many comparisons, all built from the same data vector.

Suppose $X \sim N_d(\mu, I_d)$ and we want to test $$H_{0,\lambda}: \lambda^T \mu = 0$$ simultaneously for every unit vector $\lambda$.

Why unit vectors? Because scaling $\lambda$ does not change the underlying null statement: if $c \neq 0$, then $(c\lambda)^T\mu = 0$ is the same hypothesis as $\lambda^T\mu=0$. Normalizing $\lambda$ just removes a meaningless scale choice.

Scheffe's method. Reject $H_{0,\lambda}$ when $$|\lambda^T X|^2 \geq \chi^2_{d,\alpha}.$$ This controls FWER at level $\alpha$ over all such contrasts.
Write $Z=X-\mu \sim N_d(0,I_d)$. If $H_{0,\lambda}$ is true, then $\lambda^T X = \lambda^T Z$. The worst possible contrast over all unit vectors is bounded by $\|Z\|^2$, and $\|Z\|^2 \sim \chi^2_d$. One random confidence ball in $\mu$-space controls an infinite family of contrast tests at once.

19.7 Deduction principle and simultaneous confidence intervals

If $C(X)$ is any joint $(1-\alpha)$ confidence region for $\theta$, then any conclusion deduced under the assumption $\theta \in C(X)$ is automatically FWER-controlled at level $\alpha$.
Intervals $C_1(X),\ldots,C_m(X)$ are simultaneous $(1-\alpha)$ confidence intervals for $g_1(\theta),\ldots,g_m(\theta)$ if $$P_\theta\bigl(g_i(\theta)\in C_i(X)\text{ for all }i\bigr)\geq 1-\alpha.$$
Bonferroni also gives simultaneous intervals. If each individual interval has marginal coverage $1-\alpha/m$, then $$P(\text{any interval misses}) \leq \sum_{i=1}^m \alpha/m = \alpha.$$ So marginal $1-\alpha/m$ intervals become simultaneous $1-\alpha$ intervals.

In correlated Gaussian settings, a deduced confidence region can beat Bonferroni. For example, if $X \sim N_d(\theta,\Sigma)$ and $t_\alpha$ is the upper-$\alpha$ quantile of $\|X-\theta\|_\infty$, then $$[X_i-t_\alpha,\; X_i+t_\alpha]$$ gives exact simultaneous intervals for all coordinates. Strong positive correlation can make $t_\alpha$ noticeably smaller than the Bonferroni width.


Bridge: Why FWER Can Be Too Strict

Lecture 19 solves the multiplicity problem by preventing even one false rejection. That is the cleanest extension of Type I error, but it may over-correct in exploratory settings.

In the prostate example, Bonferroni reduced 478 marginal hits to 3. That is wonderful if every false positive is unacceptable. It is less attractive if the scientific goal is to generate a manageable shortlist for follow-up experiments.

Lecture 20 keeps the multiple-testing mindset but changes the target. Instead of asking for the probability of zero false discoveries, it asks for control of the fraction of discoveries that are false.


Lecture 20: False Discovery Rate and Benjamini-Hochberg

20.1 Motivation: false fraction, not false existence

For the same $6{,}033$ prostate-gene tests:
Marginal threshold Total rejections Expected false rejections (roughly $m\alpha$) Estimated false fraction
$0.05$ 478 302 About 63%
$0.01$ 172 60 About 35%
$0.001$ 60 6 About 10%
This suggests a different goal: keep the proportion of bad discoveries small, rather than insisting on no false discovery at all.

20.2 FDP and FDR

Let Then the false discovery proportion is $$\FDP = \frac{V}{R \vee 1},$$ where $R\vee 1 = \max(R,1)$ so that $\FDP=0$ when $R=0$. The false discovery rate is $$\FDR = E\!\left[\frac{V}{R\vee 1}\right] = E[\FDP].$$
FWER asks: "did we make at least one false discovery?" FDR asks: "what fraction of our discoveries are false, on average?" They answer different scientific questions.

20.3 Why FDR is weaker than FWER

Your discussion note gives the clean comparison. Consider two cases: So in every outcome, $$\FDP \leq \mathbf{1}\{V>0\}.$$ Taking expectations gives $$\FDR = E[\FDP] \leq E[\mathbf{1}\{V>0\}] = P(V>0) = \FWER.$$ Therefore every FWER-controlling procedure automatically controls FDR, but the reverse is not true.
This is why Bonferroni is usually much more conservative than BH. Bonferroni is solving a strictly harder problem.

20.4 The Benjamini-Hochberg (BH) procedure

Given $p$-values $p_1,\ldots,p_m$ and target FDR level $\alpha$:
  1. Sort them: $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}$.
  2. Find the largest $k$ such that $$p_{(k)} \leq \frac{\alpha k}{m}.$$
  3. Reject the hypotheses corresponding to $p_{(1)},\ldots,p_{(k)}$.
BH is a step-up rule. Bonferroni uses one horizontal cutoff $\alpha/m$. BH uses the rising line $$y=\frac{\alpha}{m}k,$$ so it becomes more permissive when many small $p$-values are present.
In the prostate data, BH at FDR $0.05$ rejects 21 genes, while Bonferroni rejects only 3. At FDR $0.10$, BH rejects 60 genes. So BH preserves much more power in large-scale screening.

20.5 BH as estimated-FDP control

Suppose we reject all hypotheses with $p_i \leq t$. Let $R_t$ be the number of such rejections and $V_t$ the number of false ones. If null $p$-values are uniform, then $$E[V_t] = m_0 t \leq mt.$$ This suggests the conservative estimator $$\widehat{\FDP}_t = \frac{mt}{R_t \vee 1}.$$

BH chooses the largest threshold $t$ for which the estimated false-discovery proportion stays below $\alpha$. At a candidate threshold $t=p_{(k)}$, we have $R_t=k$, so the condition $$\widehat{\FDP}_t \leq \alpha$$ becomes $$\frac{mp_{(k)}}{k} \leq \alpha \quad\Longleftrightarrow\quad p_{(k)} \leq \frac{\alpha k}{m}.$$ So the BH rule is exactly the "control estimated FDP" idea written in ordered-$p$-value form.

Why do we only check thresholds at observed $p$-values? Between two consecutive ordered $p$-values, the rejection count $R_t$ does not change, so $\widehat{\FDP}_t = mt/R_t$ only gets larger as $t$ increases. The best threshold in each interval is therefore its left endpoint, namely an observed ordered $p$-value.

20.6 Proof sketch for FDR control

Under independence and null uniformity, BH controls $$\FDR \leq \alpha \frac{m_0}{m} \leq \alpha.$$
The course proof has four moves:
  1. Write $V=\sum_{i\in\mathcal{H}_0} V_i$, where $V_i=\mathbf{1}\{H_{0i}\text{ rejected}\}$.
  2. Show $\FDR = \sum_{i\in\mathcal{H}_0} E[V_i/(R\vee 1)]$.
  3. For a fixed true null $i$, replace $p_i$ by $0$ and call the resulting BH rejection count $R^0$. On the event that $i$ is rejected, the total number of BH rejections does not change, so $R=R^0$ there.
  4. Condition on the other $p$-values. Since $R^0$ depends only on $p_{-i}$ and $p_i\sim \text{Unif}(0,1)$ independently, $$E\!\left[\frac{V_i}{R\vee 1}\right] \leq \frac{\alpha}{m}.$$ Summing over the $m_0$ true nulls gives the result.

The miracle step is the conditioning trick: once the other $p$-values are fixed, the threshold seen by one null $p$-value behaves like a fixed number, and uniformity gives exactly the cancellation needed.

20.7 Assumptions and dependence remarks

Assumption Role in the proof Consequence
Null uniformity Gives $P(p_i \leq t)=t$ for true nulls Can be weakened to conservative null $p$-values
Independence Makes $p_i$ independent of $p_{-i}$ and hence of $R^0$ Harder to relax; BH remains valid under certain positive dependence conditions
Arbitrary dependence Breaks the basic proof A conservative fix is BH at level $\alpha/L_m$, where $L_m=\sum_{k=1}^m 1/k$

Big Comparison Map

Procedure / concept What it controls Rule or object Typical use
Marginal testing One-test Type I error only Reject if $p_i \leq \alpha$ Only appropriate when there is effectively one test
Bonferroni FWER Reject if $p_i \leq \alpha/m$ Safe default under arbitrary dependence
Sidak FWER Reject if $p_i \leq 1-(1-\alpha)^{1/m}$ Independence case; modest improvement over Bonferroni
Scheffe / deduced inference FWER over a whole family of questions Start from one joint confidence region Simultaneous contrasts and geometric inference
Simultaneous CIs Joint coverage of all target parameters All intervals cover together with probability at least $1-\alpha$ Report many intervals without hidden multiplicity inflation
Benjamini-Hochberg FDR Largest $k$ with $p_{(k)} \leq \alpha k/m$ Exploratory screening with many discoveries
Decision summary:
Use FWER language when even one false positive is costly. Use FDR language when you expect many discoveries and can tolerate a small false fraction among them.

Master Summary and Formula Sheet

Lecture 19 core formulas

Concept Formula Comment
Familywise error rate $\FWER_\theta = P_\theta(\mathcal{R}\cap\mathcal{H}_0 \neq \emptyset)$ Probability of at least one false rejection
Bonferroni threshold $\alpha/m$ Comes directly from the union bound
Sidak threshold $1-(1-\alpha)^{1/m}$ Requires independence
Scheffe rule Reject if $|\lambda^T X|^2 \geq \chi^2_{d,\alpha}$ Controls infinitely many Gaussian contrasts
Simultaneous CI guarantee $P(g_i(\theta)\in C_i(X)\text{ for all }i)\geq 1-\alpha$ All intervals must cover together

Lecture 20 core formulas

Concept Formula Comment
False discovery proportion $\FDP = V/(R\vee 1)$ Defined as 0 when $R=0$
False discovery rate $\FDR = E[V/(R\vee 1)]$ Expected false fraction
FDR-FWER comparison $\FDR \leq \FWER$ Because $\FDP \leq \mathbf{1}\{V>0\}$
BH rejection rule $p_{(k)} \leq \alpha k/m$ Take the largest valid $k$
Estimated FDP heuristic $\widehat{\FDP}_t = mt/(R_t\vee 1)$ BH picks the largest threshold with estimate $\leq \alpha$
BH guarantee $\FDR \leq \alpha m_0/m \leq \alpha$ Under independence and null uniformity
Fast memory aid:
Bonferroni protects against any false discovery. BH protects against too many false discoveries on average as a fraction.

Common Mistakes

1. Treating $\alpha/m$ as a magical recipe instead of a union-bound consequence.
If you remember the inequality $P(\cup A_i)\leq \sum P(A_i)$, you remember Bonferroni.
2. Saying FDR and FWER are basically the same.
They are not. FWER controls the chance of any false discovery; FDR controls the expected false proportion.
3. Forgetting the $R\vee 1$ convention.
When there are no rejections, the false discovery proportion is defined to be 0, not undefined.
4. Thinking BH guarantees there are no false positives.
No. BH allows false discoveries; it controls their average proportion.
5. Applying BH under arbitrary dependence without comment.
The standard proof needs independence (or stronger positive-dependence assumptions than were proved in class).

Data 145 Study Guide - Lectures 19-20 - Standalone Review Version