By Lecture 18, the course had built a full toolbox for one question at a time: choose the right test, compute one $p$-value, and compare it to
one level $\alpha$. That framework works well when the scientific problem itself is singular.
Lecture 19 changes the unit of analysis. The issue is no longer "what is the right test for one hypothesis?" It is
"what should error control mean when we test dozens, thousands, or even infinitely many related hypotheses at once?"
This is the natural sequel to Lecture 18. Once the class has many testing procedures available, the next real-world obstacle is that modern data
analysis rarely stops after one test. Variable selection, genomics, A/B testing platforms, and simultaneous confidence statements all force us
to think at the level of a family of inferences.
The conceptual move is:
single-test Type I error -> familywise error -> simultaneous inference -> false discovery rate.
Lecture 19: Multiple Testing and Simultaneous Inference
19.1 Motivating example: prostate cancer gene expression
Singh et al. measured expression for $6{,}033$ genes in healthy controls and prostate cancer patients. Testing each gene separately at
$\alpha=0.05$ produced 478 significant genes. But if all $6{,}033$ nulls were true, we would still expect about $6{,}033 \cdot 0.05 \approx 302$
false positives just from noise.
So the existence of many small $p$-values does not automatically mean we have many trustworthy discoveries. The main danger is that
false rejections accumulate when the number of tests is large.
19.2 Setup and notation
We test $m$ null hypotheses $H_{01},\ldots,H_{0m}$ with $p$-values $p_1,\ldots,p_m$.
$\mathcal{R}(X)=\{i: H_{0i} \text{ is rejected}\}$ is the rejection set, with $R=|\mathcal{R}|$.
$\mathcal{H}_0=\{i: H_{0i} \text{ is true}\}$ is the set of true nulls, with $m_0=|\mathcal{H}_0|$.
A false rejection occurs when $\mathcal{R}\cap\mathcal{H}_0 \neq \emptyset$.
If all $m$ nulls are true and each test is run at level $\alpha$, then under independence: $$P(\text{at least one rejection}) =
1-(1-\alpha)^m.$$ For $m=6{,}033$ and $\alpha=0.05$, this is essentially 1.
A level-$0.05$ test sounds conservative in isolation. It is not conservative when repeated thousands of times. Multiple testing is mostly about
translating the words "Type I error" from the one-test world into the many-test world.
19.3 Familywise error rate (FWER)
The familywise error rate is the probability of making any false rejection: $$\FWER_\theta =
P_\theta(\mathcal{R}\cap\mathcal{H}_0 \neq \emptyset).$$ We want procedures with $\sup_\theta \FWER_\theta \leq \alpha$.
FWER is strict. It treats one false rejection among a thousand true rejections as a failure. That makes sense in high-stakes confirmatory
settings, but it can become very conservative in screening problems.
19.4 Bonferroni and where $\alpha/m$ comes from
Bonferroni correction. Reject $H_{0i}$ only if $p_i \leq \alpha/m$. Then $$\sup_\theta \FWER_\theta \leq \alpha.$$
This is exactly the union-bound argument from discussion: $$P(\text{any false rejection}) = P\!\left(\bigcup_{i\in\mathcal{H}_0}\{\text{reject
}H_{0i}\}\right) \leq \sum_{i\in\mathcal{H}_0} P(\text{reject }H_{0i}).$$ Under Bonferroni, each true null is tested at level $\alpha/m$, so
$$\sum_{i\in\mathcal{H}_0} P(\text{reject }H_{0i}) \leq m_0 \frac{\alpha}{m} \leq \alpha.$$ That is the whole reason the denominator is $m$: it
is the price paid to make the sum of all possible false-rejection probabilities stay below $\alpha$.
We divide by the total number of tests $m$, not the number of true nulls $m_0$, because $m_0$ is unknown. If we knew $m_0$, we
could use $\alpha/m_0$, but in practice we do not.
In the prostate study, Bonferroni uses the threshold $$\frac{0.05}{6033} \approx 8.3\times 10^{-6}.$$ The 478 marginal discoveries collapse to
just 3 Bonferroni discoveries. So Bonferroni is valid and simple, but often extremely harsh.
19.5 Sidak: what independence buys you
If the $p$-values are independent, the Sidak correction uses $$\tilde\alpha_m = 1-(1-\alpha)^{1/m}$$ and rejects when $p_i \leq
\tilde\alpha_m$.
Sidak is derived by asking for the probability of no false rejections to be at least $1-\alpha$ and multiplying independent terms. But
for small per-test thresholds, Sidak and Bonferroni are numerically very close. Independence helps a little, not dramatically.
19.6 Scheffe's method and infinitely many related tests
Bonferroni works best when we can literally list the hypotheses. But some simultaneous inference problems involve infinitely many comparisons,
all built from the same data vector.
Suppose $X \sim N_d(\mu, I_d)$ and we want to test $$H_{0,\lambda}: \lambda^T \mu = 0$$ simultaneously for every unit vector $\lambda$.
Why unit vectors? Because scaling $\lambda$ does not change the underlying null statement: if $c \neq 0$, then $(c\lambda)^T\mu = 0$ is the same
hypothesis as $\lambda^T\mu=0$. Normalizing $\lambda$ just removes a meaningless scale choice.
Scheffe's method. Reject $H_{0,\lambda}$ when $$|\lambda^T X|^2 \geq \chi^2_{d,\alpha}.$$ This controls FWER at level $\alpha$
over all such contrasts.
Write $Z=X-\mu \sim N_d(0,I_d)$. If $H_{0,\lambda}$ is true, then $\lambda^T X = \lambda^T Z$. The worst possible contrast over all unit vectors
is bounded by $\|Z\|^2$, and $\|Z\|^2 \sim \chi^2_d$. One random confidence ball in $\mu$-space controls an infinite family of contrast tests at
once.
19.7 Deduction principle and simultaneous confidence intervals
If $C(X)$ is any joint $(1-\alpha)$ confidence region for $\theta$, then any conclusion deduced under the assumption $\theta \in C(X)$ is
automatically FWER-controlled at level $\alpha$.
Intervals $C_1(X),\ldots,C_m(X)$ are simultaneous $(1-\alpha)$ confidence intervals for $g_1(\theta),\ldots,g_m(\theta)$ if
$$P_\theta\bigl(g_i(\theta)\in C_i(X)\text{ for all }i\bigr)\geq 1-\alpha.$$
Bonferroni also gives simultaneous intervals. If each individual interval has marginal coverage $1-\alpha/m$, then $$P(\text{any interval
misses}) \leq \sum_{i=1}^m \alpha/m = \alpha.$$ So marginal $1-\alpha/m$ intervals become simultaneous $1-\alpha$ intervals.
In correlated Gaussian settings, a deduced confidence region can beat Bonferroni. For example, if $X \sim N_d(\theta,\Sigma)$ and $t_\alpha$ is
the upper-$\alpha$ quantile of $\|X-\theta\|_\infty$, then $$[X_i-t_\alpha,\; X_i+t_\alpha]$$ gives exact simultaneous intervals for all
coordinates. Strong positive correlation can make $t_\alpha$ noticeably smaller than the Bonferroni width.
19.8 Why Bonferroni can be wasteful under correlation
Bonferroni is intentionally blind to dependence. It protects against a false rejection in each coordinate separately, then adds those risks by
the union bound. That is why it is always safe, but it can spend probability budget on combinations of errors that the joint distribution almost
never produces.
Blue boxes represent axis-aligned Bonferroni-style simultaneous intervals. Green shapes represent the actual joint error cloud. When the
coordinates are strongly correlated, the joint cloud is thinner than the box, so a method that uses the covariance structure can be tighter.
The firefly picture: the data point moves inside a joint error cloud. If the cloud is round, an axis-aligned box is not wildly inefficient. If
the cloud is a long ellipse, Bonferroni still builds a box wide enough to cover the extreme coordinate tips, including corners where the firefly
essentially cannot go. That unused corner area is the "wasted" probability budget.
Dependence has two slightly different stories here. For two-sided simultaneous coordinate intervals, strong correlation of either sign can make
the effective dimension smaller and make Bonferroni conservative. For one-sided upper-tail testing events, positive correlation makes rejection
events overlap more, while negative dependence can make them more disjoint; in that case the union bound can be closer to sharp. Always ask:
which events are overlapping?
Bridge: Why FWER Can Be Too Strict
Lecture 19 solves the multiplicity problem by preventing even one false rejection. That is the cleanest extension of Type I error, but it may
over-correct in exploratory settings.
In the prostate example, Bonferroni reduced 478 marginal hits to 3. That is wonderful if every false positive is unacceptable. It is less
attractive if the scientific goal is to generate a manageable shortlist for follow-up experiments.
Lecture 20 keeps the multiple-testing mindset but changes the target. Instead of asking for the probability of zero false discoveries, it asks
for control of the fraction of discoveries that are false.
Lecture 20: False Discovery Rate and Benjamini-Hochberg
20.1 Motivation: false fraction, not false existence
For the same $6{,}033$ prostate-gene tests:
Marginal threshold
Total rejections
Expected false rejections (roughly $m\alpha$)
Estimated false fraction
$0.05$
478
302
About 63%
$0.01$
172
60
About 35%
$0.001$
60
6
About 10%
This suggests a different goal: keep the proportion of bad discoveries small, rather than insisting on no false discovery at all.
20.2 FDP and FDR
Let
$R = |\mathcal{R}|$: total number of discoveries,
$V = |\mathcal{R}\cap\mathcal{H}_0|$: number of false discoveries.
Then the false discovery proportion is $$\FDP = \frac{V}{R \vee 1},$$ where $R\vee 1 = \max(R,1)$ so that $\FDP=0$ when $R=0$.
The false discovery rate is $$\FDR = E\!\left[\frac{V}{R\vee 1}\right] = E[\FDP].$$
FWER asks: "did we make at least one false discovery?" FDR asks: "what fraction of our discoveries are false, on average?" They answer different
scientific questions.
20.3 Why FDR is weaker than FWER
Your discussion note gives the clean comparison. Consider two cases:
If $V=0$, then $\FDP=0$.
If $V>0$, then $0 < \FDP = V/(R\vee 1) \leq 1 = \mathbf{1}\{V>0\}$.
So in every outcome, $$\FDP \leq \mathbf{1}\{V>0\}.$$ Taking expectations gives $$\FDR = E[\FDP] \leq E[\mathbf{1}\{V>0\}] = P(V>0) =
\FWER.$$ Therefore every FWER-controlling procedure automatically controls FDR, but the reverse is not true.
This is why Bonferroni is usually much more conservative than BH. Bonferroni is solving a strictly harder problem.
20.4 The Benjamini-Hochberg (BH) procedure
Given $p$-values $p_1,\ldots,p_m$ and target FDR level $\alpha$:
Find the largest $k$ such that $$p_{(k)} \leq \frac{\alpha k}{m}.$$
Reject the hypotheses corresponding to $p_{(1)},\ldots,p_{(k)}$.
BH is a step-up rule. Bonferroni uses one horizontal cutoff $\alpha/m$. BH uses the rising line $$y=\frac{\alpha}{m}k,$$ so it
becomes more permissive when many small $p$-values are present.
In the prostate data, BH at FDR $0.05$ rejects 21 genes, while Bonferroni rejects only 3. At FDR $0.10$, BH rejects 60 genes. So BH preserves
much more power in large-scale screening.
20.5 BH as estimated-FDP control
Suppose we reject all hypotheses with $p_i \leq t$. Let $R_t$ be the number of such rejections and $V_t$ the number of false ones. If null
$p$-values are uniform, then $$E[V_t] = m_0 t \leq mt.$$ This suggests the conservative estimator $$\widehat{\FDP}_t = \frac{mt}{R_t \vee 1}.$$
BH chooses the largest threshold $t$ for which the estimated false-discovery proportion stays below $\alpha$. At a candidate threshold
$t=p_{(k)}$, we have $R_t=k$, so the condition $$\widehat{\FDP}_t \leq \alpha$$ becomes $$\frac{mp_{(k)}}{k} \leq \alpha
\quad\Longleftrightarrow\quad p_{(k)} \leq \frac{\alpha k}{m}.$$ So the BH rule is exactly the "control estimated FDP" idea written in
ordered-$p$-value form.
Why do we only check thresholds at observed $p$-values? Between two consecutive ordered $p$-values, the rejection count $R_t$ does not change,
so $\widehat{\FDP}_t = mt/R_t$ only gets larger as $t$ increases. The best threshold in each interval is therefore its left endpoint, namely an
observed ordered $p$-value.
20.6 Proof sketch for FDR control
Under independence and null uniformity, BH controls $$\FDR \leq \alpha \frac{m_0}{m} \leq \alpha.$$
The course proof has four moves:
Write $V=\sum_{i\in\mathcal{H}_0} V_i$, where $V_i=\mathbf{1}\{H_{0i}\text{ rejected}\}$.
Show $\FDR = \sum_{i\in\mathcal{H}_0} E[V_i/(R\vee 1)]$.
For a fixed true null $i$, replace $p_i$ by $0$ and call the resulting BH rejection count $R^0$. On the event that $i$ is rejected, the
total number of BH rejections does not change, so $R=R^0$ there.
Condition on the other $p$-values. Since $R^0$ depends only on $p_{-i}$ and $p_i\sim \text{Uniform}(0,1)$ independently,
$$E\!\left[\frac{V_i}{R\vee 1}\right] \leq \frac{\alpha}{m}.$$ Summing over the $m_0$ true nulls gives the result.
The miracle step is the conditioning trick: once the other $p$-values are fixed, the threshold seen by one null $p$-value behaves like a fixed
number, and uniformity gives exactly the cancellation needed.
20.7 Assumptions and dependence remarks
Assumption
Role in the proof
Consequence
Null uniformity
Gives $P(p_i \leq t)=t$ for true nulls
Can be weakened to conservative null $p$-values
Independence
Makes $p_i$ independent of $p_{-i}$ and hence of $R^0$
Harder to relax; BH remains valid under certain positive dependence conditions
Arbitrary dependence
Breaks the basic proof
A conservative fix is BH at level $\alpha/L_m$, where $L_m=\sum_{k=1}^m 1/k$
The phrase "positive dependence" in the BH theorem is not just "the scatterplot slopes upward" in every possible sense. The standard lecture
proof used full independence. A more advanced theorem allows certain positive regression dependence conditions because increasing evidence
against one null should not perversely make the BH rejection threshold behave in the wrong direction. Arbitrary or strongly negative dependence
is not covered by the basic proof.
Study Map: Procedures, Guarantees, and Assumptions
For studying, every method in Lectures 19-20 should be remembered through five questions: what is the target error concept, what is the
procedure, what does it achieve, what assumptions does the guarantee need, and how does it relate to the other procedures?
Method
Process
Achieves
Assumes / proof needs
Relationship
Marginal testing
Reject $H_{0i}$ when $p_i\leq\alpha$.
Controls each individual Type I error.
Only needs each true-null $p_i$ to be valid for its own test.
Baseline; fails to control family-level error when $m$ is large.
Bonferroni
Reject $H_{0i}$ when $p_i\leq\alpha/m$.
Controls FWER: $P(V>0)\leq\alpha$.
Valid true-null $p$-values. No independence needed.
Union-bound method; safest and often most conservative.
Sidak
Reject $H_{0i}$ when $p_i\leq1-(1-\alpha)^{1/m}$.
Controls FWER, with a slightly larger cutoff than Bonferroni.
Independence among true-null $p$-values, plus null uniformity or conservativeness.
Independence-refined Bonferroni; usually numerically close.
Scheffe
Build a $\chi^2$ confidence ball, then test all contrasts from that ball.
Controls FWER over infinitely many Gaussian contrasts.
Gaussian linear-contrast setup and known covariance structure in the lecture version.
Shows that geometry can replace counting hypotheses.
Deduction principle
Start with one joint $1-\alpha$ confidence region and deduce all valid conclusions from it.
Any wrong deduced conclusion occurs only if the joint region missed $\theta$.
The starting confidence region must have joint coverage at least $1-\alpha$.
Generalizes Scheffe and simultaneous intervals.
BH
Sort $p$-values and take the largest $k$ with $p_{(k)}\leq \alpha k/m$.
Lecture proof needs independence and null uniformity; uniformity can relax to conservative null $p$-values.
Step-up estimated-FDP control; more powerful than FWER methods for screening.
BH under dependence
Use ordinary BH only under approved positive-dependence settings; otherwise reduce the target level.
PRDS-type positive dependence still allows BH control; arbitrary dependence needs a correction.
Arbitrary dependence uses the conservative Benjamini-Yekutieli level $\alpha/L_m$, $L_m=\sum_{k=1}^m1/k$.
This is the FDR analogue of asking when dependence breaks or preserves the proof.
Relationship in one line:
Bonferroni/Sidak/Scheffe/deduction control the chance of any false claim; BH controls the expected false fraction among the claims you make.
Big Comparison Map
Procedure / concept
What it controls
Rule or object
Typical use
Marginal testing
One-test Type I error only
Reject if $p_i \leq \alpha$
Only appropriate when there is effectively one test
Bonferroni
FWER
Reject if $p_i \leq \alpha/m$
Safe default under arbitrary dependence
Sidak
FWER
Reject if $p_i \leq 1-(1-\alpha)^{1/m}$
Independence case; modest improvement over Bonferroni
Scheffe / deduced inference
FWER over a whole family of questions
Start from one joint confidence region
Simultaneous contrasts and geometric inference
Simultaneous CIs
Joint coverage of all target parameters
All intervals cover together with probability at least $1-\alpha$
Report many intervals without hidden multiplicity inflation
Benjamini-Hochberg
FDR
Largest $k$ with $p_{(k)} \leq \alpha k/m$
Exploratory screening with many discoveries
Decision summary:
Use FWER language when even one false positive is costly. Use FDR language when you expect many discoveries and can tolerate a small false
fraction among them.
Reject if $|\lambda^T X|^2 \geq \chi^2_{d,\alpha}$
Controls infinitely many Gaussian contrasts
Simultaneous CI guarantee
$P(g_i(\theta)\in C_i(X)\text{ for all }i)\geq 1-\alpha$
All intervals must cover together
Lecture 20 core formulas
Concept
Formula
Comment
False discovery proportion
$\FDP = V/(R\vee 1)$
Defined as 0 when $R=0$
False discovery rate
$\FDR = E[V/(R\vee 1)]$
Expected false fraction
FDR-FWER comparison
$\FDR \leq \FWER$
Because $\FDP \leq \mathbf{1}\{V>0\}$
BH rejection rule
$p_{(k)} \leq \alpha k/m$
Take the largest valid $k$
Estimated FDP heuristic
$\widehat{\FDP}_t = mt/(R_t\vee 1)$
BH picks the largest threshold with estimate $\leq \alpha$
BH guarantee
$\FDR \leq \alpha m_0/m \leq \alpha$
Under independence and null uniformity
Fast memory aid:
Bonferroni protects against any false discovery. BH protects against too many false discoveries on average as a fraction.
Common Mistakes
1. Treating $\alpha/m$ as a magical recipe instead of a union-bound consequence.
If you remember the inequality $P(\cup A_i)\leq \sum P(A_i)$, you remember Bonferroni.
2. Saying FDR and FWER are basically the same.
They are not. FWER controls the chance of any false discovery; FDR controls the expected false proportion.
3. Forgetting the $R\vee 1$ convention.
When there are no rejections, the false discovery proportion is defined to be 0, not undefined.
4. Thinking BH guarantees there are no false positives.
No. BH allows false discoveries; it controls their average proportion.
5. Applying BH under arbitrary dependence without comment.
The standard proof needs independence (or stronger positive-dependence assumptions than were proved in class).
Data 145 Study Guide - Lectures 19-20 - Standalone Review Version