Data 145: Evidence and Uncertainty

Comprehensive Study Guide — Lectures 14 through 15 · Spring 2026
Instructors: Ani Adhikari, William Fithian

Table of Contents

  1. Bridge: Lecture 13 to Lecture 14
  2. Lecture 14: p-values, Confidence Regions, and Test-CI Duality
  3. Bridge: Lecture 14 to Lecture 15
  4. Lecture 15: Asymptotic Tests (Wald, Score, GLRT)
  5. Big Transition Map
  6. Master Summary & Formula Sheet
  7. Common Mistakes to Avoid

Bridge: Lecture 13 to Lecture 14

By the end of Lecture 13, the course already has the full testing language: null/alternative, Type I and II error, level, power, NP lemma, and MLR/UMP logic. That part of the course solves an optimization problem: among all tests with controlled Type I error, which rejection rule is strongest against the alternative?

Lecture 13 solves: "what rejection region is best?"
Lecture 14 asks the next natural question: "once we have a test, how should we summarize the evidence in one dataset?"

In practice, a binary reject/do-not-reject answer is too thin. We usually want two richer outputs: a scalar evidence summary telling us how extreme the data are under the null, and a set-valued uncertainty summary telling us which parameter values still look plausible.

This is why Lecture 14 does not replace hypothesis testing; it asks what testing output should look like after the rejection rule has already been designed. P-values and confidence regions are the reporting layer built on top of the Lecture 12-13 testing framework.


Lecture 14: p-values, Confidence Regions, and Test-CI Duality

14.1 p-values formalized

If a test rejects for large values of a statistic $T(X)$, then for a simple null: $$p(x) = P_{H_0}(T(X) \ge T(x)).$$ For a composite null $H_0: \theta \in \Theta_0$: $$p(x) = \sup_{\theta\in\Theta_0} P_\theta(T(X) \ge T(x)).$$
For all valid p-values under the null: $$P_\theta\bigl(p(X)\le\alpha\bigr) \le \alpha, \qquad \theta\in\Theta_0.$$ So rejecting when $p\le\alpha$ always gives a valid level-$\alpha$ test.
A p-value is not $P(H_0\mid\text{data})$.
It is a tail probability of data extremeness assuming $H_0$ is true.
A p-value is not an intrinsic property of the dataset alone. It depends on the chosen test statistic $T(X)$, because $T$ determines what it means for data to be "more extreme" than what was observed.

14.2 Why p-values and confidence intervals can disagree with intuition

Two binomial scenarios from class:
Scenario n Heads $\hat p$ Two-sided p-value for $H_0:p=0.5$ 95% CI (normal approx)
A 50 29 0.58 0.3222 [0.443, 0.717]
B 5000 2600 0.52 0.0049 [0.506, 0.534]
Scenario A has a larger observed departure from 0.5 but weak evidence because the sample is small. Scenario B has a tiny departure but strong evidence because precision is high.

14.3 Confidence regions

A $(1-\alpha)$ confidence region for $g(\theta)$ is a random set $C(X)$ such that $$P_\theta\bigl(C(X)\ni g(\theta)\bigr) \ge 1-\alpha \quad \text{for all }\theta.$$
The random object is the interval/region $C(X)$; the true parameter value is fixed. So the 95% statement is about the procedure over repeated samples, not a posterior probability for one realized interval.

14.4 Test-CI duality (the central structural idea)

If $\phi(X;a)$ is a level-$\alpha$ test of $H_0:g(\theta)=a$, then $$C(X)=\{a:\phi(X;a)<1\}$$ is a valid $(1-\alpha)$ confidence region.
Conversely, if $C(X)$ is a valid $(1-\alpha)$ confidence region, then $$\phi(x;a)=\mathbf 1\{a\notin C(x)\}$$ is a valid level-$\alpha$ test.
A confidence interval is exactly the set of null values that the data do not reject at level $\alpha$.

14.5 What to carry forward

Lecture 14 is mostly structural, not yet constructive. It explains how p-values, tests, and confidence sets are supposed to fit together, but it does not yet tell us which test family to use in a parametric model.

So the next question is practical: with likelihoods, scores, Fisher information, and MLE asymptotics already available from Lectures 3-5, what concrete tests should we actually build and invert?


Bridge: Lecture 14 to Lecture 15

Lecture 14 gives a recipe: for each candidate null value, test it at level $\alpha$, then collect the values not rejected. That recipe is only useful once we know how to manufacture a good test for every possible null value.

Lecture 14 says: confidence regions come from inverting tests.
Lecture 15 says: in parametric models, the local shape of the log-likelihood gives three canonical asymptotic tests to invert.

This bridge directly reuses Lectures 3-5: log-likelihood, score, Fisher information, and MLE asymptotics. The same local quadratic approximation that explained why the MLE is asymptotically normal will now explain why Wald, Score, and GLRT are all valid and closely related.


Lecture 15: Asymptotic Tests (Wald, Score, GLRT)

15.1 Setup recap from earlier lectures

For i.i.d. $X_1,\ldots,X_n\sim f_\theta$: $$\ell_n(\theta)=\sum_{i=1}^n\log f_\theta(X_i),\quad S_n(\theta)=\ell_n'(\theta),\quad I(\theta)=\Var_\theta(\ell_1'(\theta;X_i)).$$ $$\sqrt n(\hat\theta_n-\theta_0)\xrightarrow{d}N\!\left(0,\frac{1}{I(\theta_0)}\right),\qquad \frac{S_n(\theta_0)}{\sqrt{nI(\theta_0)}}\xrightarrow{d}N(0,1).$$
Near $\hat\theta_n$, the log-likelihood is approximately quadratic. Wald, Score, and GLRT are three ways to measure different geometric aspects of this same local shape.

This is the key reuse of earlier material. In Lectures 4-5, the quadratic approximation around the truth produced the asymptotic distribution of the MLE. In Lecture 15, that same parabola is read in three different ways: how far the null is from the peak, how steep the likelihood is at the null, and how much log-likelihood is lost by forcing the null to hold.

15.2 Wald test

$$W=\frac{\hat\theta_n-\theta_0}{\widehat{\mathrm{SE}}(\hat\theta_n)}.$$ Reject for large $|W|$ in two-sided testing.

Standard error choices discussed in class:

Wald is convenient but can be sensitive to parameterization and can yield poor finite-sample behavior near boundaries.

15.3 Score test

$$Z=\frac{\ell_n'(\theta_0)}{\sqrt{nI(\theta_0)}}.$$ Reject for large $|Z|$ (two-sided).
Score test advantages:

15.4 Generalized likelihood ratio test (GLRT)

$$\Lambda_n = 2\bigl[\ell_n(\hat\theta_n)-\ell_n(\theta_0)\bigr].$$ Under regularity in the one-parameter setting: $$\Lambda_n \xrightarrow{d} \chi^2_1.$$
GLRT measures vertical drop from the maximum log-likelihood to the null value. It needs the MLE but does not explicitly require Fisher information in the statistic.

15.5 Poisson running example ($n=30$, $\bar X=3.07$)

For $X_i\sim\text{Pois}(\lambda)$, class comparisons give:
Method 95% interval for $\lambda$ Comment
Wald [2.443, 3.697] Symmetric around $\bar X$
Score [2.504, 3.764] Respects positivity naturally
GLRT [2.485, 3.740] Likelihood-based inversion

15.6 Unifying equivalence statement

Locally under $H_0$: $$W^2 \approx Z^2 \approx 2\bigl[\ell_n(\hat\theta_n)-\ell_n(\theta_0)\bigr] \xrightarrow{d} \chi^2_1.$$ So all three tests are asymptotically equivalent; finite-sample differences come from how imperfect the quadratic approximation is.
Normal mean with known variance gives an exactly quadratic log-likelihood, so Wald, Score, and GLRT coincide exactly. Poisson is only approximately quadratic near the MLE, so small finite-sample differences remain.

Big Transition Map

Lecture stage Main question answered How it connects forward
11 (KS / empirical CDF) How to test whole-distribution fit Prepares broader testing language
12–13 (NP, MLR, UMP) What rejection rule is optimal Creates the test design framework but not yet the reporting language
14 (p-values, CIs, duality) How to summarize evidence and plausible parameter values Turns a family of tests into a family of confidence sets
15 (Wald, Score, GLRT) How to build practical asymptotic tests from likelihood theory Specializes the general testing framework using likelihood geometry and sets up later chi-squared/GLRT ideas

Master Summary & Formula Sheet

Core lecture-14 formulas

Object Formula Meaning
Simple-null p-value $p(x)=P_{H_0}(T(X)\ge T(x))$ Tail extremeness under null
Composite-null p-value $p(x)=\sup_{\theta\in\Theta_0}P_\theta(T(X)\ge T(x))$ Worst-case null calibration
Super-uniformity $P_\theta(p\le\alpha)\le\alpha$ Guarantees valid level control
Confidence region $P_\theta(C(X)\ni g(\theta))\ge 1-\alpha$ Procedure-level coverage
Test->CI inversion $C(X)=\{a:\phi(X;a)<1\}$ Set of non-rejected null values

Core lecture-15 formulas

Test Statistic Reads what geometric feature?
Wald $W=(\hat\theta-\theta_0)/\widehat{SE}$ Horizontal distance
Score $Z=\ell_n'(\theta_0)/\sqrt{nI(\theta_0)}$ Slope at null
GLRT $2[\ell_n(\hat\theta)-\ell_n(\theta_0)]$ Vertical likelihood drop
Equivalence $W^2\approx Z^2\approx 2[\ell_n(\hat\theta)-\ell_n(\theta_0)]$ Same local quadratic backbone
Practical guidance:
Use Wald for convenience when MLE is already in hand, Score when null-evaluation and invariance matter, and GLRT when likelihood comparison is the natural primary object.

Common Mistakes to Avoid

1. "$p<0.05$ means the null is probably false."
No. p-value is computed under the null, not a posterior null probability.
2. "$p>0.05$ means no effect."
No. It may reflect low power or poor targeting of alternatives.
3. "A realized 95% CI contains the truth with 95% probability."
No. Frequentist 95% is long-run coverage of the method.
4. "Wald, Score, and GLRT always agree."
Only asymptotically; finite-sample behavior can differ.
5. "Wald intervals are always fine."
Near boundaries or under awkward parameterization, Score/GLRT-based intervals can be safer.

Data 145 Study Guide · Lectures 14–15 · Standalone Review Version