Data 145 Study Guide - Lectures 25-26

Comprehensive Study Guide - Lectures 25 through 26 - Spring 2026
Instructors: Ani Adhikari, William Fithian

Big Picture: Computation and Borrowing Strength

Lectures 25 and 26 fit together as a single story. Lecture 25 asks: if the posterior distribution is complicated, how can we still estimate posterior expectations? Lecture 26 asks: if we have many related parameters, how can data from all of them improve each individual estimate?

The bridge is hierarchical Bayes. A hierarchical posterior is often high-dimensional and hard to integrate directly, so Gibbs sampling is a practical way to simulate from it. Once we have simulated from the posterior, we can estimate posterior means and understand shrinkage.

$$\text{Bayes estimator}=\text{posterior expectation}\quad\Longrightarrow\quad \text{if the integral is hard, simulate from the posterior.}$$

posterior mean MCMC Gibbs full conditionals hierarchical model shrinkage hyperparameter concentration

The payoff is that MCMC gives a computational tool, while hierarchical Bayes gives a statistical reason for pooling information across related problems.

The central Lecture 26 insight is: with many parallel problems, the population-level hyperparameters can become sharply learned even when each individual parameter remains noisy. That is why shrinkage can improve many individual estimates at once.

Review: Bayes Risk, Bayes Estimators, and Posterior Means

2.1 Risk versus Bayes risk

In the frequentist setup, $\theta$ is fixed and the data $X$ are random. If $T(X)$ estimates $\theta$ and $L(\theta,T(X))$ is the loss, then the risk is

In the Bayesian setup, $\theta$ also has a prior distribution $\pi$. The Bayes risk averages over both $\theta$ and $X$:

The same expectation can be decomposed in two useful ways: $$r_\pi(T)=\E\{R(\theta;T)\}=\E\{\E[L(\theta,T(X))\mid X]\}.$$ The second form says: after seeing $X$, minimize the expected posterior loss.

2.2 Bayes estimator

The Bayes estimator minimizes the posterior expected loss. Under squared-error loss, the answer is the posterior mean:

If $L(\theta,a)=(\theta-a)^2$, then $$T_{\text{Bayes}}(X)=\E[\theta\mid X].$$

This is why computation matters. If $\E[\theta\mid X]$ is an intractable integral, we need another way to approximate it. Monte Carlo is the natural idea: simulate posterior draws $\theta^{(1)},\ldots,\theta^{(m)}$ and average them.

Markov Chain Monte Carlo

3.1 The basic MCMC idea

Suppose the target distribution $p$ is complicated, high-dimensional, or known only up to a normalizing constant. MCMC builds a Markov chain whose stationary distribution is $p$. After the chain has mixed, its states behave like dependent draws from $p$.

A Markov chain Monte Carlo method constructs states $X_0,X_1,\ldots$ such that the long-run distribution of $X_t$ is the target distribution $p$. After burn-in, averages along the chain approximate expectations under $p$.

The code-breaking example from probability is a useful memory hook. The state was a decoder, the state space was enormous, and a Metropolis chain searched through likely decoders. In general Bayesian computation, we are not just searching for a mode; we want posterior averages.

3.2 Burn-in, mixing, trace plots, and thinning

Trace-plot intuition: before burn-in, the chain can still remember the starting point. After mixing, the trace should fluctuate around a stable region instead of drifting persistently.

MCMC samples are not independent. The law of large numbers still has Markov-chain versions, but the effective sample size can be much smaller than the number of iterations when autocorrelation is high.

Gibbs Sampling

4.1 Full conditionals

Gibbs sampling is designed for a multivariate target distribution where sampling jointly is hard, but sampling one coordinate at a time is easy. For a vector $\mathbf{x}=(x_1,\ldots,x_d)$, write $\mathbf{x}_{-i}$ for all coordinates except $x_i$.

The full conditional for coordinate $i$ is the distribution $$p(x_i\mid \mathbf{x}_{-i}).$$ Gibbs sampling repeatedly samples each coordinate from its full conditional, holding all other coordinates fixed.

4.2 The complete cycle

Term	Meaning	How to remember it
Burn-in	The early part of the chain before it has reached its steady-state behavior.	Throw away the first $B$ draws.
Mixing	How quickly the chain forgets its starting point and explores the target distribution.	Good mixing means the trace wanders around the target region without a long trend.
Trace plot	A plot of $f(X_t)$ against iteration $t$.	Use it to diagnose drift, sticking, or slow movement.
Thinning	Keeping every $k$th draw to reduce autocorrelation or storage.	Useful sometimes, but not automatically better because it discards data.

Starting at $\mathbf{x}$, one complete Gibbs cycle does:

Draw $x_1^*\sim p(x_1\mid x_2,\ldots,x_d)$.
Draw $x_2^*\sim p(x_2\mid x_1^*,x_3,\ldots,x_d)$.
Continue through coordinate $d$.
The resulting vector is the next Markov-chain state.

The process is Markov because the next complete cycle depends only on the current state, not on the entire past.

4.3 Two-dimensional geometry: horizontal and vertical slices

For two parameters $(\alpha,\beta)$, Gibbs alternates between conditional slices. At current $\beta$, sample a new $\alpha$ from the horizontal slice. Then at the new $\alpha$, sample a new $\beta$ from the vertical slice.

Gibbs sampling in two dimensions: each move samples one coordinate conditional on the other. One horizontal move plus one vertical move is one complete cycle.

4.4 Why Gibbs can be slow

Gibbs is easy when the conditional slices are easy and the posterior is not too ridge-like. It can be painfully slow when the posterior is highly correlated because the chain moves axis-aligned, one coordinate at a time.

Why Gibbs Works

The rigorous details are beyond the course, but the core idea is simple: each coordinate update is built from a conditional distribution that is consistent with the target joint distribution. Therefore, if the current state already has the target distribution, the coordinate-updated state still has the target distribution.

5.1 One-coordinate update

Fix coordinate $i$. The Gibbs transition changes only $x_i$ and leaves $\mathbf{x}_{-i}$ fixed. Its transition probability/density is

If $\mathbf{x}$ and $\mathbf{x}^*$ differ only in coordinate $i$, then $\mathbf{x}_{-i}=\mathbf{x}^*_{-i}$. Using the factorization $p(\mathbf{x})=p(\mathbf{x}_{-i})p(x_i\mid\mathbf{x}_{-i})$, the detailed-balance calculation is

$$ \begin{aligned} p(\mathbf{x})P_i(\mathbf{x},\mathbf{x}^*) &=p(\mathbf{x}_{-i})p(x_i\mid \mathbf{x}_{-i})p(x_i^*\mid \mathbf{x}_{-i})\\ &=p(\mathbf{x}_{-i})p(x_i^*\mid \mathbf{x}_{-i})p(x_i\mid \mathbf{x}_{-i})\\ &=p(\mathbf{x}^*)P_i(\mathbf{x}^*,\mathbf{x}). \end{aligned} $$ Thus the target distribution is stationary for this one-coordinate update.

5.2 Complete-cycle chain

A full Gibbs iteration is a composition of these coordinate updates. Since each update preserves the target distribution, the complete cycle preserves it too. Under suitable irreducibility and aperiodicity conditions, the chain converges to the target distribution.

"Stationary distribution is correct" does not automatically mean "your finite simulation is good." You still need mixing diagnostics, enough iterations, and awareness of multimodality or high autocorrelation.

5.3 Balance, detailed balance, and fixed-order Gibbs

One subtle lecture point: each one-coordinate Gibbs update satisfies detailed balance with the target distribution, but a full deterministic cycle through coordinates need not be reversible. That is not a problem for stationarity. The important fact is that each coordinate update preserves the target distribution, so composing the updates preserves it too.

If the order of coordinate updates is randomly permuted each iteration, the complete cycle can also satisfy detailed balance. With a fixed order, it is safer to remember the weaker but sufficient statement: the target distribution is stationary for the complete-cycle chain.

Hierarchical Bayes: Many Related Parameters

6.1 The statistical problem

Suppose there are many related problems: many baseball players, many teams, many schools, many genes, or many coordinates. Each unit has its own parameter $\theta_i$, but the parameters are not completely unrelated. They come from a shared population.

A hierarchical Bayes model gives the individual parameters a common population distribution, usually controlled by hyperparameters. A generic version is $$\eta\sim \text{hyperprior},\qquad \theta_i\mid \eta\iid G_\eta,\qquad X_i\mid \theta_i\sim P_{\theta_i}.$$

Hierarchical Bayes: the hyperparameter controls the population distribution of all $\theta_i$'s; each observation $X_i$ is generated from its own $\theta_i$.

6.2 Borrowing strength

The phrase "borrowing strength" means that the estimate of one unit uses information from the others through the shared hyperparameter. Each unit still has its own data, but the model learns a population pattern from all units together.

Hierarchical Bayes does not literally average everybody together. It learns the population center and spread, then shrinks noisy individual estimates toward that learned population structure.

Gaussian Many-Means Model and Shrinkage

7.1 The model

$$X_i\mid \theta_i\sim N(\theta_i,1),\qquad \theta_i\iid N(\mu,\tau^2),\qquad i=1,\ldots,d.$$

If we ignored the relationship among the $\theta_i$'s, the MLE would be $\hat\theta_i=X_i$, with per-coordinate risk 1. The hierarchical model says the $\theta_i$'s are similar enough that we can improve by shrinking the noisy $X_i$'s toward a common center.

7.2 Oracle Bayes when $\mu$ and $\tau^2$ are known

$$\theta_i\mid \mu,\tau^2,X_i\sim N\!\left(\mu+\frac{\tau^2}{1+\tau^2}(X_i-\mu),\frac{\tau^2}{1+\tau^2}\right).$$

$$\hat\theta_i(\mu,\zeta)=\zeta\mu+(1-\zeta)X_i.$$ This is a convex combination of the population mean $\mu$ and the individual observation $X_i$.

Precision view: why those weights appear

Because the observation variance is 1, the data precision is 1. The prior precision is $1/\tau^2$. Posterior precision is the sum:

$$\text{posterior precision}=1+\frac{1}{\tau^2}=\frac{1+\tau^2}{\tau^2},\qquad \text{posterior variance}=\frac{\tau^2}{1+\tau^2}.$$ $$\E[\theta_i\mid X_i,\mu,\tau^2]=\frac{\tau^2}{1+\tau^2}X_i+\frac{1}{1+\tau^2}\mu.$$

The weight on a source is larger when that source is more precise. If $\tau^2$ is small, the prior population is tightly concentrated, so the prior mean $\mu$ gets more weight. If $\tau^2$ is large, the prior population is diffuse, so the current observation $X_i$ gets more weight.

7.3 Linear-function view

$$\hat\theta_i=\underbrace{\zeta\mu}_{\text{intercept}}+\underbrace{(1-\zeta)}_{\text{slope}}X_i.$$

The MLE uses the identity line $\hat\theta_i=X_i$. Shrinkage uses a flatter line that pulls extreme observations back toward the population center.

The page's main memory hook: hierarchical and empirical Bayes are not magic. They estimate the intercept and slope of this shrinkage line using all $d$ observations, then apply the same line to each individual $X_i$.

Unknown Hyperparameters: Hierarchical Bayes vs Empirical Bayes

8.1 The two philosophies

8.2 Hierarchical Bayes formula by the tower property

Let $\zeta=1/(1+\tau^2)$. Conditional on $(\mu,\zeta)$, the posterior mean is the oracle shrinkage rule. Now average over the posterior distribution of $(\mu,\zeta)$:

$$ \begin{aligned} \hat\theta_i^{\text{HB}} &=\E[\theta_i\mid X]\\ &=\E\!\left[\E[\theta_i\mid \mu,\zeta,X]\mid X\right]\\ &=\E[\zeta\mu\mid X]+\bigl(1-\E[\zeta\mid X]\bigr)X_i. \end{aligned} $$

This is the same linear shrinkage rule, except the intercept $\zeta\mu$ and shrinkage factor $\zeta$ are replaced by posterior means learned from the full data set.

This is why the lecture describes hierarchical Bayes as a kind of plug-in estimator in disguise. It does not plug in one point estimate of $(\mu,\zeta)$; instead it plugs in posterior averages of the two coefficients that determine the shrinkage line.

8.3 Empirical Bayes from the marginal distribution

$$X_i=\theta_i+\varepsilon_i,\qquad \theta_i\sim N(\mu,\tau^2),\quad \varepsilon_i\sim N(0,1),$$ $$X_i\iid N(\mu,1+\tau^2).$$

Therefore, the center of the observed $X_i$'s estimates $\mu$, and the spread estimates $1+\tau^2$. Since $1+\tau^2=1/\zeta$,

$$\hat\mu=\bar X,\qquad \hat\zeta=\frac{1}{S^2},\qquad S^2=\frac{1}{d-1}\sum_{i=1}^d(X_i-\bar X)^2.$$ If $S^2<1$, use $\hat\zeta=1$ as a clipped full-shrinkage version.

The Gaussian case makes empirical Bayes look almost too easy because the marginal mean and variance identify the hyperparameters. In more general hierarchical models, the same idea becomes: integrate out the individual $\theta_i$'s, maximize the marginal likelihood over the hyperparameters, then plug those hyperparameter estimates into the Bayes update for each unit.

$$\hat\theta_i^{\text{EB}}=\hat\zeta\hat\mu+(1-\hat\zeta)X_i =\bar X+(1-\hat\zeta)(X_i-\bar X).$$

Each individual estimate still uses its own $X_i$. The sharing happens through the common coefficients, which are estimated from all $d$ observations.

Hyperparameter Concentration

9.1 Why many parallel problems help

One observation $X_i$ contains limited information about its own $\theta_i$. But the whole collection $X_1,\ldots,X_d$ contains many draws from the marginal population distribution $N(\mu,1+\tau^2)$. That makes $\mu$ and $\tau^2$ learnable.

With many parallel problems, hyperparameters can become sharply estimated even while individual parameters remain uncertain.

The lecture visualizes this by increasing the number of parallel Gaussian problems. For small $d$, the posterior over $\mu$ and $\tau^2$ is broad. For large $d$, the same posterior becomes tight around the population values, because the cloud of $X_i$'s reveals the population center and spread.

9.2 Posterior concentration under a flat hyperprior

After integrating out the latent $\theta_i$'s, the Gaussian hierarchy becomes ordinary inference from $X_i\iid N(\mu,1+\tau^2)$. With a flat prior on $(\mu,\tau^2)$, the posterior summaries are the familiar normal-sample forms:

$$\mu\mid X\sim \bar X+S\sqrt{\frac{d-1}{d(d-3)}}\,t_{d-3},$$ $$(1+\tau^2)\mid X\sim \IG\!\left(\frac{d-3}{2},\frac{(d-1)S^2}{2}\right).$$

As $d$ grows, these distributions tighten. In the lecture's simulated example with $d=200$, the posterior for the hyperparameters is much more concentrated than it is for small $d$.

The individual parameters $\theta_i$ each have only one noisy observation. The hyperparameters have the entire cloud of $d$ observations. That is why the population can be known better than any one individual.

Gamma-Poisson Example and Gibbs Traces

10.1 The Premier League hierarchy

The lecture returns to the Gamma-Poisson Premier League model from the worksheet. For team $i$ and match $j$,

$$\theta_i\mid \beta\iid \text{Gamma}(\alpha,\beta),\qquad X_{ij}\mid \theta_i\sim \text{Pois}(\theta_i).$$

Here $\theta_i$ is a team scoring rate, $\alpha$ is held fixed, and $\beta$ is a hyperparameter that controls the population of scoring rates. If $G_i=\sum_j X_{ij}$ is total goals for team $i$ over $m$ matches, conjugacy gives the Gibbs updates:

$$\theta_i\mid \beta,X\sim \text{Gamma}(\alpha+G_i,\beta+m),$$ $$\beta\mid \theta\sim \text{Gamma}\!\left(a+d\alpha,\ b+\sum_i\theta_i\right).$$

This is exactly where Lecture 25 and Lecture 26 meet. The model is hierarchical, and Gibbs sampling alternates between updating team-level rates $\theta_i$ and the population-level hyperparameter $\beta$.

10.2 Marginal model check before Gibbs

Before running Gibbs, the lecture checks whether the hierarchy is a plausible model for the observed team totals. If $\theta_i\sim\text{Gamma}(\alpha,\beta)$ and $G_i\mid\theta_i\sim\text{Pois}(m\theta_i)$, then integrating out $\theta_i$ gives a negative-binomial marginal distribution:

This is the empirical-Bayes instinct again: the marginal distribution across teams tells us about the population hyperparameter. If the fitted marginal distribution is wildly inconsistent with the observed totals, then the hierarchy itself is suspect before we even worry about Gibbs.

10.3 Trace-plot interpretation

The lecture's trace-plot message: the population-level hyperparameter can have a narrow posterior band while individual rates still wander over wider ranges.

We do not know each team rate precisely after only 10 matches. But the population distribution of team rates can be learned much more reliably from all 20 teams. That reliable population estimate is what makes shrinkage useful.

Formula Sheet and Recall Map

11.1 Main formulas

11.2 Recall map

$$\text{posterior mean hard to compute}\Rightarrow \text{MCMC}\Rightarrow \text{Gibbs when full conditionals are easy}.$$ $$\text{many related parameters}\Rightarrow \text{hierarchical model}\Rightarrow \text{learn population}\Rightarrow \text{shrink individuals}.$$

The conceptual one-liner for these two lectures is: Gibbs sampling is a way to compute posterior averages, and hierarchical Bayes is a reason those posterior averages can improve many related estimates by learning a shared population.

Common Mistakes

Example	Individual parameter	Population/hyperparameter	Borrowing strength means
Baseball Beta-Binomial	Player batting probability $p_i$	Beta$(\alpha,\beta)$	Players with few at-bats get pulled toward the league-level batting distribution.
Premier League Gamma-Poisson	Team scoring rate $\theta_i$	Gamma$(\alpha,\beta)$	Team rates get shrunk toward the population of team scoring rates.
Gaussian many means	Mean $\theta_i$	$N(\mu,\tau^2)$	Each $X_i$ gets shrunk toward the learned center $\mu$.

Situation	Limit	Estimator behavior
Large population spread	$\tau^2\to\infty$, so $\zeta\to 0$	Little shrinkage: $\hat\theta_i\approx X_i$.
Small population spread	$\tau^2\to 0$, so $\zeta\to 1$	Strong shrinkage: $\hat\theta_i\approx \mu$.

Method	What it does with $(\mu,\tau^2)$	Output
Hierarchical Bayes	Puts a prior on hyperparameters and computes their posterior.	Posterior mean estimator after averaging over hyperparameter uncertainty.
Empirical Bayes	Estimates hyperparameters from the marginal distribution of the data.	Plug-in Bayes estimator using estimated hyperparameters.

Estimator	Intercept	Slope
MLE	$0$	$1$
Oracle Bayes	$\zeta\mu$	$1-\zeta$
Hierarchical Bayes	$\E[\zeta\mu\mid X]$	$1-\E[\zeta\mid X]$
Empirical Bayes	$\hat\zeta\hat\mu=\bar X/S^2$	$1-\hat\zeta=1-1/S^2$

Estimator	What happens as $d$ grows?
MLE $\hat\theta_i=X_i$	Per-coordinate risk stays at 1 because each coordinate is treated alone.
Oracle Bayes	Uses the true $\mu,\tau^2$ and has per-coordinate risk $\tau^2/(1+\tau^2)$.
Empirical Bayes	Approaches oracle because $\bar X$ and $S^2$ learn the hyperparameters.
Hierarchical Bayes	Approaches oracle because the posterior on the hyperparameters concentrates.

Concept	Formula	Meaning
Bayes risk	$r_\pi(T)=\E[L(\theta,T(X))]$	Risk averaged over the prior and sampling distribution.
Squared-loss Bayes estimator	$T(X)=\E[\theta\mid X]$	Posterior mean minimizes posterior expected squared loss.
MCMC estimate	$m^{-1}\sum_{t=B+1}^{B+m}f(X_t)$	Average post-burn-in chain values.
Gibbs full conditional	$p(x_i\mid \mathbf{x}_{-i})$	Update one coordinate holding all others fixed.
Gaussian oracle shrinkage	$\hat\theta_i=\zeta\mu+(1-\zeta)X_i$	Convex combination of population mean and data.
Shrinkage factor	$\zeta=1/(1+\tau^2)$	More population spread means less shrinkage.
Hierarchical Bayes shrinkage	$\E[\zeta\mu\mid X]+(1-\E[\zeta\mid X])X_i$	Posterior-mean intercept and slope.
Empirical Bayes shrinkage	$\bar X+(1-\hat\zeta)(X_i-\bar X)$	Plug-in shrinkage using marginal estimates.
Gaussian EB hyperparameter estimates	$\hat\mu=\bar X,\ \hat\zeta=1/S^2$	Mean and spread of observed $X_i$'s identify the population.
General EB principle	Estimate hyperparameters from the marginal distribution of the data.	Integrate out individual parameters, fit the population model, then plug in.
Gamma-Poisson team update	$\theta_i\mid\beta,X\sim \text{Gamma}(\alpha+G_i,\beta+m)$	Conjugate Gibbs update for team rate.
Gamma-Poisson marginal check	$G_i\sim\text{NegBin}(\alpha,\beta/(\beta+m))$	Team totals reveal whether the population hierarchy is plausible.
Gamma-Poisson hyperparameter update	$\beta\mid\theta\sim\text{Gamma}(a+d\alpha,b+\sum_i\theta_i)$	Conjugate Gibbs update for population rate parameter.

1. Confusing the posterior mode with the posterior mean.
The code-breaking MCMC example often looked for a high-scoring decoder, but Bayes estimators under squared loss require posterior means.

2. Forgetting burn-in.
Early chain states can still reflect the arbitrary starting point. Posterior averages should use draws after the chain has mixed.

3. Thinking thinning automatically improves inference.
Thinning can reduce autocorrelation or storage, but it also discards samples. It is not a universal fix.

4. Thinking Gibbs always mixes quickly.
Gibbs can be slow for multimodal targets or highly correlated coordinates, because it moves one coordinate at a time.

5. Forgetting that full conditionals must be compatible with one target joint distribution.
Gibbs works because the full conditionals come from the target distribution you want.

6. Confusing hierarchical Bayes with simply pooling all data into one mean.
Hierarchical Bayes learns a population distribution and partially pools. It does not force all $\theta_i$'s to be identical.

7. Thinking empirical Bayes is fully Bayesian.
Empirical Bayes plugs in hyperparameter estimates. Hierarchical Bayes averages over hyperparameter uncertainty.

8. Reversing the shrinkage factor interpretation.
$\zeta=1/(1+\tau^2)$ is the weight on the population mean. Larger $\tau^2$ means less shrinkage, not more.

9. Missing where information is shared.
In the Gaussian model, each estimate is a function of its own $X_i$, but the line's slope and intercept are learned from all observations.

10. Thinking individual uncertainty must vanish because hyperparameter uncertainty vanishes.
With one observation per unit, each $\theta_i$ can remain uncertain even when $\mu$ and $\tau^2$ are sharply learned from many units.

Data 145: Evidence and Uncertainty

Table of Contents