Data 145: Evidence and Uncertainty

Comprehensive Study Guide - Lecture 27 - Spring 2026
Instructors: Ani Adhikari, William Fithian

Big Picture: Evidence for Causes
Potential Outcomes and ATE
Randomized Controlled Trials
Testing for a Treatment Effect
Estimating the Average Treatment Effect
Standard Causal Notation
Observational Studies and Confounding
Propensity Scores and IPW
Formula Sheet and Recall Map
Common Mistakes

Big Picture: Evidence for Causes

The final lecture asks a different kind of inference question. Earlier lectures often asked whether an observed pattern is surprising under a statistical model. Causal inference asks whether changing a treatment would change an outcome.

Association is about what tends to appear together. Causation is about what would happen under an intervention. The whole lecture is about the gap between those two statements.

The main arc is:

potential outcomes random assignment sharp vs weak nulls ATE estimation confounding propensity weighting

Randomized experiments make causal inference easier because treatment assignment is independent of the potential outcomes. Observational studies are harder because treatment assignment can be entangled with confounders.

Potential Outcomes and ATE

2.1 Two possible worlds per unit

For each unit $i$, imagine two potential outcomes before assignment happens: $t_i$ if the unit receives treatment and $c_i$ if the unit receives control.

The pair $(t_i,c_i)$ contains the two outcomes unit $i$ could have under the two possible assignments. In the finite-population view from lecture, these are fixed lists of numbers; the randomness comes from which units are assigned to treatment.

The fundamental missing-data problem: for each unit, randomization reveals one potential outcome and hides the other.

2.2 Average treatment effect

The individual treatment effect is $t_i-c_i$, but we cannot observe it directly for any one unit. The main population target is the average:

$$\tau=\frac{1}{N}\sum_{i=1}^N(t_i-c_i)=\bar t-\bar c.$$ This is the average treatment effect, or ATE.

The lecture uses finite-population summaries of the two potential-outcome lists:

$$S_t^2=\frac{1}{N-1}\sum_{i=1}^N(t_i-\bar t)^2,\qquad S_c^2=\frac{1}{N-1}\sum_{i=1}^N(c_i-\bar c)^2.$$

The ATE is a comparison of two full potential-outcome lists. We observe only pieces of those lists, so causal inference is about using the assignment mechanism to justify the comparison.

Randomized Controlled Trials

3.1 Randomization creates the probability model

In the lecture's randomized controlled trial setup, the treatment group $T$ is a simple random sample of $n$ indices from $\{1,\ldots,N\}$. The control group $C$ is sampled from the remaining indices and has size $m$.

The observed treatment outcomes are $$X_1,\ldots,X_n \quad \text{the sampled } t_i\text{'s in }T,$$ and the observed control outcomes are $$Y_1,\ldots,Y_m \quad \text{the sampled } c_i\text{'s in }C.$$

The observed data are two disjoint lists, not paired observations. If a unit is in treatment, we observe its $t_i$ as one of the $X$'s and do not observe its $c_i$. If a unit is in control, we observe its $c_i$ as one of the $Y$'s and do not observe its $t_i$. In reality we see $$X_1,\ldots,X_n\qquad\text{and}\qquad Y_1,\ldots,Y_m,$$ not the full paired table $(t_i,c_i)$ for every person.

Because treatment assignment is random, $\bar X$ is an unbiased estimator of $\bar t$ and $\bar Y$ is an unbiased estimator of $\bar c$. Therefore $$\hat\tau=\bar X-\bar Y$$ is an unbiased estimator of $\tau=\bar t-\bar c$.

3.2 Why randomization matters

Random assignment breaks the connection between "which units are treated" and "what those units would have been like anyway." That is what turns a raw comparison of groups into causal evidence.

Setting	Assignment mechanism	Why comparison is hard or easy
Randomized trial	External random assignment	Treatment and control groups are comparable in expectation.
Observational study	Units effectively select or are selected into groups	Group differences can reflect confounders, not treatment effects.

Testing for a Treatment Effect

4.1 Two null hypotheses

The phrase "the treatment has no effect" can mean two different things.

Null	Statement	What it allows
Fisher's strong / sharp null	$t_i=c_i$ for every unit $i$.	No unit-level effect at all. If one outcome is observed, the missing counterfactual is known under the null.
Neyman's weak null	$\bar t=\bar c$, equivalently $\tau=0$.	Some units can benefit and others can be harmed, as long as the average effect is zero.

The sharp null is stronger because it fills in every missing potential outcome under the null. The weak null only says the average effect is zero, so it does not usually tell us every unit's missing counterfactual.

4.2 Binary outcomes: Fisher's exact test

For binary outcomes, testing equal proportions can lead to an exact hypergeometric null distribution. If there are $w$ ones in the pooled $N$ units and $n$ treated units, then under the no-difference null the number of ones in the treatment group has distribution

$$\text{Hypergeometric}(N,w,n).$$

Fisher's exact test uses this distribution to compute the p-value for the observed treated count.

For binary outcomes, permutation of treatment labels reduces to a hypergeometric count: how many of the pooled successes land in the treatment group?

This is the same randomization idea as a permutation test. The permutation test shuffles labels and recomputes a statistic. When the outcome is binary and the statistic is the number of successes in treatment, the shuffled-label distribution has a closed form: hypergeometric.

4.3 Permutation test

Under Fisher's sharp null, relabeling treatment and control assignments is legitimate because treatment would not change any unit's outcome. A permutation test repeatedly shuffles the treatment labels, recomputes a statistic such as $\bar X-\bar Y$, and compares the observed statistic to this randomization distribution.

A permutation test is not just "shuffle because shuffling feels fair." It is justified by an assignment mechanism and a null hypothesis under which the observed outcomes would be unchanged by treatment labels.

Estimating the Average Treatment Effect

5.1 Difference in observed means

Testing answers whether there is evidence of an effect. But instead of only testing hypotheses about whether the ATE is zero, it is often more useful to estimate the ATE itself. Then we learn roughly how big the effect is, not just whether it is distinguishable from zero.

$$\hat\tau=\bar X-\bar Y=\frac{1}{n}\sum_{i\in T}t_i-\frac{1}{m}\sum_{i\in C}c_i.$$ Randomization makes this an unbiased estimator of $$\tau=\bar t-\bar c.$$

5.2 Variance: the unobservable covariance problem

The variance of $\hat\tau$ depends on the variability in the treatment potential outcomes, the variability in the control potential outcomes, and the covariance between the two potential-outcome lists.

This is where the disjoint-list issue matters most. We can estimate the variation among observed treatment outcomes from $X_1,\ldots,X_n$ and the variation among observed control outcomes from $Y_1,\ldots,Y_m$. But we cannot directly estimate how $t_i$ and $c_i$ pair within the same person, because no person reveals both outcomes.

Collapsible derivation: finite-population variance and the conservative bound

First recall the variance of a simple random sample sum without replacement. Suppose a population of size $N$ has mean $\mu$ and variance $\sigma^2$ using the denominator $N$. Draw a simple random sample of size $n$, and let

$$S_n=X_1+\cdots+X_n.$$

By symmetry, $\E[S_n]=n\mu$. Also, by exchangeability,

$$\Var(S_n)=n\sigma^2+n(n-1)\Cov(X_1,X_2).$$

To find $\Cov(X_1,X_2)$, use the census case $n=N$. Then $S_N$ is the fixed population total, so $\Var(S_N)=0$. Therefore

$$0=N\sigma^2+N(N-1)\Cov(X_1,X_2),\qquad \Cov(X_1,X_2)=-\frac{\sigma^2}{N-1}.$$

Plugging this back in gives the finite-population correction:

$$\Var(S_n)=n\sigma^2\frac{N-n}{N-1},\qquad \Var(\bar X)=\frac{\sigma^2}{n}\frac{N-n}{N-1}.$$

Apply this to the treatment and control potential-outcome lists:

$$\Var(\bar X)=\frac{\sigma_t^2}{n}\frac{N-n}{N-1},\qquad \Var(\bar Y)=\frac{\sigma_c^2}{m}\frac{N-m}{N-1}.$$

The estimator is $\hat\tau=\bar X-\bar Y$, so

$$\Var(\hat\tau)=\Var(\bar X)+\Var(\bar Y)-2\Cov(\bar X,\bar Y).$$

For disjoint treatment and control samples, the cross-covariance is $$\Cov(\bar X,\bar Y)=-\frac{1}{N-1}\text{cov}(t,c),$$ where $\text{cov}(t,c)$ is the finite-population covariance of the paired potential outcomes $(t_i,c_i)$. Substituting and regrouping gives

$$\Var(\hat\tau)=\frac{N}{N-1}\left(\frac{\sigma_t^2}{n}+\frac{\sigma_c^2}{m}\right)-\frac{1}{N-1} \left(\sigma_t^2+\sigma_c^2-2\text{cov}(t,c)\right).$$

Since $S_t^2=\frac{N}{N-1}\sigma_t^2$ and $S_c^2=\frac{N}{N-1}\sigma_c^2$, and since

$$\sigma_t^2+\sigma_c^2-2\text{cov}(t,c)=\text{Var}(t_i-c_i)\ge 0,$$

we get the conservative upper bound

$$\Var(\hat\tau)\le \frac{S_t^2}{n}+\frac{S_c^2}{m}.$$

$$\Var(\hat\tau)=\frac{S_t^2}{n}+\frac{S_c^2}{m}-\frac{\sigma_\tau^2}{N-1},$$ where $$\sigma_\tau^2=\frac{1}{N}\sum_{i=1}^N\{(t_i-c_i)-\tau\}^2.$$

Why is the correction term negative?

The last finite-population correction can be seen directly from the covariance algebra:

$$\frac{1}{N-1}\left(2\text{cov}(t,c)-\sigma_t^2-\sigma_c^2\right) =-\frac{1}{N-1}\left(\sigma_t^2+\sigma_c^2-2\text{cov}(t,c)\right).$$

But the expression in parentheses is the variance of individual treatment effects:

$$\sigma_t^2+\sigma_c^2-2\text{cov}(t,c)=\text{Var}(t_i-c_i)\ge 0.$$

So the correction is $-\text{Var}(t_i-c_i)/(N-1)\le 0$. That is why dropping it gives an upper bound on $\Var(\hat\tau)$.

The term $\sigma_\tau^2$ is unobservable because it requires knowing both $t_i$ and $c_i$ for the same unit. That is exactly the fundamental causal missing-data problem again.

The sign comes from the identity $$\Cov(\bar X,\bar Y)=-\frac{1}{N-1}\text{cov}(t,c).$$ The $\text{cov}(t,c)$ part refers to paired potential outcomes for the same person. Those are often positively related: a person with a high control outcome might also tend to have a high treatment outcome. The negative sign comes from $\bar X$ and $\bar Y$ being computed from disjoint groups. If randomization puts many high-potential-outcome people into treatment, those same people cannot appear in control, so the control average is pushed downward. This is the usual without-replacement tradeoff.

Since $\sigma_\tau^2\ge 0$, we get the conservative upper bound

$$\Var(\hat\tau)\le \frac{S_t^2}{n}+\frac{S_c^2}{m}.$$

Estimate the bound using the observed sample variances:

$$\widehat{\Var}_{\text{cons}}(\hat\tau)=\frac{S_X^2}{n}+\frac{S_Y^2}{m}.$$

5.3 Conservative confidence interval

When the treatment and control group sizes are large enough for an approximate normal argument, use

$$\hat\tau\pm 2\sqrt{\frac{S_X^2}{n}+\frac{S_Y^2}{m}}.$$ This is conservative because it estimates an upper bound on the true variance.

The interval is conservative for a good reason: the data cannot reveal how each unit's treatment outcome would pair with its own control outcome. Instead of pretending to know that covariance, we use a variance bound.

Standard Causal Notation

6.1 Potential-outcome notation

Causal inference usually switches to the following notation. This is the same setup as above, but now $n$ denotes the total number of units and $n_1,n_0$ denote the treatment and control counts.

Symbol	Meaning
$Z_i$	Treatment indicator: $Z_i=1$ if unit $i$ receives treatment, $0$ otherwise.
$Y_i(1)$	Potential outcome for unit $i$ under treatment.
$Y_i(0)$	Potential outcome for unit $i$ under control.
$\tau_i=Y_i(1)-Y_i(0)$	Individual treatment effect.
$\bar\tau=n^{-1}\sum_i\tau_i$	Average treatment effect.

6.2 Observed outcome

The observed outcome is whichever potential outcome corresponds to the assigned group:

$$Y_i=Z_iY_i(1)+(1-Z_i)Y_i(0).$$ Equivalently, $$Y_i=Y_i(0)+Z_i\{Y_i(1)-Y_i(0)\}=Y_i(0)+Z_i\tau_i.$$

If $n_1=\sum_i Z_i$ and $n_0=\sum_i(1-Z_i)$, then the usual difference-in-means estimator is

$$\hat\tau=\frac{1}{n_1}\sum_i Z_iY_i-\frac{1}{n_0}\sum_i(1-Z_i)Y_i.$$

6.3 Assumptions in randomized experiments

The main randomized-experiment assumption is $$\{Y_i(1),Y_i(0)\}\perp\!\!\!\perp Z_i.$$ In words: treatment assignment is independent of the unit's potential outcomes.

The lecture also assumes no interference between units: one unit's potential outcomes are not affected by another unit's treatment assignment.

Randomization is powerful because it gives this independence by design. In observational studies, the same independence is usually false unless we condition on enough confounders.

Observational Studies and Confounding

7.1 Why observational studies are harder

In an observational study, units are not randomly assigned. Treatment status can be related to other variables that also affect the outcome. Those variables are confounders.

This is the key break from the randomized-trial setup. In an observational study, units effectively "assign themselves" to treatment or control, so it is no longer safe to assume $$\{Y(1),Y(0)\}\perp\!\!\!\perp Z.$$ The potential outcomes and the assignment can both be influenced by the same confounding variable. That is why the next move is not "just compare the two groups," but "compare treated and control units after accounting for the confounder."

The lecture's memory example is coffee and lung cancer in the era when smoking was common in cafes. Coffee drinking was associated with lung cancer, but smoking was a confounder: it affected the chance of being in the coffee group and the chance of lung cancer.

In that example, one useful coding is:

$$X_i=\begin{cases}1 & \text{if person }i\text{ smokes}\\ 0 & \text{if person }i\text{ does not smoke}\end{cases},\qquad Z_i=\begin{cases}1 & \text{if person }i\text{ drinks coffee}\\ 0 & \text{if person }i\text{ does not drink coffee}\end{cases},$$ $$Y_i=\begin{cases}1 & \text{if person }i\text{ gets lung cancer}\\ 0 & \text{if person }i\text{ does not get lung cancer.}\end{cases}$$

Here $X_i$ is the confounder, $Z_i$ is the treatment/group assignment, and $Y_i$ is the observed outcome. The potential outcomes would be $Y_i(1)$, the lung-cancer outcome if person $i$ drank coffee, and $Y_i(0)$, the lung-cancer outcome if person $i$ did not drink coffee.

If $X$ affects both treatment $Z$ and outcome $Y$, a treated-control comparison can mix the treatment effect with pre-existing differences.

7.2 Conditional independence

The observational-study replacement for randomization is an assumption:

For each confounder value $x$, assume $$\{Y(1),Y(0)\}\perp\!\!\!\perp Z\mid X=x.$$ That is, the potential outcomes are conditionally independent of the group assignment, given the confounder.

This says that within each value of $X$, treated and control units are comparable. In the coffee example, among smokers and among non-smokers separately, coffee-drinker status is assumed independent of the potential lung-cancer outcomes.

This is a strong assumption. It says there are no unmeasured confounders after conditioning on $X$. If an important confounder is missing, the causal interpretation can fail.

7.3 CATE and ATE

The conditional average treatment effect is $$\tau(X)=\E[Y(1)-Y(0)\mid X].$$ The overall average treatment effect is then $$\tau=\E[\tau(X)].$$

CATE asks for the treatment effect within comparable strata. ATE averages those stratum-specific effects back over the population.

Propensity Scores and IPW

8.1 Propensity score

The propensity score is the probability of receiving treatment given the observed confounder:

$$e(X)=\P(Z=1\mid X)=\E[Z\mid X].$$

Under conditional independence, $e(X)$ depends on the confounder but not directly on the potential outcomes: $$\P(Z=1\mid X,Y(1),Y(0))=\P(Z=1\mid X).$$

8.2 Inverse propensity weighting

If treated units are overrepresented in some stratum of $X$, each treated unit from that stratum should count less. If treated units are rare in a stratum, each treated unit from that stratum should count more. That is the inverse-weighting idea.

With estimated propensity scores $\hat e(X_i)$, the inverse propensity weighted estimator is $$\hat\tau_{\text{IPW}}=\frac{1}{n}\sum_{i=1}^n\frac{Z_iY_i}{\color{red}{\hat e(X_i)}}-\frac{1}{n}\sum_{i=1}^n \frac{(1-Z_i)Y_i}{\color{red}{1-\hat e(X_i)}}.$$

The denominators come directly from the assignment probabilities within a confounder stratum:

$$e(x)=\P(Z=1\mid X=x),\qquad 1-e(x)=\P(Z=0\mid X=x).$$

The $Z_i$ term turns on only for treated units, so a treated observed outcome is divided by the chance of being treated, $e(X_i)$. The $(1-Z_i)$ term turns on only for control units, so a control observed outcome is divided by the chance of being control, $1-e(X_i)$. Rare observations get larger weights because they stand in for many similar units who could have appeared in that group.

In this Horvitz-Thompson form, both weighted sums are divided by the full sample size $n$. After weighting, the first sum estimates the population mean $\E[Y(1)]$ and the second estimates $\E[Y(0)]$. A separately normalized version also exists, but it is a slightly different estimator.

Inverse propensity weighting tries to repair imbalance by making each observed unit represent the number of similar units it stands in for.

8.3 Why IPW targets the ATE

If the true propensity score is known, if $0<e(X)<1$, and if conditional independence holds, then IPW has the right expectation.

$$\E[Y(1)]=\E\!\left[\frac{ZY}{e(X)}\right],\qquad \E[Y(0)]=\E\!\left[\frac{(1-Z)Y}{1-e(X)}\right].$$ Therefore $$\tau=\E\!\left[\frac{ZY}{e(X)}-\frac{(1-Z)Y}{1-e(X)}\right].$$

Since $Y=Y(1)$ when $Z=1$, $$\E\!\left[\frac{ZY}{e(X)}\right]=\E\!\left[\frac{ZY(1)}{e(X)}\right].$$ Condition on $X$: $$\E\!\left[\frac{ZY(1)}{e(X)}\mid X\right]=\frac{1}{e(X)}\E[ZY(1)\mid X].$$ By conditional independence, $$\E[ZY(1)\mid X]=\E[Z\mid X]\E[Y(1)\mid X]=e(X)\E[Y(1)\mid X].$$ The $e(X)$ cancels, and iterating expectations gives $\E[Y(1)]$.

Positivity matters: if $e(X)=0$ or $e(X)=1$ for some stratum, then one group is absent there and IPW either blows up or cannot learn that stratum's counterfactual comparison.

Estimating the propensity score adds another layer of uncertainty. The clean expectation identity above uses the true $e(X)$; in practice, $\hat e(X)$ must be estimated carefully, and the causal interpretation still depends on no unmeasured confounding.

Formula Sheet and Recall Map

9.1 Main formulas

Concept	Formula	Meaning
ATE	$\tau=N^{-1}\sum_i(t_i-c_i)=\bar t-\bar c$	Average causal effect across the finite population.
RCT estimator	$\hat\tau=\bar X-\bar Y$	Difference between observed treatment and control means.
Conservative variance estimate	$S_X^2/n+S_Y^2/m$	Usable because the exact covariance term is unobservable.
Observed outcome	$Y_i=Z_iY_i(1)+(1-Z_i)Y_i(0)$	Only one potential outcome is revealed.
Randomized experiment assumption	$\{Y_i(1),Y_i(0)\}\perp\!\!\!\perp Z_i$	Assignment independent of potential outcomes.
Conditional ignorability	$\{Y(1),Y(0)\}\perp\!\!\!\perp Z\mid X$	No unmeasured confounding after conditioning on $X$.
Propensity score	$e(X)=\P(Z=1\mid X)$	Probability of treatment within confounder strata.
IPW estimator	$n^{-1}\sum_i Z_iY_i/\hat e(X_i)-n^{-1}\sum_i(1-Z_i)Y_i/\{1-\hat e(X_i)\}$	Weighted comparison that corrects observed imbalance.

9.2 Recall map

$$\text{causal question}\Rightarrow \text{potential outcomes}\Rightarrow \text{one observed, one missing}.$$ $$\text{random assignment}\Rightarrow \text{unbiased difference in means}.$$ $$\text{observational assignment}\Rightarrow \text{confounding}\Rightarrow \text{condition or weight by }X.$$

The final lecture's one-line memory hook: causal inference is missing-data inference where the missing values are counterfactual outcomes, and the assignment mechanism determines how credible the comparison is.

Common Mistakes

1. Treating association as causation.
A difference in observed group means is causal only if the assignment mechanism or assumptions justify comparing those groups.

2. Forgetting the missing counterfactual.
We never observe both $Y_i(1)$ and $Y_i(0)$ for the same unit. This is why causal inference is fundamentally hard.

3. Confusing Fisher's sharp null with Neyman's weak null.
The sharp null says no unit-level effects. The weak null says the average effect is zero.

4. Using the exact variance formula as if all terms were observable.
The covariance between $t_i$ and $c_i$ requires seeing both potential outcomes for the same unit. Use the conservative bound instead.

5. Forgetting positivity in IPW.
If a stratum has only treated or only control units, inverse propensity weighting cannot create the missing comparison.

6. Thinking propensity scores solve unmeasured confounding.
Propensity methods help only for confounders that are measured and included in the treatment-assignment model.

Data 145 Study Guide - Lecture 27 - Standalone Review Version