The final lecture asks a different kind of inference question. Earlier lectures often asked whether an observed pattern is surprising under a
statistical model. Causal inference asks whether changing a treatment would change an outcome.
Association is about what tends to appear together. Causation is about what would happen under an intervention. The whole lecture is about the
gap between those two statements.
The main arc is:
potential outcomesrandom assignmentsharp vs weak nullsATE estimationconfoundingpropensity weighting
Randomized experiments make causal inference easier because treatment assignment is independent of the potential outcomes. Observational studies
are harder because treatment assignment can be entangled with confounders.
Potential Outcomes and ATE
2.1 Two possible worlds per unit
For each unit $i$, imagine two potential outcomes before assignment happens:
$t_i$ if the unit receives treatment and $c_i$ if the unit receives control.
The pair $(t_i,c_i)$ contains the two outcomes unit $i$ could have under the two possible assignments. In the finite-population view from
lecture, these are fixed lists of numbers; the randomness comes from which units are assigned to treatment.
The fundamental missing-data problem: for each unit, randomization reveals one potential outcome and hides the other.
2.2 Average treatment effect
The individual treatment effect is $t_i-c_i$, but we cannot observe it directly for any one unit. The main population target is the average:
$$\tau=\frac{1}{N}\sum_{i=1}^N(t_i-c_i)=\bar t-\bar c.$$ This is the average treatment effect, or ATE.
The lecture uses finite-population summaries of the two potential-outcome lists:
The ATE is a comparison of two full potential-outcome lists. We observe only pieces of those lists, so causal inference is about using the
assignment mechanism to justify the comparison.
Randomized Controlled Trials
3.1 Randomization creates the probability model
In the lecture's randomized controlled trial setup, the treatment group $T$ is a simple random sample of $n$ indices from $\{1,\ldots,N\}$. The
control group $C$ is sampled from the remaining indices and has size $m$.
The observed treatment outcomes are $$X_1,\ldots,X_n \quad \text{the sampled } t_i\text{'s in }T,$$ and the observed control outcomes are
$$Y_1,\ldots,Y_m \quad \text{the sampled } c_i\text{'s in }C.$$
The observed data are two disjoint lists, not paired observations. If a unit is in treatment, we observe its $t_i$ as one of the $X$'s and do
not observe its $c_i$. If a unit is in control, we observe its $c_i$ as one of the $Y$'s and do not observe its $t_i$. In reality we see
$$X_1,\ldots,X_n\qquad\text{and}\qquad Y_1,\ldots,Y_m,$$ not the full paired table $(t_i,c_i)$ for every person.
Because treatment assignment is random, $\bar X$ is an unbiased estimator of $\bar t$ and $\bar Y$ is an unbiased estimator of $\bar c$.
Therefore $$\hat\tau=\bar X-\bar Y$$ is an unbiased estimator of $\tau=\bar t-\bar c$.
3.2 Why randomization matters
Random assignment breaks the connection between "which units are treated" and "what those units would have been like anyway." That is what turns
a raw comparison of groups into causal evidence.
Setting
Assignment mechanism
Why comparison is hard or easy
Randomized trial
External random assignment
Treatment and control groups are comparable in expectation.
Observational study
Units effectively select or are selected into groups
Group differences can reflect confounders, not treatment effects.
Testing for a Treatment Effect
4.1 Two null hypotheses
The phrase "the treatment has no effect" can mean two different things.
Null
Statement
What it allows
Fisher's strong / sharp null
$t_i=c_i$ for every unit $i$.
No unit-level effect at all. If one outcome is observed, the missing counterfactual is known under the null.
Neyman's weak null
$\bar t=\bar c$, equivalently $\tau=0$.
Some units can benefit and others can be harmed, as long as the average effect is zero.
The sharp null is stronger because it fills in every missing potential outcome under the null. The weak null only says the average effect is
zero, so it does not usually tell us every unit's missing counterfactual.
4.2 Binary outcomes: Fisher's exact test
For binary outcomes, testing equal proportions can lead to an exact hypergeometric null distribution. If there are $w$ ones in the pooled $N$
units and $n$ treated units, then under the no-difference null the number of ones in the treatment group has distribution
$$\text{Hypergeometric}(N,w,n).$$
Fisher's exact test uses this distribution to compute the p-value for the observed treated count.
For binary outcomes, permutation of treatment labels reduces to a hypergeometric count: how many of the pooled successes land in the treatment
group?
This is the same randomization idea as a permutation test. The permutation test shuffles labels and recomputes a statistic. When the outcome is
binary and the statistic is the number of successes in treatment, the shuffled-label distribution has a closed form: hypergeometric.
4.3 Permutation test
Under Fisher's sharp null, relabeling treatment and control assignments is legitimate because treatment would not change any unit's outcome. A
permutation test repeatedly shuffles the treatment labels, recomputes a statistic such as $\bar X-\bar Y$, and compares the observed statistic
to this randomization distribution.
A permutation test is not just "shuffle because shuffling feels fair." It is justified by an assignment mechanism and a null hypothesis under
which the observed outcomes would be unchanged by treatment labels.
Estimating the Average Treatment Effect
5.1 Difference in observed means
Testing answers whether there is evidence of an effect. But instead of only testing hypotheses about whether the ATE is zero, it is often more
useful to estimate the ATE itself. Then we learn roughly how big the effect is, not just whether it is distinguishable from zero.
$$\hat\tau=\bar X-\bar Y=\frac{1}{n}\sum_{i\in T}t_i-\frac{1}{m}\sum_{i\in C}c_i.$$ Randomization makes this an unbiased estimator of
$$\tau=\bar t-\bar c.$$
5.2 Variance: the unobservable covariance problem
The variance of $\hat\tau$ depends on the variability in the treatment potential outcomes, the variability in the control potential outcomes,
and the covariance between the two potential-outcome lists.
This is where the disjoint-list issue matters most. We can estimate the variation among observed treatment outcomes from $X_1,\ldots,X_n$ and
the variation among observed control outcomes from $Y_1,\ldots,Y_m$. But we cannot directly estimate how $t_i$ and $c_i$ pair within the same
person, because no person reveals both outcomes.
Collapsible derivation: finite-population variance and the conservative bound
First recall the variance of a simple random sample sum without replacement. Suppose a population of size $N$ has mean $\mu$ and variance
$\sigma^2$ using the denominator $N$. Draw a simple random sample of size $n$, and let
$$S_n=X_1+\cdots+X_n.$$
By symmetry, $\E[S_n]=n\mu$. Also, by exchangeability,
$$\Var(S_n)=n\sigma^2+n(n-1)\Cov(X_1,X_2).$$
To find $\Cov(X_1,X_2)$, use the census case $n=N$. Then $S_N$ is the fixed population total, so $\Var(S_N)=0$. Therefore
For disjoint treatment and control samples, the cross-covariance is $$\Cov(\bar X,\bar Y)=-\frac{1}{N-1}\text{cov}(t,c),$$ where
$\text{cov}(t,c)$ is the finite-population covariance of the paired potential outcomes $(t_i,c_i)$. Substituting and regrouping gives
So the correction is $-\text{Var}(t_i-c_i)/(N-1)\le 0$. That is why dropping it gives an upper bound on $\Var(\hat\tau)$.
The term $\sigma_\tau^2$ is unobservable because it requires knowing both $t_i$ and $c_i$ for the same unit. That is exactly the fundamental
causal missing-data problem again.
The sign comes from the identity $$\Cov(\bar X,\bar Y)=-\frac{1}{N-1}\text{cov}(t,c).$$ The $\text{cov}(t,c)$ part refers to paired potential
outcomes for the same person. Those are often positively related: a person with a high control outcome might also tend to have a high treatment
outcome. The negative sign comes from $\bar X$ and $\bar Y$ being computed from disjoint groups. If randomization puts many
high-potential-outcome people into treatment, those same people cannot appear in control, so the control average is pushed downward. This is the
usual without-replacement tradeoff.
Since $\sigma_\tau^2\ge 0$, we get the conservative upper bound
When the treatment and control group sizes are large enough for an approximate normal argument, use
$$\hat\tau\pm 2\sqrt{\frac{S_X^2}{n}+\frac{S_Y^2}{m}}.$$ This is conservative because it estimates an upper bound on the true variance.
The interval is conservative for a good reason: the data cannot reveal how each unit's treatment outcome would pair with its own control
outcome. Instead of pretending to know that covariance, we use a variance bound.
Standard Causal Notation
6.1 Potential-outcome notation
Causal inference usually switches to the following notation. This is the same setup as above, but now $n$ denotes the total number of units and
$n_1,n_0$ denote the treatment and control counts.
Symbol
Meaning
$Z_i$
Treatment indicator: $Z_i=1$ if unit $i$ receives treatment, $0$ otherwise.
$Y_i(1)$
Potential outcome for unit $i$ under treatment.
$Y_i(0)$
Potential outcome for unit $i$ under control.
$\tau_i=Y_i(1)-Y_i(0)$
Individual treatment effect.
$\bar\tau=n^{-1}\sum_i\tau_i$
Average treatment effect.
6.2 Observed outcome
The observed outcome is whichever potential outcome corresponds to the assigned group:
The main randomized-experiment assumption is $$\{Y_i(1),Y_i(0)\}\perp\!\!\!\perp Z_i.$$ In words: treatment assignment is independent of the
unit's potential outcomes.
The lecture also assumes no interference between units: one unit's potential outcomes are not affected by another unit's treatment assignment.
Randomization is powerful because it gives this independence by design. In observational studies, the same independence is usually false unless
we condition on enough confounders.
Observational Studies and Confounding
7.1 Why observational studies are harder
In an observational study, units are not randomly assigned. Treatment status can be related to other variables that also affect the outcome.
Those variables are confounders.
This is the key break from the randomized-trial setup. In an observational study, units effectively "assign themselves" to treatment or control,
so it is no longer safe to assume $$\{Y(1),Y(0)\}\perp\!\!\!\perp Z.$$ The potential outcomes and the assignment can both be influenced by the
same confounding variable. That is why the next move is not "just compare the two groups," but "compare treated and control units after
accounting for the confounder."
The lecture's memory example is coffee and lung cancer in the era when smoking was common in cafes. Coffee drinking was associated with lung
cancer, but smoking was a confounder: it affected the chance of being in the coffee group and the chance of lung cancer.
In that example, one useful coding is:
$$X_i=\begin{cases}1 & \text{if person }i\text{ smokes}\\ 0 & \text{if person }i\text{ does not smoke}\end{cases},\qquad Z_i=\begin{cases}1 &
\text{if person }i\text{ drinks coffee}\\ 0 & \text{if person }i\text{ does not drink coffee}\end{cases},$$ $$Y_i=\begin{cases}1 & \text{if
person }i\text{ gets lung cancer}\\ 0 & \text{if person }i\text{ does not get lung cancer.}\end{cases}$$
Here $X_i$ is the confounder, $Z_i$ is the treatment/group assignment, and $Y_i$ is the observed outcome. The potential outcomes would be
$Y_i(1)$, the lung-cancer outcome if person $i$ drank coffee, and $Y_i(0)$, the lung-cancer outcome if person $i$ did not drink coffee.
If $X$ affects both treatment $Z$ and outcome $Y$, a treated-control comparison can mix the treatment effect with pre-existing differences.
7.2 Conditional independence
The observational-study replacement for randomization is an assumption:
For each confounder value $x$, assume $$\{Y(1),Y(0)\}\perp\!\!\!\perp Z\mid X=x.$$ That is, the potential outcomes are conditionally independent
of the group assignment, given the confounder.
This says that within each value of $X$, treated and control units are comparable. In the coffee example, among smokers and among non-smokers
separately, coffee-drinker status is assumed independent of the potential lung-cancer outcomes.
This is a strong assumption. It says there are no unmeasured confounders after conditioning on $X$. If an important confounder is missing, the
causal interpretation can fail.
7.3 CATE and ATE
The conditional average treatment effect is $$\tau(X)=\E[Y(1)-Y(0)\mid X].$$ The overall average treatment effect is then $$\tau=\E[\tau(X)].$$
CATE asks for the treatment effect within comparable strata. ATE averages those stratum-specific effects back over the population.
Propensity Scores and IPW
8.1 Propensity score
The propensity score is the probability of receiving treatment given the observed confounder:
$$e(X)=\P(Z=1\mid X)=\E[Z\mid X].$$
Under conditional independence, $e(X)$ depends on the confounder but not directly on the potential outcomes: $$\P(Z=1\mid
X,Y(1),Y(0))=\P(Z=1\mid X).$$
8.2 Inverse propensity weighting
If treated units are overrepresented in some stratum of $X$, each treated unit from that stratum should count less. If treated units are rare in
a stratum, each treated unit from that stratum should count more. That is the inverse-weighting idea.
With estimated propensity scores $\hat e(X_i)$, the inverse propensity weighted estimator is
$$\hat\tau_{\text{IPW}}=\frac{1}{n}\sum_{i=1}^n\frac{Z_iY_i}{\color{red}{\hat e(X_i)}}-\frac{1}{n}\sum_{i=1}^n
\frac{(1-Z_i)Y_i}{\color{red}{1-\hat e(X_i)}}.$$
The denominators come directly from the assignment probabilities within a confounder stratum:
The $Z_i$ term turns on only for treated units, so a treated observed outcome is divided by the chance of being treated, $e(X_i)$. The $(1-Z_i)$
term turns on only for control units, so a control observed outcome is divided by the chance of being control, $1-e(X_i)$. Rare observations get
larger weights because they stand in for many similar units who could have appeared in that group.
In this Horvitz-Thompson form, both weighted sums are divided by the full sample size $n$. After weighting, the first sum estimates the
population mean $\E[Y(1)]$ and the second estimates $\E[Y(0)]$. A separately normalized version also exists, but it is a slightly different
estimator.
Inverse propensity weighting tries to repair imbalance by making each observed unit represent the number of similar units it stands in for.
8.3 Why IPW targets the ATE
If the true propensity score is known, if $0<e(X)<1$, and if conditional independence holds, then IPW has the right expectation.
Since $Y=Y(1)$ when $Z=1$, $$\E\!\left[\frac{ZY}{e(X)}\right]=\E\!\left[\frac{ZY(1)}{e(X)}\right].$$ Condition on $X$:
$$\E\!\left[\frac{ZY(1)}{e(X)}\mid X\right]=\frac{1}{e(X)}\E[ZY(1)\mid X].$$ By conditional independence, $$\E[ZY(1)\mid X]=\E[Z\mid
X]\E[Y(1)\mid X]=e(X)\E[Y(1)\mid X].$$ The $e(X)$ cancels, and iterating expectations gives $\E[Y(1)]$.
Positivity matters: if $e(X)=0$ or $e(X)=1$ for some stratum, then one group is absent there and IPW either blows up or cannot learn that
stratum's counterfactual comparison.
Estimating the propensity score adds another layer of uncertainty. The clean expectation identity above uses the true $e(X)$; in practice, $\hat
e(X)$ must be estimated carefully, and the causal interpretation still depends on no unmeasured confounding.
Formula Sheet and Recall Map
9.1 Main formulas
Concept
Formula
Meaning
ATE
$\tau=N^{-1}\sum_i(t_i-c_i)=\bar t-\bar c$
Average causal effect across the finite population.
RCT estimator
$\hat\tau=\bar X-\bar Y$
Difference between observed treatment and control means.
Conservative variance estimate
$S_X^2/n+S_Y^2/m$
Usable because the exact covariance term is unobservable.
Observed outcome
$Y_i=Z_iY_i(1)+(1-Z_i)Y_i(0)$
Only one potential outcome is revealed.
Randomized experiment assumption
$\{Y_i(1),Y_i(0)\}\perp\!\!\!\perp Z_i$
Assignment independent of potential outcomes.
Conditional ignorability
$\{Y(1),Y(0)\}\perp\!\!\!\perp Z\mid X$
No unmeasured confounding after conditioning on $X$.
Propensity score
$e(X)=\P(Z=1\mid X)$
Probability of treatment within confounder strata.
Weighted comparison that corrects observed imbalance.
9.2 Recall map
$$\text{causal question}\Rightarrow \text{potential outcomes}\Rightarrow \text{one observed, one missing}.$$ $$\text{random
assignment}\Rightarrow \text{unbiased difference in means}.$$ $$\text{observational assignment}\Rightarrow \text{confounding}\Rightarrow
\text{condition or weight by }X.$$
The final lecture's one-line memory hook: causal inference is missing-data inference where the missing values are counterfactual outcomes, and
the assignment mechanism determines how credible the comparison is.
Common Mistakes
1. Treating association as causation.
A difference in observed group means is causal only if the assignment mechanism or assumptions justify comparing those groups.
2. Forgetting the missing counterfactual.
We never observe both $Y_i(1)$ and $Y_i(0)$ for the same unit. This is why causal inference is fundamentally hard.
3. Confusing Fisher's sharp null with Neyman's weak null.
The sharp null says no unit-level effects. The weak null says the average effect is zero.
4. Using the exact variance formula as if all terms were observable.
The covariance between $t_i$ and $c_i$ requires seeing both potential outcomes for the same unit. Use the conservative bound instead.
5. Forgetting positivity in IPW.
If a stratum has only treated or only control units, inverse propensity weighting cannot create the missing comparison.
6. Thinking propensity scores solve unmeasured confounding.
Propensity methods help only for confounders that are measured and included in the treatment-assignment model.
Data 145 Study Guide - Lecture 27 - Standalone Review Version