Recall Script: What These Lectures Are Really About
The most useful way to remember Lectures 23 and 24 is not as a list of separate tests. The better story is: normal data can be rotated into
orthogonal coordinates; those coordinates split into nuisance, signal, and residual pieces; and the classical test distributions are just ways
to compare the signal size to the noise size.
The whole lecture sequence is a machine: $$\text{normal vector}\longrightarrow\text{orthogonal projections}\longrightarrow \text{signal/noise
ratio}\longrightarrow z,\chi^2,t,\text{ or }F.$$
How to read this page:
first learn what $\chi^2$, t, and F are as mathematical objects; then see why the canonical model produces exactly those objects; then learn the
general model-space rotation $Z=Q^TY$; then use the geometric pictures as the memory hook; finally translate one-sample t, two-sample t, ANOVA,
and regression into the same canonical language.
So the goal is not to memorize many formulas independently. The goal is to recognize the same pattern every time: find the tested direction,
find the residual directions, decide whether $\sigma^2$ is known, and decide whether the signal is one-dimensional or multidimensional.
Review / Background: The Distribution Objects $\chi^2$, t, and F
1.1 The three objects before the model
Before talking about linear models, pin down the three probability objects. Each one answers a slightly different "how large is this signal?"
question.
Object
Mathematical definition
Conceptual meaning
$\chi^2_d$
If $U_1,\ldots,U_d\overset{iid}{\sim}N(0,1)$, then $\sum_{i=1}^dU_i^2\sim\chi^2_d$.
Squared length of a $d$-dimensional standardized Gaussian vector.
$t_d$
If $Z\sim N(0,1)$, $V\sim\chi^2_d$, and $Z\perp V$, then $Z/\sqrt{V/d}\sim t_d$.
One signed normal coordinate divided by an independent estimated noise scale.
$F_{d_1,d_2}$
If $V_1\sim\chi^2_{d_1}$, $V_2\sim\chi^2_{d_2}$, and $V_1\perp V_2$, then $(V_1/d_1)/(V_2/d_2)\sim F_{d_1,d_2}$.
Ratio of two average squared Gaussian lengths.
The memory version is: $$\chi^2=\text{squared Gaussian length},\qquad t=\frac{\text{signal}}{\text{estimated noise}},\qquad F=\frac{\text{signal
sum of squares per signal df}}{\text{residual sum of squares per residual df}}.$$ This is exactly what linear-model tests will produce.
If $T\sim t_d$, then $$T^2\sim F_{1,d}.$$ This is immediate because $Z^2\sim \chi^2_1$, so $$T^2=\frac{Z^2/1}{V/d}.$$ A two-sided t-test is
therefore the same rejection rule as the corresponding one-degree-of-freedom F-test.
The unknown $\sigma$ disappears in t and F ratios. If $\tilde Z=\sigma Z$ and $\tilde V=\sigma^2V$, then $$\frac{\tilde Z}{\sqrt{\tilde
V/d}}=\frac{\sigma Z}{\sqrt{\sigma^2V/d}}=\frac{Z}{\sqrt{V/d}}.$$ This cancellation is why t and F statistics are pivotal when the variance is
unknown.
1.2 What the degrees of freedom $d$ means
If $V\sim\chi^2_d$, then by definition $$V=U_1^2+\cdots+U_d^2$$ for $d$ independent standard normal variables. So $d$ is not a decorative
adjustment: it is the number of independent Gaussian directions being squared.
$$E[V]=d\qquad\Longrightarrow\qquad E[V/d]=1.$$
Dimension of a Gaussian subspace = degrees of freedom of the chi-squared built from that subspace. Dividing by $d$ turns total squared variation
into average squared variation per direction.
1.3 The normal fact that makes everything work
If $Z\sim N_n(0,\sigma^2I_n)$ and $Q$ is orthogonal, meaning $Q^TQ=I_n$, then $$QZ\sim N_n(0,\sigma^2I_n).$$ The spherical normal distribution
is unchanged by rotations and reflections.
Use the affine transformation formula for multivariate normals. If $Y=QZ$, then $$Y\sim
N_n(Q0,\;Q(\sigma^2I_n)Q^T)=N_n(0,\sigma^2QQ^T)=N_n(0,\sigma^2I_n).$$ The covariance stays spherical because $Q$ is orthogonal.
If $V_1$ and $V_2$ are orthogonal subspaces of dimensions $d_1$ and $d_2$, and $Z\sim N_n(0,\sigma^2I_n)$, then $$\|P_{V_1}Z\|^2\sim
\sigma^2\chi^2_{d_1},\qquad \|P_{V_2}Z\|^2\sim \sigma^2\chi^2_{d_2},$$ and these squared projection lengths are independent.
Choose an orthonormal basis adapted to the subspaces. After rotating into that basis, the coordinates are still independent $N(0,\sigma^2)$
variables. Projection lengths are just sums of squares of disjoint coordinate blocks.
A chi-squared distribution does not appear just because a variable is centered at 0. It appears because we take a sum of
squares of independent standardized Gaussian coordinates.
1.4 Translation dictionary
This dictionary is the bridge from probability objects to the canonical model. Every later example is just a different way of deciding which
projection is nuisance, which projection is signal, and which projection is residual noise.
Geometric object
Statistical role
Distribution under the null
Projection onto tested direction
Signal
Normal if 1D, chi-squared length if multidimensional
Projection onto residual directions
Noise / variance estimate
$\sigma^2\chi^2_{d_r}$ length squared
Projection onto nuisance directions
Unrestricted mean under both hypotheses
Accounted for, but not evidence for the target signal
Ratio of signal to residual scale
Test statistic when $\sigma^2$ unknown
t if 1D signal, F if multidimensional signal
Now let's go into the main model.
The distributions above are the ingredients. The canonical model is the recipe that tells us why those ingredients show up in hypothesis tests.
It will label each coordinate as nuisance, signal, or residual, and then the right distribution will almost choose itself.
The Canonical Model: Nuisance, Signal, Residual
2.1 The organizing model
The canonical model is the clean coordinate system we wish every testing problem already came in. The observed vector is split into three
orthogonal blocks: $$Z=\begin{bmatrix}Z_0\\Z_1\\Z_r\end{bmatrix} \sim N_n\!\left( \begin{bmatrix}\mu_0\\\mu_1\\0\end{bmatrix}, \sigma^2I_n
\right),\qquad d_0+d_1+d_r=n.$$ We test $$H_0:\mu_1=0\qquad\text{vs}\qquad H_1:\mu_1\neq0.$$
Block
Name
Why it matters
$Z_0\in\mathbb R^{d_0}$
Nuisance
Its mean $\mu_0$ is unknown under both $H_0$ and $H_1$, so it is not evidence for or against the target hypothesis.
$Z_1\in\mathbb R^{d_1}$
Signal
Its mean is forced to be 0 under $H_0$ and allowed to move under $H_1$.
$Z_r\in\mathbb R^{d_r}$
Residual noise
Its mean is known to be 0 under both hypotheses, so its squared length estimates $\sigma^2$.
This is the important "we can use this!" moment: the residual block is known to be pure noise. Its mean is 0 and it has the same noise variance
$\sigma^2$ as the signal block. So if $\sigma^2$ is known, we scale by it directly; if $\sigma^2$ is unknown, we use the residual block to
estimate the noise level.
The whole test is now a comparison: $$\text{How large is the signal block }Z_1\text{ compared with the noise scale?}$$ Everything else is just
deciding whether the noise scale is known and whether the signal is a signed coordinate or a multidimensional length.
2.2 Why $Z_r$ estimates variance
Since the residual block has known mean 0, $$Z_r\sim N_{d_r}(0,\sigma^2I_{d_r}).$$ Written by coordinates, this means
$$Z_{r,1},\ldots,Z_{r,d_r}\overset{iid}{\sim}N(0,\sigma^2).$$ Dividing each coordinate by $\sigma$ standardizes it:
$$\frac{Z_{r,1}}{\sigma},\ldots,\frac{Z_{r,d_r}}{\sigma}\overset{iid}{\sim}N(0,1).$$
Now expand the squared length: $$\|Z_r\|^2=Z_{r,1}^2+\cdots+Z_{r,d_r}^2.$$ Dividing by $\sigma^2$ can be pushed inside the sum:
$$\frac{\|Z_r\|^2}{\sigma^2} = \frac{Z_{r,1}^2+\cdots+Z_{r,d_r}^2}{\sigma^2} = \left(\frac{Z_{r,1}}{\sigma}\right)^2+\cdots+
\left(\frac{Z_{r,d_r}}{\sigma}\right)^2 = \sum_{j=1}^{d_r}\left(\frac{Z_{r,j}}{\sigma}\right)^2.$$ Since this is a sum of $d_r$ squared standard
normals, $$\frac{\|Z_r\|^2}{\sigma^2}\sim\chi^2_{d_r}.$$
Taking expectation gives $E[\|Z_r\|^2]=d_r\sigma^2$, so $$\hat\sigma^2=\frac{\|Z_r\|^2}{d_r}.$$ This is "average squared residual length per
residual direction."
Nuisance and residual are not the same thing. $Z_0$ also contains randomness, but its mean is unknown, so its raw squared length includes
unknown mean structure. $Z_r$ has mean 0, so its squared length is interpretable as pure noise.
2.3 The two questions to ask before any test
Once the canonical blocks are understood, the rest of the test is determined by two questions.
Question
If yes
If no
Do we know $\sigma^2$?
Use the known $\sigma$ as the noise scale.
Use $Z_r$ to estimate noise with $\hat\sigma^2=\|Z_r\|^2/d_r$.
Is the signal one-dimensional?
Keep the signed coordinate $Z_1$ and use z or t.
Use the squared length $\|Z_1\|^2$ and use $\chi^2$ or F.
This is the full canonical intuition in one sentence: $Z_0$ is fit/accounted for but not used as evidence, $Z_1$ is the signal being tested,
$Z_r$ supplies noise if needed, and the dimension of $Z_1$ decides whether we keep a signed coordinate or use a squared norm.
The subscript $r$ means residual block, not one specific coordinate. If $d_r=1$, then $Z_r$ has one coordinate and $\sqrt{\|Z_r\|^2/d_r}=|Z_r|$.
If $d_r>1$, then $Z_r$ is a vector of several residual coordinates, and $\|Z_r\|^2/d_r$ is the average squared residual coordinate.
Read the table as an algebraic summary of the story. Known variance means the noise scale is fixed. Unknown variance means the residual block
supplies the scale. A one-dimensional signal keeps its sign and gives z or t. A multidimensional signal has no single sign, so we use squared
length and get $\chi^2$ or F. If $d_1=1$, the squared-length F version is just $T^2$, so the signed t statistic is usually more informative.
General Linear Models: Rotate Into the Canonical Model
3.1 From general coordinates to canonical coordinates
In a general linear model, observe $$Y\sim N_n(\theta,\sigma^2I_n),$$ where the mean vector lies in a model subspace $\mathcal H\subseteq\mathbb
R^n$. Test nested subspaces $$H_0:\theta\in\mathcal H_0\qquad \text{vs.}\qquad H_1:\theta\in\mathcal H\setminus\mathcal H_0,$$ with $\mathcal
H_0\subseteq\mathcal H$.
This is the big-picture version of the rotation. The full model space $\mathcal H$ contains all mean vectors allowed by the larger model. The
null space $\mathcal H_0$ contains the mean vectors allowed if the null hypothesis is true. Points in $\mathcal H\setminus\mathcal H_0$ are
alternatives; the orthogonal part $\mathcal H\cap\mathcal H_0^\perp$ is the signal subspace we test after accounting for nuisance directions.
Model-space dictionary:
$\mathcal H$: full model space; $\mathcal H_0$: null/nuisance space; $\mathcal H\setminus\mathcal H_0$: alternative region; $\mathcal
H\cap\mathcal H_0^\perp$: orthogonal tested signal space; $\mathcal H^\perp$: residual/noise space.
$Q_1$: basis for $\mathcal H\cap\mathcal H_0^\perp$.
$Q_r$: basis for $\mathcal H^\perp$.
Stack them: $$Q=[Q_0\mid Q_1\mid Q_r].$$ Then $Q$ is orthogonal and $Z=Q^TY$ is in canonical coordinates.
The rotated mean is $$E[Z]=E[Q^TY]=Q^T\theta = \begin{bmatrix} Q_0^T\theta\\ Q_1^T\theta\\ Q_r^T\theta \end{bmatrix} = \begin{bmatrix} \mu_0\\
\mu_1\\ 0 \end{bmatrix}.$$ The residual block is 0 in mean because every allowed model mean $\theta\in\mathcal H$ is perpendicular to $\mathcal
H^\perp$. The tested block decides the hypothesis: $$H_0:\theta\in\mathcal H_0\Longleftrightarrow \mu_1=Q_1^T\theta=0,\qquad
H_1:\theta\in\mathcal H\setminus\mathcal H_0\Longleftrightarrow \mu_1\neq0.$$
Because $Q$ is orthogonal, $$Z=Q^TY\sim N_n\!\left( \begin{bmatrix}\mu_0\\\mu_1\\0\end{bmatrix}, \sigma^2I_n \right).$$ That is exactly the
canonical block model from Section 2: nuisance, signal, residual.
The choice of basis inside each subspace is not important. The test only depends on projection lengths such as $\|Q_1^TY\|^2$ and
$\|Q_r^TY\|^2$, which are intrinsic geometric quantities.
The Geometry: Ratios, Angles, and Rotations
Now move from the general rotation recipe to the picture. This section is about visualization: axes, ratios, angles, and rotations. In canonical
coordinates, the signal coordinate is one axis and the residual coordinate is a perpendicular axis. In original data coordinates, the same idea
may look tilted, so Section 3's $Q$ rotation turns it into the signal/residual split.
4.1 The $n=2$ canonical picture
Start with the picture, not the formula. Put the tested coordinate $Z_1$ on the horizontal axis and the residual coordinate $Z_2$ on the
vertical axis. Under the null, the cloud is centered at the origin. Under the alternative, the center slides horizontally, because only the
signal coordinate changes.
Observe $$Z\sim N_2\!\left(\begin{bmatrix}\mu_1\\0\end{bmatrix},\sigma^2I_2\right),$$ and test $$H_0:\mu_1=0 \qquad \text{vs} \qquad
H_1:\mu_1\neq 0.$$ The coordinate $Z_1$ is the tested signal direction. The coordinate $Z_2$ is pure residual noise with mean 0 under both
hypotheses.
An observed point is evidence against $H_0$ when it points too strongly in the $Z_1$ direction relative to the residual direction. In the
unknown-variance case, that comparison is angular: $$\tan(\theta)=\frac{Z_2}{Z_1},\qquad \frac{Z_1}{Z_2}=\cot(\theta),\qquad
\frac{Z_1^2}{Z_2^2}=\cot^2(\theta).$$ This tiny two-dimensional picture is the seed of the whole four-test table.
4.2 Angle reading: ratios become cotangents
The diagram below is the visual version of the algebra above. The observed point has two coordinates: its horizontal signal projection $Z_1$ and
its vertical residual projection $Z_2$. Comparing $Z_1^2/Z_2^2$ is the same as comparing the squared cotangent of the point's angle from the
signal axis.
With unknown variance, absolute scale is not reliable, but angle is. The statistic $|Z_1|/|Z_2|$ measures how closely the observed vector points
in the signal direction. A vector nearly aligned with the $Z_1$-axis is surprising under the rotationally symmetric null.
In the canonical $Z_1,Z_2$ coordinates, the observed point determines an angle $\theta$ from the signal axis. Since $\tan(\theta)=Z_2/Z_1$, the
ratio $Z_1/Z_2=\cot(\theta)=1/\tan(\theta)$. Squaring gives the $F_{1,1}$ form $Z_1^2/Z_2^2=\cot^2(\theta)$.
A useful way to remember the unknown-variance test: $$\frac{Z_1}{Z_2}=\cot(\theta),\qquad \frac{Z_1^2}{Z_2^2}=\cot^2(\theta).$$ The classical t
statistic uses $Z_1/|Z_2|$ for the denominator scale, but the squared test is the same: $$T^2=\frac{Z_1^2}{Z_2^2}.$$ Large $|T|$ means the
observed vector is more horizontal than expected under the rotationally symmetric null.
4.3 The $n=2$ one-sample t-test is a rotation
For $X_1,X_2\overset{iid}{\sim}N(\mu,\sigma^2)$, the mean vector is $\mu(1,1)^T$. The signal direction is the diagonal line $X_1=X_2$, not the
original $X_1$ axis. A 45-degree rotation turns that diagonal into the canonical signal axis.
Nothing new is happening probabilistically. We are only changing coordinates. The observed point $X$ stays fixed in the plane, but the axes are
rotated so that one new axis points along the mean direction and the other new axis points along pure residual variation.
Use $$Q=\frac{1}{\sqrt2}\begin{bmatrix}1 & -1\\1 & 1\end{bmatrix},\qquad Z=Q^TX.$$ Since $X\sim N_2(\mu\mathbf1,\sigma^2I_2)$, the
affine transformation formula gives $$Z=Q^TX\sim N_2(Q^T\mu\mathbf1,\sigma^2Q^TQ).$$ Because $Q$ is orthogonal and $Q^T\mathbf1=(\sqrt2,0)^T$,
this becomes $$Z\sim N_2\!\left(\begin{bmatrix}\sqrt2\,\mu\\0\end{bmatrix},\sigma^2I_2\right).$$ Reading off the coordinates,
$$Z=\begin{bmatrix}(X_1+X_2)/\sqrt2\\(X_2-X_1)/\sqrt2\end{bmatrix}.$$
This is the $d_1=1$, $\sigma^2$ unknown row of the canonical table. In this $n=2$ example the residual block has only one coordinate, so
$$Z_r=Z_2,\qquad d_r=1,$$ and the table predicts $$T=\frac{Z_1}{\sqrt{\|Z_r\|^2/d_r}}=\frac{Z_1}{\sqrt{Z_2^2}}=\frac{Z_1}{|Z_2|}.$$
Equivalently, the rotated axes are the orthonormal basis vectors $$q_1=\frac{1}{\sqrt2}\begin{bmatrix}1\\1\end{bmatrix},\qquad
q_2=\frac{1}{\sqrt2}\begin{bmatrix}-1\\1\end{bmatrix}.$$ The new coordinates are projections: $$Z_1=q_1^TX,\qquad Z_2=q_2^TX.$$ So $Z_1$ is "how
far $X$ points along the 45-degree signal line," and $Z_2$ is "how far $X$ points along the perpendicular residual line."
The factor $\sqrt2$ is just normalization. The raw signal direction $(1,1)$ has length $\sqrt2$, so the unit signal vector is $(1,1)/\sqrt2$.
Projection coordinates are taken against unit vectors, which is why the signal projection is $Z_1=(X_1+X_2)/\sqrt2$ rather than $X_1+X_2$.
The rotated signal coordinate is $$Z_1=\frac{X_1+X_2}{\sqrt2}=\sqrt2\,\bar X,$$ and the residual coordinate is $$Z_2=\frac{X_2-X_1}{\sqrt2}.$$
So the one-sample t-test with $n=2$ is exactly the $n=2$ canonical model in disguised coordinates.
The transformation $Z=Q^TX$ rotates the coordinate system so the diagonal mean direction becomes the $Z_1$ signal axis. The observed point is
the same point in the plane; only the axes change. In the new axes, $Z_1=(X_1+X_2)/\sqrt2$ is the signal projection and $Z_2=(X_2-X_1)/\sqrt2$
is the residual projection. This is the rotated version of the previous canonical picture.
In your handwritten picture, this is the relationship $$\left|\frac{\sqrt2\,\bar X}{S}\right|=|\cot(\theta)|,$$ where $\theta$ is the angle from
the 45-degree signal line. The point $X$ is compared to the diagonal signal line in the original coordinates; after rotation, that same
comparison becomes the ratio $Z_1/|Z_2|$ in canonical coordinates. The signed statistic keeps the sign of the signal projection $Z_1$.
Show the algebra: the usual t-statistic equals the projection ratio
The usual one-sample statistic for testing $H_0:\mu=0$ is $$T_{\text{classical}}=\frac{\bar X}{S/\sqrt n}.$$ For $n=2$, this is
$$T_{\text{classical}}=\frac{\sqrt2\,\bar X}{S}.$$
The sample standard deviation still comes from subtracting the sample mean. For $n=2$, $$\bar X=\frac{X_1+X_2}{2},$$ so $$X_1-\bar
X=\frac{X_1-X_2}{2},\qquad X_2-\bar X=\frac{X_2-X_1}{2}.$$ Therefore $$S^2=(X_1-\bar X)^2+(X_2-\bar X)^2
=\frac{(X_1-X_2)^2}{4}+\frac{(X_2-X_1)^2}{4} =\frac{(X_1-X_2)^2}{2},$$ so $$S=\frac{|X_1-X_2|}{\sqrt2}.$$
Since $Z_1=(X_1+X_2)/\sqrt2$ and $|Z_2|=|X_2-X_1|/\sqrt2$,
$$T_{\text{classical}}=\frac{(X_1+X_2)/\sqrt2}{|X_1-X_2|/\sqrt2}=\frac{Z_1}{|Z_2|}.$$
So the mean subtraction did not disappear. With two points, the sample mean is the midpoint, and the two centered deviations are just opposite
halves of the gap between the observations. That is why the residual scale can be written using $|X_1-X_2|$.
4.4 General $n$ one-sample t-test
Let $$X_1,\ldots,X_n\overset{iid}{\sim}N(\mu,\sigma^2),\qquad X\sim N_n(\mu\mathbf1,\sigma^2I_n).$$ The signal subspace is the line spanned by
$\mathbf1=(1,\ldots,1)^T$.
This is the same move as the $n=2$ rotation, except the residual part is no longer one perpendicular line. The signal line still has dimension
$d_1=1$, but the residual space $\mathbf1^\perp$ has dimension $d_r=n-1$.
Let $$q_1=\frac{\mathbf1}{\sqrt n}.$$ This is a unit vector because $\|\mathbf1\|=\sqrt n$. Extend $q_1$ to an orthonormal basis $$Q=[q_1\mid
Q_r],$$ where the columns of $Q_r$ span the residual space $\mathbf1^\perp$. Define the rotated coordinates
$$\begin{bmatrix}Z_1\\Z_r\end{bmatrix}=Q^TX,\qquad Z_1=q_1^TX=\sqrt n\,\bar X,\qquad Z_r=Q_r^TX.$$
This is the lecture's affine-transformation step. Since $X\sim N_n(\mu\mathbf1,\sigma^2I_n)$, $$Q^TX\sim N_n(Q^T\mu\mathbf1,\sigma^2Q^TQ).$$ The
covariance simplifies because $Q$ is orthogonal: $$\sigma^2Q^TQ=\sigma^2I_n.$$ The mean simplifies because
$$Q^T\mathbf1=\begin{bmatrix}q_1^T\mathbf1\\Q_r^T\mathbf1\end{bmatrix} =\begin{bmatrix}\sqrt n\\\mathbf{0}_{n-1}\end{bmatrix}.$$ The top entry
is $\sqrt n$ because $q_1=\mathbf1/\sqrt n$, and the residual entries are 0 because the columns of $Q_r$ are perpendicular to $\mathbf1$.
Therefore $$\begin{bmatrix}Z_1\\Z_r\end{bmatrix}=Q^TX\sim N_n\!\left(\begin{bmatrix}\sqrt
n\,\mu\\\mathbf{0}_{n-1}\end{bmatrix},\sigma^2I_n\right).$$
Now use the rotated coordinates to recognize the usual sample variance. Orthogonal decomposition gives $$\|X\|^2=Z_1^2+\|Z_r\|^2.$$ Since
$Z_1=\sqrt n\,\bar X$, $$\|Z_r\|^2=\sum_{i=1}^nX_i^2-n\bar X^2=\sum_{i=1}^n(X_i-\bar X)^2=(n-1)S^2.$$
The first proof explains why $Z_r$ is a pure-noise block with mean 0. The second proof explains why the length of that block is the familiar
centered sum of squares. That is exactly where the result is used: it turns the canonical denominator $\|Z_r\|^2$ into the classical denominator
$(n-1)S^2$.
The number $n-1$ is the residual dimension: after fitting one mean direction, only $n-1$ independent noise directions remain.
Concrete bridge: more than one residual direction
For $n=3$, the signal direction is still one-dimensional: $$q_1=\frac{1}{\sqrt3}(1,1,1).$$ One possible pair of residual directions is
$$q_2=\frac{1}{\sqrt2}(1,-1,0),\qquad q_3=\frac{1}{\sqrt6}(1,1,-2).$$
Then the residual block is a vector, not a single number: $$Z_r=\begin{bmatrix}Z_2\\Z_3\end{bmatrix},\qquad d_r=2.$$ The unknown-variance t
denominator is $$\sqrt{\frac{\|Z_r\|^2}{d_r}}=\sqrt{\frac{Z_2^2+Z_3^2}{2}},$$ which is the projection version of the sample standard deviation
$S$.
Under $H_0:\mu=0$, $$\frac{Z_1}{\sigma}=\frac{\sqrt n\,\bar X}{\sigma}\sim N(0,1),$$ and
$$\frac{\|Z_r\|^2}{\sigma^2}=\frac{(n-1)S^2}{\sigma^2}\sim \chi^2_{n-1},$$ independently. Therefore
$$T=\frac{Z_1}{\sqrt{\|Z_r\|^2/(n-1)}}=\frac{\sqrt n\,\bar X}{S}\sim t_{n-1}.$$
The classical facts $\bar X\perp S^2$ and $(n-1)S^2/\sigma^2\sim\chi^2_{n-1}$ are not isolated miracles. They come from independence of
orthogonal Gaussian projections: $\bar X$ lives in the signal line, and $S^2$ lives in the residual hyperplane.
Recall Checkpoint: Tests and Intervals
Section 2.3 already gave the four-test table. This short section is here as the recall checkpoint: when you are solving a problem, you should be
able to rebuild the table from the two questions without memorizing it row by row.
Recall checkpoint:
known $\sigma^2$ + one signal coordinate gives z; known $\sigma^2$ + many signal coordinates gives $\chi^2$; unknown $\sigma^2$ + one signal
coordinate gives t; unknown $\sigma^2$ + many signal coordinates gives F.
For the unknown-variance tests, we still need $d_r>0$. No residual degrees of freedom means no independent residual estimate of $\sigma^2$.
5.1 Confidence intervals by inversion
Lecture 14's test-confidence interval duality returns here. For the common one-dimensional unknown-variance case, testing $H_0:\mu_1=\mu_1^0$
uses $$\frac{Z_1-\mu_1^0}{\hat\sigma}\sim t_{d_r}.$$ The non-rejected values form the interval $$\mu_1\in Z_1\pm
\hat\sigma\,t_{d_r,1-\alpha/2}.$$
Known variance gives the corresponding normal interval: $$\mu_1\in Z_1\pm \sigma z_{\alpha/2}.$$ Unknown variance replaces $\sigma$ by
$\hat\sigma$ and replaces the normal cutoff by the t cutoff.
Applications: Identify the Blocks, Then Test
For each familiar test, resist the urge to start from the final statistic. Instead, identify the null subspace, the full model subspace, the
tested signal directions, and the residual directions. The statistic then drops out from the canonical table.
6.1 Equal-variance two-sample t-test
Group 1 has $X_1,\ldots,X_m\overset{iid}{\sim}N(\mu,\sigma^2)$ and group 2 has $Y_1,\ldots,Y_n\overset{iid}{\sim}N(\nu,\sigma^2)$. Let $N=m+n$
and stack the observations into $$W=(X_1,\ldots,X_m,Y_1,\ldots,Y_n)^T.$$ Test $H_0:\mu=\nu$.
The full model space is $$\mathcal H=\Span(a,b),$$ where $a$ is 1 on group 1 and 0 on group 2, while $b$ is 0 on group 1 and 1 on group 2. The
null space is $$\mathcal H_0=\Span(\mathbf1_N).$$ Therefore $$d_0=1,\qquad d_1=1,\qquad d_r=N-2.$$
Identify the blocks:
$\mathcal H_0$ is the nuisance direction where both groups share one mean, $\mathcal H\cap\mathcal H_0^\perp$ is the one-dimensional contrast
direction, and $\mathcal H^\perp$ is within-group residual noise. Since $d_1=1$ and $\sigma^2$ is unknown, choose the t row of the canonical
table.
Contrast direction
The signal direction is the contrast between group means. Define
$$c=\left(\underbrace{\frac1m,\ldots,\frac1m}_{m},\underbrace{-\frac1n,\ldots,-\frac1n}_{n}\right)^T.$$ Then $$c^TW=\bar X-\bar Y,\qquad
\|c\|^2=\frac1m+\frac1n.$$ The unit signal vector is $$q_1=\frac{c}{\sqrt{1/m+1/n}}.$$
The signal coordinate is $$Z_1=q_1^TW=\frac{\bar X-\bar Y}{\sqrt{1/m+1/n}}.$$ Under $H_0$, $Z_1\sim N(0,\sigma^2)$.
Residual direction and pooled variance
The residual projection has squared length $$\|Z_r\|^2=\sum_{i=1}^m(X_i-\bar X)^2+\sum_{j=1}^n(Y_j-\bar Y)^2.$$ Hence
$$\|Z_r\|^2=(m-1)S_X^2+(n-1)S_Y^2,$$ with $d_r=N-2$.
The pooled variance estimate is $$S_p^2=\frac{(m-1)S_X^2+(n-1)S_Y^2}{N-2}.$$
The canonical $d_1=1$, unknown-variance test becomes the classical pooled two-sample t-test: $$T=\frac{Z_1}{S_p}=\frac{\bar X-\bar
Y}{S_p\sqrt{1/m+1/n}}\sim t_{N-2}\quad\text{under }H_0.$$
The corresponding confidence interval for $\mu-\nu$ is $$(\bar X-\bar Y)\pm t_{N-2,1-\alpha/2}\,S_p\sqrt{\frac1m+\frac1n}.$$
This is the equal-variance two-sample t-test. The pooled variance estimate is justified by the common variance assumption. If the group
variances are not plausibly equal, the Welch test from earlier lectures is the safer default.
6.2 Where one-way ANOVA fits
One-way ANOVA compares $G$ group means under a common normal variance assumption. If group $g$ has observations $$Y_{g,1},\ldots,Y_{g,n_g}
\overset{iid}{\sim}N(\mu_g,\sigma^2),\qquad g=1,\ldots,G,$$ then the null and alternative are $$H_0:\mu_1=\mu_2=\cdots=\mu_G
\qquad\text{vs}\qquad H_1:\text{not all }\mu_g\text{ are equal}.$$ Stack all observations into $$Y=(Y_{1,1},\ldots,Y_{1,n_1},Y_{2,1},\ldots,
Y_{G,n_G})^T\sim N_N(\theta,\sigma^2I_N),\qquad N=\sum_{g=1}^G n_g.$$
The usual one-way ANOVA model is built on three assumptions: observations are independent within and across groups, the conditional
distributions are approximately normal, and all groups share the same variance $\sigma^2$. A practical first check is to compare the group
spreads with box plots or residual plots; if one group is much more variable than the others, the pooled-variance F-test can be misleading.
For the one-way ANOVA null, the mean vector can only lie on the grand-mean line: $$\mathcal H_0=\Span(\mathbf1_N),\qquad d_0=\dim(\mathcal
H_0)=1.$$ The full model allows one constant level within each group. If $e_g$ is the group-$g$ indicator vector, for example
$e_1=(1,\ldots,1,0,\ldots,0)^T$, then $$\mathcal H=\Span(e_1,\ldots,e_G),\qquad \dim(\mathcal H)=G.$$ Therefore $$H_0:\theta\in\mathcal
H_0,\qquad H_1:\theta\in\mathcal H\setminus\mathcal H_0,$$ and $$d_1=\dim(\mathcal H)-\dim(\mathcal H_0)=G-1,\qquad d_r=N-G.$$
Identify the blocks:
the grand-mean line is nuisance, the $G-1$ independent group contrasts are signal, and the within-group deviations are residual noise. Since the
signal is multidimensional and $\sigma^2$ is unknown, choose the F row.
A useful correction to keep in mind: ANOVA is absolutely doing projections. The projection step creates the observed squared lengths
$\text{SSB}$ and $\text{SSW}$. The F distribution is the reference distribution that tells us whether the ratio of those lengths is unusually
large under $H_0$.
Projection view from the discussion section
The null and full models are nested: $\mathcal H_0\subseteq\mathcal H$. Projecting $Y$ onto $\mathcal H_0$ fits one grand mean, while projecting
$Y$ onto $\mathcal H$ fits one mean for each group:
The key projection identity is $$\|Y-P_{\mathcal H_0}Y\|^2=\|Y-P_{\mathcal H}Y\|^2+\|P_{\mathcal H}Y-P_{\mathcal H_0}Y\|^2.$$ In words:
$$\text{variation not explained by the null}=\text{variation not explained by the full model}+\text{variation explained by full but not null}.$$
The reason this is a clean sum of squares is orthogonality. The vector $Y-P_{\mathcal H}Y$ lives in $\mathcal H^\perp$ as residual noise, while
$P_{\mathcal H}Y-P_{\mathcal H_0}Y$ lives in the tested group-contrast space $\mathcal H\cap\mathcal H_0^\perp$. These directions are
perpendicular, so their squared lengths add.
Discussion view: what are the two RSS values doing?
The cleanest way to remember ANOVA is to ask what each model is allowed to explain. The null model is only allowed to fit one grand mean. The
full model is allowed to fit one mean per group. So $\RSS_0-\RSS$ is not a mysterious extra formula; it is the amount of error saved when the
model is allowed to move from one shared mean to separate group means.
Quantity
Model being fit
Plain-language meaning
Canonical block
$\RSS_0$
Null model: one grand mean
Everything left unexplained if all groups are forced to have the same mean.
$\|Z_1\|^2+\|Z_r\|^2$
$\RSS$
Full model: one mean per group
Only the leftover within-group scatter after fitting each group mean.
$\|Z_r\|^2$
$\RSS_0-\RSS$
Improvement from null to full
The between-group signal: how much fit improves by allowing group means to differ.
$\|Z_1\|^2$
In one-way ANOVA, the null residual sum of squares decomposes orthogonally into between-group signal and within-group residual noise.
The algebra behind total = between + within
For every observation, split its deviation from the grand mean into two pieces:
Squaring and summing gives a cross-term, but it vanishes within each group: $$\sum_{i=1}^{n_g}(Y_{gi}-\bar Y_{g\cdot})=0.$$ Therefore
$$\sum_{g=1}^G\sum_{i=1}^{n_g}(Y_{gi}-\bar Y_{\cdot\cdot})^2 = \sum_{g=1}^G n_g(\bar Y_{g\cdot}-\bar Y_{\cdot\cdot})^2+
\sum_{g=1}^G\sum_{i=1}^{n_g}(Y_{gi}-\bar Y_{g\cdot})^2.$$ In ANOVA language, $$\text{SST}=\text{SSB}+\text{SSW}.$$ In nested-model language,
$$\RSS_0=(\RSS_0-\RSS)+\RSS.$$
The factor $n_g$ in the between-group term matters. A group mean far from the grand mean is stronger evidence when it comes from many
observations, because the fitted group-mean vector has that same shift repeated $n_g$ times.
The null model fits one shared grand mean: $$\RSS_0=\sum_{g=1}^G\sum_{i=1}^{n_g}(Y_{gi}-\bar Y_{\cdot\cdot})^2.$$ This leftover error contains
both the group-mean mismatch and the within-group noise. The full model fits a separate mean for each group:
$$\RSS=\sum_{g=1}^G\sum_{i=1}^{n_g}(Y_{gi}-\bar Y_{g\cdot})^2.$$ This is only within-group noise. Therefore $$\RSS_0-\RSS=\sum_{g=1}^G n_g(\bar
Y_{g\cdot}-\bar Y_{\cdot\cdot})^2,$$ which is the between-group signal.
This is exactly the canonical decomposition: $$\RSS_0=\|Z_1\|^2+\|Z_r\|^2,\qquad \RSS=\|Z_r\|^2,\qquad \RSS_0-\RSS=\|Z_1\|^2.$$ So ANOVA
compares the between-group signal length to the within-group residual-noise length.
Plain-language F ratio:
$$F= \frac{\text{between-group variation per signal dimension}} {\text{within-group variation per residual dimension}}.$$ Under $H_0$, both
pieces estimate the same noise variance $\sigma^2$, so the ratio should be near 1. A large value means the group means are farther apart than
the within-group noise would suggest.
Under $H_0$, $$\frac{\text{SSB}}{\sigma^2}\sim\chi^2_{G-1},\qquad \frac{\text{SSW}}{\sigma^2}\sim\chi^2_{N-G},$$ and the two pieces are
independent because they are squared lengths of orthogonal Gaussian projections. That is why $$F=\frac{\text{MSB}}{\text{MSW}}\sim
F_{G-1,N-G}.$$
If $\RSS_0$ is the total sum of squares around the grand mean and $\RSS$ is the within-group residual sum of squares, then
$$F=\frac{(\RSS_0-\RSS)/(G-1)}{\RSS/(N-G)}=\frac{\text{SSB}/(G-1)}{\text{SSW}/(N-G)}\sim F_{G-1,N-G}\quad\text{under }H_0.$$ This is one-way
ANOVA written in the same nested-subspace language as regression.
Practice bridge: why manual ANOVA matches software
A practical ANOVA calculation is just a computational version of the projection picture. A software routine such as `stats.f_oneway` is not
using a different idea: it computes the same between-group and within-group sums of squares, forms the same F ratio, and then reads a right-tail
probability from the same $F_{G-1,N-G}$ reference distribution.
The manual route is: compute total variation around the grand mean, compute within-group variation around each group mean, take the difference
as between-group variation, then form $$F_{\text{obs}}=\frac{\text{SSB}/(G-1)}{\text{SSW}/(N-G)}.$$ The software route and the manual route
agree because they are computing this same statistic and using the same right tail: $$p=P(F_{G-1,N-G}\geq F_{\text{obs}}).$$
Conceptually, the p-value is asking: if all groups truly had the same mean, how often would random within-group noise create a between-group
projection this large compared with the within-group projection? A small p-value says this particular signal length is too large to comfortably
explain as noise.
For example, a p-value around $0.009$ would reject the equal-means null at the 5% level. The important study takeaway is not the arithmetic
itself, but the interpretation: the observed between-group projection is large relative to the within-group projection. ANOVA says there is
evidence that not all group means are equal, but it does not identify which group comparisons are responsible.
The ANOVA F-test is an omnibus mean test. It can say that the vector of group-mean contrasts is unusually long, but it does not by itself say
which group differs from which. It also tests means under the normal/common-variance model; it is not a full test that all empirical
distributions are identical. When the boxplot shows strong skewness or unequal spreads, a rank-based method such as Kruskal-Wallis is often a
more robust follow-up question.
The two-sample equal-variance t-test is the $G=2$ special case of one-way ANOVA. There is only one group-contrast direction, so $d_1=G-1=1$ and
the ANOVA F statistic equals the square of the pooled two-sample t statistic: $$F=T^2.$$
6.3 Linear regression as projection
In regression, $$Y_i=x_i^T\beta+\epsilon_i,\qquad \epsilon_i\overset{iid}{\sim}N(0,\sigma^2),$$ or in matrix form $$Y\sim
N_n(X\beta,\sigma^2I_n).$$ Assume the design matrix $X$ has full column rank $d$.
The model space is $$\mathcal H=\Col(X),\qquad \dim(\mathcal H)=d.$$ The fitted values are the orthogonal projection of $Y$ onto this column
space: $$\hat Y=X\hat\beta=P_{\mathcal H}Y,\qquad P_{\mathcal H}=X(X^TX)^{-1}X^T.$$
Regression is not a separate universe from ANOVA. The fitted values are the projection onto the model subspace, and the residuals are the
projection onto the orthogonal complement. Tests ask whether adding selected directions to the model subspace produces a signal length that is
large relative to residual noise.
Practice bridge: least squares is the canonical residual
A common practice exercise is to verify that the canonical residual length $\|Z_r\|^2$ is the same object as the ordinary least-squares RSS.
This is the key bridge: least squares is not an extra procedure bolted onto the canonical model. It is exactly the projection of $Y$ onto the
model space.
Model feature
Canonical interpretation
Least-squares meaning
No intercept
The model space is the line generated by the predictor direction.
The fitted values are the closest point on that line; RSS is the leftover squared distance.
With intercept
The constant vector is a nuisance direction; the predictor signal is what remains after accounting for that constant direction.
The fitted values are the closest point in the intercept-plus-predictor plane; RSS is the squared distance left outside the plane.
This is why subtracting the mean shows up in regression with an intercept. It is Gram-Schmidt in disguise: remove the nuisance direction
$\mathbf1$ from $x$, then test the remaining predictor direction. The slope t-test is just a signed version of "how much does $Y$ point in that
cleaned-up signal direction?"
The residual vector is $$r=Y-\hat Y=(I-P_{\mathcal H})Y=P_{\mathcal H^\perp}Y,$$ and $$\RSS=\|r\|^2=\|P_{\mathcal H^\perp}Y\|^2.$$ Since
$\dim(\mathcal H^\perp)=n-d$, $$\frac{\RSS}{\sigma^2}\sim \chi^2_{n-d},\qquad \hat\sigma^2=\frac{\RSS}{n-d}.$$
6.4 Regression F-test for a subset of coefficients
Partition the design as $X=[X_0\mid X_1]$, where $X_0$ contains predictors kept under the null and $X_1$ contains predictors being tested. The
null hypothesis is $$H_0:\beta_1=0,$$ so $$\mathcal H_0=\Col(X_0),\qquad \mathcal H=\Col(X).$$
Identify the blocks:
$\Col(X_0)$ is nuisance, the extra part of $\Col(X)$ not already explained by $\Col(X_0)$ is signal, and $\Col(X)^\perp$ is residual noise.
Therefore $d_1$ is the number of added independent predictor directions and $d_r=n-d$.
Let $$d_0=\rank(X_0),\qquad d_1=d-d_0,\qquad d_r=n-d.$$ Let $\RSS_0$ be the residual sum of squares for the null model and $\RSS$ for the full
model. Since $\mathcal H_0\subseteq\mathcal H$, $$\RSS_0\geq \RSS.$$
The improvement $\RSS_0-\RSS$ is the squared length of the tested signal projection: it is how much residual error drops when we add the tested
predictors. The full-model RSS estimates the remaining noise.
Same canonical bridge: $$\RSS_0=\|Z_1\|^2+\|Z_r\|^2,\qquad \RSS=\|Z_r\|^2,\qquad \RSS_0-\RSS=\|Z_1\|^2.$$ The null model cannot explain the
tested predictor directions, so that signal is counted as leftover error in $\RSS_0$. The full model can explain those directions, so only
residual noise remains in $\RSS$.
Plain-language regression F ratio:
$$F= \frac{\text{variation explained by tested predictors per tested direction}} {\text{remaining residual variation per residual direction}}.$$
This is the same signal-over-noise comparison as ANOVA, just with predictor directions instead of group-mean directions.
The regression F-statistic is $$F=\frac{(\RSS_0-\RSS)/d_1}{\RSS/d_r}\sim F_{d_1,d_r}\quad\text{under }H_0.$$ Reject for large values. This asks
whether the tested predictors improve fit more than would be expected from noise alone.
The F-test is a whole-subspace test. When $d_1>1$, it does not choose one signed direction the way a t-test does; it squares and adds the
projections over all tested directions. That is why ANOVA can detect "some group mean pattern is present" without saying which contrast caused
it. The geometry gives the observed length, and the F distribution calibrates how surprising that length is under the null.
6.5 Individual coefficient t-test and confidence interval
For a single coefficient test $H_0:\beta_j=0$, the tested subspace has $d_1=1$. The canonical F-test is equivalent to a t-test:
$$T=\frac{\hat\beta_j}{\widehat{\text{SE}}(\hat\beta_j)}\sim t_{n-d}\quad\text{under }H_0.$$
This is the one-dimensional version of the regression subset test. The signal direction is the part of predictor $j$ that remains after the
nuisance predictors have been projected out, so the t statistic is a signed signal-over-noise ratio.
The estimated standard error is $$\widehat{\text{SE}}(\hat\beta_j)=\hat\sigma\sqrt{[(X^TX)^{-1}]_{jj}}.$$ The confidence interval is
$$\hat\beta_j\pm t_{n-d,1-\alpha/2}\,\widehat{\text{SE}}(\hat\beta_j).$$
Geometrically, the signal direction for $\beta_j$ is the part of predictor column $X_j$ that remains after removing its projection onto all
other predictor columns. The coefficient t-statistic measures how much $Y$ points in that unique direction relative to residual noise.
Recall Map
Checklist for any testing problem:
identify the nuisance subspace, identify the tested signal subspace, identify the residual subspace, decide whether $\sigma^2$ is known, and
decide whether the tested signal is one-dimensional or multidimensional. Those five answers determine the row of the table.
Problem
Subspaces / dimensions
Statistic
Reference distribution
Known-variance 1D signal
$d_1=1$, $\sigma^2$ known
$Z_1/\sigma$
$N(0,1)$
Known-variance multidimensional signal
$d_1>1$, $\sigma^2$ known
$\|Z_1\|^2/\sigma^2$
$\chi^2_{d_1}$
Unknown-variance 1D signal
$d_1=1$, residual df $d_r$
$Z_1/\sqrt{\|Z_r\|^2/d_r}$
$t_{d_r}$
Unknown-variance multidimensional signal
$d_1>1$, residual df $d_r$
$(\|Z_1\|^2/d_1)/(\|Z_r\|^2/d_r)$
$F_{d_1,d_r}$
One-sample t-test
Signal $\Span(\mathbf1)$, residual $\mathbf1^\perp$
$\sqrt n\,\bar X/S$
$t_{n-1}$
Two-sample pooled t-test
$d_0=1$, $d_1=1$, $d_r=m+n-2$
$(\bar X-\bar Y)/(S_p\sqrt{1/m+1/n})$
$t_{m+n-2}$
Regression subset test
$d_1$ tested coefficients, $d_r=n-d$
$((\RSS_0-\RSS)/d_1)/(\RSS/d_r)$
$F_{d_1,d_r}$
One-way ANOVA
$d_0=1$, $d_1=G-1$, $d_r=N-G$
$((\RSS_0-\RSS)/(G-1))/(\RSS/(N-G))$
$F_{G-1,N-G}$
Regression single coefficient
$d_1=1$, $d_r=n-d$
$\hat\beta_j/\widehat{\text{SE}}(\hat\beta_j)$
$t_{n-d}$
Unifying sentence:
Every statistic above compares a tested projection to either a known variance scale or an independent residual variance estimate.
Formula Sheet
Distribution facts
Object
Formula
Use
Chi-squared
$\sum_{i=1}^dZ_i^2\sim\chi^2_d$
Squared length of a $d$-dimensional Gaussian noise vector
t
$Z/\sqrt{V/d}\sim t_d$
1D Gaussian signal over independent variance estimate
F
$(V_1/d_1)/(V_2/d_2)\sim F_{d_1,d_2}$
Signal sum of squares per signal df over residual sum of squares per residual df
1. Forgetting that nuisance directions are not residual directions.
Nuisance means unknown/free mean under both hypotheses. Residual means known mean 0 under both hypotheses, so it can estimate $\sigma^2$.
2. Using the pooled two-sample t-test without the common-variance assumption.
The pooled test is the canonical equal-variance normal model. If variances differ, the geometry no longer gives this exact t distribution.
3. Confusing $d_1$ and $d_r$ in F-tests.
$d_1$ is the number of tested directions. $d_r$ is the residual degrees of freedom used to estimate $\sigma^2$.
4. Thinking the rotation matrix itself is the point.
The basis is a coordinate choice. The test depends on subspaces and projection lengths, not on the particular orthonormal basis chosen inside a
subspace.
5. Forgetting why $\RSS_0-\RSS$ appears in regression.
It is the improvement in fit from adding the tested predictors, which equals the tested signal sum of squares.
6. Treating individual regression coefficients as marginal effects without considering other predictors.
The coefficient t-test is about the unique direction in $X_j$ left after projecting out the other columns of $X$.
7. Thinking an orthogonal transformation leaves the mean unchanged.
It preserves the spherical covariance shape, but it rotates the mean vector too. That is exactly why the one-sample t-test can be put into
canonical form.
8. Thinking the alternative must change the residual coordinate.
In the canonical model, the alternative shifts the signal coordinate. The residual coordinate keeps mean 0 and supplies the variance estimate.
9. Thinking the $d$ in $\chi^2_d$ or $V/d$ is arbitrary.
It is the dimension of the relevant Gaussian subspace, also called the degrees of freedom.
10. Confusing the alternative region with the orthogonal signal subspace.
$\mathcal H\setminus\mathcal H_0$ describes which means are allowed under the alternative. The coordinate block $Z_1$ comes from the orthogonal
signal space $\mathcal H\cap\mathcal H_0^\perp$, after the nuisance part has been accounted for.
Data 145 Study Guide - Lectures 23-24 - Standalone Review Version