STA6246: Theory of Linear Models

2023-10-21 16193 words 77 minutes

Contents

Taken by Yu Zheng in 2022 fall

Review of Linear Algebra

Def: A set of vectors $\mathbf x^{(1)}, \mathbf x^{(2)},\cdots,\mathbf x^{(n)}$ is linearly dependent if there exist coefficients $c_j$, $j=1,\cdots,n$, not all zero, such that $$ \sum_{j=1}^nc_j\mathbf x^{(j)}=\mathbf 0. $$ This set of vectors is linearly independent if $\sum_{j=1}^nc_j\mathbf x^{(j)}=\mathbf 0$ implies $c_j\equiv0$.

Def: Two vectors are orthogonal to each other, written $\mathbf x\perp \mathbf y$, if $$ \mathbf x^T\mathbf y=\mathbf y^T\mathbf x=\sum_j x_jy_j=0. $$ A set of vectors $\mathbf x^{(1)}, \mathbf x^{(2)}, \cdots, \mathbf x^{(n)}$ are mutually orthogonal iff $\mathbf x^{(j)T}\mathbf x^{(j)}=0$ for all $i\not=j$.

A set of mutually orthogonal nonzero vectors are also linearly independent.

Def: A vector space $\mathcal S$ is a set of vectors that are closed under addition and scalar multiplication, that is, if $\mathbf x^{(1)}$ and $\mathbf x^{(2)}$ are in $\mathcal S$, then $c_1\mathbf x^{(1)}+c_2\mathbf x^{(2)}$ is in $\mathcal S$.

Def: A vector space $\mathcal S$ is said to be generated by a set of vectors $\mathbf x ^{(1)}, \mathbf x^{(2)},\cdots, \mathbf x^{(n)}$ if for every $\mathbf x\in\mathcal S$, there exist some coefficients $c_j$ so that we can write $$ \mathbf x=\sum_{j}c_j\mathbf x^{(j)}. $$ Def: If a vector space $\mathcal S$ is generated by a set of linearly independent vectors $\mathbf x ^{(1)}, \mathbf x^{(2)},\cdots, \mathbf x^{(n)}$, then this set of vectors form a basis for the space $\mathcal S$.

Def: The set of all linear combinations of a set of vectors is called the span of that set.

$\mathbf x\in span\{\mathbf x ^{(1)}, \mathbf x^{(2)},\cdots, \mathbf x^{(n)}\}$ if and only if there exists constants $c_j$ such that $\mathbf x=\sum_jc_j\mathbf x^{(j)}$.

Def: The number of vectors in the basis for a vector space $\mathcal S$ is the dimension of the space $\mathcal S$, written $dim(\mathcal S)$.

Def: The rank of a matrix $\mathbf A$ is the number of linear independent rows or columns, and denoted by $rank(\mathbf A)$ or $r(\mathbf A)$.

Def: The column space of a matrix, denoted by $\mathcal C(\mathbf A)$, is the vector space spannned by the columns of the matrix, that is $$ \mathcal C(\mathbf A )=\{\mathbf x: \text{there exists a vectors }\mathbf c\text{ such that }\mathbf x=\mathbf A \mathbf c \}. $$

$\dim(\mathcal C(\mathbf A))=rank(\mathbf A)$
$rank(\mathbf A \mathbf B)\leq \min(rank(\mathbf A),rank(\mathbf B))$
$\mathcal C(\mathbf A\mathbf B)\subset \mathcal C(\mathbf A)$
If $\mathcal C(\mathbf A)\subset \mathcal C(\mathbf B)$, then there exista a matrix $\mathbf C$ such that $\mathbf A=\mathbf B\mathbf C$

Def: The null space of a matrix, denoted by $\mathcal N(\mathbf A)$, is defined by $\mathcal N(\mathbf A)=\{\mathbf y:\mathbf A\mathbf y=\mathbf 0\}$.

Notice that if the matrix $\mathbf A$ is $m\times n$, then vectors in $\mathcal N(\mathbf A)$ have dimension $n$ while vectors in $\mathcal C(\mathbf A)$ have dimension $m$. Mathematically, this may be expressed as $\mathcal C(\mathbf A)\subset \mathbb R^m$ and $\mathcal N(\mathbf A)\subset \mathbb R^n$.

If $\mathbf A$ has full-column rank, then $\mathcal N(\mathbf A)=\{\mathbf 0\}$
Suppose $\mathbf A\in\mathbb R ^{m\times n}$, then $\dim(\mathcal N(\mathbf A))=n-r$ where $r=rank(\mathbf A)$, or, more elegantly, $$ \dim(\mathcal N(\mathbf A))+\dim(\mathcal C(\mathbf A))=n. $$

Def: Two vector spaces $\mathcal S$ and $\mathcal T$ form orthogonal complements in $\mathbb R^m$ if and only if $\mathcal S$, $\mathcal T\subset\mathbb R^m$, $\mathcal S\cap \mathcal T={\mathbf 0}$, $\dim(\mathcal S)=r$, $\dim(\mathcal T)=n-r$, and every vector in $\mathcal S$ is orthogonal to every vector in $\mathcal T$.

Let $\mathcal S$ and $\mathcal T$ be orthogonal complements in $\mathbb R^m$, then any vector $\mathbf x\in\mathbb R^m$ can be written as $\mathbf x=\mathbf s+\mathbf t$ where $\mathbf s\in\mathcal S$ and $\mathbf t\in\mathcal T$, and this decomposition is unique
If $\mathbf A$ is an $m\times n$ matrix, then $\mathcal C(\mathbf A)$ and $\mathcal N(\mathbf A^T)$ are orthogonal complements in $\mathbb R^m$
Let $\mathcal S_1$ and $\mathcal T_1$ be orthogonal complements, as welle as $\mathcal S_2$ and $\mathcal T_2$; then if $\mathcal S_1\subset \mathcal S_2$, then $\mathcal T_2\subset \mathcal T_1$
COnsider two vectors spaces $\mathcal S$ and $\mathcal T$. If $\mathcal S\subset \mathcal T$ and $\dim(\mathcal S)=\dim(\mathcal T)=k$, then $\mathcal S=\mathcal T$
Let $\mathbf A$ be an $m\times n$ matrix and $\mathbf b$ a fixed vector. If $\mathbf A\mathbf x+\mathbf b=\mathbf 0$ for all $\mathbf x\in\mathbb R^n$, then $\mathbf A=\mathbf 0$ and $\mathbf b=\mathbf 0$
If $\mathbf B\mathbf x=\mathbf C\mathbf x$ for all $\mathbf x$, then $\mathbf B=\mathbf C$
Let $\mathbf A$ have full-column rank; then if $\mathbf A\mathbf B=\mathbf A\mathbf C$, then $\mathbf B=\mathbf C$
If $\mathbf C^T\mathbf C=\mathbf 0$, then $\mathbf C=\mathbf 0$

Def: A system of equations $\mathbf A\mathbf x=\mathbf c$ is consistent iff there exists a solution $\mathbf x^\star$ such that $\mathbf A\mathbf x^\star=\mathbf c$

A system of equations $\mathbf A\mathbf x=\mathbf c$ is consistent iff $\mathbf c\in\mathcal C(\mathbf A)$

Def: A matrix $\mathbf G$ is a generalized inverse of the matrix $\mathbf A$ iff it satisfies $\mathbf A\mathbf G\mathbf A=\mathbf A$

Let $\mathbf A$ be an $m\times n$ matrix with rank $r$. If $\mathbf A$ can be partitioned as below, with $rank(\mathbf A)=rank(\mathbf C)=r$, $$ \mathbf A=\begin{bmatrix}\mathbf C&\mathbf D\\\mathbf E&\mathbf F\end{bmatrix}\begin{matrix}r\\m-r\end{matrix} $$ so that $\mathbf C$ is nonsingular, then the matrix $$ \mathbf G=\begin{bmatrix}\mathbf C^{-1}&\mathbf 0\\\mathbf 0&\mathbf 0\end{bmatrix}\begin{matrix}r\\n-r\end{matrix} $$ is a generalized inverse of $\mathbf A$
Let $\mathbf A$ be an $m\times n$ matrix with rank $r$. If $\mathbf A$ can be partitioned as below, with $rank(\mathbf A)=rank(\mathbf F)=r$, $$ \mathbf A=\begin{bmatrix}\mathbf C&\mathbf D\\\mathbf E&\mathbf F\end{bmatrix}\begin{matrix}m-r\\r\end{matrix} $$ so that $\mathbf C$ is nonsingular, then the matrix $$ \mathbf G=\begin{bmatrix}\mathbf 0&\mathbf 0\\\mathbf 0&\mathbf F^{-1}\end{bmatrix}\begin{matrix}n-r\\r\end{matrix} $$ is a generalized inverse of $\mathbf A$
Let $\mathbf A$ be an $m\times n$ matrix with rank $r$. Let $\mathbf P$ and $\mathbf Q$ be permutation matrices such that $$ \mathbf P\mathbf A\mathbf Q=\begin{bmatrix}\mathbf C&\mathbf D\\\mathbf E&\mathbf F\end{bmatrix} $$ where $rank(\mathbf A)=rank(\mathbf C)=r$ and $\mathbf C$ is nonsingular. Then the matrix $\mathbf G$ below is a generalized inverse of $\mathbf A$: $$ \mathbf G=\mathbf Q\begin{bmatrix}\mathbf C^{-1}&\mathbf 0\\\mathbf 0&\mathbf 0\end{bmatrix}\mathbf P $$
Let $\mathbf A\mathbf x=\mathbf c$ be a consistent system of equations and let $\mathbf G$ be a generalized inverse of $\mathbf A$, then $\mathbf G\mathbf c$ is a solution to the euqations $\mathbf A\mathbf x=\mathbf c$
Let $\mathbf A\mathbf x=\mathbf c$ be a consistent system of equations and let $\mathbf G$ be a generalized inverse of $\mathbf A$; then $\tilde {\mathbf x}$ is a solution to the equations $\mathbf A\mathbf x=\mathbf c$ iff there exists a vector $\mathbf z$ such that $\tilde{\mathbf x}=\mathbf G\mathbf c+(\mathbf I-\mathbf G\mathbf A)\mathbf z$

Def: A square matrix $\mathbf P$ is idempotent iff $\mathbf P^2=\mathbf P$

Def: A square matrix $\mathbf P$ is a projection onto the vector space $\mathcal S$ iff

$\mathbf P$ is idempotent,
for any $\mathbf x$, $\mathbf P\mathbf x\in\mathcal S$. and
if $\mathbf z\in\mathcal S$, $\mathbf P\mathbf z=\mathbf z$

$\mathbf A\mathbf A^g$ is a projection onto $\mathcal C(\mathbf A)$
$(\mathbf I-\mathbf A^g\mathbf A)$ is a projection onto $\mathcal N(\mathbf A)$
A symmetric, idempotent matrix $\mathbf P$ that projects onto the vector space $\mathcal S$ is unique
If a symmetric, idempotent matrix $\mathbf P$ projects onto $\mathcal S$, then $\mathbf I-\mathbf P$ projects onto its orthogonal complement
$trace(\mathbf A\mathbf B)=trace(\mathbf B\mathbf A)$
$trace(\mathbf A^T\mathbf A)=\sum_i\sum_jA_{ij}^2$
$|\mathbf A^{-1}|=1/|\mathbf A|$
$det\left(\begin{bmatrix}\mathbf A&\mathbf B\\\mathbf C&\mathbf D\end{bmatrix}\right)=|\mathbf A||\mathbf D-\mathbf C\mathbf A^{-1}\mathbf B|=|\mathbf D||\mathbf A-\mathbf B\mathbf D^{-1}\mathbf C|$
The spectral decomposition of a symmetric matrix is $\mathbf A=\mathbf Q\mathbf \Lambda\mathbf Q^T=\sum_j\lambda_j\mathbf q^{(j)}\mathbf q^{(j)T}$where $\mathbf \Lambda$ is a diagonal matrix of the eigenvalues, ordered in the same way the eigenvectors are stacked as columns in $\mathbf Q$
For a symmetric $n\times n$ matrix $\mathbf A$, $|\mathbf A|=\lambda_1\times\cdots\times\lambda_n$ and $trace(\mathbf A)=\lambda_1+\cdots+\lambda_n$, where $\lambda_j$ are the eigenvalues of $\mathbf A$
For a symmetric $n\times n$ matrix $\mathbf A$, the rank of it is equal to the number of nonzero eigenvalues

Def: A matrix $\mathbf A$ is nonnegative definite iff $\mathbf x^T\mathbf A\mathbf x\geq0$ for all $\mathbf x$

Def: A matrix $\mathbf A$ is positive definite iff $\mathbf x^T\mathbf A\mathbf x>0$ for all $\mathbf x$

(Cholesky factorization) A square matrix $\mathbf A$ is positive definite iff there exists a nonsigunlar lower triangular matrix $\mathbf L$ such that $\mathbf A=\mathbf L\mathbf L^T$
$\left(\mathbf A^{\frac{1}{2}}\right)^{-1}=\mathbf A^{-\frac{1}{2}}=\mathbf Q\mathbf \Lambda^{-\frac{1}{2}}\mathbf Q^T$

Chapter 1: The GLM and Examples

GLM: $\mathbf y=\mathbf X\mathbf b+\mathbf e$

$\mathbf y$: $N\times 1$ random vector of observable responses

$\mathbf X$: $N\times p$ matrix of known constants

$\mathbf b$: $p\times1$ vector of unknown constants

$\mathbf e$: $N\times 1$ random vector of unobservable error such that $\mathbb E(\mathbf e)=\mathbf 0$

It is called a linear model because the expected value of $\mathbf y$, $\mathbb E(\mathbf y)$, is linear in $\mathbf b$.

Define $\mu(\mathbf b)=\mathbf X\mathbf b=\mathbb E(\mathbf y)$. $\mu(\cdot)$ is a linear function: Fix $\mathbf b^{(1)}$, $\mathbf b^{(2)}\in\mathbb R^p$, and $a_1,a_2\in\mathbb R$, then $$ \mu(a_1\mathbf b^{(1)}+a_2\mathbf b^{(2)})=\mathbf X(a_1\mathbf b^{(1)}+a_2\mathbf b^{(2)})=a_1\mathbf X\mathbf b^{(1)}+a_2\mathbf X\mathbf b^{(2)}=a_1\mu(\mathbf b^{(1)})+a_2\mu(\mathbf b^{(2)}) $$ Examples:

One-sample problem (p=1): $ y_1,\cdots,y_N$ i.i.d., $\mathbb E(y_1)=\mu$, $Var( y_1)=\sigma^2$. This is a special case of the GLM with $\mathbf y=(y_1,\cdots,y_N)^T$ and $\mathbf X\mathbf b=\mathbf 1_N\mu$; $\mathbf e$ is just a vector of i.i.d. centered errors $e_1\stackrel{d}{=}y_1-\mu$.
Simple linear regression (p=2): $y_i=\beta_0+\beta_1x_i+e_i$, $i=1,2,\cdots,N$, where $\{e_i\}_{i=1}^N$ uncorrelated r.v.s with $\mathbb E(e_i)=0$ and $Var(e_i)=\sigma^2$. This is a special case of the GLM with $\mathbf y=(y_1,\cdots,y_N)^T$, $\mathbf X\mathbf b=\begin{pmatrix}1&x_0\\\vdots&\vdots\\1&x_N\end{pmatrix}\begin{pmatrix}\beta_0\\\beta_1\end{pmatrix}$, $\mathbf e=(e_1,\cdots,e_N)^T$.
- Actually we are making the assumption that the $x_i$’s are measured without error.
  Eg. $y_i=\beta_0+\beta_1n_i+e_i$ (True Model)
  $y_i$ is the crop yield, and $n_i$ is the nitrogen in soil. We don’t see $n_i$ exactly; we see a distorted version $x_i=n_i+u_i$, where $u_i$ is a random measurement error.
  $y_i=\beta_0+\beta_1(n_i+u_i)+e_i$ (Observed Model) $\rightarrow$ Is this a special case of the GLM? No!

Multiple regression (p=k+1): $y_i=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\cdots+\beta_kx_{ik}+e_i$, $i=1,2,\cdots,N$, where $\{e_i\}^N_{i=1}$ uncorrelated r.v.s with $\mathbb E(e_i)=0$ and $Var(e_i)=\sigma^2$. This is a special case of the GLM with $\mathbf y=(y_1,\cdots,y_N)^T$, $\mathbf X\mathbf b=\begin{pmatrix}1&x_{11}&\cdots&x_{1k}\\\vdots&\vdots&\ddots&\vdots\\1&x_{11}&\cdots&x_{1k}\end{pmatrix}\begin{pmatrix}\beta_0\\\vdots\\\beta_N\end{pmatrix}$, $\mathbf e=(e_1,\cdots,e_N)^T$.
- $y_i=\beta_0+\beta_1i+\beta_2i^2+e_i$ (Yes)
- $y_i=\beta_0+\beta_1\cos\left(\frac{2\pi i}{7}\right)+\beta_2\log(i+1)+e_i$ (Yes)
- $y_i=\beta_0+\beta_1 e^{-\beta_2x_i}+e_i$ (No)
- $y_i=\frac{\beta_0}{\beta_1+\beta_2 x_i}+e_i$ (No)
One-way ANOVA: $y_{ij}=\mu+\alpha_i+e_{ij}$, $i=1,\cdots, a$, $j=1,\cdots, n_i$, where $\{e_{ij}\}^{a,n_i}_{i,j=1}$ uncorrelated r.v.s with $\mathbb E(e_{ij})=0$ and $Var(e_{ij})=\sigma^2$. If we assume that $\{\alpha_i\}^a_{i=1}$ are unknown constants, then this is a special case of the GLM with $\mathbf y=(y_{11},\cdots,y_{1n_1},y_{21},\cdots,y_{2n_2},\cdots,y_{a1},\cdots, y_{an_a})^T$, $\mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_{n_1}&\mathbf 1_{n_1}&\mathbf0&\cdots&\mathbf 0\\\mathbf 1_{n_2}&\mathbf 0&\mathbf 1_{n_2}&\cdots&\mathbf 0\\\vdots&\vdots&\vdots&\ddots&\vdots\\\mathbf 1_{n_a}&\mathbf0&\mathbf0&\cdots&\mathbf1_{n_a}\end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\vdots\\\alpha_a\end{pmatrix}$, $\mathbf e=(e_{11},\cdots,e_{1n_1},e_{21},\cdots,e_{2n_2},\cdots,e_{a1},\cdots, e_{an_a})^T$.
The comments:
- $\mathbf X$ is not full rank
- If the $a$ treatments were randomly selected from a population of treatments, then $\{\alpha_i\}_{i=1}^a$ must be considered random, and in this case, we have a so-called mixed model. However, if we redefine $\mathbf X\mathbf b$ and $\mathbf e$, this mixed model is still a special case of the GLM with $\mathbf X\mathbf b=\begin{pmatrix}1\\1\\\vdots\\1\end{pmatrix}\mu$, $\mathbf e=\begin{pmatrix}\alpha_1+e_{11}\\\alpha_1+e_{12}\\\vdots\\\alpha_a+e_{a1}\\\vdots\\\alpha_a+e_{an_a}\end{pmatrix}$
  Here elements of $\mathbf e$ are no longer uncorrelated–but that’s okay.
Two-way nested model: $y_{ijk}=\mu+\alpha_i+\beta_{ij}+e_{ijk}$, $i=1,\cdots,a$, $j=1,\cdots,b_i$, $k=1,\cdots,n_{ij}$, where $\{e_{ijk}\}_{i,j,k=1}^{a,b_i,n_{ij}}$ uncorrelated r.v.s with $\mathbb E(e_{ijk})=0$ and $Var(e_{ijk})=\sigma^2$. We’ll assume the $\{\alpha_i\}_{i=1}^a$ and $\{\beta_{ij}\}_{i,j=1}^{a,b_i}$ are fixed treatment effects. This is a special case of the GLM.
- Eg: Hospital costs from medical patients in two states (Florida & Pennsylvania)
  $b_1=2$ Florida (Dade county, Menrec county)
  $b_2=3$ Penn (A, B, E)
  $\mathbf y=\begin{pmatrix}y_{F,D,1}\\\vdots\\ y_{F,D,n_{11}}\\ y_{F,M,1}\\\vdots\\ y_{F,M,n_{12}}\\ y_{P,A,1}\\\vdots\\ y_{P,E,n_{23}}\end{pmatrix}$,
  $\mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_{n_{11}}&\mathbf1_{n_{11}}&\mathbf0&\mathbf1_{n_{11}}&\mathbf0&\mathbf0&\mathbf0&\mathbf0\\\mathbf 1_{n_{12}}&\mathbf1_{n_{12}}&\mathbf0&\mathbf0&\mathbf1_{n_{12}}&\mathbf0&\mathbf0&\mathbf0\\\mathbf 1_{n_{21}}&\mathbf0&\mathbf1_{n_{21}}&\mathbf0&\mathbf0&\mathbf1_{n_{21}}&\mathbf0&\mathbf0\\\mathbf 1_{n_{22}}&\mathbf0&\mathbf1_{n_{22}}&\mathbf0&\mathbf0&\mathbf0&\mathbf1_{n_{22}}&\mathbf0\\\mathbf 1_{n_{23}}&\mathbf0&\mathbf1_{n_{23}}&\mathbf0&\mathbf0&\mathbf0&\mathbf0&\mathbf1_{n_{23}}\end{pmatrix}\begin{pmatrix}\mu\\\alpha_F\\\alpha_P\\\beta_{F,D}\end{pmatrix}$
  $p=a+1+\sum_{i=1}^ab_i$, $N=\sum_{i=1}^a\sum_{j=1}^{b_i}n_{ij}$
Two-way crossed model: $y_{ijk}=\mu+\alpha_i+\beta_j+\gamma_{ij}+e_{ijk}$, $i=1,\cdots,a$, $j=1,\cdots,b$, $k=1,\cdots,n_{ij}$, where $\{e_{ijk}\}_{i,j,k=1}^{a,b,n_{ij}}$ uncorrelated r.v.s with $\mathbb E(e_{ijk})=0$ and $Var(e_{ijk})=\sigma^2$. Assume all fixed effects. This is a special case of the GLM.
Simple special case: $y_{ij}=\mu+\alpha_i+\beta_j+e_{ij}$, $i=1,\cdots,a$, $j=1,\cdots,b$.
$\rightarrow$Randomized block model: $\alpha_i$-treatment effects, $\beta_j$-block effects. If block effects are considered fixed, this is a special case of the GLM.
$p=a+b+1$, $N=ab$, $\mathbf y=(y_{11},\cdots,y_{1b},y_{21},\cdots,y_{2b},\cdots,y_{a1},\cdots,y_{ab})^T$, $\mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_b&\mathbf 1_b&\mathbf 0&\cdots&\mathbf 0&\mathbf I_b\\\mathbf 1_b&\mathbf 0&\mathbf 1_b&\cdots&\mathbf 0&\mathbf I_b\\\vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\\mathbf 1_b&\mathbf 0&\mathbf 0&\cdots&\mathbf 1_b&\mathbf I_b \end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\vdots\\\alpha_a\\\beta_1\\\vdots\\\beta_b\end{pmatrix}$
if block effects are considered random, then $\mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_b&\mathbf 1_b&\mathbf 0&\cdots&\mathbf 0\\\mathbf 1_b&\mathbf 0&\mathbf 1_b&\cdots&\mathbf 0\\\vdots&\vdots&\vdots&\ddots&\vdots\\\mathbf 1_b&\mathbf 0&\mathbf 0&\cdots&\mathbf 1_b\end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\vdots\\\alpha_a\\\end{pmatrix}$, $\mathbf e=\begin{pmatrix}\beta_1+e_{11}\\\vdots\\\beta_b+e_{1b}\\\vdots\\\beta_1+e_{a1}\\\vdots\\\beta_b+e_{ab}\end{pmatrix}$.
Analysis of covariate: $y_{ij}=\mu+\alpha_i+\beta x_{ij}+e_{ij}$, $i=1,\cdots,a$, $j=1,\cdots,n_i$, where $\{e_{ij}\}_{i,j=1}^{a,n_i}$ uncorrelated r.v.s with $\mathbb E(e_{ij})=0$ and $Var(e_{ij})=\sigma^2$.
- Eg: $y_{ij}=\text{weight gain for individual }j\text{ on diet }i$, $x_{ij}=\text{ initial weight}$
  $p=a+2$, $N=\sum_{i=1}^an_i$, $\mathbf y=(y_{11},\cdots,y_{1n_1},y_{21},\cdots,y_{2n_2},\cdots,y_{a1},\cdots, y_{an_a})^T$, $\mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_{n_1}&\mathbf 1_{n_1}&\mathbf0&\cdots&\mathbf 0&\tilde{\mathbf x}_1\\ \mathbf 1_{n_2}&\mathbf 0&\mathbf 1_{n_2}&\cdots&\mathbf 0&\tilde{\mathbf x}_2\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ \mathbf 1_{n_a}&\mathbf0&\mathbf0&\cdots&\mathbf1_{n_a}&\tilde{\mathbf x}_a\end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\vdots\\\alpha_a\\\beta\end{pmatrix}$, $\tilde{\mathbf x}_i=(x_{i1},\cdots,x_{in_i})^T$, $\mathbf e=(e_{11},\cdots,e_{1n_1},e_{21},\cdots,e_{2n_2},\cdots,e_{a1},\cdots, e_{an_a})^T$.

We will gradually build up assumptions on $\mathbf e$:

No assumptions. Just find $\mathbf b$ to minimize $|\mathbf y-\mathbf X\mathbf b|$
Assume $\mathbb E(\mathbf e)=0$. Find unbiased estimators of $\pmb \lambda^T\mathbf b$ , where $\pmb\lambda^T=(\lambda_1,\cdots,\lambda_p)$ (Fixed vector)
Assume $\{e_i\}_{i=1}^n$ are uncorrelated with $\mathbb E(\mathbf e)=\mathbf 0$ and $Var(\mathbf e_{ij})=\sigma^2$. Find the minimum variance unbiased estimator of $\pmb \lambda^T\mathbf b$
Assume $\mathbf e$ is MVN and develop tests and CIs concerning $\pmb \lambda^T\mathbf b$

Chapter 2: The Linear Least Squares Problem

2.1 The Normal Equations

Assumption: $\mathbf y\in\mathbb R^N$ and $\mathbf X\in\mathbb R^{N\times p}$

Goal: Find the value of $\mathbf b\in\mathbb R^p$ that minimizes $$ Q(\mathbf b)=(\mathbf y-\mathbf X\mathbf b)^T(\mathbf y-\mathbf X\mathbf b)=|\mathbf y-\mathbf X\mathbf b|^2 $$ A value of $\mathbf b$ that minimizes $Q(\mathbf b)$ will be called a least squares solution $$ \mathcal C(\mathbf X)=\{\mathbf v\in\mathbb R^N:\mathbf v=\mathbf X\mathbf w\text{ for some }\mathbf w\in\mathbb R^p\} $$ Mathematically, trying to find the closest point in $\mathcal C(\mathbf X)$ to the data $\mathbf y$

The gradient vector is given by $$ \frac{\partial Q}{\partial \mathbf b}=(\frac{\partial Q}{\partial b_1},\frac{\partial Q}{\partial b_2},\cdots,\frac{\partial Q}{\partial b_p})^T $$ The minimum of $Q$ will occur where the gradient is zero, so we want to solve $$ \frac{\partial Q}{\partial\mathbf b}=\mathbf 0 $$ Result 2.1: Assume $\mathbf a$ is a $p\times 1$ vector, $\mathbf b$ is a $p\times1$ vector, $\mathbf A$ is a $p\times p$ matrix. Then

$\frac{\partial \mathbf a^T\mathbf b}{\partial \mathbf b}=\mathbf a$
$\frac{\partial\mathbf b^T\mathbf A\mathbf b}{\partial \mathbf b}=(\mathbf A+\mathbf A^T)\mathbf b$

proof: For the first result, note that $\mathbf a^T\mathbf b=a_1b_1+a_2b_2+\cdots+a_pb_p$, so $\left(\frac{\partial\mathbf a^T\mathbf b}{\partial \mathbf b}\right)_j=\frac{\partial\mathbf a^T\mathbf b}{\partial\mathbf b_j}=a_j$.

Similarly, $$ \mathbf b^T\mathbf A\mathbf b=\sum_{j=1}^p\sum_{k=1}^pA_{jk}b_jb_k, $$ and to form $\left(\frac{\partial\mathbf b^T\mathbf A\mathbf b}{\partial\mathbf b}\right)_j$, the terms that depend on $b_j$ are $$ A_{jj}b_j^2+\sum_{k\not=j}(A_{jk}+A_{kj})b_kb_j, $$ so the partial derivative w.r.t. $b_j$ is $$ 2A_{jj}b_j+\sum_{k\not=j}A_{jk}b_k+\sum_{k\not=j}A_{kj}b_k, $$ which is the $j$-th element of $(\mathbf A+\mathbf A^T)\mathbf b$.

Back to $Q(\mathbf b)$, using Result 2.1, $$ \frac{\partial Q}{\partial \mathbf b}=-2\mathbf X^T\mathbf y+2\mathbf X^T\mathbf X\mathbf b. $$ Now, setting the gradient to 0 yields $$ \mathbf X^T\mathbf X\mathbf b=\mathbf X^T\mathbf y. $$ They are the normal equations.

Example:

SLR: $y_i=\beta_0+\beta_1x_i+e_i$, $i=1,\cdots,N$ $$ \mathbf X^T\mathbf X\mathbf b=\begin{pmatrix}N&N\bar{\mathbf x}\\N\bar{\mathbf x}&\sum_{i=1}^Nx_i^2\end{pmatrix}\begin{pmatrix}\beta_0\\\beta_1\end{pmatrix},\quad \mathbf X^T\mathbf y=\begin{pmatrix}N\bar{\mathbf y}\\\sum_{i=1}^Nx_iy_i\end{pmatrix} $$ Assume that $\sum_{i=1}^N(x_i-\bar{\mathbf x})^2>0$, i.e., the $x_i$’s are NOT all the same. The solution to the N.E. is $$ \hat \beta_1=\frac{\sum_{i=1}^N(x_i-\bar{\mathbf x})y_i}{\sum_{i=1}^N(x_i-\bar{\mathbf x})^2}\\ \hat\beta_0=\bar{\mathbf y}-\hat \beta_1\bar{\mathbf x} $$ If $\sum_{i=1}^N(x_i-\bar{\mathbf x})^2=0$, then there are infinitely many solutions to the N.E.s $$ \hat\beta_1=c\\\hat\beta_0=\bar{\mathbf y}-c\bar{\mathbf x} $$

Balanced one-way ANOVA:
$y_{ij}=\mu+\alpha_i+e_{ij}$, $i=1,\cdots,a$, $j=1,\cdots,n$ $$ \mathbf X=\begin{pmatrix}\mathbf 1_n&\mathbf 1_n&\mathbf 0&\cdots&\mathbf 0\\\mathbf 1_n&\mathbf 0&\mathbf 1_n&\cdots&\mathbf 0\\\vdots&\vdots&\vdots&\ddots&\vdots\\\mathbf 1_n&\mathbf 0&\mathbf 0&\cdots&\mathbf 1_n\end{pmatrix} $$
$$ \mathbf X^T\mathbf X\mathbf b=\begin{pmatrix}n&n&n&\cdots&n\\n&n&0&\cdots&0\\n&0&n&\cdots&0\\\vdots&\vdots&\vdots&\ddots&\vdots\\n& 0& 0&\cdots&n\end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\vdots\\\alpha_a\end{pmatrix},\quad \mathbf X^T\mathbf y=\begin{pmatrix}y_{\cdot\cdot}\\y_{1\cdot}\\\vdots\\y_{a\cdot}\end{pmatrix} $$
$\mathbf X^T\mathbf X$ is singular, again, there are infinitely many solutions to the N.E.s $$ \hat\mu=c\\\hat\alpha_i=\bar{y}_{i}-c $$ for any $c\in\mathbb R$.

Next, we will show that the N.E.s are consistent and we’ll also show that $$ Q(\mathbf b)\text{ is minimized at }\hat{\mathbf b}\quad\Leftrightarrow\quad \hat{\mathbf b}\text{ is a solution to the N.E.s} $$

2.2 The Geometry of Least Squares

Recall that a vector space, $\mathcal S$, in $\mathbb R^m$ is a collection of vectors that is closed under addition and scalar multiplication. So if $\mathbf x^{(1)}, \mathbf x^{(2)}\in\mathcal S$ and $c_1,c_2\in\mathbb R$, then $c_1\mathbf x^{(1)}+c_2\mathbf x^{(2)}\in\mathcal S$.

Def A.4: Two vector spaces $\mathcal S$ and $\mathcal T$ form orthogonal complements in $\mathbb R^m$ if and only if $\mathcal S$, $\mathcal T\subset\mathbb R^m$, $\mathcal S\cap \mathcal T={\mathbf 0}$, $\dim(\mathcal S)+\dim(\mathcal T)=n$, and $\mathbf s^T\mathbf t=0$ for every vector $\mathbf s$ in $\mathcal S$ and every vector $\mathbf t$ in $\mathcal T$.

Result A.4: Let $\mathcal S$ and $\mathcal T$ be orthogonal complements in $\mathbb R^m$, then any vector $\mathbf x\in\mathbb R^m$ can be written as $\mathbf x=\mathbf s+\mathbf t$ where $\mathbf s\in\mathcal S$ and $\mathbf t\in\mathcal T$, and this decomposition is unique

Result A.5: If $\mathbf A$ is an $m\times n$ matrix, then $\mathcal C(\mathbf A)$ and $\mathcal N(\mathbf A^T)$ are orthogonal complements in $\mathbb R^m$

Result A.6: Let $\mathcal S_1$ and $\mathcal T_1$ be orthogonal complements, as welle as $\mathcal S_2$ and $\mathcal T_2$; then if $\mathcal S_1\subset \mathcal S_2$, then $\mathcal T_2\subset \mathcal T_1$

Question: Are the N.E.s consistent? In other words, is it always true that $\mathbf X^T\mathbf y\in\mathcal C(\mathbf X^T\mathbf X)$?

Lemma 2.1: $\mathcal N(\mathbf X^T\mathbf X)=\mathcal N(\mathbf X)$

proof: $\mathbf w\in\mathcal N(\mathbf X)\Rightarrow \mathbf X\mathbf w=\mathbf 0\Rightarrow \mathbf X^T\mathbf X\mathbf w=\mathbf 0\Rightarrow \mathbf w\in\mathcal N(\mathbf X^T\mathbf X)$

$\mathbf w\in\mathcal N(\mathbf X^T\mathbf X)\Rightarrow \mathbf X^T\mathbf X\mathbf w=\mathbf 0\Rightarrow \mathbf w^T\mathbf X^T\mathbf X\mathbf w=0\Rightarrow |\mathbf X\mathbf w|=0\Rightarrow \mathbf X\mathbf w=\mathbf 0\Rightarrow \mathbf w\in\mathcal N(\mathbf X)$

Result 2.2: $\mathcal C(\mathbf X^T\mathbf X)=\mathcal C(\mathbf X^T)$

proof: Result A.5 $\Rightarrow$$\mathcal N(\mathbf X)$ and $\mathcal C(\mathbf X^T)$ are O.C.s, $\mathcal N(\mathbf X^T\mathbf X)$ and $\mathcal C(\mathbf X^T\mathbf X)$ are O.C.s

Result A.6 $\Rightarrow$ $\mathcal N(\mathbf X)\subset \mathcal N(\mathbf X^T\mathbf X)\Rightarrow \mathcal C(\mathbf X^T\mathbf X)\subset\mathcal C(\mathbf X^T)$, $\mathcal N(\mathbf X^T\mathbf X)\subset \mathcal N(\mathbf X)\Rightarrow \mathcal C(\mathbf X^T)\subset\mathcal C(\mathbf X^T\mathbf X)$

Corollary 2.1: The N.E.s are consistent

proof: Result 2.2: $\mathbf X^T\mathbf y\in\mathcal C(\mathbf X^T)=\mathcal C(\mathbf X^T\mathbf X) $

Result 2.3: $\hat{\mathbf b}$ is a solution to the N.E.s iff $\hat{\mathbf b}$ minimizes $Q(\mathbf b)=|\mathbf y-\mathbf X\mathbf b|^2$

*By Result A.12, $\hat{\mathbf b}=(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y$ is a solution to the N.E.s

proof: By Corollary 2.1, we know that there exists a solution to the N.E.s, $\hat{\mathbf b}$. $$ \begin{aligned} Q(\mathbf b)&=(\mathbf y-\mathbf X\mathbf b)^T(\mathbf y-\mathbf X\mathbf b)\\ &=(\mathbf y-\mathbf X\hat{\mathbf b}+\mathbf X\hat{\mathbf b}-\mathbf X\mathbf b)^T(\mathbf y-\mathbf X\hat{\mathbf b}+\mathbf X\hat{\mathbf b}-\mathbf X\mathbf b)\\ &=(\mathbf y-\mathbf X\hat{\mathbf b})^T(\mathbf y-\mathbf X\hat{\mathbf b})+2(\hat{\mathbf b}-\mathbf b)^T\mathbf X^T(\mathbf y-\mathbf X\hat{\mathbf b})+(\mathbf X\hat{\mathbf b}-\mathbf X\mathbf b)^T(\mathbf X\hat{\mathbf b}-\mathbf X\mathbf b)\\ &=Q(\hat{\mathbf b})+|\mathbf X(\hat{\mathbf b}-\mathbf b)|^2 \end{aligned} $$ where the cross-product term vanishes since $\hat{\mathbf b}$ solves the N.E.s $\mathbf X^T(\mathbf y-\mathbf X^T\mathbf b)=\mathbf 0$. Now since no other value of $\mathbf b$ can give a smaller value of $Q(\mathbf b)$, clearly $\hat{\mathbf b}$ minimizes $Q(\mathbf b)$.

For the other direction, we still have $\hat{\mathbf b}$ that solves N.E.s, and we now know that $\hat{\mathbf b}$ minimizes $Q$ as well.

Suppose $\tilde {\mathbf b}$ minimizes $Q$. So $Q(\tilde {\mathbf b})=Q(\hat{\mathbf b})+|\mathbf X(\hat{\mathbf b}-\tilde{\mathbf b})|^2$. Since $Q(\tilde {\mathbf b})=Q(\hat{\mathbf b})$, we know $|\mathbf X(\hat{\mathbf b}-\tilde{\mathbf b})|^2=0\Rightarrow \mathbf X(\hat{\mathbf b}-\tilde{\mathbf b})=\mathbf 0\Rightarrow \mathbf X\hat{\mathbf b}=\mathbf X\tilde{\mathbf b}$. Now $\mathbf X^T\mathbf y=\mathbf X^T\mathbf X\hat{\mathbf b}=\mathbf X^T\mathbf X\tilde{\mathbf b}$ , so $\tilde{\mathbf b}$ satisfies the N.E.s.

Corollary 2.3: $\mathbf X\hat{\mathbf b}$ is invariant to the choice of a solution to the N.E.s.

proof: If $\hat{\mathbf b}$ and $\tilde{\mathbf b}$ are both solutions to N.E.s, then $Q(\tilde{\mathbf b})=Q(\hat{\mathbf b})+|\mathbf X(\hat{\mathbf b}-\tilde{\mathbf b})|^2$. So $|\mathbf X(\hat{\mathbf b}-\tilde{\mathbf b})|=0$.

We have shown that the closest point in $\mathcal C(\mathbf X)$ to $\mathcal Y$ is the vector of fitted values, $\hat{\mathbf y}=\mathbf X\hat{\mathbf b}$. The residual vector $\hat{\mathbf e}=\mathbf y-\hat{\mathbf y}$ is in $\mathcal N(\mathbf X^T)$ since $\mathbf X^T\hat{\mathbf e}=\mathbf X^T(\mathbf y-\hat{\mathbf y})=\mathbf X^T\mathbf y-\mathbf X^T\mathbf X\hat{\mathbf b}=0$.

This unique orthogonal decomposition (from Result A.4) also yields a decomposition of sums of squares from the Pythagorean Theorem: $$ |\mathbf y|^2=|\hat{\mathbf y}|^2+|\hat{\mathbf e}|^2. $$ That is, the total sum of squares $|\mathbf y|^2$ is the sum of the regression sum of squares, or $SSR = |\mathbf X\hat{\mathbf b}|^2$, and error sum of squares, or $SSE=|\hat{\mathbf e}|^2$.

Now let’s look at a different way of developing this decomposition using projections.

Result 2.4: $\mathbf X^T\mathbf X\mathbf A=\mathbf X^T\mathbf X\mathbf B\Leftrightarrow\mathbf X\mathbf A=\mathbf X\mathbf B$.

proof: ($\Leftarrow$) Obvious

($\Rightarrow$) We will show two different proofs:

Algebraic: $\mathbf X^T\mathbf X(\mathbf A-\mathbf B)=\mathbf 0\Rightarrow(\mathbf A-\mathbf B)^T\mathbf X^T\mathbf X(\mathbf A-\mathbf B)=0\Rightarrow\left[\mathbf X(\mathbf A-\mathbf B)\right]^T\mathbf X(\mathbf A-\mathbf B)=0$
$\Rightarrow (\text{Lemma A.1}) \mathbf X(\mathbf A-\mathbf B)=\mathbf 0\Rightarrow \mathbf X\mathbf A=\mathbf X\mathbf B$.
Geometric: $\mathbf X^T\mathbf X(\mathbf A-\mathbf B)=0$ implies that the columns of $\mathbf X(\mathbf A-\mathbf B)$ are in $\mathcal N(\mathbf X^T)$. But the columns of $\mathbf X(\mathbf A-\mathbf B)$ are also all in $\mathcal C(\mathbf X)$. Since $\mathcal C(\mathbf X)$ and $\mathcal N(\mathbf X^T)$ are O.C.s, the only vector that have in common is the zero vector. Thus, each conlumn of $\mathbf X(\mathbf A-\mathbf B)$ must be $\mathbf 0$, i.e., $\mathbf X(\mathbf A-\mathbf B)=\mathbf 0$.

Some material from the appendix:

Def: A square matrix $\mathbf P$ is a projection onto the vector space $\mathcal S$ iff

$\mathbf P$ is idempotent,
for any $\mathbf x$, $\mathbf P\mathbf x\in\mathcal S$. and
if $\mathbf z\in\mathcal S$, $\mathbf P\mathbf z=\mathbf z$

Result A.14: $\mathbf A\mathbf A^g$ is a projection onto $\mathcal C(\mathbf A)$.

proof: 1. $\mathbf A\mathbf A^g\mathbf A\mathbf A^g=\mathbf A\mathbf A^g$

$\mathbf A\mathbf A^g\mathbf x=\mathbf A(\mathbf A^g\mathbf x)\in\mathcal C(\mathbf A)$
If $\mathbf z\in\mathcal C(\mathbf A)$, then $\mathbf z=\mathbf A\mathbf y$ for some $\mathbf y$. $\mathbf A\mathbf A^g\mathbf z=\mathbf A\mathbf A^g\mathbf A\mathbf y=\mathbf A\mathbf y=z$

Result A.16: There is only one symmetric projection onto $\mathcal S$.

Result 2.5: $(\mathbf X^T\mathbf X)^g\mathbf X^T$ is a generalized inverse of $\mathbf X$.

proof: $(\mathbf X^T\mathbf X)(\mathbf X^T\mathbf X)^g(\mathbf X^T\mathbf X)=\mathbf X^T\mathbf X$. By Result 2.4, we have $\mathbf X(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf X=\mathbf X$.

Theorem 2.1: Define $\mathbf P_{\mathbf X}=\mathbf X(\mathbf X^T\mathbf X)^g\mathbf X^T$. $\mathbf P_{\mathbf X}$ is the symmetric projection onto $\mathcal C(\mathbf X)$.

proof: First, note that $\mathbf P_\mathbf X=\mathbf A\mathbf A^g$ where $\mathbf X$ is playing the role of $\mathbf A$. So by result A.14, $\mathbf P_\mathbf X$ is a projection onto $\mathcal C(\mathbf X)$.

In order to establish symmetry, we first show that $\mathbf P_\mathbf X$ is invariant to the choice of G.I. of $\mathbf X^T\mathbf X$. Let $\mathbf G_1$ and $\mathbf G_2$ be two G.I.s of $\mathbf X^T\mathbf X$. Then $(\mathbf X^T\mathbf X)\mathbf G_1(\mathbf X^T\mathbf X)=\mathbf X^T\mathbf X=(\mathbf X^T\mathbf X)\mathbf G_2(\mathbf X^T\mathbf X)$. By Result 2.4: $$ \mathbf X\mathbf G_1(\mathbf X^T\mathbf X)=\mathbf X\mathbf G_2(\mathbf X^T\mathbf X)\Rightarrow\mathbf X^T\mathbf X\mathbf G_1^T\mathbf X^T=\mathbf X^T\mathbf X\mathbf G_2^T\mathbf X^T. $$ Using Result 2.4 again: $$ \mathbf X\mathbf G_1^T\mathbf X^T=\mathbf X\mathbf G_2^T\mathbf X^T\Rightarrow\mathbf X\mathbf G_1\mathbf X^T=\mathbf X\mathbf G_2\mathbf X^T. $$ Hence, invariance.

Now, note that $$ (\mathbf X^T\mathbf X)\mathbf G_1(\mathbf X^T\mathbf X)=\mathbf X^T\mathbf X\Rightarrow \mathbf X^T\mathbf X\mathbf G_1^T\mathbf X^T\mathbf X=\mathbf X^T\mathbf X $$ which indicates that if $\mathbf G_1$ is a G.I. of $\mathbf X^T\mathbf X$, then so is $G_1^T$.

Finally, $$ \mathbf P_\mathbf X^T=\left[\mathbf X(\mathbf X^T\mathbf X)^g\mathbf X^T\right]^T=\mathbf X\left[(\mathbf X^T\mathbf X)^g\right]^T\mathbf X^T=\mathbf P_\mathbf X. $$ Result 2.6: $\mathbf I-\mathbf P_\mathbf X$ is the unique symmetric projection onto $\mathcal N(\mathbf X^T)$.

proof: Follows immediately from corollary A.4.

Example: $\mathcal S=\{\mathbf v\in\mathbb R^N:\mathbf v=c\mathbf 1_N\text{ for some }c\in\mathbb R\}=\mathcal C(\mathbf 1_N)$

$\mathcal T=\{\mathbf w\in\mathbb R^N:\mathbf 1_N^T\mathbf w=0\}=\mathcal N(\mathbf 1_N^T)$

$\mathcal S\bigcap\mathcal T=\{0\}$
$\dim(\mathcal S)+\dim(\mathcal T)=1+N-1=N$ (Theorem A.1)
$\mathbf v\in\mathcal S,\mathbf w\in\mathcal T$, $\mathbf v^T\mathbf w=c\mathbf 1^T_N\mathbf w=0$

$\mathbf P_{\mathbf 1_N}=\frac{1}{N}\mathbf 1_N\mathbf 1_N^T$, $\mathbf P_{\mathbf 1_N}\mathbf y=\mathbf 1_N\frac{1}{N}\mathbf 1^T_N\mathbf y=\mathbf 1_N\frac{1}{N}\sum_{i=1}^Ny_i=\mathbf 1_N\bar{y}$, $(\mathbf I-\mathbf P_{\mathbf 1_N})\mathbf y=\mathbf y-\mathbf P_{\mathbf 1_N}\mathbf y=\mathbf y-\mathbf 1_N\bar{y}$.

Recall that we have the orthogonal decomposition: $\mathbf y=\mathbf X\hat{\mathbf b}+(\mathbf y-\mathbf X\hat{\mathbf b})=\hat{\mathbf y}+\hat{\mathbf e}$.

Now, $\hat{\mathbf y}=\mathbf X\hat{\mathbf b}=\mathbf X(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y=\mathbf P_\mathbf X\mathbf y\in\mathcal C(\mathbf X)$ and $\hat{\mathbf e}=\mathbf y-\mathbf X\hat{\mathbf b}=(\mathbf I-\mathbf P_\mathbf X)\mathbf y\in\mathcal N(\mathbf X^T)$.

Another way to think about minimizing $Q(\mathbf b)$: $$ Q(\mathbf b)=|\mathbf y-\mathbf X\mathbf b|^2=|\mathbf y-\hat{\mathbf y}+\hat{\mathbf y}-\mathbf X\mathbf b|^2=|\hat{\mathbf y}-\mathbf X\mathbf b|^2+|\hat{\mathbf e}|^2, $$ where the second piece cannot be minimized by varying $\mathbf b$. The first piece can be minimized to zero, because the equations $$ \mathbf X\mathbf b=\mathbf P_\mathbf X\mathbf y=\hat{\mathbf y} $$ are consistent because $\mathbf P_\mathbf X\mathbf y\in\mathcal C(\mathbf X)$.

Result 2.7: The solutions to the N.E.s are the same as the solutions to the consistent equations $\mathbf X\mathbf b=\mathbf P_\mathbf X\mathbf y$

proof: Assume that $\hat{\mathbf b}$ solves $\mathbf X\mathbf b=\mathbf P_\mathbf X\mathbf y$. Then $\mathbf X^T\mathbf X\hat{\mathbf b}=\mathbf X^T\mathbf P_\mathbf X\mathbf y=\mathbf X^T\mathbf y$, so $\hat{\mathbf b}$ solves N.E.s.

Now assume that $\hat{\mathbf b}$ solves N.E.s. Then $\mathbf X^T\mathbf X\hat{\mathbf b}=\mathbf X^T\mathbf y=\mathbf X^T\mathbf P_\mathbf X\mathbf y=\mathbf X^T\mathbf X(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y$. Using Result 2.4, we have $\mathbf X\hat{\mathbf b}=\mathbf X(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y=\mathbf P_\mathbf X\mathbf y$, so $\hat{\mathbf b}$ solves $\mathbf X\mathbf b=\mathbf P_\mathbf X\mathbf y$.

Theorem 2.2: If $\mathcal C(\mathbf W)\subset\mathcal C(\mathbf X)$, then $\mathbf P_\mathbf X-\mathbf P_\mathbf W$ is the symmetric projection onto $\mathcal C\left((\mathbf I-\mathbf P_\mathbf W)\mathbf X\right)$.

proof: We first show that $\mathbf P_\mathbf X-\mathbf P_\mathbf W$ is idempotent. $$ (\mathbf P_\mathbf X-\mathbf P_\mathbf W)^2=\mathbf P_\mathbf X^2-\mathbf P_\mathbf X\mathbf P_\mathbf W-\mathbf P_\mathbf W\mathbf P_\mathbf X+\mathbf P^2_\mathbf W=\mathbf P_\mathbf X-\mathbf P_\mathbf X\mathbf P_\mathbf W-\mathbf P_\mathbf W\mathbf P_\mathbf X+\mathbf P_\mathbf W. $$ Now, for any $\mathbf z$, $\mathbf P_\mathbf W\mathbf z\in\mathcal C(\mathbf W)\subset\mathcal C(\mathbf X)$. So $\mathbf P_\mathbf X\mathbf P_\mathbf W\mathbf z=\mathbf P_\mathbf W\mathbf z$, $\forall \mathbf z$. Hence $\mathbf P_\mathbf X\mathbf P_\mathbf W=\mathbf P_\mathbf W$ (Corollary A.1).

Then $$ (\mathbf P_\mathbf X\mathbf P_\mathbf W)^T=\mathbf P_W^T=\mathbf P_\mathbf W\mathbf P_\mathbf X=\mathbf P_\mathbf W\\ \Rightarrow (\mathbf P_\mathbf X-\mathbf P_\mathbf W)^2=\mathbf P_\mathbf X-\mathbf P_\mathbf W-\mathbf P_\mathbf W+\mathbf P_\mathbf W=\mathbf P_\mathbf X-\mathbf P_\mathbf W. $$ Second, for any vector $\mathbf u$ can be decomposed as $\mathbf u=\mathbf X\mathbf s+\mathbf t$ where $\mathbf t\in\mathcal N(\mathbf X^T)$. So $$ \begin{aligned} (\mathbf P_\mathbf X-\mathbf P_\mathbf W)\mathbf U&=(\mathbf P_\mathbf X-\mathbf P_\mathbf W)(\mathbf X\mathbf s+\mathbf t)\\ &=\mathbf P_\mathbf X\mathbf X\mathbf s-\mathbf P_\mathbf W\mathbf X\mathbf s+\mathbf P_\mathbf X\mathbf t-\mathbf P_\mathbf W\mathbf t\\ &=\mathbf X\mathbf s-\mathbf P_\mathbf W\mathbf X\mathbf s+\mathbf 0-\mathbf P_\mathbf W\mathbf P_\mathbf X\mathbf t\\ &=\mathbf X\mathbf s-\mathbf P_\mathbf W\mathbf X\mathbf s+\mathbf 0-\mathbf 0\\ &=(\mathbf I-\mathbf P_\mathbf W)\mathbf X\mathbf s. \end{aligned} $$ So for any $\mathbf u$, $(\mathbf P_\mathbf X-\mathbf P_\mathbf W)\mathbf u\in\mathcal C((\mathbf I-\mathbf P_\mathbf W)\mathbf X)$.

Third, if $\mathbf y\in\mathcal C((\mathbf I-\mathbf P_\mathbf W)\mathbf X)$, then $\mathbf y=(\mathbf I-\mathbf P_\mathbf W)\mathbf X\mathbf c$ for some $\mathbf c$. $$ \begin{aligned} (\mathbf P_\mathbf X-\mathbf P_\mathbf W)\mathbf y&=(\mathbf P_\mathbf X-\mathbf P_\mathbf W)(\mathbf I-\mathbf P_\mathbf W)\mathbf X\mathbf c\\ &=(\mathbf P_\mathbf X-\mathbf P_\mathbf W)\mathbf X\mathbf c-(\mathbf P_\mathbf X-\mathbf P_\mathbf W)\mathbf P_\mathbf W\mathbf X\mathbf c\\ &=\mathbf X\mathbf c-\mathbf P_\mathbf W\mathbf X\mathbf c-\mathbf P_\mathbf X\mathbf P_\mathbf W\mathbf X\mathbf c+\mathbf P_\mathbf W\mathbf X\mathbf c\\ &=(\mathbf I-\mathbf P_\mathbf W)\mathbf X\mathbf c=\mathbf y \end{aligned} $$

2.3 Reparametrization

One-way ANOVA:

$y_{ij}=\mu+\alpha_i+e_{ij}$, $i=1,2,3$
$y_{ij}=d_i+e_{ij}$
$y_{ij}=c_1+c_2\mathbb 1_{\{1\}}(i)+c_3\mathbb 1_{\{2\}}(i)+e_{ij}$

These three versions of the ANOVA model should lead ti equivalent inferences.

Def: Two L.M.s, $\mathbf y=\mathbf X\mathbf b+\mathbf e$, where $\mathbf X$ is $N\times p$, and $\mathbf y=\mathbf W\mathbf c+\mathbf e$, where $\mathbf W$ is $N\times t$, are equivalent or reparametrizations of each other, if the two design matrices, $\mathbf X$ and $\mathbf W$, have the same column sapce, i.e., $\mathcal C(\mathbf X)=\mathcal C(\mathbf W)$.

Back to ANOVA examples: $$ \mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_{n_1}&\mathbf 1_{n_1}&\mathbf 0&\mathbf 0\\ \mathbf 1_{n_2}&\mathbf 0&\mathbf 1_{n_2}&\mathbf 0\\\mathbf 1_{n_3}&\mathbf 0&\mathbf 0&\mathbf 1_{n_3}\end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\alpha_2\\\alpha_3\end{pmatrix} $$

$$ \mathbf W\mathbf c=\begin{pmatrix}\mathbf 1_{n_1}&\mathbf 1_{n_1}&\mathbf 0\\\mathbf 1_{n_2}&\mathbf 0&\mathbf 1_{n_2}\\\mathbf 1_{n_3}&\mathbf 0&\mathbf 0\end{pmatrix}\begin{pmatrix}c_1\\c_2\\c_3\\\end{pmatrix} $$

$\mathcal C(\mathbf X)=\mathcal C(\mathbf W)$: Note that in $\mathbf X$, column 4=column1-column2-column3.

Result 2.8: If two L.M.s are equivalent, then $\mathbf P_\mathbf X=\mathbf P_\mathbf W$.

Corollary 2.4: If two L.M.s are equivalent, then $\hat{\mathbf y}$ and $\hat{\mathbf e}$ are the same for the two models.

proof: $\hat{\mathbf y}=\mathbf P_\mathbf X\mathbf y$, $\hat{\mathbf e}=(\mathbf I-\mathbf P_\mathbf X)\mathbf y$.

Assume $\mathcal C(\mathbf X)=\mathcal C(\mathbf W)$. Result A.2 implies the existence of two matrices, $\mathbf S$ and $\mathbf T$, such that $\mathbf W=\mathbf X\mathbf T$ and $\mathbf X=\mathbf W\mathbf S$.

Result 2.9: Assume $\mathcal C(\mathbf X)=\mathcal C(\mathbf W)$. If $\hat{\mathbf c}$ solves the N.E.s $\mathbf W^T\mathbf W\mathbf c=\mathbf W^T\mathbf y$, then $\hat{\mathbf b}=\mathbf T\hat{\mathbf c}$ solves the N.E.s $\mathbf X^T\mathbf X\mathbf b=\mathbf X^T\mathbf y$.

proof:
$$ \mathbf X^T\mathbf X\mathbf T\hat{\mathbf c}=\mathbf X^T\mathbf W\hat{\mathbf c}=\mathbf X^T\mathbf P_\mathbf W\mathbf y=\mathbf X^T\mathbf P_\mathbf X\mathbf y=\mathbf X^T\mathbf y. $$ Example: $y_{ij}=\mu+\alpha_i+e_{ij}$, $i=1,2,3$. $$ \hat{\mathbf b}=\begin{pmatrix}\bar{y}_{\cdot\cdot}\\ \bar{y}_{1\cdot}-\bar{y}_{\cdot\cdot}\\\bar{y}_{2\cdot}-\bar{y}_{\cdot\cdot}\\\bar{y}_{3\cdot}-\bar{y}_{\cdot\cdot}\end{pmatrix} $$ General solution: $$ \begin{pmatrix}0\\\bar{y}_{1\cdot}\\\bar{y}_{2\cdot}\\\bar{y}_{3\cdot}\end{pmatrix}+z\begin{pmatrix}1\\-1\\-1\\-1\end{pmatrix},\quad\text{for any }z\in\mathbb R $$ $y_{ij}=c_1+c_2\mathbf I_{\{1\}}(i)+c_3\mathbf I_{\{2\}}(i)+e_{ij}$, $i=1,2,3$ $$ \hat{c}=\begin{pmatrix}\bar{y}_{3\cdot}\\\bar{y}_{1\cdot}-\bar{y}_{2\cdot}\\\bar{y}_{2\cdot}-\bar{y}_{3\cdot}\end{pmatrix} $$ For $\mathbf W$ and $\mathbf X$ presented before, we have $\mathbf W=\mathbf X\mathbf T$, where $\mathbf T$ is give by $$ \mathbf T=\begin{pmatrix}1&0&0\\0&1&0\\0&0&1\\0&0&0\end{pmatrix} $$

$$ \hat{\mathbf b}=\mathbf T\hat{\mathbf c}=\begin{pmatrix}1&0&0\\0&1&0\\0&0&1\\0&0&0\end{pmatrix}\begin{pmatrix}\bar{y}_{3\cdot}\\\bar{y}_{1\cdot}-\bar{y}_{2\cdot}\\\bar{y}_{2\cdot}-\bar{y}_{3\cdot}\end{pmatrix}=\begin{pmatrix}\bar{y}_{3\cdot}\\\bar{y}_{1\cdot}-\bar{y}_{2\cdot}\\\bar{y}_{2\cdot}-\bar{y}_{3\cdot}\\0\end{pmatrix} $$

2.4 A Version of the QR Decomposition

Result: Let $\mathbf W_{N\times p}$ have full column rank. Then $\mathbf W$ can be decomposed as $\mathbf W=\mathbf Q\mathbf R$ where $\mathbf Q_{N\times p}$ is s.t. $\mathbf Q^T\mathbf Q=\mathbf I_p$ and $\mathbf R_{p\times p}$ is upper triangular, with positive elements on the diagonal.

Idea of the proof: Write the columns of $\mathbf W$ as $\mathbf W_{\cdot1},\mathbf W_{\cdot2},\cdots,\mathbf W_{\cdot p}$. We begin by constructing a set of orthogonal vectors $\mathbf U_{\cdot1},\mathbf U_{\cdot2},\cdots,\mathbf U_{\cdot p}$, each of which is a linear combination of the columns of $\mathbf W$, and such that for each $k=1,2,\cdots,p$, we have $span\{\mathbf W_{\cdot1},\mathbf W_{\cdot2},\cdots,\mathbf W_{\cdot k}\}=span\{\mathbf U_{\cdot1},\mathbf U_{\cdot2},\cdots,\mathbf U_{\cdot k}\}$.

Taking $k=p$ shows that $\mathcal C(\mathbf W)=\mathcal C(\mathbf U)$, where $\mathbf U$ is the matrix whose columns are $\mathbf U_{\cdot1},\mathbf U_{\cdot2},\cdots,\mathbf U_{\cdot p}$.

Since the $\mathbf U_{\cdot i}$’s are orthogonal, we have $\mathbf U^T\mathbf U=\mathbf D$ where $\mathbf D$ is a diagonal matrix with $k$-th diagonal element equal to $|\mathbf U_{\cdot k}|^2$.

So, if we take $\mathbf Q=\mathbf U\mathbf D^{-\frac{1}{2}}$, then $\mathbf Q^T\mathbf Q=\mathbf D^{-1/2}\mathbf U^T\mathbf U\mathbf D^{-1/2}=\mathbf D^{-1/2}\mathbf D^{1/2}\mathbf D^{1/2}\mathbf D^{-1/2}=\mathbf I$.

proof: Because we construct $\mathbf U_{\cdot 1},\mathbf U_{\cdot2},\cdots,\mathbf U_{\cdot p}$, Recall basic facts about L.S. theory:

$\mathbf y=\mathbf X\mathbf b+\mathbf e$, $\mathbf X^T\mathbf X\mathbf b=\mathbf X^T\mathbf y$, $\hat{\mathbf b}=(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y$, $\hat{\mathbf e}=\mathbf y-\mathbf X\hat{\mathbf b}$ $$ \mathbf X^T\hat{\mathbf e}=\mathbf X^T(\mathbf y-\mathbf X\hat{\mathbf b})=\mathbf X^T\mathbf y-\mathbf X^T\mathbf X\hat{\mathbf b}=\mathbf 0\ \Rightarrow\hat{\mathbf e}\text{ is orthogonal to every column of }\mathbf X. $$ Step 1: $\mathbf U_{\cdot1}=\mathbf W_{\cdot1}$

Step 2: Regress $\mathbf W_{\cdot2}$ on $\mathbf U_{\cdot1}$; that is, fit the model $\mathbf W_{\cdot2}=\mathbf U_{\cdot1}b+\mathbf e$. Then $$ \hat{b}^{(2)}=\frac{\mathbf U^T_{\cdot1}\mathbf W_{\cdot2}}{\mathbf U^T_{\cdot1}\mathbf U_{\cdot1}}. $$ Now set $\mathbf U_{\cdot 2}=\mathbf W_{\cdot2}-\mathbf U_{\cdot1}\hat{b}^{(2)}$ (Essentially $\mathbf U_{\cdot2}$ is $\hat{\mathbf e}$). So we have $\mathbf U_{\cdot1}\perp\mathbf U_{\cdot2}$.

Step 3: Regress $\mathbf W_{\cdot3}$ on $\mathbf U_{\cdot1}$ and $\mathbf U_{\cdot2}$; that is, fit the model $\mathbf W_{\cdot3}=(\mathbf U_{\cdot1};\mathbf U_{\cdot2})\begin{pmatrix}b_1\\b_2\end{pmatrix}+\mathbf e$. Then $$ \mathbf X^T\mathbf X=\begin{pmatrix}\mathbf U^T_{\cdot1}\\ \mathbf U^T_{\cdot2}\end{pmatrix}(\mathbf U_{\cdot1};\mathbf U_{\cdot2})=\begin{pmatrix}|\mathbf U_{\cdot1}|^2&0\\0&|\mathbf U_{\cdot2}|^2\end{pmatrix} $$ and it is easy to solve N.E. $\mathbf X^T\mathbf X\mathbf b=\mathbf X^T\mathbf y$ directly. For $j=1,2$: $$ \left(\hat{\mathbf b}^{(3)}\right)_j=\frac{\mathbf U^T_{\cdot j}\mathbf W_{\cdot3}}{\mathbf U^T_{\cdot j}\mathbf U_{\cdot j}}. $$ Now set $\mathbf U_{\cdot 3}=\mathbf W_{\cdot 3}-(\mathbf U_{\cdot1};\mathbf U_{\cdot2})\hat{\mathbf b}^{(3)}\Rightarrow \mathbf U_{\cdot3}\perp\mathbf U_{\cdot j}(j=1,2)$.

Now for $i\in\{3,4,5,\cdots,p-1\}$, at step $p+1$, we regress $\mathbf W_{i+1}$ on $\mathbf U_{\cdot1},\mathbf U_{\cdot2},\cdots,\mathbf U_{\cdot i}$. As above, for $j=1,2,c\dots,i$, $$ \left(\hat{\mathbf b}^{(i+1)}\right)_j=\frac{\mathbf U^T_{\cdot j}\mathbf W_{\cdot i+1}}{\mathbf U^T_{\cdot j}\mathbf U_{\cdot j}}. $$ Now set $U_{\cdot i+1}=\mathbf W_{\cdot i+1}-(\mathbf U_{\cdot 1};\mathbf U_{\cdot2}\cdots\mathbf U_{\cdot i})\hat{\mathbf b}^{(i+1)}\Rightarrow \mathbf U_{\cdot i+1}\perp\mathbf U_{\cdot j}(j=1,2,\cdots,i)$.

Now collect all of the $\hat{\mathbf b}$’s in a $p\times p$ matrix as follows: $$ \mathbf S=\begin{pmatrix} 1 &(\hat{\mathbf b^{(2)}})_1 & (\hat{\mathbf b}^{(3)})_1 &\cdots & (\hat{\mathbf b}^{(p)})_1\\ 0 & 1 & (\hat{\mathbf b}^{(3)})_2 &\cdots & (\hat{\mathbf b}^{(p)})_2\\ 0 &0&1&\cdots&(\hat{\mathbf b}^{(p)})_3 \\ \vdots &\vdots &\vdots&\ddots&\vdots \\ 0 &0&0&\cdots&(\hat{\mathbf b}^{(p)})_{p-1} \\ 0 &0&0&\cdots&1 \end{pmatrix} $$ More formally, $\mathbf S_{j,i+1}=\left(\hat{\mathbf b}^{(i+1)}\right)_j$ for $i=1,2,\cdots,p-1$ and $j=1,2,\cdots,i$. Diagonal elements are all $1$, and all $0$’s below diagonal.'

Now for $i=1,2,\cdots,p-1$, we have $$ \mathbf U_{\cdot i+1}=\mathbf W_{\cdot i+1}-\sum_{j=1}^i\left(\hat{\mathbf b}^{(i+1)}\right)\mathbf U_{\cdot j} =\mathbf W_{\cdot i+1}-\sum_{j=1}^i\mathbf S_{j,i+1}\mathbf U_{\cdot j}\\ \Rightarrow \mathbf W_{\cdot i+1}=\mathbf U_{\cdot i+1}+\sum_{j=1}^i\mathbf S_{j,i+1}\mathbf U_{\cdot j}=\sum_{j=1}^{i+1}\mathbf S_{j,i+1}\mathbf U_{\cdot j}\\ \Rightarrow \mathbf W_{N\times p}=\mathbf U_{N\times p}\mathbf S_{p\times p} $$ Finally, we take $\mathbf Q=\mathbf U\mathbf D^{-\frac{1}{2}}$ and $\mathbf R=\mathbf D^{\frac{1}{2}}\mathbf S$. Then $$ \mathbf Q\mathbf R=\mathbf U\mathbf D^{-\frac{1}{2}}\mathbf D^{\frac{1}{2}}\mathbf S=\mathbf W $$

Chapter 3: Estimability and Least Squares Estimators

GLM: $\mathbf y=\mathbf X\mathbf b+\mathbf e$

We assume $\mathbb E\mathbf e=0$, so $\mathbb E\mathbf y=\mathbf X\mathbf b$.

Fix $\pmb\lambda\in\mathbb R^p$

Def 3.1: An estimator $t(\mathbf y)$ is unbiased for $\pmb\lambda^T\mathbf b$ if $\mathbf Et(\mathbf y)=\pmb\lambda^T\mathbf b$, $\forall \mathbf b\in\mathbb R^p$.

Def 3.2: An estimator $t(\mathbf y)$ is a linear estimator if $t(\mathbf y)=c+\mathbf a^T\mathbf y$ where $c\in\mathbb R$ and $\mathbf a\in\mathbb R^p$ are known constants.

Def 3.3: A function $\pmb\lambda^T\mathbf b$ is (linearly) estimable if $\exist$ a linear unbiased estimator of it; otherwise $\pmb\lambda^T\mathbf b$ is called non-estimable.

Result 3.1: $\pmb\lambda^T\mathbf b$ is estimable iff $\exist \mathbf a\in\mathbb R^N$ such that $\pmb\lambda=\mathbf X^T\mathbf a$, or, in other words, $\pmb\lambda\in\mathcal C(\mathbf X^T)$.

proof: ($\Leftarrow$) Assume there exists $\mathbf a\in\mathbb R^N$ such that $\mathbf X^T\mathbf a=\pmb\lambda$. Then $\mathbb E(\mathbf a^T\mathbf y)=\mathbf a^T\mathbb E(\mathbf y)=\mathbf a^T\mathbf X\mathbf b=\pmb\lambda^T\mathbf b$.

($\Rightarrow$) Assume that $\pmb\lambda^T\mathbf b$ is estimable. Then $\exist c,\mathbf a$ such that $\mathbb E(c+\mathbf a^T\mathbf y)=\pmb\lambda^T\mathbf b$, $\forall \mathbf b\in\mathbb R^p$. Then, $c+\mathbf a^T\mathbf X\mathbf b-\pmb\lambda^T\mathbf b=0$, $\forall \mathbf b\in\mathbb R^p\Rightarrow c+(\mathbf a^T\mathbf X-\pmb\lambda^T)\mathbf b=0$, $\forall \mathbf b\in\mathbb R^p$. Using Result A.8, we have $c=0$ and $\mathbf a^T\mathbf X-\pmb\lambda^T=\mathbf 0$ or $\pmb\lambda=\mathbf X^T\mathbf a$.

Example 3.1: One-way ANOVA with $a=n=2$. $y_{ij}=\mu+\alpha_i+e_{ij}$, $i=1,2$, $j=1,2$. $\mathbf X\mathbf b=\begin{pmatrix}1&1&0\\1&1&0\\1&0&1\\1&0&1\end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\alpha_2\end{pmatrix}$.

Q: Is $\alpha_1$ estimable?

A: First, $\alpha_1=\pmb\lambda^T\mathbf b=(0,1,0)(\mu,\alpha_1,\alpha_2)^T$. Is $(0,1,0)^T\in\mathcal C(\mathbf X^T)$? $$ \begin{pmatrix} 1&1&1&1\\1&1&0&0\\0&0&1&1 \end{pmatrix} \begin{pmatrix} a_1\\a_2\\a_3\\a_4 \end{pmatrix}=\begin{pmatrix}0\\1\\0\end{pmatrix} $$

$$ \Rightarrow \sum_{i=1}^4a_i=0,\quad a_1+a_2=1,\quad a_3+a_4=0 $$

This is impossible. So $(0,1,0)^T\not\in\mathcal C(\mathbf X^T)\Rightarrow\alpha_1$ is not estimable.

Result 3.1a: $\pmb\lambda^T\mathbf b$ is estimable if and only if $\pmb\lambda^T\mathbf b$ is a linear combination of the expected values of the $ y_i$’s.

proof: ($\Rightarrow$) Assume that $\pmb\lambda^T\mathbf b$ is estimable. Then $\exist$ $a\mathbf \in\mathbb R^N$ such that $\pmb\lambda=\mathbf X^T\mathbf a$. Then $\pmb\lambda^T\mathbf b=\mathbf a^T\mathbf X\mathbf b=\mathbf a^T\mathbb E(\mathbf y)$.

($\Leftarrow$) Assume that $\pmb\lambda^T\mathbf b$ is a linear combination of the elements of $\mathbf X\mathbf b$. Then, for all $\mathbf b\in\mathbb R^p$, $\pmb\lambda^T\mathbf b=\mathbf a^T\mathbf X\mathbf b\Rightarrow\pmb\lambda^T=\mathbf a^T\mathbf X$ or $\pmb\lambda=\mathbf X^T\mathbf a$, i.e., $\pmb\lambda\in\mathcal C(\mathbf X^T)$. So $\pmb\lambda^T\mathbf b$ is estimable.

*If we take $\mathbf a^T=(\mathbf e^{(i)})^T$, then $\mathbf a^T\mathbb E\mathbf y=\mathbb E y_i$. So the expected value of any $y_i$ is estimable.

Suppose $\pmb\lambda^{(j)}\in\mathbb R^p$, $j=1,2,\cdots,k$. If $\pmb\lambda^{(j)T}\mathbf b$ are all estimable, so is $\sum_{j=1}^kd_j\pmb\lambda^{(j)T}\mathbf b=\left[\sum_{j=1}^kd_j\pmb\lambda^{(j)}\right]^T\mathbf b$ for any $\{d_j\}_{j=1}^k$.

Def: A set of estimable functions $\{\pmb\lambda^{(j)}\mathbf b\}_{j=1}^k$ is called linearly independent if the $\pmb\lambda^{(j)}$’s are linearly independent.

Let $r=rank(\mathbf X^T)\leq p$ and $\dim(\mathcal C(\mathbf X^T))=rank(X^T)=r$. Thus, any set of L.I. estimable functions cannot have more than $r$ members.

Suppose that $\mathbf X$ has full column rank, $r=p$. Then $\mathcal C(\mathbf X^T)=\mathbb R^p$ and $\pmb\lambda^T\mathbf b$ is estimable for any $\pmb\lambda\in\mathbb R^p$.

Q: How do we establish that $\pmb\lambda^T\mathbf b$ is estimable?

A: We have three methods.

Method 3.1: Show that $\pmb\lambda^T\mathbf b$ can be written as a linear combination of the elements of $\mathbf X\mathbf b=\mathbb E(\mathbf y)$.

Method 3.2: Construct a set of basis vectors for $\mathcal C(\mathbf X^T)$, call them $\mathbf v^{(1)},\cdots,\mathbf v^{(r)}$ and show that $\pmb\lambda=\sum_{j=1}^rd_j\mathbf v^{(j)}$.

Method 3.3: Construct a set of basis vectors for $\mathcal N(\mathbf X)$, call them $\mathbf c^{(1)},\cdots,\mathbf c^{(p-r)}$. If $\pmb\lambda\perp\mathbf c^{(j)}$ for $j=1,2,\cdots,p-r$, then $\pmb\lambda\in\mathcal C(\mathbf X^T)$.

Note that $\mathbf X^T\mathbf X(\mathbf X^T\mathbf X)^g$ is a projection onto $\mathcal C(\mathbf X^T\mathbf X)=\mathcal C(\mathbf X^T)$. So if $(\mathbf X^T\mathbf X)^g$ has been calculated, and $\mathbf X^T\mathbf X(\mathbf X^T\mathbf X)^g\pmb\lambda=\pmb\lambda$. Then $\pmb\lambda\in\mathcal C(\mathbf X^T)\Rightarrow\pmb\lambda^T\mathbf b$ is estimable.

Examples:

One-way ANOVA: $y_{ij}=\mu+\alpha_i+e_{ij}$, $i=1,\cdots,a$, $j=1,\cdots,n_i$. $\mathbf y=(y_{11},\cdots,y_{1n_1},y_{21},\cdots,y_{2n_2},\cdots,y_{a1},\cdots, y_{an_a})^T$, $\mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_{n_1}&\mathbf 1_{n_1}&\mathbf0&\cdots&\mathbf 0\\\mathbf 1_{n_2}&\mathbf 0&\mathbf 1_{n_2}&\cdots&\mathbf 0\\\vdots&\vdots&\vdots&\ddots&\vdots\\\mathbf 1_{n_a}&\mathbf0&\mathbf0&\cdots&\mathbf1_{n_a}\end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\vdots\\\alpha_a\end{pmatrix}$, $p=a+1$, $N=\sum_{i=1}^an_i$, $rank(\mathbf X)=r=a<p$.
$\dim(\mathcal C(\mathbf X^T))+\dim(\mathcal N(\mathbf X))=a+1\Rightarrow\dim(\mathcal N(\mathbf X))=1$. To find a basis fo $\mathcal N(\mathbf X)$, all we need is to find a single vector, $\mathbf c$, such that $\mathbf X\mathbf c=\mathbf 0$. Note that $\mathbf c=\begin{pmatrix}1\\-\mathbf 1_a\end{pmatrix}$ is a basis for $\mathcal N(\mathbf X)$. Now let’s use this basis to figure out which functions $\pmb\lambda^T\mathbf b=\lambda_0\mu+\sum_{i=1}^a\lambda_i\alpha_i$ are estimable. According to Method 3.3, $\pmb\lambda^T\mathbf b$ is estimable iff $\pmb\lambda^T\mathbf c=0$. Hence, $\mu+\alpha_i$ and $\alpha_i-\alpha_k(i\not=k)$ are estimable, but $\alpha_i$ is not estimable. Note that $\sum_{i=1}^ad_i\alpha_i$ is estimable iff $\sum_{i=1}^ad_i=0$. Of course, when $\sum_{i=1}^ad_i=0$, $\sum_{i=1}^ad_i\alpha_i$ is called a contrast.

Def 3.4: The least sqaures (L.S.) estimator of the estimable function $\pmb\lambda^T\mathbf b$ is $\pmb\lambda^T\hat{\mathbf b}$ when $\hat{\mathbf b}$ is any solution to the N.E.s.

Result 3.2: $\pmb\lambda^T\hat{\mathbf b}$ is invariant to the choice of a solution to the N.E.s.

proof: Suppose that $\hat{\mathbf b}_1$ and $\hat{\mathbf b}_2$ are both solutions to the N.E.s. Then $\mathbf X^T\mathbf X(\hat{\mathbf b}_1-\hat{\mathbf b}_2)=\mathbf 0$. In other words, $\hat{\mathbf b}_1-\hat{\mathbf b}_2\in\mathcal N(\mathbf X^T\mathbf X)=\mathcal N(\mathbf X)$. But $\pmb\lambda\in\mathcal C(\mathbf X^T)\perp\mathcal N(\mathbf X)$. So $\pmb\lambda\perp(\hat{\mathbf b}_1-\hat{\mathbf b}_2)$, or $\pmb\lambda^T(\hat{\mathbf b}_1-\hat{\mathbf b}_2)=0$, or $\pmb\lambda^T\hat{\mathbf b}_1=\pmb\lambda^T\hat{\mathbf b}_2$.

Result 3.3: The L.S. estimator $\pmb\lambda^T\hat{\mathbf b}$ of an estimable function, $\pmb\lambda^T\mathbf b$, is a linear unbiased estimator of $\pmb\lambda^T\mathbf b$.

proof: $\lambda^T\hat{\mathbf b}=\mathbf a^T\mathbf X(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y=\mathbf a^T\mathbf P_\mathbf X\mathbf y$.

$\mathbb E(\pmb\lambda^T\hat{\mathbf b})=\mathbb E(\mathbf a^T\mathbf P_\mathbf X\mathbf y)=\mathbf a^T\mathbf P_\mathbf X\mathbb E(\mathbf y)=\mathbf a^T\mathbf P_\mathbf X\mathbf X\mathbf b=\mathbf a^T\mathbf X\mathbf b=\pmb\lambda^T\mathbf b$.

Example 1: Back to one-way ANOVA example: $y_{ij}=\mu+\alpha_i+e_{ij}$, $i=1,\cdots,a$, $j=1,\cdots,n_i$.

Let’s look at the least squares estimators of some estimable functions. Start with a solution to the N.E.s. $$ \mathbf X^T\mathbf X=\begin{pmatrix}N&n_1&n_2&\cdots&n_a\\n_1&n_1&0&\cdots& 0\\n_2&0&n_2&\cdots&0\\\vdots&\vdots&\vdots&\ddots&\vdots\\n_a&0&0&\cdots&n_a\end{pmatrix},\quad \mathbf X^T\mathbf y= \begin{pmatrix}N\bar{y}_{\cdot\cdot}\\n_1\bar{y}_{1\cdot}\\n_2\bar{y}_{2\cdot}\\\vdots\\n_a\bar{y}_{a\cdot}\end{pmatrix} $$ A simple G.I. of $\mathbf X^T\mathbf X$ is: $$ \begin{pmatrix} 0&0&0&\cdots&0\\ 0&1/n_1&0&\cdots&0\\ 0&0&1/n_2&\cdots&0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&\cdots&1/n_a \end{pmatrix} $$ This leads to the following general solution to the N.E.s: $$ \hat{\mathbf b}=\begin{pmatrix}0\\\bar{y}_{1\cdot}\\\bar{y}_{2\cdot}\\\vdots\\\bar{y}_{a\cdot}\end{pmatrix}+z\begin{pmatrix}-1\\1\\1\\\vdots\\1\end{pmatrix}\quad\text{ for any }z\in\mathbb R. $$ L.S. estimator of $\pmb\lambda^T\mathbf b$ is $\pmb\lambda^T\hat{\mathbf b}$: $$ \begin{aligned} &\mu+\alpha_i:\quad\pmb\lambda^T\hat{\mathbf b}=(0-z)+(\bar{y}_{i\cdot}+z)=\bar{y}_{i\cdot}\\ &\alpha_i-\alpha_k:\quad\pmb\lambda^T\hat{\mathbf b}=(\bar{y}_{i\cdot}+z)-(\bar{y}_{k\cdot}+z)=\bar{y}_{i\cdot}-\bar{y}_{k\cdot}\\ &\sum_{i=1}^ad_i\alpha_i:\quad\pmb\lambda^T\hat{\mathbf b}=\sum_{i=1}^ad_i(\bar{y}_{i\cdot}+z)=\sum_{i=1}^ad_i\bar{y}_{i\cdot}+z\sum_{i=1}^ad_i=\sum_{i=1}^ad_i\bar{y}_{i\cdot} \end{aligned} $$ Example 2: Two-way crossed model without interaction: $y_{ij}=\mu+\alpha_i+\beta_j+e_{ij}$, $i=1,\cdots,a$, $j=1,\cdots,b$. $N=ab$ and $p=a+b+1$. $$ \mathbf y=(y_{11},\cdots,y_{1b},y_{21},\cdots,y_{2b},\cdots,y_{a1},\cdots,y_{ab})^T $$

$$ \mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_b&\mathbf 1_b&\mathbf 0&\cdots&\mathbf 0&\mathbf I_b\\\mathbf 1_b&\mathbf 0&\mathbf 1_b&\cdots&\mathbf 0&\mathbf I_b\\\vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\\mathbf 1_b&\mathbf 0&\mathbf 0&\cdots&\mathbf 1_b&\mathbf I_b \end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\vdots\\\alpha_a\\\beta_1\\\vdots\\\beta_b\end{pmatrix} $$

$rank(\mathbf X)=a+b-1$. Reason: The first column is the sum of columns $2$ through $a+1$ (and also the sum of columns $a+2$ through $a+b+1$). The second column is the sum of columns $a+2$ through $a+b+1$ minus the sum of columns $3$ through $a+1$. Remaining $a+b-1$ columns are linearly independent.

Now, $dim(\mathcal N(\mathbf X))=2$. Here is a basis for $\mathcal N(\mathbf X)$: $$ \mathbf c^{(1)}=\begin{pmatrix}1\\-\mathbf 1_a\\\mathbf 0_b\end{pmatrix},\quad \mathbf c^{(2)}=\begin{pmatrix}1\\\mathbf 0_a\\-\mathbf 1_b\end{pmatrix} $$ Now let’s write $\pmb\lambda^T\mathbf b=\lambda_0\mu+\sum_{i=1}^a\lambda_i\alpha_i+\sum_{j=1}^b\lambda_{a+j}\beta_j$.

Invoking Method 3.3: $\pmb\lambda^T\mathbf b$ is estimable iff $\lambda_0-\sum_{i=1}^a\lambda_i=0$ and $\lambda_0-\sum_{j=1}^b\lambda_{a+j}=0$.

Common estimable functions:

$\mu+\alpha_i+\beta_j$
$\alpha_i-\alpha_k$
$\beta_j-\beta_k$
$\sum_{i=1}^ad_i\alpha_i$ (fix $\sum_{i=1}^ad_i=0$)
$\sum_{j=1}^bf_j\beta_j$ (fixed$ \sum_{j=1}^df_j=0$)

Suppose that $a=3$ and $b=4$. $rank(\mathbf X)=6$. Here is a set of linearly independent estimable functions: $$ \{\mu+\alpha_1+\beta_1,\alpha_2-\alpha_1,\alpha_3-\alpha_1,\beta_2-\beta_1,\beta_3-\beta_1,\beta_4-\beta_1\} $$ Now, consider what happends when there is some missing data.

Eg 1:

	1	2	3	4
1	X	X	X
2	X		X
3			X	X

*column - B, row - A

So we do not observe $y_{14},y_{22},y_{24},y_{31},y_{32}$. $$ \mathbf X\mathbf b=\begin{pmatrix}1&1&0&0&1&0&0&0\\ 1&1&0&0&0&1&0&0\\ 1&1&0&0&0&0&1&0\\ 1&0&1&0&1&0&0&0\\ 1&0&1&0&0&0&1&0\\ 1&0&0&1&0&0&1&0\\ 1&0&0&1&0&0&0&1\end{pmatrix} \begin{pmatrix} \mu\\\alpha_1\\\alpha_2\\\alpha_3\\\beta_1\\\beta_2\\\beta_3\\\beta_4 \end{pmatrix} $$ $rank(\mathbf X)=6$. And $\mathcal C(\mathbf X^T)$ remains the same, so same set of $6$ linearly independent estimable functions is srill valid.

Eg 2:

	1	2	3	4
1	X	X	X
2	X	X	X
3				X

*column - B, row - A

In this case, $rank(\mathbf X)=5$. Basis for $\mathcal N(\mathbf X)$: $$ \begin{pmatrix} 1\\-1\\-1\\-1\\0\\0\\0\\0 \end{pmatrix}, \quad \begin{pmatrix} 1\\0\\0\\0\\-1\\-1\\-1\\-1 \end{pmatrix}, \text{ and } \begin{pmatrix} 0\\0\\0\\1\\0\\0\\0\\-1 \end{pmatrix} $$ Here is a set of $5$ linearly independent estimable functions: $$ \{\mu+\alpha_1+\beta_1,\alpha_2-\alpha_1,\beta_2-\beta_1,\beta_3-\beta_1,\mu+\alpha_3+\beta_4\}. $$ Note that $\alpha_2-\alpha_1$ and $\beta_4-\beta_1$ are no longer estimable in this case. There is a simple reason for this: $$ \mathbb Ey_{34}=\mu+\alpha_3+\beta_4 $$ This is the only component of $\mathbf X\mathbf b$ that involves either $\alpha_3$ or $\beta_4$. We cannot differentiate between $\alpha_3$ and $\beta_4$.

Back to Eg 1:

$\mathbb E(y_{33})-\mathbb E(y_{13})=\mu+\alpha_3+\beta_3-(\mu+\alpha_1+\beta_3)=\alpha_3-\alpha_1$.

$\mathbb E(y_{34})-\mathbb E(y_{33})+\mathbb E(y_{13})-\mathbb E(y_{11})=(\mu+\alpha_3+\beta_4)-(\mu+\alpha_3+\beta_3)+(\mu+\alpha_1+\beta_3)-(\mu+\alpha_1+\beta_1)=\beta_4-\beta_1$.

The pattern is connected in Eg1, but not in Eg2.

3.7 Reparameterization Revisited

$\mathbf y=\mathbf X\mathbf b+\mathbf e$ ($\mathbf X_{N\times p}$)
$\mathbf y=\mathbf W\mathbf c+\mathbf e$ ($\mathbf W_{N\times t}$)

Assume the models are equivalent, so $\mathcal C(\mathbf X)=\mathcal C(\mathbf W)$. Then $\exist$ $\mathbf T$, $\mathbf S$, such that $\mathbf W=\mathbf X\mathbf T$ and $\mathbf X=\mathbf W\mathbf S$.

Result 3.4: If $\pmb\lambda^T\mathbf b$ is estimable in the $\mathbf X$ model, and $\hat{\mathbf c}$ solves the N.E.s $\mathbf W^T\mathbf W\mathbf c=\mathbf W^T\mathbf y$, then $\pmb\lambda^T\mathbf T\hat{\mathbf c}$ is the L.S. estimator of $\pmb\lambda^T\mathbf b$.

proof: If $\mathbf W^T\mathbf W\hat{\mathbf c}=\mathbf W^T\mathbf y$, then by Result 2.9, we know that $\mathbf X^T\mathbf X\mathbf T\hat{\mathbf c}=\mathbf X^T\mathbf y$, so the L.S. estimator of $\pmb\lambda^T\mathbf b$ is $\pmb\lambda^T\mathbf T\hat{\mathbf c}$.

Result 3.5: If $\mathbf q^T\mathbf c$ is estimable to the $\mathbf W$ model (so $\mathbf q\in\mathcal C(\mathbf W^T)$), then $\mathbf q^T\mathbf S\mathbf b$ is estimable in the $\mathbf X$ model, and the L.S. estimator is $\mathbf q^T\hat{\mathbf c}$ where $\mathbf W^T\mathbf W\hat{\mathbf c}=\mathbf W^T\mathbf y$.

proof: $\mathbf q\in\mathcal C(\mathbf W^T)\Rightarrow \mathbf q=\mathbf W^T\mathbf a$ for some $\mathbf a$. So $\mathbf S^T\mathbf q=\mathbf S^T\mathbf W^T\mathbf a=\mathbf X^T\mathbf a\in\mathcal C(\mathbf X)$. So $\mathbf q^T\mathbf S\mathbf b$ is estimable in the $\mathbf X$ model. Now, the estimator is $$ \mathbf q^T\mathbf S\hat{\mathbf b}=\mathbf q^T\mathbf S\mathbf T\hat{\mathbf c}=\mathbf a^T\mathbf W\mathbf S\mathbf T\hat{\mathbf c}=\mathbf a^T\mathbf X\mathbf T\hat{\mathbf c}=\mathbf a^T\mathbf W\hat{\mathbf c}=\mathbf q^T\hat{\mathbf c}. $$

*Note: The converse of Result 3.5 does not hold. That is, $\mathbf S^T\mathbf q\in\mathcal C(\mathbf X^T)\not\Rightarrow\mathbf q\in\mathcal C(\mathbf W^T)$. Indeed, $\mathbf S^T\mathbf q=\mathbf X^T\mathbf a=\mathbf S^T\mathbf W^T\mathbf a$. But this imples that $\mathbf q=\mathbf W^T\mathbf a$ only if $\mathbf S^T$ has full column rank (See corollary A.2).

The most common goal of reparameterization is to find a full-rank version of an overparameterized model. A full-rank version always exists, one simply has to delete columns of the design matrix that are linear combinations of the columns until there are no more such columns.

Recall that $\mathbf X_{N\times p}=\mathbf W_{N\times t}\mathbf S_{t\times p}$. If $\mathbf W$ has full column rank:

$\mathbf W^T\mathbf W$ is non-singular, and there is a unique solution to the $\mathbf W$ N.E.s
The matrix $\mathbf S$ is unique, so $\mathbf X\mathbf b=\mathbf W\mathbf c\Rightarrow\mathbf W\mathbf S\mathbf b=\mathbf W\mathbf c\Rightarrow\mathbf S\mathbf b=\mathbf c$. So, the parameter $\mathbf c$ has a unique representation in terms of the elements of $\mathbf b$.

Examples: Let’s look at several different full-rank versions of the one-way ANOVA model.

Standard model: $y_{ij}=\mu+\alpha_i+e_{ij}$, $i=1,\cdots,a$, $j=1,\cdots,n_i$. $$ \mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_{n_1}&\mathbf 1_{n_1}&\mathbf0&\cdots&\mathbf 0\\\mathbf 1_{n_2}&\mathbf 0&\mathbf 1_{n_2}&\cdots&\mathbf 0\\\vdots&\vdots&\vdots&\ddots&\vdots\\\mathbf 1_{n_a}&\mathbf0&\mathbf0&\cdots&\mathbf1_{n_a}\end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\vdots\\\alpha_a\end{pmatrix} $$
Cell means model: $y_{ij}=\mu_i+e_{ij}$ $$ \mathbf Z\pmb\mu=\begin{pmatrix}\mathbf 1_{n_1}&\mathbf0&\cdots&\mathbf 0\\\mathbf 0&\mathbf 1_{n_2}&\cdots&\mathbf 0\\\vdots&\vdots&\ddots&\vdots\\\mathbf0&\mathbf0&\cdots&\mathbf1_{n_a}\end{pmatrix}\begin{pmatrix}\mu_1\\\mu_2\\\vdots\\\mu_a\end{pmatrix} $$ $\mathbf X_{N\times(a+1)}=\mathbf Z_{N\times a}\mathbf S_{a\times(a+1)}$ where $\mathbf S=[\mathbf 1_a\;\mathbf I_a]$.
Let $\pmb\mu^\star=(\mu_1,\mu_2,\cdots,\mu_a)^T$. Then $\pmb\mu^\star=\mathbf S\mathbf b=\mathbf S(\mu,\alpha_1,\cdots,\alpha_a)^T=(\mu+\alpha_1,\mu+\alpha_2,\cdots,\mu+\alpha_a)^T$.
Cell reference model: $y_{ij}=c_1+c_2\mathbf I_{\{1\}}(i)+c_3\mathbf I_{\{2\}}(i)+\cdots+c_a\mathbf I_{\{a-1\}}(i)+e_{ij}$ $$ \mathbf W\mathbf c=\begin{pmatrix}\mathbf 1_{n_1}&\mathbf 1_{n_1}&\mathbf0&\cdots&\mathbf 0\\\mathbf 1_{n_2}&\mathbf 0&\mathbf 1_{n_2}&\cdots&\mathbf 0\\\vdots&\vdots&\vdots&\ddots&\vdots\\\mathbf 1_{n_{a-1}}&\mathbf0&\mathbf0&\cdots&\mathbf1_{n_{a-1}}\\\mathbf 1_{n_a}&\mathbf 0&\mathbf 0&\cdots&\mathbf0\end{pmatrix}\begin{pmatrix}c_1\\c_2\\\vdots\\c_a\end{pmatrix} $$ $\mathbf X_{N\times(a+1)}=\mathbf W_{N\times a}\mathbf S_{a\times(a+1)}$ where $\mathbf S=[\mathbf I_a;(1,-1,-1,\cdots,-1)^T]$.
Let $\mathbf c=(c_1,c_2,\cdots,c_a)^T$. Then $\mathbf c=\mathbf S\mathbf b=[\mathbf I_a;(1,-1,-1,\cdots,-1)^T]\begin{pmatrix}\mu\\\alpha_1\\\vdots\\\alpha_a\end{pmatrix}=\begin{pmatrix}\mu+\alpha_a\\\alpha_1-\alpha_a\\\vdots\\\alpha_{a-1}-\alpha_a\end{pmatrix}$.
Now, let’s look at the relationship between $\mathbf c$ and $\pmb\mu^*$: $$ \begin{pmatrix}\mu_1\\\mu_2\\\vdots\\\mu_{a-1}\\\mu_a\end{pmatrix}=\begin{pmatrix}c_1+c_2\\c_1+c_3\\\vdots\\c_1+c_a\\c_1\end{pmatrix}\Rightarrow \begin{cases}c_1=\mu_a\\c_2=\mu_1-\mu_a\\c_3=\mu_3-\mu_a\\\vdots\\c_a=\mu_{a-1}-\mu_a\end{cases} $$

3.8 Imposing Conditions on $\mathbf b$ to Get a Unique Solution to the Normal Equations

Example: One-way ANOVA: $y_{ij}=\mu+\alpha_i+e_{ij}$

N.E.s: $N\mu+\sum_{i=1}^an_a\alpha_i=N\bar{y}_{\cdot\cdot}$, $\mu+\alpha_i=\bar{y}_{i\cdot}$, $i=1,2,\cdots,a$

Here are those standard conditions on $\mathbf b=(\mu,\alpha_1,\cdots,\alpha_a)^T$ that are impossed to get a unqiue solution:

$\alpha_1=0\Rightarrow\hat{\mu}=\bar{y}_{a\cdot},\hat{\alpha}_i=\bar{y}_{i\cdot}-\bar{y}_{a\cdot}$, $i=1,2,\cdots,a-1$, $\hat{\alpha}_a=0$
$\sum_{i=1}^a\alpha_i=0\Rightarrow\hat \mu=\frac{1}{a}\sum_{i=1}^a\bar{y}_{i\cdot}$, $\hat\alpha_i=\bar{y}_{i\cdot}-\hat\mu$, $i=1,\cdots,a$
$\sum_{i=1}^an_i\alpha_i=0\Rightarrow\hat{\mu}=\bar{y}_{\cdot\cdot}$, $\hat{\alpha}_i=\bar{y}_{i\cdot}-\bar{y}_{\cdot\cdot}$, $i=1,\cdots,a$

Q: Can we always impose a set of conditions on $\mathbf b$ to obtain a unique solution to the N.E.s?

A:Yes.

Define $rank(X)=r\leq p$.

The conditions on $\mathbf b$ will be specified as follows: $$ \mathbf C\mathbf b=\mathbf 0 $$ where $\mathbf C\in\mathbb R^{(p-r)\times p}$, $\mathbf b\in\mathbb R^{p\times 1}$, $\mathbf 0\in\mathbb R^{(p-r)\times 1}$, and $rank(\mathbf C)=p-r$.

Adding these conditions to the N.E.s yields $$ \begin{pmatrix}\mathbf X^T\mathbf X\\\mathbf C\end{pmatrix}\mathbf b=\begin{pmatrix}\mathbf X^T\mathbf y\\\mathbf 0\end{pmatrix}\quad\quad\quad(1) $$ or equivalently, $$ \begin{pmatrix}\mathbf X\\\mathbf C\end{pmatrix}\mathbf b=\begin{pmatrix}\mathbf P_\mathbf X\mathbf y\\\mathbf 0\end{pmatrix}\quad\quad\quad (2) $$ ($\hat{\mathbf b}$ solves N.E.s $\Leftrightarrow$ $\hat{\mathbf b}$ solves $\mathbf X\mathbf b=\mathbf P_\mathbf X\mathbf y$)

equations (2) will have a unique solution if $rank(\begin{pmatrix}\mathbf X\\\mathbf C\end{pmatrix})=p$. That is, if $\mathcal C(\begin{pmatrix}\mathbf X^T&\mathbf C^T\end{pmatrix})=\mathbb R^p$. Thus, choosing $\mathbf C$ is equivalent to adding L.I. rows to $\mathbf X$ until its rank is $p$. Equivalently, we can add L.I. columns to $\mathbf X^T$ until it has rank $p$. The columns of $\mathbf C^T$must be L.I. of one another, and of the collumns of $\mathbf X^T$. Thus, the columns of $\mathbf C^T$ CANNOT be in $\mathcal C(\mathbf X^T)$. In other words, the columns of $\mathbf C^T$ must correspond to non-estimable functions.

Suppose $\mathbf C$ satisfies $$ rank(\mathbf C)=p-r\quad\quad\quad(3)\\ \mathcal C(\mathbf X^T)\cap\mathcal C(\mathbf C^T)=\{0\}\quad\quad\quad(4) $$ Together, (3) and (4) implies $$ rank(\begin{pmatrix}\mathbf X\\\mathbf C\end{pmatrix})=p $$ Example: $y_{ij}=\mu+\alpha_i+e_{ij}$, $i=1,2,3$

$p=4$, $r=3$. A basis for $\mathcal C(\mathbf X^T)$ is given by $$ \begin{pmatrix}1\\1\\0\\0\end{pmatrix},\quad\begin{pmatrix}1\\0\\1\\0\end{pmatrix},\quad\begin{pmatrix}1\\0\\0\\1\end{pmatrix} $$ Consider the restriction $\sum_{i=1}^3\alpha_i=0$. In the form $\mathbf C\mathbf b=\mathbf 0$, we have $$ \begin{pmatrix}0&1&1&1\end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\alpha_2\\\alpha_3\end{pmatrix}=\mathbf 0 $$ So $\mathbf C^T=\begin{pmatrix}0\\1\\1\\1\end{pmatrix}$.

Q: Is $(0,1,1,1)^T$ linearly independent of the basis vectors for $\mathcal C(\mathbf X^T)$?

A: Yes. It is because $(0,1,1,1)^T=a_1(1,1,0,0)^T+a_2(1,0,1,0)^T+a_3(1,0,0,1)^T$ is impossible.

Lemma 3.1: If $\mathbf C$ satisfies (3) and (4), then (1) is equivalent to $$ \begin{pmatrix}\mathbf X^T\mathbf X\\\mathbf C^T\mathbf C\end{pmatrix}\mathbf b=\begin{pmatrix}\mathbf X^T\mathbf y\\\mathbf 0\end{pmatrix}\quad\quad\quad(6) $$ And, (6) is equivalent to $$ (\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)\mathbf b=\mathbf X^T\mathbf y\quad\quad\quad(7) $$ proof: Recall from Lemma 2.1 that $\mathcal N(\mathbf C)=\mathcal N(\mathbf C^T\mathbf C)$. Thus $\mathbf C\mathbf b=\mathbf 0\Leftrightarrow\mathbf C^T\mathbf C\mathbf b=\mathbf 0$.

Now, if (6) holds, then $$ \mathbf X^T\mathbf X\mathbf b=\mathbf X^T\mathbf y\quad\text{ and }\quad\mathbf C^T\mathbf C\mathbf b=\mathbf 0. $$ (7) follows by just adding these equations. Now assume that (7) holds, and rewrite it as follows: $$ \mathbf C^T\mathbf C\mathbf b=\mathbf X^T\mathbf y-\mathbf X^T\mathbf X\mathbf b $$ The RHS is in $\mathcal C(\mathbf X^T)$, but the only vectors that $\mathcal C(\mathbf X^T)$ and $\mathcal C(\mathbf C^T)$ have in common is $\mathbf 0$. So $\mathbf C^T\mathbf C\mathbf b=\mathbf 0$ and $\mathbf X^T\mathbf X\mathbf b=\mathbf X^T\mathbf y$.

Result 3.6: Assume $\mathbf C$ satisfies (3) and (4). Then

$\mathbf X^T\mathbf X+\mathbf C^T\mathbf C$ is non-singular
$(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)^{-1}\mathbf X^T\mathbf y$ is the unique solution to (1)
$(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)^{-1}$ is a G.I. of $\mathbf X^T\mathbf X$
$\mathbf C(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)^{-1}\mathbf X^T=\mathbf 0$
$\mathbf C(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)^{-1}\mathbf C^T=\mathbf I$

proof:

$\mathbb R^p=\mathcal C(\begin{pmatrix}\mathbf X^T&\mathbf C^T\end{pmatrix})=\mathcal C\left(\begin{pmatrix}\mathbf X^T&\mathbf C^T\end{pmatrix}\begin{pmatrix}\mathbf X\\\mathbf C\end{pmatrix}\right)=\mathcal C(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)$
So, $\mathbf X^T\mathbf X+\mathbf C^T\mathbf C$ is a $p\times p$ matrix with rank $p$. So it is non-singular.
The unique solution of $(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)\mathbf b=\mathbf X^T\mathbf y$ is $(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)^{-1}\mathbf X^T\mathbf y$. Now, the result follows from Lemma 3.1
From 2, we have $\mathbf X^T\mathbf X(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)^{-1}\mathbf X^T\mathbf y=\mathbf X^T\mathbf y$ for $\forall\mathbf y\in\mathbb R^N$ because of the first equations in (1). So $\mathbf X^T\mathbf X(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)^{-1}\mathbf X^T=\mathbf X^T$. Postmultiply by $\mathbf X$: $$ \mathbf X^T\mathbf X(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)^{-1}\mathbf X^T\mathbf X=\mathbf X^T\mathbf X\Rightarrow(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)^{-1}\text{ is a G.I. of }\mathbf X^T\mathbf X $$
From 2, we have $\mathbf C(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)^{-1}\mathbf X^T\mathbf y=\mathbf 0$ for $\forall\mathbf y\in\mathbb R^N$.

3.9 Constrained Parameter Space

GLM: $\mathbf y=\mathbf X\mathbf b+\mathbf e$

Suppose we wish to restrict $\mathbf b$ to a subspcae of $\mathbb R^p$ given by $$ \mathcal T=\{\mathbf b\in\mathbb R^p:\mathbf P^T\mathbf b=\delta\}, $$ where we will insist that the matrix $\mathbf P\in\mathbb R^{p\times q}$ is of full-column rank.

Note: This may seem similar to Section 3.8, but here we do NOT insist that the columns of $\mathbf P$ correspond to non-estimable functions.

Def: $\pmb\lambda^T\mathbf b$ is estimable in the constrained model if $\exist$ $c\in\mathbb R$ and $\mathbf a\in\mathbb R^N$ s.t. $\mathbb E(c+\mathbf a^T\mathbf y)=\pmb\lambda^T\mathbf b$, $\forall\mathbf b\in\mathcal T$.

*Note: If $\pmb\lambda^T\mathbf b$ is estimable in the unconstrained model, then it is estimable in the constrained model. So the constraint can only increase the number of estimable functions.

Result 3.7: In the constrained model, $\pmb\lambda^T\mathbf b$ is estimable iff $\exist$ $\mathbf a\in\mathbb R^N$ and $\mathbf d\in\mathbb R^q$ s.t. $\pmb\lambda=\mathbf X^T\mathbf a+\mathbf P\mathbf d$.

proof: ($\Leftarrow$): $\mathbb E(\mathbf d^T\delta+\mathbf a^T\mathbf y)=\mathbf d^T\delta+\mathbf a^T\mathbf X\mathbf b=\mathbf d^T\delta+(\pmb\lambda-\mathbf P\mathbf d)^T\mathbf b=\mathbf d^T\delta+\pmb\lambda^T\mathbf b-\mathbf d^T\mathbf P^T\mathbf b=\mathbf d^T\delta+\pmb\lambda^T\mathbf b-\mathbf d^T\delta=\pmb\lambda^T\mathbf b$.

($\Rightarrow$): Let $\mathbf W$ be a matrix such that $\mathcal C(\mathbf W)=\mathcal N(\mathbf P^T)$. Fix $\mathbf b_\star\in\mathcal T$. Then for any $\mathbf z$ we have $\mathbf P^T(\mathbf b_\star+\mathbf W\mathbf z)=\mathbf P^T\mathbf b_\star+\mathbf P^T\mathbf W\mathbf z=\delta$. Now, since $\mathbb E(c+\mathbf a^T\mathbf y)=\pmb\lambda^T\mathbf b$, $\forall \mathbf b\in\mathcal T$, we have $$ \begin{aligned} &c+\mathbf a^T\mathbf X\mathbf b=\pmb\lambda^T\mathbf b,\quad\forall\mathbf b\in\mathcal T\\ \Rightarrow&c+\mathbf a^T\mathbf X(\mathbf b_\star+\mathbf W\mathbf z)=\pmb\lambda^T(\mathbf b_\star+\mathbf W\mathbf z),\quad\forall\mathbf z\\ \Rightarrow&(\pmb\lambda-\mathbf X^T\mathbf a)^T\mathbf W\mathbf z-c-\mathbf a^T\mathbf X\mathbf b_\star+\pmb\lambda^T\mathbf b_\star=0,\quad\forall \mathbf z \end{aligned} $$ Using Result A.8, we have $(\pmb\lambda-\mathbf X^T\mathbf a)^T\mathbf W=\mathbf 0$. Thus $\mathbf W^T(\pmb\lambda-\mathbf X^T\mathbf a)=\mathbf 0\Rightarrow(\pmb\lambda-\mathbf X^T\mathbf a)\in\mathcal N(\mathbf W^T)=\mathcal C(\mathbf P)\Rightarrow\exist$ $\mathbf d\in\mathbb R^q$, s.t. $\pmb\lambda-\mathbf X^T\mathbf a=\mathbf P\mathbf d$.

Note that the existence of $\mathbf a\in\mathbb R^N$ and $\mathbf d\in\mathbb R^q$ s..t. $\pmb\lambda=\mathbf X^T\mathbf a+\mathbf P\mathbf d$ means that $$ \pmb\lambda=\begin{pmatrix}\mathbf X^T&\mathbf P\end{pmatrix}\begin{pmatrix}\mathbf a\\\mathbf d\end{pmatrix}\\ \Rightarrow\pmb\lambda\in\mathcal C\left(\begin{pmatrix}\mathbf X^T&\mathbf P\end{pmatrix}\right)\supset\mathcal C(\mathbf X^T) $$ Now consider minimizing $Q(\mathbf b)=|\mathbf y-\mathbf X\mathbf b|^2$ subject to the constraint that $\mathbf P^T\mathbf b=\delta$.

Method of Lagrange Multipliers $$ L(\mathbf b,\pmb\theta)=(\mathbf y-\mathbf X\mathbf b)^T(\mathbf y-\mathbf X\mathbf b)+2\pmb\theta^T(\mathbf P^T\mathbf b-\delta) $$

$$ \begin{aligned} &\frac{\partial L(\mathbf b,\pmb\theta)}{\partial\mathbf b}=-2\mathbf X^T(\mathbf y-\mathbf X\mathbf b)+2\mathbf P\pmb\theta\\ &\frac{\partial L(\mathbf b,\pmb\theta)}{\partial\pmb\theta}=2(\mathbf P^T\mathbf b-\delta) \end{aligned} $$

Setting these to zeo yields the R.N.E.s: $$ \begin{pmatrix} \mathbf X^T\mathbf X&\mathbf P\\ \mathbf P^T&\mathbf 0 \end{pmatrix}\begin{pmatrix}\mathbf b\\\pmb\theta\end{pmatrix}= \begin{pmatrix}\mathbf X^T\mathbf y\\\delta\end{pmatrix} $$ Are the restricted normal equations (R.N.E.s) consistent?

Let’s recall how we establish that the N.E.s $(\mathbf X^T\mathbf X\mathbf b=\mathbf X^T\mathbf y)$ are consistent. We showed that $\mathcal N(\mathbf X^T\mathbf X)=\mathcal N(\mathbf X)$, which implies that $\mathcal C(\mathbf X^T\mathbf X^T)=\mathcal C(\mathbf X^T)$. Note that it would have been sufficient to show that $\mathcal N(\mathbf X^T\mathbf X)\subseteq\mathcal N(\mathbf X)$ because by result A.6, it follows that $\mathcal C(\mathbf X^T\mathbf X)\supseteq\mathcal C(\mathbf X^T)$. Then $\mathbf X^T\mathbf y\in\mathcal C(\mathbf X^T)\Rightarrow \mathbf X^T\mathbf y\in\mathcal C(\mathbf X^T\mathbf X)$.

Result 3.8: The R.N.Es are consistent.

proof: Claim $$ \begin{pmatrix} \mathbf X^T\mathbf y\\\delta \end{pmatrix}\in\mathcal C\left(\begin{pmatrix} \mathbf X^T&\mathbf 0\\ \mathbf 0&\mathbf P^T \end{pmatrix}\right) $$ Indeed, there esits a vector $$ \mathbf u=\begin{pmatrix}\mathbf u_1\in\mathbb R^{N\times1}\\\mathbf u_2\in\mathbb R^{p\times1}\end{pmatrix}\quad\text{s.t.}\quad\begin{pmatrix}\mathbf X^T&\mathbf 0\\\mathbf 0&\mathbf P^T\end{pmatrix}\begin{pmatrix}\mathbf u_1\\\mathbf u_2\end{pmatrix}=\begin{pmatrix}\mathbf X^T\mathbf y\\\delta\end{pmatrix} $$ where $\mathbf u_1=\mathbf y$ and $\mathbf u_2$ exists because we assume that $\delta\in\mathcal C(\mathbf P^T)$.

Thus, it is enough to show that $$ \mathcal C\left(\begin{pmatrix}\mathbf X^T&\mathbf 0\\\mathbf 0&\mathbf P^T\end{pmatrix}\right)\subseteq\mathcal C\left(\begin{pmatrix}\mathbf X^T\mathbf X&\mathbf P\\mathbf P^T&\mathbf 0\end{pmatrix}\right). $$ To start, let $\mathbf v=\begin{pmatrix}\mathbf v_1\in\mathbb R^{p\times1}\\\mathbf v_2\in\mathbb R^{q\times 1}\end{pmatrix}$ be s.t. $$ \begin{pmatrix}\mathbf X^T\mathbf X&\mathbf P\\\mathbf P^T&\mathbf 0\end{pmatrix}\begin{pmatrix}\mathbf v_1\\\mathbf v_2\end{pmatrix}=\begin{pmatrix}\mathbf0\\mathbf 0\end{pmatrix}. $$ Then, we have (i) $\mathbf X^T\mathbf X\mathbf v_1+\mathbf P\mathbf v_2=\mathbf 0$ and (ii) $\mathbf P^T\mathbf v_1=\mathbf 0$. $$ (\text{i})\Rightarrow \mathbf v_1^T\mathbf X^T\mathbf X\mathbf v_1+\mathbf v^T_1\mathbf P\mathbf v_2=\mathbf 0\stackrel{\text{(ii)}}{\Rightarrow}\mathbf v_1^T\mathbf X^T\mathbf X\mathbf v_1=\mathbf 0\Rightarrow\mathbf X\mathbf v_1=\mathbf 0. $$ But by (i), $\mathbf P\mathbf v_2=\mathbf 0$ $$ \Rightarrow\begin{pmatrix}\mathbf X&\mathbf 0\\\mathbf 0&\mathbf P\end{pmatrix}\begin{pmatrix}\mathbf v_1\\\mathbf v_2\end{pmatrix}=\begin{pmatrix}\mathbf 0\\\mathbf 0\end{pmatrix}\\ \Rightarrow \mathcal N\left(\begin{pmatrix}\mathbf X^T\mathbf X&\mathbf P\\\mathbf P^T&\mathbf 0\end{pmatrix}\right)\subseteq\mathcal N\left(\begin{pmatrix}\mathbf X&\mathbf 0\\\mathbf0&\mathbf P\end{pmatrix}\right)\\ \stackrel{\text{Result A.6}}{\Rightarrow}\mathcal C\left(\begin{pmatrix}\mathbf X^T&\mathbf 0\\\mathbf 0&\mathbf P^T\end{pmatrix}\right)\subseteq\mathcal C\left(\begin{pmatrix}\mathbf X^T\mathbf X&\mathbf P\\\mathbf P^T&\mathbf 0\end{pmatrix}\right). $$

Let $\begin{pmatrix}\hat{\mathbf b}_H\\\hat{\pmb\theta}_H\end{pmatrix}$ denote a generic solution to the R.N.E.s

Result 3.9: $\hat{\mathbf b}_H$ minimizes $Q(\mathbf b)$ subject to $\mathbf b\in\mathcal T$.

proof: Suppose $\mathbf P^T\tilde{\mathbf b}=\delta$. $$ \begin{aligned} Q(\tilde{\mathbf b})&=(\mathbf y-\mathbf X\tilde{\mathbf b})^T(\mathbf y-\mathbf X\tilde{\mathbf b})=\left(\mathbf y-\mathbf X\hat{\mathbf b}_H+\mathbf X(\hat{\mathbf b}_H-\tilde{\mathbf b})\right)^T\left(\mathbf y-\mathbf X\hat{\mathbf b}_H+\mathbf X(\hat{\mathbf b}_H-\tilde{\mathbf b})\right)\\ &=Q(\hat{\mathbf b}_H)+|\mathbf X(\hat{\mathbf b}_H-\tilde{\mathbf b})|^2 \end{aligned} $$ since $(\hat{\mathbf b}_H-\tilde{\mathbf b})^T\mathbf X^T(\mathbf y-\mathbf X\hat{\mathbf b}_H)=(\hat{\mathbf b}_H-\tilde{\mathbf b})^T\mathbf P\hat{\pmb\theta}_H=\left[\mathbf P^T(\hat{\mathbf b}_H-\tilde{\mathbf b})\right]^T\hat{\pmb\theta}_H=0$ (by constraint). So, we have $Q(\hat{\mathbf b}_H)\leq Q(\tilde{\mathbf b})$ for all $\tilde{\mathbf b}\in\mathcal T$, with equality iff $|\mathbf X(\hat{\mathbf b}_H-\tilde{\mathbf b})|^2=0$, or $\mathbf X\hat{\mathbf b}_H=\mathbf X\tilde{\mathbf b}$.

Result 3.10: Let $\tilde{\mathbf b}$ satisfiy $\mathbf P^T\tilde{\mathbf b}=\delta$. $Q(\hat{\mathbf b}_H)=Q(\tilde{\mathbf b})$ iff $\exist$ $\tilde{\pmb\theta}$ s.t. $\begin{pmatrix}\tilde{\mathbf b}&\tilde{\pmb\theta}\end{pmatrix}$ solves the R.N.E.s.

proof: ($\Leftarrow$) If $\begin{pmatrix}\hat{\mathbf b}_H&\hat{\pmb\theta}_H\end{pmatrix}$ and $\begin{pmatrix}\tilde{\mathbf b}&\tilde{\pmb\theta}\end{pmatrix}$ both solve the R.N.E.s, then $Q(\hat{\mathbf b}_H)=Q(\tilde{\mathbf b})=\text{min}$.

($\Rightarrow$) From the proof of Result 3.9, we have $Q(\hat{\mathbf b}_H)=Q(\tilde{\mathbf b})\Leftrightarrow\mathbf X\hat{\mathbf b}_H=\mathbf X\tilde{\mathbf b}\Rightarrow \mathbf X^T\mathbf X\hat{\mathbf b}_H=\mathbf X^T\mathbf X\tilde{\mathbf b}$. Now, since $\mathbf X^T\mathbf X\hat{\mathbf b}_H+\mathbf P\hat{\pmb\theta}_H=\mathbf X^T\mathbf y=\mathbf X^T\mathbf X\tilde{\mathbf b}+\mathbf P\hat{\pmb\theta}_H$ we have that $\begin{pmatrix}\tilde{\mathbf b}&\hat{\pmb\theta}_H\end{pmatrix}$ solves the R.N.E.s,

Chapter 4: Gauss–Markov Model

Suppose $\Sigma$ is an $n\times n$ covariance matrix associated with $Z=(Z_1,\cdots,Z_n)^T$. Then for $v\in\mathbb R^n$, we have $$ v^T\Sigma v=Var(v^TZ)=Var(\sum_{i=1}^nv_iZ_i)\geq0\quad\Rightarrow\quad\Sigma\text{ is non-negative definite}. $$ When is $\Sigma$ positive definite? The only way that $v^T\Sigma v=0$ when $v\not=0$ is if $Var(\sum_{i=1}^nv_iZ_i)=0$. This means that $\sum_{i=1}^nv_iZ_i$ must be constant. If no such relationship exists, then $\Sigma$ is positive definite.

Gauss-Markov Model

$\mathbf y=\mathbf X\mathbf b+\mathbf e$, $\mathbb E\mathbf e=0$, $Cov(\mathbf e)=\sigma^2\mathbf I$.

So, each element of $\mathbf e$ has mean $0$, variance $\sigma^2$ and $Cov(e_i,e_j)=0(i\not=j)$.

Let’s derive the variance of the L.S. estimator of the estimable function $\pmb\lambda^T\mathbf b$.

Of course, L.S. estimator of $\pmb\lambda^T\mathbf b$ is $\pmb\lambda^T\hat{\mathbf b}=\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y$. We know that $\mathbb E(\pmb\lambda^T\hat{\mathbf b})=\pmb\lambda^T\mathbf b$. $$ \begin{aligned} Var(\pmb\lambda^T\hat{\mathbf b})&=Var(\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y)\\ &=\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^TCov(\mathbf y)[\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T]^T\\ &=\sigma^2\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda \end{aligned} $$ Note that $\mathbf X^T\mathbf X[(\mathbf X^T\mathbf X)^g]^T$ projects onto $\mathcal C(\mathbf X^T\mathbf X)=\mathcal C(\mathbf X^T)$ and $\pmb\lambda\in\mathcal C(\mathbf X^T)$, so $\mathbf X^T\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda=\pmb\lambda$. So $$ Var(\pmb\lambda^T\hat{\mathbf b})=\sigma^2\pmb\lambda^T(\mathbf X^T\mathbf X)^g\pmb\lambda. $$ Example:

SLR: $y_i=\beta_0+\beta_1X_i+e_i$. Find $Var(\hat{\beta}_1)$. $$ \mathbf X^T\mathbf X=\begin{pmatrix}N&N\bar X\\N\bar X&\sum_{i=1}^NX_i^2\end{pmatrix} $$

$$ \beta_1=\pmb\lambda^T\mathbf b=\begin{pmatrix}0&1\end{pmatrix}\begin{pmatrix}b_0\\b_1\end{pmatrix} $$

$$ \begin{aligned} Var(\hat{\beta}_1)&=\sigma^2\pmb\lambda^T(\mathbf X^T\mathbf X)^g\pmb\lambda\\ &=\sigma^2\begin{pmatrix}0&1\end{pmatrix}\begin{pmatrix}N&N\bar X\\N\bar X&\sum_{i=1}^NX_i^2\end{pmatrix}^{-1}\begin{pmatrix}0\\1\end{pmatrix}\\ &=\frac{\sigma^2}{NS_{XX}}\begin{pmatrix}0&1\end{pmatrix}\begin{pmatrix}\sum_{i=1}^NX_i^2 & -N\bar X\\-N\bar X&N\end{pmatrix}\begin{pmatrix}0\\1\end{pmatrix}\\ &=\frac{\sigma^2}{S_{XX}} \end{aligned} $$

Theorem 4.1 (Gauss-Markov Theorem): Assume that $\pmb\lambda^T\mathbf b$ is estimable. Under the G-M model, the L.S. estimator $\pmb\lambda^T\hat{\mathbf b}$ is the best (minimum variance) linear unbiased estimator (BLUE).

proof: Suppose that $c+\mathbf d^T\mathbf y$ is an unbiased estimator of $\pmb\lambda^T\mathbf b$. Then $c+\mathbf d^T\mathbf X\mathbf b=\pmb\lambda^T\mathbf b$, $\forall \mathbf b\in\mathbb R^p$. This implies $c=0$ and $\mathbf d^T\mathbf X=\pmb\lambda^T$. Now, $$ \begin{aligned} &Var(c+\mathbf d^T\mathbf y)\\ =&Var(\mathbf d^T\mathbf y)\\ =&Var(\pmb\lambda^T\hat{\mathbf b}+\mathbf d^T\mathbf y-\pmb\lambda^T\hat{\mathbf b})\\ =&Var(\pmb\lambda^T\hat{\mathbf b})+Var(\mathbf d^T\mathbf y-\pmb\lambda^T\hat{\mathbf b})+2Cov(\pmb\lambda^T\hat{\mathbf b},\mathbf d^T\mathbf y-\pmb\lambda^T\hat{\mathbf b})\\ =&Var(\pmb\lambda^T\hat{\mathbf b})+Var(\mathbf d^T\mathbf y-\pmb\lambda^T\hat{\mathbf b})+2Cov(\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y,[\mathbf d-\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda]^T\mathbf y)\\ =&Var(\pmb\lambda^T\hat{\mathbf b})+Var(\mathbf d^T\mathbf y-\pmb\lambda^T\hat{\mathbf b})+2\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^TCov(\mathbf y)[\mathbf d-\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda]\\ =&Var(\pmb\lambda^T\hat{\mathbf b})+Var(\mathbf d^T\mathbf y-\pmb\lambda^T\hat{\mathbf b})+2\sigma^2\pmb\lambda^T(\mathbf X^T\mathbf X)^g[\mathbf X^T\mathbf d-\mathbf X^T\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda]\\ =&Var(\pmb\lambda^T\hat{\mathbf b})+Var(\mathbf d^T\mathbf y-\pmb\lambda^T\hat{\mathbf b})+2\sigma^2\pmb\lambda^T(\mathbf X^T\mathbf X)^g(\pmb\lambda-\pmb\lambda)\\ =&Var(\pmb\lambda^T\hat{\mathbf b})+Var(\mathbf d^T\mathbf y-\pmb\lambda^T\hat{\mathbf b})\\ \geq& Var(\pmb\lambda^T\hat{\mathbf b}) \end{aligned} $$ with equality iff $$ \begin{aligned} 0&=Var(\mathbf d^T\mathbf y-\pmb\lambda^T\hat{\mathbf b})\\ &=Var(\mathbf d^T\mathbf y-\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y)\\ &=Var\left([\mathbf d-\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda]^T\mathbf y\right)\\ &=[\mathbf d-\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda]^TCov(\mathbf y)[\mathbf d-\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda]\\ &=\sigma^2|\mathbf d-\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda|^2 \end{aligned} $$ but this norm square is $0$ iff $\mathbf d^T=\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\Rightarrow\mathbf d^T\mathbf y=\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y=\pmb\lambda^T\hat{\mathbf b}\Rightarrow$ BLUE is unique.

Note that the crucial step in the proof of G-M theorem is showing that $Cov(\pmb\lambda^T\hat{\mathbf b},\mathbf d^T\mathbf y-\pmb\lambda^T\hat{\mathbf b})=0$. Of course, $\mathbf d^T\mathbf y-\pmb\lambda^T\hat{\mathbf b}$ is an unbiased estimator of $0$.

Result 4.1: The BLUE $\pmb\lambda^T\hat{\mathbf b}$ is uncorrelated with all linear unbiased estimator of $0$.

proof: Suppose $\mathbb E(c+\mathbf a^T\mathbf y)=c+\mathbf a^T\mathbf X\mathbf b=0$, $\forall \mathbf b\in\mathbb R^p$. It follows that $c=0$ and $\mathbf a^T\mathbf X=\mathbf 0\Rightarrow\mathbf a\in\mathcal N(\mathbf X^T)$. So, $$ \begin{aligned} Cov(\pmb\lambda^T\hat{\mathbf b},\mathbf a^T\mathbf y)&=Cov(\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y,\mathbf a^T\mathbf y)\\ &=\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^TCov(\mathbf y)\mathbf a\\ &=\sigma^2\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf a\\ &=0 \end{aligned} $$

*The G-M theorem can be extended to the case where we have $r$ linearly independent estimable functions $\{\pmb\lambda^{(j)T}\mathbf b\}_{j=1}^r$. (Assume $rank(\mathbf X)=r$)

Let $\pmb\Lambda$ be a $p\times r$ matrix whose $j$th column is $\pmb\lambda^{(j)}$. Then $\pmb\Lambda^T\hat{\mathbf b}$ is the $r\times 1$ vector of L.S. estimators, and $\mathbb E(\pmb\Lambda^T\hat{\mathbf b})=\pmb\Lambda^T\mathbf b$, $$ \begin{aligned} Cov(\pmb\Lambda^T\hat{\mathbf b})&=Cov(\pmb\Lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y)\\ &=\pmb\Lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^TCov(\mathbf y)[\pmb\Lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T]^T\\ &=\sigma^2\pmb\Lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\Lambda\\ &=\sigma^2\pmb\Lambda^T(\mathbf X^T\mathbf X)^g\pmb\Lambda. \end{aligned} $$ Q: Is $\sigma^2\pmb\Lambda^T(\mathbf X^T\mathbf X)^g\pmb\Lambda$ positive definite?

A: Yes. Note that $\pmb\Lambda=\mathbf X^T\mathbf S$, so $\pmb\Lambda^T(\mathbf X^T\mathbf X)^g\pmb\Lambda=\mathbf S^T\mathbf X(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf S=\mathbf S^T\mathbf P_\mathbf X\mathbf P_\mathbf X\mathbf S=(\mathbf P_\mathbf X\mathbf S)^T\mathbf P_\mathbf X\mathbf S$.

Because $\mathbf X^T\mathbf P_\mathbf X\mathbf S=\mathbf X^T\mathbf S=\pmb\Lambda$, we have $$ r=rank(\pmb\Lambda)\leq\min\{rank(\mathbf X^T),rank(\mathbf P_\mathbf X\mathbf S)\}\leq rank(\mathbf P_\mathbf X\mathbf S)\Rightarrow rank(\mathbf P_\mathbf X\mathbf S)=r. $$ So, $\sigma^2\pmb\Lambda^T(\mathbf X^T\mathbf X)^g\pmb\Lambda$ is positive definite.

Suppose that $\mathbb E(\mathbf C^T\mathbf y)=\pmb\Lambda^T\mathbf b$, $\forall\mathbf b\in\mathbb R^p$. ($\mathbf C\in\mathbb R^{N\times r}$)

So, $\mathbf C^T\mathbf X\mathbf b=\pmb\Lambda^T\mathbf b$, $\forall \mathbf b\in\mathbb R^p\Rightarrow \mathbf C^T\mathbf X=\pmb\Lambda^T$. Now, $Cov(\mathbf C^T\mathbf y)=\mathbf C^TCov(\mathbf y)\mathbf C=\sigma^2\mathbf C^T\mathbf C$ and $$ \begin{aligned} Cov(\mathbf C^T\mathbf y)-Cov(\pmb\Lambda^T\hat{\mathbf b})&=\sigma^2\left[\mathbf C^T\mathbf C-\pmb\Lambda^T(\mathbf X^T\mathbf X)^g\pmb\Lambda\right]\\ &=\sigma^2\left[\mathbf C^T\mathbf C-\mathbf C^T\mathbf X(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf C\right]\\ &=\sigma^2\mathbf C^T(\mathbf I-\mathbf P_\mathbf X)\mathbf C \end{aligned} $$ Note that, for $\mathbf v\in\mathbb R^r$, we have $$ \mathbf v^T\mathbf C^T[\mathbf I-\mathbf P_\mathbf X]\mathbf C\mathbf v=\mathbf v^T\mathbf C(\mathbf I-\mathbf P_\mathbf X)(\mathbf I-\mathbf P_\mathbf X)\mathbf C\mathbf v=\left((\mathbf I-\mathbf P_\mathbf X)\mathbf C\mathbf v\right)^T(\mathbf I-\mathbf P_\mathbf X)\mathbf C\mathbf v\geq0\\ \Rightarrow \mathbf C^T[\mathbf I-\mathbf P_\mathbf X]\mathbf C\text{ is non-negative definite}\\ \Rightarrow Cov(\mathbf C^T\mathbf y)-Cov(\Lambda^T\hat{\mathbf b}) \text{ is non-negative definite}. $$ Fix $\mathbf v\in\mathbb R^r$, $\mathbf v\not=\mathbf 0$. Let’s compare $\mathbf v^T\mathbf C^T\mathbf y$ with $\mathbf v^T\pmb\Lambda^T\hat{\mathbf b}$. $$ Var(\mathbf v^T\mathbf C^T\mathbf y)-Var(\mathbf v^T\pmb\Lambda^T\hat{\mathbf b})=\mathbf v^T Cov(\mathbf C^T\mathbf y)\mathbf v-\mathbf v^TCov(\pmb\Lambda^T\hat{\mathbf b})\mathbf v\\=\mathbf v^T(Cov(\mathbf C^T\mathbf y)-Cov(\Lambda^T\hat{\mathbf b}))\mathbf v\geq0. $$

Suppose $r=rank(\mathbf X)=p$. Then any functions $\pmb\lambda^T\mathbf b$ are estimable. So we can take $\pmb\Lambda=\mathbf I_p$, which yields $$ Cov(\hat{\mathbf b})=\sigma^2(\mathbf X^T\mathbf X)^{-1}. $$ We also have $$ Cov(\tilde{\mathbf b})-Cov(\hat{\mathbf b})\text{ is non-negative definite}, $$ where $\tilde{\mathbf b}$ is any generic unbasied estimator of $\mathbf b$.

4.3 Variance estimation

Let $\mathbf Z$ be random variable such that $\mathbb E\mathbf Z=\mu$ and $Cov(\mathbf Z)=\pmb\Sigma$. Then if $\mathbf A$ is a fixed matrix, we have $$ \mathbb E(\mathbf Z^T\mathbf A\mathbf Z)=\mu^T\mathbf A\mu+tr(\mathbf A\pmb\Sigma). $$ proof: See Page 76 in the textbook.

Fact: $tr(\mathbf I-\mathbf P_\mathbf X)=N-r$, where $r=rank(\mathbf X)$.

Reason: $tr(\mathbf I-\mathbf P_\mathbf X)=tr(\mathbf I)-tr(\mathbf P_\mathbf X)$. Since $\mathbf P_\mathbf X$ is idempotent, $rank(\mathbf P_\mathbf X)=tr(\mathbf P_\mathbf X)$, but $\mathcal C(\mathbf P_\mathbf X)=\mathcal C(\mathbf X)$, so $rank(\mathbf P_\mathbf X)=rank(\mathbf X)=r$.

Recall that $SSE=\hat{\mathbf e}^T\hat{\mathbf e}=\mathbf y^T(\mathbf I-\mathbf P_\mathbf X)\mathbf y$. Define $\hat{\sigma}^2=SSE/(N-r)$.

Result 4.2: Under the G-M model, $\hat{\sigma}^2$ is an unbiased estimator of $\sigma^2$.

proof: $$ \begin{aligned} \mathbb E(\mathbf y^T(\mathbf I-\mathbf P_\mathbf X)\mathbf y)&=\mathbf b^T\mathbf X^T(\mathbf I-\mathbf P_\mathbf X)\mathbf X\mathbf b+tr((\mathbf I-\mathbf P_\mathbf X)Cov(\mathbf y))\\ &=0+\sigma^2tr(\mathbf I-\mathbf P_\mathbf X)\\ &=\sigma^2(N-r). \end{aligned} $$

4.5 The Aitken Model and Generalized Least Squares

Aitken model: $\mathbf y=\mathbf A\mathbf b+\mathbf e$, $\mathbb E(\mathbf e)=\mathbf 0$, $Cov(\mathbf e)=\sigma^2\mathbf V$, $\mathbf V$ is a fixed positive definite matrix.

The key to dealing with this model is the decomposition $\mathbf R\mathbf V\mathbf R^T=\mathbf I$.

Two ways to find $\mathbf R$.

Cholesky factorization: $\mathbf V=\mathbf L\mathbf L^T$, where $\mathbf L$ is an invertible lower-triangular matrix. Then $\mathbf R=\mathbf L^{-1}$ and $\mathbf R\mathbf V\mathbf R^T=\mathbf L^{-1}\mathbf V(\mathbf L^{-1})^T=\mathbf L^{-1}\mathbf L\mathbf L^T(\mathbf L^{-1})^T=\mathbf I$.
Spectral decomposition: $\mathbf V=\mathbf Q\pmb\Lambda\mathbf Q^T$. Here $\mathbf R=\mathbf Q\pmb\Lambda^{-1/2}\mathbf Q^T$ (symmetric) and $$ \mathbf R\mathbf V\mathbf R^T=\mathbf Q\pmb\Lambda^{-1/2}\mathbf Q^T\mathbf Q\pmb\Lambda\mathbf Q^T\mathbf Q\pmb\Lambda^{-1/2}\mathbf Q^T=\mathbf Q\pmb\Lambda^{-1/2}\pmb\Lambda\pmb\Lambda^{-1/2}\mathbf Q^T=\mathbf Q\mathbf Q^T=\mathbf I. $$

Now, we use $\mathbf R$ to transform the Aitken model: $$ \mathbf R\mathbf y=\mathbf R\mathbf X\mathbf b+\mathbf R\mathbf e\\ \Leftrightarrow\mathbf z=\mathbf U\mathbf b+\mathbf f $$ $\mathbb E(\mathbf f)=\mathbb E(\mathbf R\mathbf e)=\mathbf R\mathbb E(\mathbf e)=\mathbf 0$, $Cov(\mathbf f)=Cov(\mathbf R\mathbf e)=\mathbf RCov(\mathbf e)\mathbf R^T=\sigma^2\mathbf R\mathbf V\mathbf R^T=\sigma^2\mathbf I$.

Thus, the transformed model is nothing but a G-M model. We will solve problems using G-M theory, and then transform back.

Estimability: $\pmb\lambda^T\mathbf b$ is estimable $\Leftrightarrow$ $\pmb\lambda\in\mathcal C(\mathbf U^T)$. But $\mathcal C(\mathbf U^T)=\mathcal C(\mathbf X^T\mathbf R^T)=\mathcal C(\mathbf X^T)$ since $\mathbf R^T$ is invertible.

Generalized least squares

$\mathbf U^T\mathbf U\mathbf b=\mathbf U^T\mathbf z\Rightarrow (\mathbf R\mathbf X)^T\mathbf R\mathbf X\mathbf b=(\mathbf R\mathbf X)^T\mathbf z\Rightarrow \mathbf X^T\mathbf R^T\mathbf R\mathbf X\mathbf b=\mathbf X^T\mathbf R^T\mathbf R\mathbf y\Rightarrow \mathbf X^T\mathbf V^{-1}\mathbf X\mathbf b=\mathbf X^T\mathbf V^{-1}\mathbf y$. Here we have used $\mathbf V=\mathbf R^{-1}(\mathbf R^T)^{-1}=(\mathbf R^T\mathbf R)^{-1}\Rightarrow \mathbf V^{-1}=\mathbf R^T\mathbf R$.

We call $\mathbf X^T\mathbf V^{-1}\mathbf X\mathbf b=\mathbf X^T\mathbf V^{-1}\mathbf y$ Aitken equations, and call the solution $\hat{\mathbf b}_{GLS}$.

Theorem (Aitken’s Theorem): Let $\mathbf y=\mathbf X\mathbf b+\mathbf e$ with $\mathbb E \mathbf e=0$ and $Cov(\mathbf e)=\sigma^2\mathbf V$ where $\mathbf V$ is known positive definite. If $\pmb\lambda^T\mathbf b$ is estimable, then $\pmb\lambda^T\hat{\mathbf b}_{GLS}$ is the BLUE for $\pmb\lambda^T\mathbf b$.

From the results in Chapter 2, we know that $\hat{\mathbf b}_{GLS}$ minimizes $$ |\mathbf z-\mathbf U\mathbf b|^2=|\mathbf R(\mathbf y-\mathbf X\mathbf b)|^2=(\mathbf y-\mathbf X\mathbf b)^T\mathbf V^{-1}(\mathbf y-\mathbf X\mathbf b) $$ Examples:

SLR: $y_i=\beta_0+\beta_1x_i+e_i$, $\mathbb E(e_i)=0$, $Var(\mathbf e)=\sigma^2\mathbf V$ where $\mathbf V$ is diagonal. $$ (\mathbf y-\mathbf X\mathbf b)^T\mathbf V^{-1}(\mathbf y-\mathbf X\mathbf b)=\sum_{i=1}^N\frac{1}{V_{ii}}(y_i-\beta_0-\beta_1x_i)^2\quad\text{“weighted L.S.”} $$
Regression through the origin with heteroskedastic errors. $y_i=\beta_1x_i+e_i$, $\mathbb Ee_i=0$, $Var(e_i)=\sigma^2x_i^2$, $Cov(e_i,e_j)=0$ ($i\not=j$). $$ \mathbf V=\begin{pmatrix} x_1^2&&& \\ &x_2^2&&\\ &&\ddots&\\ &&&x_N^2 \end{pmatrix}\quad \mathbf R=\begin{pmatrix} x_1^{-1}&&&\\ &x_2^{-1}&&\\ &&\ddots&\\ &&&x_N^{-1} \end{pmatrix} $$ $\mathbf R\mathbf y=(y_1/x_1,y_2/x_2,\cdots,y_N/x_N)^T$, $\mathbf R\mathbf X=\mathbf 1_N$.
Transformed model: $\frac{y_i}{x_i}=\beta+\frac{e_i}{x_i}$.
$\mathbf X^T\mathbf V^{-1}\mathbf X=N$, $\mathbf X^T\mathbf V^{-1}\mathbf y=\sum_{i=1}^N\frac{y_i}{x_i}$, $\hat{\mathbf b}_{GLS}=(\mathbf X^T\mathbf V^{-1}\mathbf X)^{-1}\mathbf X^T\mathbf V^{-1}\mathbf y=\frac{1}{N}\sum_{i=1}^N\frac{y_i}{x_i}$, $Var(\hat{\mathbf b}_{GLS})=\frac{1}{N^2}\sum_{i=1}^NVar(y_i/x_i)=\frac{1}{N^2}\sum_{i=1}^N\sigma^2=\sigma^2/N$

Q: Is $\pmb\lambda^T\hat{\mathbf b}_{OLS}$ ever BLUE under the Aitken model?

Result 4.3 (Generalization of Result 4.1): Let $\mathbf y$ be a random variable s.t. $Var(y_i)<\infty$. The linear estimator $\mathbf t^T\mathbf y$ is BLUE for $\mathbb E(\mathbf t^T\mathbf y)$ iff it is uncorrelated with all linear unbiased estimator of $0$.

proof: ($\Leftarrow$) Let $\mathbf a^T\mathbf y$ be such that $\mathbb E(\mathbf a^T\mathbf y)=\mathbb E(\mathbf t^T\mathbf y)$. $$ \begin{aligned} Var(\mathbf a^T\mathbf y)&=Var(\mathbf t^T\mathbf y+\mathbf a^T\mathbf y-\mathbf t^T\mathbf y)\\ &=Var(\mathbf t^T\mathbf y)+Var(\mathbf a^T\mathbf y-\mathbf t^T\mathbf y)+2Cov(\mathbf t^T\mathbf y,\mathbf a^T\mathbf y-\mathbf t^T\mathbf y)\\ &=Var(\mathbf t^T\mathbf y)+Var(\mathbf a^T\mathbf y-\mathbf t^T\mathbf y)\\ &\geq Var(\mathbf t^T\mathbf y) \end{aligned} $$ ($\Rightarrow $) Suppose $\mathbb E(\mathbf h^T\mathbf y)=0$. Need to show that $Cov(\mathbf t^T\mathbf y,\mathbf h^T\mathbf y)=0$. If $\mathbf h^T\mathbf y$ is constant then the result clearly holds. So assume that $\mathbf h^T\mathbf y$ is not constant. Let $Cov(\mathbf t^T\mathbf y,\mathbf h^T\mathbf y)=c$ and $Var(\mathbf h^T\mathbf y)=d>0$. Now consider an alternative linear unbiased estimator of $\mathbb E(\mathbf t^T\mathbf y)$: $\mathbf a^T\mathbf y=\mathbf t^T\mathbf y-(\frac{c}{d})\mathbf h^T\mathbf y$, and we have $$ \begin{aligned} Var(\mathbf a^T\mathbf y)&= Var(\mathbf t^T\mathbf y)+(\frac{c}{d})^2Var(\mathbf h^T\mathbf y)-2\frac{c}{d}Cov(\mathbf t^T\mathbf y,\mathbf h^T\mathbf y)\\ &=Var(\mathbf t^T\mathbf y)+\frac{c^2}{d}-2\frac{c^2}{d}\\ &=Var(\mathbf t^T\mathbf y)-\frac{c^2}{d} \end{aligned} $$ But $\mathbf t^T\mathbf y$ is BLUE, so $c$ must be $0$.

Corollary 4.1: Under the Aitken model, $\mathbf t^T\mathbf y$ is the BLUE for $\mathbb E(\mathbf t^T\mathbf y)$ iff $\mathbf V\mathbf t\in\mathcal C(\mathbf X)$.

proof: It is enough to show that $\mathbf t^T\mathbf y$ is uncorrelated with all linear unbiased estimators of $0$ iff $\mathbf V\mathbf t\in\mathcal C(\mathbf X)$.

First, $\mathbb E(\mathbf h^T\mathbf y)=0\Leftrightarrow \mathbf h^T\mathbf X\mathbf b=0,\forall\mathbf b\in\mathbb R^p\Leftrightarrow \mathbf h^T\mathbf X=\mathbf 0\Leftrightarrow \mathbf h\in\mathcal N(\mathbf X^T)$.

($\Leftarrow$) Suppose $\mathbb E(\mathbf h^T\mathbf y)=0$. Then $\mathbf h\in\mathcal N(\mathbf X^T)$ and $Cov(\mathbf t^T\mathbf y,\mathbf h^T\mathbf y)=\mathbf t^TCov(\mathbf y)\mathbf h=\sigma^2\mathbf t^T\mathbf V\mathbf h=0$.

($\Rightarrow$) If $Cov(\mathbf t^T\mathbf y,\mathbf h^T\mathbf y)=\sigma^2\mathbf t^T\mathbf V\mathbf h=0$ for $\forall \mathbf h\in\mathcal N(\mathbf X^T)$, then $\mathbf V^T\mathbf t\perp\mathcal N(\mathbf X^T)\Rightarrow\mathbf V\mathbf t\in\mathcal C(\mathbf X)$.

Result 4.4: Under the Aitkin model, all OLS estimators are BLUE (that is, each $\pmb\lambda^T\hat{\mathbf b}_{OLS}$ is BLUE for the corresponding estimable $\pmb\lambda^T\mathbf b$) iff $\exist$ a matrix $\mathbf Q$ s.t. $\mathbf V\mathbf X=\mathbf X\mathbf Q$.

proof: ($\Leftarrow$) A generic OLS estuimator takes the form $$ \pmb\lambda^T\hat{\mathbf b}_{OLS}=\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y=\mathbf t^T\mathbf y $$ where $\mathbf t=\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda$.

Now, if $\mathbf V\mathbf X=\mathbf X\mathbf Q$, we have $$ \mathbf V\mathbf t=\mathbf V\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda=\mathbf X\mathbf Q[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda\in\mathcal C(\mathbf X) $$ so $\pmb\lambda^T\hat{\mathbf b}_{OLS}$ is BLUE by corollary 4.1.

($\Rightarrow$) Let $\mathbf X^T\mathbf X=(\pmb\lambda^{(1)},\pmb\lambda^{(2)},\cdots,\pmb\lambda^{(p)})$; that is, define $\pmb\lambda^{(j)}$ to be the $j$-th column of $\mathbf X^T\mathbf X$, $j=1,2,\cdots,p$.

Now, $\pmb\lambda^{(j)T}\hat{\mathbf b}_{OLS}=\mathbf t^{(j)T}\mathbf y$ where $\mathbf t^{(j)}=\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda^{(j)}$. Since we know that $\pmb\lambda^{(j)T}\hat{\mathbf b}_{OLS}$ is BLUE, $\mathbf V\mathbf t^{(j)}\in\mathcal C(\mathbf X)$, or, in other words, $\exist$ $\mathbf q^{(j)}$ s.t. $\mathbf V\mathbf t^{(j)}=\mathbf X\mathbf q^{(j)}\Rightarrow\mathbf V\underbrace{(\mathbf t^{(1)},\mathbf t^{(2)},\cdots,\mathbf t^{(p)})}_\mathbf T=\mathbf X\underbrace{(\mathbf q^{(1)},\mathbf q^{(2)},\cdots,\mathbf q^{(p)})}_\mathbf Q$. But $$ \begin{aligned} \mathbf V\mathbf T&=\mathbf V\left[\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda^{(1)}\quad\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda^{(2)}\quad\cdots\quad\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda^{(p)}\right]\\ &=\mathbf V\mathbf X[(\mathbf X^T\mathbf X)^g]^T[\pmb\lambda^{(1)}\quad\pmb\lambda^{(2)}\quad\cdots\quad\pmb\lambda^{(p)}]\\ &=\mathbf V\mathbf X\underbrace{[(\mathbf X^T\mathbf X)^g]^T\mathbf X^T}_{\mathbf P_\mathbf X}\mathbf X=\mathbf V\mathbf X \end{aligned} $$

Example: Back to heteroskedastic regression through the origin.

$y_i=\beta x_i+e_i$, $\mathbb Ee_i=0$, $Var(e_i)=\sigma^2x_i^2$, $Cov(e_i,e_j)=0$ ($i\not=j$). Assume $x_i\not=0$, $\forall i$. $$ \mathbf V=\begin{pmatrix}x_1^2&&&\\ &x_2^2&&\\ &&\ddots&\\ &&&x_N^2\end{pmatrix}\quad \mathbf X=\begin{pmatrix} x_1\\x_2\\\vdots\\x_N \end{pmatrix} $$ We showed that $\hat{\mathbf b}_{GLS}=\frac{1}{N}\sum_{i=1}^N\frac{y_i}{x_i}$ and $Var(\hat{\mathbf b}_{GLS})=\frac{\sigma^2}{N}$. Of course, $$ \hat{\mathbf b}_{OLS}=\frac{\sum_{i=1}^Nx_iy_i}{\sum_{i=1}^Nx_i^2}\quad Var(\hat{\mathbf b}_{OLS})=\frac{\sum_{i=1}^Nx_i^2Var(y_i)}{\left(\sum_{i=1}^Nx_i^2\right)^2}=\sigma^2\frac{\sum_{i=1}^Nx_i^4}{\left(\sum_{i=1}^Nx_i^2\right)^2}. $$ Here is a slick way of comparing these variances:

Let $U$ be a random variable taking values $x_1^2,x_2^2,\cdots,x_N^2$ with equal probabilities $1/N$. Then $\mathbb EU=\frac{1}{N}\sum x_i^2$ and $\mathbb EU^2=\frac{1}{N}\sum x_i^4$, and $$ Var(U)=\frac{1}{N}\sum x_i^4-\left(\frac{1}{N}\sum x_i^2\right)^2\geq0\\ \Rightarrow \frac{1}{N}\sum x_i^4\geq\frac{1}{N^2}\left(\sum x_i^2\right)^2\\ \Rightarrow N\sum x_i^4\geq\left(\sum x_i^2\right)^2,\quad\text{or}\quad \frac{\sum x_i^4}{(\sum x_i^2)^2}\geq\frac{1}{N} $$ with equality iff $Var(U)=0$; that is, iff $|x_i|=c>0$ for $\forall i=1,2,\cdots,N$.

Of course, if $|x_i|=c>0$, $\forall i=1,2,\cdots,N$, $$ \begin{aligned} \hat{\mathbf b}_{GLS}&=\frac{1}{N}\sum\frac{y_i}{x_i}=\frac{1}{N}\sum\frac{x_iy_i}{x_i^2}=\frac{1}{N}\sum\frac{x_iy_i}{c^2}\\ &=\frac{1}{Nc^2}\sum x_iy_i=\frac{\sum x_iy_i}{\sum x_i^2}=\hat{\mathbf b}_{OLS}. \end{aligned} $$ In other words, the OLS estimator $\hat{\mathbf b}_{OLS}$ is BLUE iff $|x_i|=c>0$, $\forall i$.

Now, let’s consider Result 4.4 in this context. When does there exist a $\mathbf Q$ such that $\mathbf V\mathbf X=\mathbf X\mathbf Q$? $$ \mathbf V\mathbf X=\begin{pmatrix}x_1^3\\x_2^3\\\vdots\\x_N^3\end{pmatrix}\quad\mathbf X\mathbf Q=\begin{pmatrix}x_1\\x_2\\\vdots\\x_N\end{pmatrix}q $$ The only way that $\mathbf V\mathbf X=\mathbf X\mathbf Q$ is if $x_i^3=x_iq$, $\forall i\Rightarrow $ $x_i^2=q=c^2>0$, $\forall i\Rightarrow $ $|x_i|=c>0$, $\forall i$.

Example: $\mathbf y=\mathbf X\mathbf b+\mathbf e$, $\mathbb E\mathbf e=\mathbf 0$, $Cov(\mathbf e)=\sigma^2\mathbf V=\sigma^2\mathbf I_N+\tau^2\mathbf 1_N\mathbf 1_N^T=\sigma^2(\mathbf I_N+\frac{\tau^2}{\sigma^2}\mathbf 1_N\mathbf 1_N^T)$. Assume that $\mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_N&\mathbf X^\star\end{pmatrix}\begin{pmatrix}b_1\\\mathbf b_2\end{pmatrix}$

Can we find $\mathbf Q$ s.t. $\mathbf V\mathbf X=\mathbf X\mathbf Q$? $$ \begin{aligned} \sigma^2\mathbf V\mathbf X&=(\sigma^2\mathbf I_N+\tau^2\mathbf 1_N\mathbf 1_N^T)\begin{pmatrix}\mathbf 1_N&\mathbf X^\star\end{pmatrix}\\ &=\begin{pmatrix}\sigma^2\mathbf 1_N+N\tau^2\mathbf 1_N&\sigma^2\mathbf X^*+\tau^2\mathbf 1_N\mathbf 1_N^T\mathbf X^\star\end{pmatrix} \end{aligned} $$

$$ \mathbf X\mathbf Q=\begin{pmatrix}\mathbf 1_N&\mathbf X^\star\end{pmatrix}\begin{pmatrix}Q_{11}&\mathbf Q_{12}\\\mathbf Q_{21}&\mathbf Q_{22}\end{pmatrix}=\begin{pmatrix}\mathbf 1_N Q_{11}+\mathbf X^\star\mathbf Q_{21}&\mathbf 1_N\mathbf Q_{12}+\mathbf X^\star\mathbf Q_{22}\end{pmatrix} $$

Matching up terms, we have $$ \begin{aligned} &Q_{11}=\sigma^2+N\tau^2\quad\mathbf Q_{21}=\mathbf 0\quad\mathbf Q_{12}=\tau^2\mathbf 1_N^T\mathbf X^*\quad\mathbf Q_{22}=\sigma^2\mathbf I_{p-1}\\ \Rightarrow &\sigma^2\mathbf V\mathbf X=\mathbf X\mathbf Q\quad\text{or}\quad\mathbf V\mathbf X=\mathbf X\mathbf Q’,\text{ where }\mathbf Q’=\mathbf Q/\sigma^2\\ \Rightarrow &\text{ OLS estimators are BLUE}. \end{aligned} $$

4.7 Best Estimation in a Constrained Parameter Space

Recall the “constrained” problem from [Section 3.9](#3.9 Constrained Parameter Space) $$ \mathbf y=\mathbf X\mathbf b+\mathbf e,\quad\text{where }\mathbf b\in\mathcal T=\left\{\mathbf b\in\R^p:\mathbf P^T\mathbf b=\delta\right\} $$ $\mathbf P$ is a fixed $p\times q$ matrix of rank $q$ and $\delta\in\mathcal C(\mathbf P^T)$ is fixed.

Recall the restricted normal equations: $$ \begin{pmatrix} \mathbf X^T\mathbf X&\mathbf P\\ \mathbf P^T&\mathbf 0 \end{pmatrix}\begin{pmatrix}\mathbf b\\\pmb\theta\end{pmatrix}= \begin{pmatrix}\mathbf X^T\mathbf y\\\delta\end{pmatrix} $$ The R.N.E.s are consistent (for all $\mathbf y\in\mathbb R^N$ and $\delta\in\mathcal C(\mathbf P^T)$). As before, let $\begin{pmatrix}\hat{\mathbf b}_H\\\hat{\pmb\theta}_H\end{pmatrix}$denote a generic solution to the R.N.E.s.

We now show that $\pmb\lambda^T\hat{\mathbf b}_H$ is BLUE for estimable $\pmb\lambda^T\mathbf b$ in the constrained model.

Lemma 4.2: Assume that $\pmb\lambda^T\mathbf b$ is estimable in the constrained model. Then the following equations are consistent: $$ \begin{pmatrix} \mathbf X^T\mathbf X&\mathbf P\\ \mathbf P^T&\mathbf 0 \end{pmatrix}\begin{pmatrix}\mathbf v_1\\\mathbf v_2\end{pmatrix}= \begin{pmatrix}\pmb\lambda\\\mathbf 0\end{pmatrix}\quad(\star) $$ proof: Result 3.7a implies that $\pmb\lambda=\mathbf X^T\mathbf a+\mathbf P\mathbf d$ for some $\mathbf a\in\R^N$ and $\mathbf d\in\R^q$.

Now consider the R.N.E.s with $\mathbf y=\mathbf a$ and $\delta=\mathbf 0$: $$ \begin{pmatrix} \mathbf X^T\mathbf X&\mathbf P\\ \mathbf P^T&\mathbf 0 \end{pmatrix}\begin{pmatrix}\mathbf w_1\\\mathbf w_2\end{pmatrix}= \begin{pmatrix}\mathbf X^T\mathbf a\\\mathbf 0\end{pmatrix} $$ Since $\mathbf 0\in\mathcal C(\mathbf P^T)$, these equations are consistent. Let $\mathbf w^\star =\begin{pmatrix}\mathbf w_1^\star\\\mathbf w^\star_2\end{pmatrix}$ be a solution. Then $$ \begin{pmatrix} \mathbf X^T\mathbf X&\mathbf P\\ \mathbf P^T&\mathbf 0 \end{pmatrix}\begin{pmatrix}\mathbf w_1^\star\\\mathbf w_2^\star+\mathbf d\end{pmatrix}= \begin{pmatrix}\mathbf X^T\mathbf X\mathbf w^*_1+\mathbf P(\mathbf w^\star_2+\mathbf d)\\\mathbf P^T\mathbf w_1^\star\end{pmatrix} =\begin{pmatrix}\mathbf X^T\mathbf a+\mathbf P\mathbf d\\\mathbf 0\end{pmatrix}=\begin{pmatrix}\pmb\lambda\\\mathbf 0\end{pmatrix}. $$

Lemma 4.3: $\pmb\lambda^T\hat{\mathbf b}_H$ is a linear unbiased estimator of $\pmb\lambda^T\mathbf b$ (in the constrained model).

proof: Let $\mathbf v^\star =\begin{pmatrix}\mathbf v^\star_1\\\mathbf v^\star_2\end{pmatrix}$ be a solution to ($\star$). Then $$ \begin{aligned} \pmb\lambda^T\hat{\mathbf b}_H&=\begin{pmatrix}\pmb\lambda^T&\mathbf 0\end{pmatrix}\begin{pmatrix}\hat{\mathbf b}_H\\\hat{\pmb\theta}_H\end{pmatrix}\stackrel{(\star)}{=}\begin{pmatrix}\mathbf v^{\star T}_1&\mathbf v^{\star T}_2\end{pmatrix}\begin{pmatrix}\mathbf X\mathbf X^T&\mathbf P\\\mathbf P^T&\mathbf 0\end{pmatrix}\begin{pmatrix}\hat{\mathbf b}_H\\\hat{\pmb\theta}_H\end{pmatrix}\\ &=\begin{pmatrix}\mathbf v^{\star T}_1&\mathbf v^{\star T}_2\end{pmatrix}\begin{pmatrix}\mathbf X^T\mathbf y\\\delta\end{pmatrix}=\mathbf v^{*T}_1\mathbf X^T\mathbf y+\mathbf v^{*T}_2\delta\to\text{linear estimator} \end{aligned} $$

$$ \begin{aligned} \mathbb E(\pmb\lambda^T\hat{\mathbf b}_H)&=\mathbb E(\mathbf v^{\star T}_1\mathbf X^T\mathbf y+\mathbf v^{\star T}_2\delta)\\ &=\mathbf v^{\star T}_1\mathbf X^T\mathbf X\mathbf b+\mathbf v^{\star T}_2\delta\\ &\stackrel{(\star)}{=}(\pmb\lambda-\mathbf P\mathbf v^{\star }_2)^T\mathbf b+\mathbf v^{\star T}_2\delta\\ &=\pmb\lambda^T\mathbf b-\mathbf v^{\star T}_2(\mathbf P^T\mathbf b-\delta)\\ &=\pmb\lambda^T\mathbf b. \end{aligned} $$

Result 4.5: Under the G-M assumptions, $\pmb\lambda^T\hat{\mathbf b}_H$ is the BLUE of the estimable function $\pmb\lambda^T\mathbf b$ in the constrained model.

proof: Let $\mathbf a^T\mathbf y+c$ be an unbiased estimator of $\pmb\lambda^T\mathbf b$ under the constraint. Again, $\pmb\lambda=\mathbf X^T\mathbf a+\mathbf P\mathbf d$ for some $\mathbf a\in\R^N$ and $\mathbf d\in\R^q$. Now, $$ Var(\mathbf a^T\mathbf y+c)=Var(\pmb\lambda^T\hat{\mathbf b}_H)+Var(\mathbf a^T\mathbf y+c-\pmb\lambda^T\hat{\mathbf b}_H)+Cov(\pmb\lambda^T\hat{\mathbf b}_H,\mathbf a^T\mathbf y+c-\pmb\lambda^T\hat{\mathbf b}_H) $$ From the previous proof, we know that $\pmb\lambda^T\hat{\mathbf b}_H=\mathbf v^{\star T}_1\mathbf X^T\mathbf y+\mathbf v^{\star T}_2\delta$. Thus, $$ \begin{aligned} Cov(\pmb\lambda^T\hat{\mathbf b}_H,\mathbf a^T\mathbf y+c-\pmb\lambda^T\hat{\mathbf b}_H)&=Cov(\mathbf v_1^{\star T}\mathbf X^T\mathbf y,\mathbf a^T\mathbf y-\mathbf v_1^{*T}\mathbf X^T\mathbf y)\\ &=Cov(\mathbf v^{\star T}_1\mathbf X^T\mathbf y,(\mathbf a-\mathbf X\mathbf v_1^{\star })^T\mathbf y)\\ &=\mathbf v_1^{\star T}\mathbf X^TCov(\mathbf y)(\mathbf a-\mathbf X\mathbf v_1^{\star })\\ &=\sigma^2\mathbf v_1^{\star T}(\mathbf X^T\mathbf a-\mathbf X^T\mathbf X\mathbf v_1^{\star })\\ &\stackrel{(\star)}{=}\sigma^2\mathbf v_1^{\star T}(\pmb\lambda-\mathbf P\mathbf d-(\pmb\lambda-\mathbf P\mathbf v_2^{\star }))\\ &=\sigma^2\mathbf v_1^{\star T}\mathbf P(\mathbf v_2^{\star }-\mathbf d)\\ &\stackrel{(\star)}{=}0. \end{aligned} $$

Chapter 5: Distributional Theory

5.2 Multivariate Normal Distribution

Let $\mathbf X\sim\mathcal N(\mu,\sigma^2)$ and $\mathbf Y=a\mathbf X+b$ for $a,b\in\mathbb R$. Then $\mathbf Y\sim\mathcal N(a\mu+b,a^2\sigma^2)$.

What about the multivariate version of this?

In math statistics class, we defined the MVN distribution through the density. We said $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf V)$, where $\pmb\mu\in\R^p$ and $\mathbf V_{p\times p}$ is positive definite, if $$ p_\mathbf X(\mathbf x)=\frac{1}{(2\pi)^{p/2}|\mathbf V|^{1/2}}e^{-\frac{1}{2}(\mathbf x-\pmb\mu)^T\mathbf V^{-1}(\mathbf x-\pmb\mu)}. $$ Now, what about a linear function of $\mathbf X$? Suppose $\mathbf A_{q\times p}$ and $\mathbf b\in\R^q$. Define $\mathbf Y=\mathbf A\mathbf X+\mathbf b$. Is $\mathbf Y$ MVN? $$ Cov(\mathbf Y)=\mathbf ACov(\mathbf X)\mathbf A^T=\mathbf A\mathbf V\mathbf A^T $$ If $q>p$, then $\mathbf A\mathbf V\mathbf A^T$ cannot be positive definite since $rank(\mathbf A\mathbf V\mathbf A^T)<q$.

So $\mathbf Y=\mathbf A\mathbf X+\mathbf b$ need not be MVN in our old “density” sense.

However, $\mathbf Y=\mathbf A\mathbf X+\mathbf b$ is MVN in a more general sense.

Def: The mgf of a $p$-dimensional r.v. is defined as $$ M_\mathbf X(t)=\mathbb Ee^{\mathbf t^T\mathbf X} $$ provided there exists $h>0$ such that this expectation exists (is finite) for all $\mathbf t\in S$, where $S:=\{\mathbf t\in\mathbb R^p:t_i\in(-h,h),i=1,2,\cdots,p\}$.

Def: Let $\mathbf Z=(Z_1,\cdots,Z_p)^T$ where $\{Z_i\}_{i=1}^p$ are i.i.d. $\mathcal N(0,1)$. Then $\mathbf Z$ has the standard multivariate normal distribution, denoted by $\mathcal N_p(\mathbf 0,\mathbf I)$. Its pdf is given by $$ p_\mathbf Z(\mathbf z)=(2\pi)^{-p/2}e^{-\frac{1}{2}\mathbf z^T\mathbf z} $$ Q: What is the mgf of $\mathbf Z$ (assuming existence)?

Fix $\mathbf t\in\R^p$, $$ \begin{aligned} M_\mathbf Z(\mathbf t)=\mathbb Ee^{\mathbf t^T\mathbf Z}&=\int_{\R^p}(2\pi)^{-p/2}e^{-\frac{1}{2}\mathbf z^T\mathbf z+\mathbf t^T\mathbf z},d\mathbf z\\ &=\int_{\mathbb R^p}(2\pi)^{-p/2}e^{-\frac{1}{2}\sum_{i=1}^n(z_i^2-2t_iz_i)},d\mathbf z\\ &=\prod_{i=1}^p\int_\R(2\pi)^{-1/2}e^{-\frac{1}{2}(z_i^2-2t_iz_i)},dz_i\\ &=\prod_{i=1}^p\int_\mathbb R(2\pi)^{-1/2}e^{-\frac{1}{2}(z_i-t_i)^2}e^{\frac{1}{2}t_i^2},dz_i\\ &=\prod_{i=1}^pe^{\frac{1}{2}t_i^2}=e^{\frac{1}{2}\mathbf t^T\mathbf t} \end{aligned} $$ Note that the expectation exists (is finite) for all $\mathbf t\in\mathbb R^p$.

Result 5.1: Let $X_1$, $X_2$ be two r.v.s. If the mgf’s exist, and $M_{X_1}(t)=M_{X_2}(t)$ for all $t$ in an open square around the origin, then $X_1\stackrel{d}{=}X_2$.

Result 5.2: Suppose $X_1,X_2,\cdots, X_p$ have mgfs $M_{X_i}(t_i),i=1,2,\cdots,p$. Let $X=(X^T_1,X_2^T,\cdots,X_p^T)^T$ have mgf $M_X(t)$, where $t=(t_1^T,t_2^T,\cdots,t_p^T)^T$. Then $X_1,X_2,\cdots,X_p$ are mutually independent iff $$ M_X(t)=\prod_{i=1}^pM_{X_i}(t_i) $$ for all $t$ in an open square around the origin.

Now suppose $\mathbf Z\sim\mathcal N_p(\mathbf 0,\mathbf I)$ and let $\mathbf X=\pmb\mu+\mathbf A\mathbf Z$. Then $$ M_\mathbf X(\mathbf t)=\mathbb E\left[e^{\mathbf t^T(\pmb\mu+\mathbf A\mathbf Z)}\right]=e^{\mathbf t^T\pmb\mu}\mathbb E[e^{\mathbf t^T\mathbf A\mathbf Z}]\\=e^{\mathbf t^T\pmb\mu}\mathbb E[e^{(\mathbf A^T\mathbf t)^T\mathbf Z}]=e^{\mathbf t^T\pmb\mu}M_\mathbf Z(\mathbf A^T\mathbf t)=e^{\mathbf t^T\pmb\mu+\frac{1}{2}\mathbf t^T\mathbf A\mathbf A^T\mathbf t}. $$ Thus, the distribution of $\mathbf X=\pmb\mu+\mathbf A\mathbf Z$ depends on $\mathbf A$ only through $\mathbf A\mathbf A^T$. Indeed, if $\mathbf B\mathbf B^T=\mathbf A\mathbf A^T$, then $\pmb\mu+\mathbf A\mathbf Z\stackrel{d}{=}\pmb\mu+\mathbf B\mathbf Z$. Of course, $\mathbf A\mathbf A^T=Cov(\mathbf X)$.

Def: Let $\pmb\mu\in\R^p$ and let $\mathbf V_{p\times p}$ be non-negative definite. We say that $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf V)$ if $$ M_\mathbf X(\mathbf t)=e^{\mathbf t^T\pmb\mu+\frac{1}{2}\mathbf t^T\mathbf V\mathbf t}. $$ If $\mathbf V$ is not positive definite, then $\mathbf X$ does not have a density.

Example: $\mathbf Z\sim\mathcal N_2(\mathbf 0,\mathbf I)$. $\mathbf X=\mathbf A\mathbf Z$ where $\mathbf A=\begin{pmatrix}1&0\\0&0\end{pmatrix}$. Then $$ \begin{pmatrix}X_1\\X_2\end{pmatrix}=\begin{pmatrix}1&0\\0&0\end{pmatrix}\begin{pmatrix}Z_1\\Z_2\end{pmatrix}=\begin{pmatrix}Z_1\\0\end{pmatrix} $$

$$ \mathbf A\mathbf A^T=\begin{pmatrix}1&0\\0&0\end{pmatrix}\begin{pmatrix}1&0\\0&0\end{pmatrix}=\begin{pmatrix}1&0\\0&0\end{pmatrix} $$

$$ \mathbf X\sim\mathcal N_2\left(\mathbf 0,\begin{pmatrix}1&0\\0&0\end{pmatrix}\right) $$

$\mathbf X$ lies on this set: $L=\{(u,v)\in\R^2:v=0\}=\R\times\{0\}$.

Note that, if $p(x,y)$ is a density on $\mathbb R^2$, that is, $p(x,y)\geq0$ and $\int_{\R^2}p(x,y)dxdy=1$, then $\int_Lp(x,y)dxdy=0$.

Result 5.3: If $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf V)$ and $\mathbf Y=\mathbf a+\mathbf B\mathbf X$ where $\mathbf a_{q\times 1}$ and $\mathbf B_{q\times p}$, then $$ \mathbf Y\sim\mathcal N_q(\mathbf a+\mathbf B\pmb\mu,\mathbf B\mathbf V\mathbf B^T). $$ proof: $M_\mathbf Y(\mathbf t)=\mathbb Ee^{\mathbf t^T\mathbf Y}=\mathbb Ee^{\mathbf t^T(\mathbf a+\mathbf B\mathbf X)}=e^{\mathbf t^T\mathbf a}\mathbb Ee^{\mathbf t^T\mathbf B\mathbf X}=e^{\mathbf t^T\mathbf a}M_\mathbf X(\mathbf B^T\mathbf t)=e^{\mathbf t^T\mathbf a}e^{\mathbf t^T\mathbf B\pmb\mu+\frac{1}{2}\mathbf t^T\mathbf B\mathbf V\mathbf B^T\mathbf t}=e^{\mathbf t^T(\mathbf a+\mathbf B\pmb\mu)+\frac{1}{2}\mathbf t^T\mathbf B\mathbf V\mathbf B^T\mathbf t}$.

Corollary 5.1: If $\mathbf X$ is MVN, then the joint distribution of any subset of the components of $\mathbf X$ is also MVN.

proof: Partition $\mathbf X$, $\pmb\mu$ and $\mathbf V$ as follows: $$ \mathbf X=\begin{pmatrix}\mathbf X_1\\\mathbf X_2\end{pmatrix}\begin{matrix}p_1\\p_2\end{matrix},\quad\pmb\mu=\begin{pmatrix}\pmb\mu_1\\\pmb\mu_2\end{pmatrix}\begin{matrix}p_1\\p_2\end{matrix},\quad\mathbf V=\begin{pmatrix}\mathbf V_{11}&\mathbf V_{12}\\\mathbf V_{21}&\mathbf V_{22}\end{pmatrix}\begin{matrix}p_1\\p_2\end{matrix} $$ Now, let $\mathbf W=\mathbf B\mathbf X$ where $\mathbf B=\begin{pmatrix}\mathbf I_{p_1}&\mathbf 0\end{pmatrix}$, so $\mathbf W=\mathbf X_1$. Then from Result 5.3, we have $$ \mathbf X_1\sim\mathcal N_{p_1}\left(\begin{pmatrix}\mathbf I_{p_1}&\mathbf 0\end{pmatrix}\begin{pmatrix}\pmb\mu_1\\\pmb\mu_2\end{pmatrix},\begin{pmatrix}\mathbf I_{p_1}&\mathbf 0\end{pmatrix}\begin{pmatrix}\mathbf V_{11}&\mathbf V_{12}\\\mathbf V_{21}&\mathbf V_{22}\end{pmatrix}\begin{pmatrix}\mathbf I_{p_1}\\\mathbf 0\end{pmatrix}\right)\Rightarrow\mathbf X_1\sim\mathcal N_p(\pmb\mu_1,\mathbf V_{11}) $$

Corollary 5.2: Suppose $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf V)$ and $\mathbf V$ is positive definite. Then

$\exist$ nonsingular $\mathbf A$ s.t. $\mathbf V=\mathbf A\mathbf A^T$
$\mathbf A^{-1}(\mathbf X-\pmb\mu)\sim\mathcal N_p(\mathbf 0,\mathbf I_p)$
the pdf of $\mathbf X$ is $(2\pi)^{-p/2}|\mathbf V|^{-1/2}e^{-\frac{1}{2}(\mathbf x-\pmb\mu)^T\mathbf V^{-1}(\mathbf x-\pmb\mu)}$.

proof is HW.

Result 5.4: Assume $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf V)$. Partition as follows $$ \mathbf X=\begin{pmatrix}\mathbf X_1\\\mathbf X_2\\\vdots\\\mathbf X_m\end{pmatrix}\begin{matrix}p_1\\p_2\\\vdots\\p_m\end{matrix}, \quad\pmb\mu=\begin{pmatrix}\pmb\mu_1\\\pmb\mu_2\\\vdots\\\pmb\mu_m\end{pmatrix}\begin{matrix}p_1\\p_2\\\vdots\\p_m\end{matrix}, \quad\mathbf V=\begin{pmatrix}\mathbf V_{11}&\mathbf V_{12}&\cdots&\mathbf V_{1m}\\\mathbf V_{21}&\mathbf V_{22}&\cdots&\mathbf V_{2m}\\\vdots&\vdots&\ddots&\vdots\\\mathbf V_{m1}&\mathbf V_{m2}&\cdots&\mathbf V_{mm}\end{pmatrix}\begin{matrix}p_1\\p_2\\\vdots\\p_m\end{matrix} $$ Then $\mathbf X_1,\mathbf X_2,\cdots,\mathbf X_m$ are mutually independent $\Leftrightarrow$ $\mathbf V_{ij}=\mathbf 0$, $\forall i\not=j$.

proof: ($\Rightarrow$) $\mathbf V_{ij}=\mathbb E\left[(\mathbf X_i-\pmb\mu_i)(\mathbf X_i-\pmb\mu_i)^T\right]\stackrel{ind}{=}\left[\mathbf E(\mathbf X_i-\pmb\mu_i)\right]\left[\mathbf E(\mathbf X_i-\pmb\mu_i)\right]^T=\mathbf 0$.

($\Leftarrow$) $M_\mathbf X(\mathbf t)=e^{\mathbf t^T\pmb\mu+\frac{1}{2}\mathbf t^T\mathbf V\mathbf t}=e^{\sum_{i=1}^m\mathbf t_i^T\pmb\mu_i+\sum_{i=1}^m\mathbf t_i^T\mathbf V_{ii}\mathbf t_i}=\prod_{i=1}^me^{\mathbf t_i^T\pmb\mu+\frac{1}{2}\mathbf t^T_i\mathbf V_{ii}\mathbf t_i}=\prod_{i=1}^mM_{\mathbf X_i}(\mathbf t_i)$. Result now follows from Result 5.2.

Corollary 5.3: $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf V)$. Let $\mathbf Y_1=\mathbf a_1+\mathbf B_1\mathbf X$ and $\mathbf Y_2=\mathbf a_2+\mathbf B_2\mathbf X$. Then $\mathbf Y_1$ and $\mathbf Y_2$ are independent iff $\mathbf B_1\mathbf V\mathbf B_2^T=\mathbf 0$.

proof: $\mathbf Y=\begin{pmatrix}\mathbf Y_1\\\mathbf Y_2\end{pmatrix}=\begin{pmatrix}\mathbf a_1\\\mathbf a_2\end{pmatrix}+\begin{pmatrix}\mathbf B_1\\\mathbf B_2\end{pmatrix}\mathbf X$. According to Result 5.3, we have $$ \mathbf Y\sim\mathcal N\left(\begin{pmatrix}\mathbf a_1\\\mathbf a_2\end{pmatrix}+\begin{pmatrix}\mathbf B_1\\\mathbf B_2\end{pmatrix}\pmb\mu,\begin{pmatrix}\mathbf B_1\\\mathbf B_2\end{pmatrix}\mathbf V\begin{pmatrix}\mathbf B_1&\mathbf B_2\end{pmatrix}^T\right)\\=\mathcal N\left(\begin{pmatrix}\mathbf a_1\\\mathbf a_2\end{pmatrix}+\begin{pmatrix}\mathbf B_1\\\mathbf B_2\end{pmatrix}\pmb\mu,\begin{pmatrix}\mathbf B_1\mathbf V\mathbf B_1^T&\mathbf B_1\mathbf V\mathbf B_2^T\\\mathbf B_2\mathbf V\mathbf B_1^T&\mathbf B_2\mathbf V\mathbf B_2^T\end{pmatrix}\right) $$ Result follows immediately from Result 5.4.

Suppose that $\mathbf Z\sim\mathcal N_p(\mathbf 0,\mathbf I)$. Define $U=\mathbf Z^T\mathbf Z=\sum_{i=1}^pZ_i^2$. $$ \begin{aligned} M_U(t)&=\mathbb Ee^{t\sum_{i=1}^pZ_i^2}=\mathbb E\prod_{i=1}^pe^{tZ_i^2}=\prod_{i=1}^p\mathbb Ee^{tZ_i^2}\\ &=\prod_{i=1}^p\int_\R(2\pi)^{-1/2}e^{-\frac{1}{2}(z_i^2-2tz_i^2)}dz_i\\ &=\prod_{i=1}^p\int_\R(2\pi)^{-1/2}e^{-\frac{1}{2}(1-2t)z_i^2}dz_i\\ &=\prod_{i=1}^p(1-2t)^{-1/2}\\&=(1-2t)^{-\frac{p}{2}}\quad(t<\frac{1}{2}) \end{aligned} $$ This implies that $$ U\sim \text{Gamma}(\frac{p}{2},2)=\chi_p^2\\ p_U(u)=\frac{1}{\Gamma(p/2)2^{p/2}}u^{\frac{p}{2}-1}e^{-\frac{p}{2}u}I_{\mathbb R^+}(u). $$ Def 5.6: Let $J\sim $ Poisson($\phi$) ($\phi>0$). Let $U|J=j\sim\chi^2_{p+2j}$. Then $U\sim\chi^2_{p}(\phi)$, that is, noncentral $\chi^2$ with $p$ degrees of freedom and noncentrality parameter $\phi$. The pdf of $U\sim\chi_p^2(\phi)$ is given by $$ p_U(u)=\sum_{j=0}^\infty\left[\frac{e^{-\phi}\phi^j}{j!}\right]\left[\frac{u^{\frac{p+2j-2}{2}}e^{-\frac{u}{2}}}{\Gamma(\frac{p+2j}{2})2^{\frac{p+2j}{2}}}\right]I_{\R^+}(u). $$ Result 5.5: If $U\sim\chi_p^2(\phi)$, then $$ M_U(t)=(1-2t)^{-\frac{p}{2}}e^{\frac{2\phi t}{1-2t}} $$ proof: $$ \begin{aligned} M_U(t)&=\mathbb Ee^{tU}=\mathbb E[\mathbb E[e^{tU}|J]]=\mathbb E[(1-2t)^{-(p+2J)/2}]\quad(t<\frac{1}{2})\\ &=\sum_{j=0}^\infty (1-2t)^{-(p+2j)/2}\left[\frac{e^{-\phi}\phi^j}{j!}\right]\\ &=(1-2t)^{-p/2}e^{-\phi}\sum_{j=0}^\infty\frac{\left(\frac{\phi}{1-2t}\right)^j}{j!}\\ &=(1-2t)^{-p/2}e^{-\phi}e^{\frac{\phi}{1-2t}}\\ &=(1-2t)^{-p/2}e^{\frac{2\phi t}{1-2t}}. \end{aligned} $$

Result 5.6: If $U\sim\chi_p^2(\phi)$, then $\mathbb EU=p+2\phi$, $Var(U)=2p+8\phi$.

proof: Easy; use $\mathbb EU=\mathbb E[\mathbb E[U|J]]$ and $Var(U)=Var[\mathbb E[U|J]]+\mathbb E[Var[U|J]]$.

Result 5.7: If $\{U_i\}_{i=1}^m$ are independent $\chi_{p_i}^2(\phi_i)$, then $U=\sum_{i=1}^mU_i\sim\chi^2_{\sum p_i}(\sum\phi_i)$.

proof: Easy; $M_U(t)=\prod_{i=1}^mM_{U_i}(t)$.

Result 5.8: If $X\sim\mathcal N(\mu,1)$, then $U=X^2\sim\chi^2_1(\mu^2/2)$.

proof: $$ \begin{aligned} M_U(t)&=\mathbb Ee^{tX^2}=\int_\R \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(x-\mu)^2}e^{tx^2},dx\\ &=e^{-\mu^2/2}\int_\R \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(x^2-2x\mu-2tx^2)},dx\\ &=e^{-\mu^2/2+\mu^2/(2(1-2t))}\int_\R\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(1-2t)\left(x-\frac{\mu}{1-2t}\right)^2},dx\\ &=(1-2t)^{-\frac{1}{2}}e^{\frac{\mu^2}{2}\frac{2t}{1-2t}}. \end{aligned} $$

Result 5.9: If $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf I)$, then $W=\mathbf X^T\mathbf X\sim\chi^2_p(\pmb\mu^T\pmb\mu/2)$.

proof: $W=\sum_{i=1}^pU_i$ where $U_i\sim\chi^2_1(\mu_i^2/2)$ $\Rightarrow W\sim\chi^2_p(\sum\mu_i^2/2)$ (By Result 5.7).

Result 5.10: If $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf V)$ where $\mathbf V$ is nonsingular, then $W=\mathbf X^T\mathbf V^{-1}\mathbf X\sim\chi^2_p(\frac{1}{2}\pmb\mu^T\mathbf V^{-1}\pmb\mu)$.

proof: $\exist$ a nonsingular matrix $\mathbf A$ s.t. $\mathbf A^T\mathbf A=\mathbf V$. Define $\mathbf Z=\mathbf A^{-T}\mathbf X$. Then $\mathbf Z\sim\mathcal N_p(\mathbf A^{-T}\pmb\mu,\mathbf A^{-T}\mathbf V\mathbf A^{-1})=\mathcal N_p(\mathbf A^{-T}\pmb\mu,\mathbf I)$. Now, by Result 5.9, we have $$ W=\mathbf X^T\mathbf V^{-1}\mathbf X=\mathbf X^T\mathbf A^{-1}\mathbf A^{-T}\mathbf X=\mathbf Z^T\mathbf Z\sim\chi^2_p\left(\frac{\pmb\mu^T\mathbf A^{-1}\mathbf A^{-T}\pmb\mu}{2}\right)\\ \Rightarrow W\sim\chi^2_p\left(\frac{1}{2}\pmb\mu^T\mathbf V^{-1}\pmb\mu\right). $$

Result 5.11: Let $U\sim\chi^2_p(\phi)$. Then $P(U>c)$ is strictly increasing in $\phi$ for fixed $c$ and $p$.

*More explanation: Suppose $\phi’>\phi$, $U’\sim\chi^2_p(\phi’)$, $U\sim\chi^2_p(\phi)$. $U’$ is stochastically larger than $U$. $P(U’>c)>P(U>c)$, or $F_{U’}(t)<F_U(t)$.

proof: Define $$ v_k=P(\chi^2_k>c)=\int_c^\infty \frac{v^{k/2-1}e^{-v/2}}{\Gamma(k/2)2^{k/2}},dv $$ Claim $v_{k+2}>v_k$: $$ \begin{aligned} v_{k+2}&=\int_c^\infty \frac{v^{(k+2)/2-1}e^{-v/2}}{\Gamma((k+2)/2)2^{(k+2)/2}},dv\\ &=\frac{1}{k\Gamma(k/2)2^{k/2}}\int_c^\infty v^{k/2}e^{-v/2},dv\\ &=\frac{1}{k\Gamma(k/2)2^{k/2}}\left[-2v^{k/2}e^{-v/2}\bigg|_c^\infty +k\int_c^\infty v^{k/2-1}e^{-v/2},dv\right]\\ &=\frac{2c^{k/2}e^{-c/2}}{k\Gamma(k/2)2^{k/2}}+\frac{1}{\Gamma(k/2)2^{k/2}}\int_c^\infty v^{k/2-1}e^{-v/2},dv\\ &=\text{something positive} + v_k \end{aligned} $$ So claim is established.

Now $P(U>c)=\sum_{j=0}^\infty\left[\frac{e^{-\phi}\phi^j}{j!}\right]v_{p+2j}$ and $$ \begin{aligned} \frac{d}{d\phi}P(U>c)&=\frac{d}{d\phi}\left[e^{-\phi}v_p+\sum_{j=1}^\infty\left[\frac{e^{-\phi}\phi^j}{j!}\right]v_{p+2j}\right]\\ &=-e^{-\phi}v_p+\sum_{j=1}^\infty\left[\frac{e^{-\phi}\phi^{j-1}}{(j-1)!}-\frac{e^{-\phi}\phi^j}{j!}\right]v_{p+2j}\\ &=\sum_{j=1}^\infty\frac{e^{-\phi}\phi^{j-1}}{(j-1)!}v_{p+2j}-\sum_{j=0}^\infty\frac{e^{-\phi}\phi^j}{j!}v_{p+2j}\\ &=\sum_{k=0}^\infty\frac{e^{-\phi}\phi^k}{k!}v_{p+2(k+1)}-\sum_{j=0}^\infty\frac{e^{-\phi}\phi^j}{j!}v_{p+2k}\\ &=\sum_{k=0}^\infty\frac{e^{-\phi}\phi^k}{k!}[v_{p+2(k+1)}-v_{p+2k}]\\ &>0\quad(\text{By claim}). \end{aligned} $$

Def 5.7: $U_1\perp U_2$, $U_1\sim\chi^2_{p_1}$, $U_2\sim\chi^2_{p_2}$. Then $F=\frac{U_1/p_1}{U_2/p_2}\sim F_{p_1,p_2}$.

Def 5.8: $U_1\perp U_2$, $U_1\sim\chi^2_{p_1}(\phi)$, $U_2\sim\chi^2_{p_2}$. Then $F=\frac{U_1/p_1}{U_2/p_2}\sim F_{p_1,p_2}(\phi)$ (Noncentral F).

Result 5.13: $W\sim F_{p_1,p_2}(\phi)$. For fixed $p_1,p_2$ and $c>0$, $P(W>c)$ is strictly increasing in $\phi$.

*In other words, if $\phi’>\phi$ and $W’\sim F_{p_1,p_2}(\phi’)$ and $W\sim F_{p_1,p_2}(\phi)$, then $W’$ is sochastically larger than $W$.

proof: $$ \begin{aligned} P(W>c)&=P\left(\frac{U_1/p_1}{U_2/p_2}>c\right)=P\left(U_1>\frac{p_1}{p_2}cU_2\right)\\ &=\int_0^\infty P\left(U_1>\frac{p_1}{p_2}cU_2\bigg|U_2=u_2\right)f_{U_2}(u_2),du_2\\ &=\int_0^\infty P\left(U_1>\frac{p_1}{p_2}cu_2\right)f_{U_2}(u_2),du_2 \end{aligned} $$ $W’=\frac{U_1’/p_1}{U_2/p_2}$, $W=\frac{U_1/p_1}{U_3/p_2}$, $U_1\sim\chi^2_{p_1}(\phi)$, $U’_1\sim\chi^2_{p_1}(\phi’)$, $U_2\sim\chi^2_{p_2}$, $U_3\sim\chi^2_{p_2}$. $$ P(W’>c)-P(W>c)=\int_0^\infty\underbrace{\left[P\left(U’_1>\frac{p_1}{p_2}cu_2\right)-P\left(U_1>\frac{p_1}{p_2}cu_2\right)\right]}_{>0\text{ for all }u_2>0\text{ due to Result 5.11}}f_{U_2}(u_2),du_2>0. $$

Def 5.9: $U\sim \mathcal N(\mu,1)$ and $V\sim\chi^2_k$, $U\perp V$. Then $T=\frac{U}{\sqrt{V/k}}\sim t_k(\mu)$ noncentral student’s t distribution with $k$ d.f. and noncentrality parameter $\mu$. If $\mu=0$, this is the usual student’s distribution.

*Note that $T\sim t_k(\mu)$, $T=\frac{U}{\sqrt{V/k}}$, $T^2=\frac{U^2}{V/k}\sim \frac{\chi^2_1(\mu^2/2)}{\chi^2_k}=F_{1,k}(\mu^2/2)$.

5.4 Distribution of Quadratic Forms

Lemma 5.1: Let $\mathbf A$ be a $p\times p$ symmetric matrix. Then $\mathbf A$ is idempotent with rank $s$ iff $\exist $ a $p\times s$ matrix $\mathbf G$ s.t. $\mathbf G^T\mathbf G=\mathbf I_s$ and $\mathbf G\mathbf G^T=\mathbf A$.

proof: ($\Leftarrow$) $\mathbf A^2=\mathbf G\mathbf G^T\mathbf G\mathbf G^T=\mathbf G\mathbf G^T=\mathbf A$ so $\mathbf A$ is idempotent. $s=rank(\mathbf G^T\mathbf G)=rank(\mathbf G\mathbf G^T)=rank(\mathbf A)$.

($\Rightarrow$) Since $\mathbf A^2=\mathbf A$, all eigenvalues of $\mathbf A$ are either $0$ or $1$, and $rank(\mathbf A)$ is equal to the number of nonzero eigenvalues. Since $\mathbf A$ is symmetric, we have a spectral decomposition $$ \mathbf A=\mathbf Q\pmb\Lambda\mathbf Q^T=\begin{pmatrix}\mathbf Q_1&\mathbf Q_2\end{pmatrix}\begin{pmatrix}\mathbf I_s&\mathbf 0\\\mathbf 0&\mathbf 0\end{pmatrix}\begin{pmatrix}\mathbf Q_1^T\\\mathbf Q_2^T\end{pmatrix}=\mathbf Q_1\mathbf Q_1^T $$ $\mathbf Q_1^T\mathbf Q_1=\mathbf I_s$ because columns of $\mathbf Q_1$ are orthogonal. So we have $\mathbf G=\mathbf Q_1$.

Result 5.14: Let $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf I)$. If $\mathbf A$ is symmetric and idempotent with rank $s$, then $$ \mathbf X^T\mathbf A\mathbf X\sim\chi^2_s\left(\frac{1}{2}\pmb\mu^T\mathbf A\pmb\mu\right). $$ proof: By Lemma 5.1, $\exist$ $\mathbf G$ s.t. $\mathbf G^T\mathbf G=\mathbf I_s$ and $\mathbf G\mathbf G^T=\mathbf A$.

Now, $\mathbf G^T\mathbf X\sim\mathcal N_s(\mathbf G^T\pmb\mu,\mathbf I_s)$. Result 5.9 $\Rightarrow$ $\mathbf X^T\mathbf A\mathbf X=(\mathbf G^T\mathbf X)^T\mathbf G\mathbf X\sim\chi^2_s\left(\frac{1}{2}\pmb\mu^T\mathbf G\mathbf G^T\pmb\mu\right)=\chi^2_s\left(\frac{1}{2}\pmb\mu^T\mathbf A\pmb\mu\right)$.

Result 5.15: Let $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf V)$ with $\mathbf V$ nonsingular. Let $\mathbf A$ be a symmetric matrix. If $\mathbf A\mathbf V$ is idempotent with rank $s$, then $$ \mathbf X^T\mathbf A\mathbf X\sim\chi^2_s\left(\frac{1}{2}\pmb\mu^T\mathbf A\pmb\mu\right) $$ proof: $\mathbf V=\pmb\Gamma\pmb\Gamma^T$. Define $\mathbf B=\pmb\Gamma^T\mathbf A\pmb\Gamma$ so that $\mathbf A=(\pmb\Gamma^T)^{-1}\mathbf B\pmb\Gamma^{-1}$. Now $\mathbf A\mathbf V=\mathbf A\mathbf V\mathbf A\mathbf V\Rightarrow \mathbf A=\mathbf A\mathbf V\mathbf A$ because $\mathbf V$ is nonsingular. Thus $\mathbf B=\pmb\Gamma^T\mathbf A\pmb\Gamma=\pmb\Gamma^T\mathbf A\mathbf V\mathbf A\pmb\Gamma=\pmb\Gamma^T\mathbf A\pmb\Gamma\pmb\Gamma^T\mathbf A\pmb\Gamma=\mathbf B^2$. So $\mathbf B$ is idempotent. $\mathbf B$ is also symmetric. Also, $rank(\mathbf B)=rank(\mathbf A)=rank(\mathbf A\mathbf V)=s$.

Now, $\mathbf Y=\pmb\Gamma^{-1}\mathbf X\sim\mathcal N_p(\pmb\Gamma^{-1}\pmb\mu,\mathbf I_p)$. Result 5.14 $\Rightarrow$ $\mathbf X^T\mathbf A\mathbf X=\mathbf X^T(\pmb\Gamma^T)^{-1}\pmb\Gamma^T\mathbf A\pmb\Gamma\pmb\Gamma^{-1}\mathbf X=\mathbf Y^T\mathbf B\mathbf Y\sim\chi^2_s\left(\frac{1}{2}\pmb\mu^T(\pmb\Gamma^{-1})^T\mathbf B\pmb\Gamma^{-1}\pmb\mu\right)=\chi^2_s\left(\frac{1}{2}\pmb\mu^T\mathbf A\pmb\mu\right)$.

Back to G-M model $\mathbf y\sim\mathcal N_N(\mathbf X\mathbf b,\sigma^2\mathbf I)$, $r=rank(\mathbf X)$. We will apply Result 5.15 with $\pmb\mu=\mathbf X\mathbf b$ and $\mathbf V=\sigma^2\mathbf I$.

First, take $\mathbf A=\frac{1}{\sigma^2}(\mathbf I-\mathbf P_\mathbf X)$. Then $\mathbf A\mathbf V=\mathbf I-\mathbf P_\mathbf X$ which is idempotent and has rank $N-r$. By Result 5.15, $\mathbf y^T\mathbf A\mathbf y=\frac{|\hat{\mathbf e}|^2}{\sigma^2}=\frac{SSE}{\sigma^2}\sim\chi^2_{N-r}$; noncentrality parameter is $0$ since $\frac{1}{2\sigma^2}(\mathbf X\mathbf b)^T(\mathbf I-\mathbf P_\mathbf X)(\mathbf X\mathbf b)=0$.

Second, take $\mathbf A=\frac{1}{\sigma^2}\mathbf P_\mathbf X$. Then $\mathbf A\mathbf V=\mathbf P_\mathbf X$, which is idempotent and has rank $r$. By Result 5.15, $\mathbf y^T\mathbf A\mathbf y=\frac{|\hat{\mathbf y}|^2}{\sigma^2}=\frac{SSR}{\sigma^2}\sim\chi^2_r\left(\frac{1}{2\sigma^2}|\mathbf X\mathbf b|^2\right)$; noncentrality parameter is $\frac{1}{2\sigma^2}(\mathbf X\mathbf b)^T\mathbf P_\mathbf X(\mathbf X\mathbf b)=\frac{1}{2\sigma^2}|\mathbf X\mathbf b|^2$.

Now note that $\begin{pmatrix}\hat{\mathbf y}\\\hat{\mathbf e}\end{pmatrix}=\begin{pmatrix}\mathbf P_\mathbf X\mathbf y\\(\mathbf I-\mathbf P_\mathbf X)\mathbf y\end{pmatrix}=\begin{pmatrix}\mathbf P_\mathbf X\\\mathbf I-\mathbf P_\mathbf X\end{pmatrix}\mathbf y$. By Result 5.3, we have $$ \begin{aligned} \begin{pmatrix}\hat{\mathbf y}\\\hat{\mathbf e}\end{pmatrix}&\sim\mathcal N_{2N}\left(\begin{pmatrix}\mathbf P_\mathbf X\\\mathbf I-\mathbf P_\mathbf X\end{pmatrix}\mathbf X\mathbf b,\begin{pmatrix}\mathbf P_\mathbf X\\\mathbf I-\mathbf P_\mathbf X\end{pmatrix}\sigma^2\mathbf I\begin{pmatrix}\mathbf P_\mathbf X&\mathbf I-\mathbf P_\mathbf X\end{pmatrix}\right)\\ &\sim\mathcal N_{2N}\left(\begin{pmatrix}\mathbf X\mathbf b\\\mathbf 0\end{pmatrix},\sigma^2\begin{pmatrix}\mathbf P_\mathbf X&\mathbf 0\\\mathbf 0&\mathbf I-\mathbf P_\mathbf X\end{pmatrix}\right) \end{aligned} $$ Corollary 5.3 $\Rightarrow$ $\hat{\mathbf y}$ and $\hat{\mathbf e}$ are independent. Hence, SSE and SSR are also independent. Thus $$ F=\frac{|\hat{\mathbf y}|^2/r}{|\hat{\mathbf e}|^2/(N-r)}\sim F_{r,N-r}\left(\frac{1}{2\sigma^2}|\mathbf X\mathbf b|^2\right) $$

Result 5.16: Let $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf V)$ and $\mathbf A$ is a symmetric matrix, and $\mathbf B$ is a $q\times p$ matrix. If $\mathbf B\mathbf V\mathbf A=\mathbf 0$, then $\mathbf B\mathbf X$ and $\mathbf X^T\mathbf A\mathbf X$ are independent.

proof: $\mathbf A=\mathbf Q\pmb\Lambda\mathbf Q^T=\begin{pmatrix}\mathbf Q_1&\mathbf Q_2\end{pmatrix}\begin{pmatrix}\pmb\Lambda_1&\mathbf 0\\\mathbf 0&\mathbf 0\end{pmatrix}\begin{pmatrix}\mathbf Q_1^T\\\mathbf Q^T_2\end{pmatrix}$ where $\mathbf Q^T\mathbf Q=\mathbf Q\mathbf Q^T=\mathbf I$ and $\pmb\Lambda_1$ is diagonal with rank$=rank(A)=s$. So $\mathbf A=\mathbf Q_1\pmb\Lambda_1\mathbf Q^T_1$.

Result 5.3 $\Rightarrow$ $$ \begin{pmatrix}\mathbf B\\\mathbf Q_1^T\end{pmatrix}\mathbf X\sim\mathcal N_{q+s}\left(\begin{pmatrix}\mathbf B\pmb\mu\\\mathbf Q^T_1\pmb\mu\end{pmatrix},\begin{pmatrix}\mathbf B\mathbf V\mathbf B^T&\mathbf B\mathbf V\mathbf Q_1\\\mathbf Q_1^T\mathbf V\mathbf B^T&\mathbf Q_1^T\mathbf V\mathbf Q_1\end{pmatrix}\right) $$ So, if $\mathbf B\mathbf V\mathbf Q_1=\mathbf 0$, then $\mathbf B\mathbf X$ and $\mathbf Q_1^T\mathbf X$ are independent. But note that $\mathbf B\mathbf X$ and $\mathbf Q_1^T\mathbf X$ are independent, then so are $\mathbf B\mathbf X$ and $(\mathbf Q_1^T\mathbf X)^T\pmb\Lambda_1(\mathbf Q_1^T\mathbf X)=\mathbf X^T\mathbf A\mathbf X$.

Finally, if $\mathbf B\mathbf V\mathbf A=\mathbf 0$, then $\mathbf B\mathbf V\mathbf Q_1\pmb\Lambda_1\mathbf Q_1^T=\mathbf 0\Rightarrow\mathbf B\mathbf V\mathbf Q_1\pmb\Lambda_1\mathbf Q_1^T\mathbf Q_1=\mathbf 0\Rightarrow\mathbf B\mathbf V\mathbf Q_1\pmb\Lambda_1=\mathbf 0\Rightarrow\mathbf B\mathbf V\mathbf Q_1=\mathbf 0$.

Corollary 5.4: Let $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf V)$. Suppose $\mathbf A$ and $\mathbf B$ are symmetric. If $\mathbf B\mathbf V\mathbf A=\mathbf 0$, then $\mathbf X^T\mathbf A\mathbf X$ and $\mathbf X^T\mathbf B\mathbf X$ are independent.

proof: HW.

5.5 Cochran’s Theorem

Theorem 5.1: Let $\mathbf y\sim\mathcal N_N(\pmb\mu,\sigma^2\mathbf I)$. Suppose that $\{\mathbf A_i\}_{i=1}^k$ are symmetric, idempotent matrices s.t. $rank(\mathbf A_i)=s_i$ and $\sum_{i=1}^k\mathbf A_i=\mathbf I_N$. Then $\sum_{i=1}^k s_i=N$, $\{\mathbf y^T\mathbf A_i\mathbf y\}_{i=1}^k$ are independent, and $\frac{1}{\sigma^2}\mathbf y^T\mathbf A_i\mathbf y\sim\chi^2_{s_i}\left(\frac{1}{2\sigma^2}\pmb\mu^T\mathbf A_i\pmb\mu\right)$.

proof: $N=\text{trace}(\mathbf I_N)=\text{trace}(\sum_{i=1}^k\mathbf A_i)=\sum_{i=1}^k\text{trace}(\mathbf A_i)=\sum_{i=1}^k s_i$.

By Lemma 5.1, for each $i=1,2,\cdots,k$, $\exist$ a $N\times s_i$ matrix $\mathbf Q_i$ s.t. $\mathbf Q_i\mathbf Q_i^T=\mathbf A_i$ and $\mathbf Q_i^T\mathbf Q_i=\mathbf I_{s_i}$ (Note that $rank(\mathbf Q_i)=s_i$).

Define $\mathbf Q=\begin{pmatrix}\mathbf Q_1&\mathbf Q_2&\cdots&\mathbf Q_k\end{pmatrix}_{N\times N}$. Now $\mathbf Q\mathbf Q^T=\sum_{i=1}^k\mathbf Q_i\mathbf Q_i^T=\sum_{i=1}^k\mathbf A_i=\mathbf I_N$. Since $\mathbf Q$ is square, it must be orthogonal, so $\mathbf Q^T\mathbf Q=\mathbf I_N$. $$ \mathbf Q^T\mathbf Q=\begin{pmatrix}\mathbf Q_1^T\\\mathbf Q_2^T\\\vdots\\\mathbf Q_k^T\end{pmatrix}\begin{pmatrix}\mathbf Q_1&\mathbf Q_2&\cdots&\mathbf Q_k\end{pmatrix}=\begin{pmatrix}\mathbf I_{s_1}&\mathbf 0&\cdots&\mathbf 0\\\mathbf 0&\mathbf I_{s_2}&\cdots&\mathbf 0\\\vdots&\vdots&\ddots&\vdots\\\mathbf 0&\mathbf 0&\cdots&\mathbf I_{s_k}\end{pmatrix}\Rightarrow\mathbf Q_i^T\mathbf Q_j=\mathbf 0,\forall i\not=j. $$ Now $\mathbf Q^T\mathbf y=\begin{pmatrix}\mathbf Q_1^T\mathbf y\\\mathbf Q_2^T\mathbf y\\\vdots\\\mathbf Q_k^T\mathbf y\end{pmatrix}\sim\mathcal N(\mathbf Q^T\pmb\mu,\sigma^2\mathbf I_N)$ $\Rightarrow$ $\{\mathbf Q_i^T\mathbf y\}_{i=1}^k$ are independent and $\mathbf Q_i^T\mathbf y\sim\mathcal N(\mathbf Q_i^T\pmb\mu,\sigma^2\mathbf I_{s_i})$, i.e., $\frac{1}{\sigma}\mathbf Q_i^T\mathbf y\sim\mathcal N(\mathbf Q_i^T\pmb\mu/\sigma,\mathbf I_{s_i})$.

Now, since $\mathbf y^T\mathbf A_i\mathbf y=|\mathbf Q_i^T\mathbf y|^2$, $\{\mathbf y^T\mathbf A_i\mathbf y\}_{i=1}^k$ are also independent.

And, finally, we have $$ \frac{1}{\sigma^2}\mathbf y^T\mathbf A_i\mathbf y=\frac{1}{\sigma^2}|\mathbf Q_i^T\mathbf y|^2\sim\chi^2_{s_i}\left(\frac{1}{2\sigma^2}\pmb\mu^T\mathbf A_i\pmb\mu\right). $$ Example: One-way ANOVA

$y_{ij}=\mu+\alpha_i+e_{ij}$, $i=1,2,\cdots,a$, $j=1,2,\cdots,n_i$, $e_{ij}\sim\mathcal N(0,\sigma^2)$ independent. $$ \mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_{n_1}&\mathbf 1_{n_1}&\mathbf 0&\cdots&\mathbf 0\\ \mathbf 1_{n_2}&\mathbf 0&\mathbf 1_{n_2}&\cdots&\mathbf 0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ \mathbf 1_{n_a}&\mathbf 0&\mathbf 0&\cdots&\mathbf 1_{n_a}\end{pmatrix} \begin{pmatrix} \mu\\\alpha_1\\\alpha_2\\\vdots\\\alpha_a \end{pmatrix} $$

$$ \mathbf A_1=\mathbf P_{\mathbf 1_ N},\quad\mathbf A_2=\mathbf P_\mathbf X-\mathbf P_{\mathbf 1_N},\quad \mathbf A_3=\mathbf I-\mathbf P_\mathbf X $$

Clearly, $\mathbf A_1+\mathbf A_2+\mathbf A_3=\mathbf I$.

$rank(\mathbf P_{\mathbf 1_N})=1$, $rank(\mathbf I-\mathbf P_\mathbf X)=N-a$ $\Rightarrow rank(\mathbf P_\mathbf X-\mathbf P_{\mathbf 1_N})=a-1$.

$\mathcal C(\mathbf 1_N)\subset\mathcal C(\mathbf X)$ $\stackrel{\text{Thm 2.2}}{\Rightarrow}$ $\mathbf P_\mathbf X-\mathbf P_{\mathbf 1_N}$ is symmetric projection onto $\mathcal C((\mathbf I-\mathbf P_{\mathbf 1_N})\mathbf X)$. $$ \begin{aligned} &\mathbf y^T\mathbf A_1\mathbf y=SSM=N\bar y^2\quad\quad\quad\quad\frac{SSM}{\sigma^2}\sim\chi^2_1\left(\frac{1}{2\sigma^2}(\mathbf X\mathbf b)^T\mathbf P_{\mathbf 1_N}\mathbf X\mathbf b\right)\\ &\mathbf y^T\mathbf A_2\mathbf y=SSA_{cfm}=\sum_{i=1}^an_i\bar y_{i\cdot}-N\bar y^2 \quad\frac{SSA_{cfm}}{\sigma^2}\sim\chi^2_{a-1}\left(\frac{1}{2\sigma^2}(\mathbf X\mathbf b)^T(\mathbf P_\mathbf X-\mathbf P_{\mathbf 1_N})\mathbf X\mathbf b\right)\\ &\mathbf y^T\mathbf A_3\mathbf y=SSE=\sum_{i=1}^a\sum_{j=1}^{n_i}(y_{ij}-\bar y_{i\cdot})^2\quad\frac{SSE}{\sigma^2}\sim\chi^2_{N-a}\left(\frac{1}{2\sigma^2}(\mathbf X\mathbf b)^T(\mathbf I-\mathbf P_\mathbf X)\mathbf X\mathbf b\right)=\chi^2_{N-a} \end{aligned} $$ ANOVA Table:

Source	DF	Projection	SS	Noncentrality
Mean	1	$\mathbf P_{\mathbf 1_N}$	SSM=$N\bar y^2$	$\frac{1}{2\sigma^2}N(\mu+\bar \alpha)^2$
Group	a-1	$\mathbf P_\mathbf X-\mathbf P_{\mathbf 1_N}$	SSA$_{cfm}=\sum_in_i\bar y^2_{i\cdot}-N\bar y^2$	$\frac{1}{2\sigma^2}\sum_in_i(\alpha_i-\bar \alpha)^2$
Error	N-a	$\mathbf I-\mathbf P_\mathbf X$	SSE=$\sum_{i,j}(y_{ij}-\bar y_{i\cdot})^2$	0

Chapter 6: Statistical Inference

6.2 Results from Statistical Theory

Assume $\mathbf y\sim\mathcal N_N(\mathbf X\mathbf b,\sigma^2\mathbf I)$. $$ \begin{aligned} f(\mathbf y|\mathbf b,\sigma^2)&=(2\pi\sigma^2)^{-N/2}\exp\{-\frac{1}{2\sigma^2}(\mathbf y-\mathbf X\mathbf b)^T(\mathbf y-\mathbf X\mathbf b)\}\\ &=(2\pi\sigma^2)^{-N/2}\exp\{-\frac{1}{2\sigma^2}\mathbf b^T\mathbf X^T\mathbf X\mathbf b\}\exp\{-\frac{1}{2\sigma^2}\mathbf y^T\mathbf y+\frac{1}{\sigma^2}\mathbf b^T\mathbf X^T\mathbf y\} \end{aligned} $$ By the factorization theorem, we see that $(\mathbf y^T\mathbf y,\mathbf X^T\mathbf y)$ is a sufficient statistic for $(\mathbf b,\sigma^2)$.

Result 6.1: Assume $\mathbf y\sim\mathcal N_N(\mathbf X\mathbf b,\sigma^2\mathbf I)$. $(\mathbf y^T\mathbf y,\mathbf X^T\mathbf y)$ is a minimal sufficient statistic for $(\mathbf b,\sigma^2)$.

proof: According to Theorem 6.2.B in C&B, it is enough to show that for two different responses, $\mathbf y_1,\mathbf y_2\in\mathbb R^N$, $\mathbf y_1\not=\mathbf y_2$, $$ \frac{f(\mathbf y_1|\mathbf b,\sigma^2)}{f(\mathbf y_2|\mathbf b,\sigma^2)}\text{ is constant in }(\mathbf b,\sigma^2)\quad\Leftrightarrow\quad(\mathbf y_1^T\mathbf y_1,\mathbf X^T\mathbf y_1)=(\mathbf y_2^T\mathbf y_2,\mathbf X^T\mathbf y_2) $$ First $$ \frac{f(\mathbf y_1|\mathbf b,\sigma^2)}{f(\mathbf y_2|\mathbf b,\sigma^2)}=\exp\left\{\frac{1}{2\sigma^2}(\mathbf y_2^T\mathbf y_2-\mathbf y_1^T\mathbf y_1)+\frac{1}{\sigma^2}\mathbf b^T(\mathbf X^T\mathbf y_1-\mathbf X^T\mathbf y_2)\right\} $$ ($\Leftarrow$) Obvious

($\Rightarrow$) $\forall (\mathbf b,\sigma^2)\in\R^p\times\R^+$, $\frac{1}{2\sigma^2}(\mathbf y_2^T\mathbf y_2-\mathbf y_1^T\mathbf y_1)+\frac{1}{\sigma^2}\mathbf b^T(\mathbf X^T\mathbf y_1-\mathbf X^T\mathbf y_2)=c$ where $c$ does not depend on $(\mathbf b,\sigma^2)$. For fixed $\sigma^2>0$, $\frac{1}{2}(\mathbf y_2^T\mathbf y_2-\mathbf y_1^T\mathbf y_1)-c\sigma^2+\mathbf b^T(\mathbf X^T\mathbf y_1-\mathbf X^T\mathbf y_2)=0$, $\forall\mathbf b\in\R^p$. By Result A.8, $\mathbf X^T\mathbf y_1-\mathbf X^T\mathbf y_2=\mathbf 0$ $\Rightarrow$ $\frac{1}{2\sigma^2}(\mathbf y_2^T\mathbf y_2-\mathbf y_1^T\mathbf y_1)=c$, $\forall \sigma^2>0$ $\Rightarrow$ $\mathbf y_2^T\mathbf y_2-\mathbf y_1^T\mathbf y_1=0$.

Corollary 6.1: $\mathbf y\sim\mathcal N_N(\mathbf X\mathbf b,\sigma^2\mathbf I)$. $(SSE,\mathbf X^T\mathbf y)$ is also minimal sufficient for $(\mathbf b,\sigma^2)$.

Q: How does least squares related to M.L. (maximum likelihood)?

A: Almost the same.

Result 6.3: Assume $\mathbf y\sim\mathcal N_N(\mathbf X\mathbf b,\sigma^2\mathbf I)$. Let $\hat{\mathbf b}$ be a solution to the N.E.s. $(\hat{\mathbf b},SSE/N)$ is a ML estimator of $(\mathbf b,\sigma^2)$.

proof: Recall that $Q(\mathbf b)=(\mathbf y-\mathbf X\mathbf b)^T(\mathbf y-\mathbf X\mathbf b)$. $$ L(\mathbf b,\sigma^2|\mathbf y)=(2\pi)^{-N/2}(\sigma^2)^{-N/2}e^{-\frac{1}{2\sigma^2}Q(\mathbf b)} $$ For any $\sigma^2>0$, $e^{-\frac{1}{2\sigma^2}Q(\mathbf b)}$ is maximized by minimizing $Q(\mathbf b)$. Of course, $Q(\mathbf b)$ is minimized at $\hat{\mathbf b}$. So, we now can say for all $(\mathbf b,\sigma^2)\in\R^p\times\R^+$, $$ L(\mathbf b,\sigma^2|\mathbf y)\leq L(\hat{\mathbf b},\sigma^2|\mathbf y) $$ Now $$ \log L(\hat{\mathbf b},\sigma^2|\mathbf y)=-\frac{N}{2}\log\sigma^2-\frac{1}{2\sigma^2}Q(\hat{\mathbf b})+\text{constant} $$

$$ \frac{d}{d\sigma^2}\log L(\hat{\mathbf b},\sigma^2|\mathbf y)=-\frac{N}{2\sigma^2}+\frac{1}{2(\sigma^2)^2}Q(\hat{\mathbf b})\stackrel{!}{=}0\\ \Rightarrow\hat{\sigma^2}=\frac{SSE}{N}\\ \frac{d^2}{d(\sigma^2)^2}\log L(\hat{\mathbf b},\sigma^2|\mathbf y)\bigg|_{\hat{\sigma^2}}<0\Rightarrow\text{ Maximum} $$

Finally, we can say: for all $(\mathbf b,\sigma^2)\in\R^p\times\R^+$, $L(\mathbf b,\sigma^2|\mathbf y)\leq L(\hat{\mathbf b},\sigma^2|\mathbf y)\leq L(\hat{\mathbf b},\hat{\sigma^2}|\mathbf y)$.

Corollary 6.3: Assume $\mathbf y\sim\mathcal N_N(\mathbf X\mathbf b,\sigma^2\mathbf I)$. The ML estimator of an estimable function, $\pmb\lambda^T\mathbf b$, is $\pmb\lambda^T\hat{\mathbf b}$ when $\hat{\mathbf b} $ solve the N.E.s.

proof: Invariance of MLE.

6.3 Testing the General Linear Hypothesis

$$ H:\mathbf K^T\mathbf b=\mathbf m\quad\quad A:\mathbf K^T\mathbf b\not=\mathbf m $$

Assume:

$\mathbf K_{p\times s}$ has rank $s$. If $\mathbf K$ is not full column rank, there will be redundant constrains.
Components of $\mathbf K^T\mathbf b$ are estimable. In other words, each column of $\mathbf K$ is in $\mathcal C(\mathbf X^T)$. Thus $\mathbf K=\mathbf X^T\mathbf A$ for some $\mathbf A_{N\times s}$.

Example: Two-way crossed model with interaction $$ y_{ijk}=\mu+\alpha_i+\beta_j+\gamma_{ij}+e_{ijk},\quad i=1,2,\cdots,a,j=1,2,\cdots,b,k=1,2,\cdots,n_{ij} $$ Q: How do we test for no interaction?

Maybe $$ \mathbf K^T\mathbf b=\begin{pmatrix}\mathbf 0_{ab\times(a+b+1)}&\mathbf I_{ab}\end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\vdots\\\alpha_a\\\beta_1\\\vdots\\\beta_b\\\gamma_{11}\\\vdots\\\gamma_{ab}\end{pmatrix},\quad\mathbf m=\mathbf 0 $$ Doesn’t work since $\gamma_{ij}$ is not estimable.

Let’s think about a special case with $K=1$, $a=3$ and $b=2$.

Q: What does no interaction really mean?

The difference in response between levels $i$ and $i’$ of factor $A$ does not depend on factor $B$. $$ \mathbb E(y_{ij}-y_{i’j})=\mathbb E(y_{ij’}-y_{i’j’}),\quad i\not=i’,j\not=j’\\ \Rightarrow \mu+\alpha_i+\beta_j+\gamma_{ij}-(\mu+\alpha_{i’}+\beta_j+\gamma_{i’j})=\mu+\alpha_i+\beta_{j’}+\gamma_{ij’}-(\mu+\alpha_{i’}+\beta_{j’}+\gamma_{i’j’})\\ \text{or }\gamma_{ij}-\gamma_{i’j}-\gamma_{ij’}+\gamma_{i’j’}=0 $$ $(i,i’)=(1,2)$, $(i,i’)=(1,3)$, $(i,i’)=(2,3)$

When $a=3$ and $b=2$, we only need $2$ such equations to test the hypothesis on no interaction:

$\gamma_{11}-\gamma_{21}-\gamma_{12}+\gamma_{22}=0$
$\gamma_{11}-\gamma_{31}-\gamma_{12}+\gamma_{32}=0$

Subtract first from second, you get $\gamma_{21}-\gamma_{31}-\gamma_{22}+\gamma_{32}=0$.

In general, we need $(a-1)(b-1)$ equations to test for no interaction. This makes sense as the rank of $\mathbf X$ in the full model is $ab$, and in the model without interaction, the rank of $\mathbf X$ is $a+b-1$, and $ab-(a+b-1)=(a-1)(b-1)$.

Back to the general case.

Let’s use “first principles” (or “do what makes sense”) to construct a statistical test for $H:\mathbf K^T\mathbf b=\mathbf m$.

The BLUE of $\mathbf K^T\mathbf b$ is $\mathbf K^T\hat{\mathbf b}=\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y$. $$ \mathbf K^T\hat{\mathbf b}\sim\mathcal N_s\left(\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf X\mathbf b,\sigma^2\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf X[(\mathbf X^T\mathbf X)^g]^T\mathbf K\right) $$ But $\left[\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf X\right]^T=\underbrace{(\mathbf X^T\mathbf X)\left[(\mathbf X^T\mathbf X)^g\right]^T}_{\text{projection onto }\mathcal C(\mathbf X^T\mathbf X)=\mathcal C(\mathbf X^T)}\mathbf K=\mathbf K$. So $$ \mathbf K^T\hat{\mathbf b}\sim\mathcal N_s\left(\mathbf K^T\mathbf b,\sigma^2\underbrace{\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K}_{\mathbf H}\right) $$ Result 6.4: The $s\times s$ matrix $\mathbf H=\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K$ is nonsingular, and hence positive definite.

proof: $\mathbf K_{p\times s}=\mathbf X^T_{p\times N}\mathbf A_{N\times s}$, $rank(\mathbf X)=r$. Now, $\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K=\mathbf A^T\mathbf X(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf A=\mathbf A^T\mathbf P_\mathbf X\mathbf P_\mathbf X\mathbf A=(\mathbf P_\mathbf X\mathbf A)^T\mathbf P_\mathbf X\mathbf A$. If $\mathbf P_\mathbf X\mathbf A_{N\times s}$ has full rank, then $(\mathbf P_\mathbf X\mathbf A)^T\mathbf P_\mathbf X\mathbf A$ is invertible, which is what we want.

Now, $\mathbf X^T\mathbf P_\mathbf X\mathbf A=\mathbf X^T\mathbf A=\mathbf K$. $s=rank(\mathbf K)\leq\min\{rank(\mathbf X^T),rank(\mathbf P_\mathbf X\mathbf A\}\leq rank(\mathbf P_\mathbf X\mathbf A)\Rightarrow rank(\mathbf P_\mathbf X\mathbf A)=s$.

$$ \mathbf K^T\hat{\mathbf b}-\mathbf m\sim\mathcal N_N(\mathbf K^T\mathbf b-\mathbf m,\sigma^2\mathbf H) $$ Result 5.10 $\Rightarrow $ $$ (\mathbf K^T\hat{\mathbf b}-\mathbf m)^T(\sigma^2\mathbf H)^{-1}(\mathbf K^T\hat{\mathbf b}-\mathbf m)\sim\chi^2_s(\phi) $$ where $\phi=\frac{1}{2}(\mathbf K^T\mathbf b-\mathbf m)(\sigma^2\mathbf H)^{-1}(\mathbf K^T\mathbf b-\mathbf m)$.

Recall that $\mathbf K=\mathbf X^T\mathbf A$, so $\mathbf K^T\hat{\mathbf b}=\mathbf A^T\mathbf X(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y=\mathbf A^T\mathbf P_\mathbf X\mathbf y$. But we know that $\mathbf P_\mathbf X\mathbf y$ and $(\mathbf I-\mathbf P_\mathbf X)\mathbf y$ are independent, so $\mathbf K^T\hat{\mathbf b}$ and $SSE$ are also independent.

Finally, $$ F=\frac{(\mathbf K^T\hat{\mathbf b}-\mathbf m)^T\mathbf H^{-1}(\mathbf K^T\hat{\mathbf b}-\mathbf m)/s}{SSE/(N-r)}\sim F_{s,N-r}(\phi). $$ Under $H:\mathbf K^T\mathbf b=\mathbf m$, $F\sim F_{s,N-r}$.

Also, we know from Result 5.13 that $F$ is stochastically increasing in $\phi$. It therefore makes sense to reject when $F$ is large.

Let’s reject when $F>F_{s,N-r,\alpha}$. This is level $\alpha$ test, and it is an unbiased because $P(rejecting)$ is the smallest when $\phi=0$.

Example 6.5: One-way ANOVA $$ y_{ij}=\mu+\alpha_i+e_{ij},\quad i=1,2,\cdots,a,j=1,2,\cdots,n_i $$ $H:\alpha_1-\alpha_2=0,\alpha_1-\alpha_3=0,\cdots,\alpha_1-\alpha_a=0$. $$ \mathbf K^T=\begin{pmatrix} 0&1&-1&0&\cdots&0\\ 0&1&0&-1&\cdots&0\\ \vdots&\vdots&\vdots&\vdots&\ddots&\vdots\\ 0&1&0&0&\cdots&-1 \end{pmatrix}, \quad \mathbf b=\begin{pmatrix} \mu\\\alpha_1\\\alpha_2\\\vdots\\\alpha_a \end{pmatrix}, \quad\mathbf m=\mathbf 0. $$ If $H$ is true, then $\alpha_1=\alpha_2=\cdots=\alpha_a$.

Note that we could have specified $H$ in a different way, e.g., $H_*=\alpha_1-\alpha_2=0,\alpha_2-\alpha_3=0,\cdots,\alpha_{a-1}-\alpha_a=0$.

Let’s construct the $F$ statistic for $H$. Here $N=\sum_{i=1}^an_i$, $p=a+1$, $s=a-1$. $$ \mathbf X=\begin{pmatrix}\mathbf 1_{n_1}&\mathbf 1_{n_1}&\mathbf 0&\cdots&\mathbf 0\\ \mathbf 1_{n_2}&\mathbf 0&\mathbf 1_{n_2}&\cdots&\mathbf 0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ \mathbf 1_{n_a}&\mathbf 0&\mathbf 0&\cdots&\mathbf 1_{n_a}\end{pmatrix},\\ (\mathbf X^T\mathbf X)^g=\begin{pmatrix} 0&&&&\\ &\frac{1}{n_1}&&&\\ &&\frac{1}{n_2}&&\\ &&&\ddots&\\ &&&&\frac{1}{n_a} \end{pmatrix}=\begin{pmatrix}0&&\\ &\frac{1}{n_1}&\\ &&\mathbf D_*\end{pmatrix} $$ This yields $$ \hat{\mathbf b}=\begin{pmatrix} 0\\\bar y_{1\cdot}\\\bar y_{2\cdot}\\\vdots\\\bar y_{a\cdot} \end{pmatrix} $$

$$ \mathbf K\hat{\mathbf b}-\mathbf m=\begin{pmatrix}\mathbf 0&\mathbf 1_{a-1}&-\mathbf I_{a-1}\end{pmatrix}\begin{pmatrix}0\\\bar y_{1\cdot}\\\vdots\\ y_{a\cdot}\end{pmatrix}=\begin{pmatrix}\bar y_{1\cdot}-\bar y_{2\cdot}\\\bar y_{1\cdot}-\bar y_{3\cdot}\\\vdots\\\bar y_{1\cdot}-\bar y_{a\cdot}\end{pmatrix} $$

$$ \begin{aligned} \mathbf H&=\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K=\begin{pmatrix}\mathbf 0&\mathbf 1_{a-1}&-\mathbf I_{a-1}\end{pmatrix}\begin{pmatrix}0&&\\ &\frac{1}{n_1}&\\ &&\mathbf D_\star\end{pmatrix} \begin{pmatrix}\mathbf 0\\\mathbf 1^T_{a-1}\\-\mathbf I_{a-1}\end{pmatrix}\\ &=\begin{pmatrix}\mathbf 0&\frac{1}{n_1}\mathbf 1_{a-1}&-\mathbf D_\star\end{pmatrix}\begin{pmatrix}\mathbf 0\\\mathbf 1^T_{a-1}\\-\mathbf I_{a-1}\end{pmatrix}\\ &=\mathbf D_\star+\frac{1}{n_1}\mathbf 1_{a-1}\mathbf 1^T_{a-1} \end{aligned} $$

Using the result from Problem A.75: $$ \left[\mathbf D_\star+\frac{1}{n_1}\mathbf 1_{a-1}\mathbf 1_{a-1}^T\right]^{-1}=\mathbf D_\star^{-1}-\frac{1}{N}\mathbf D_\star^{-1}\mathbf 1_{a-1}\mathbf 1_{a-1}^T\mathbf D_\star^{-1} $$

$$ \begin{aligned} &(\mathbf K^T\hat{\mathbf b}-\mathbf m)^T[\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K]^{-1}(\mathbf K^T\hat{\mathbf b}-\mathbf m)\\ =&(\mathbf K^T\hat{\mathbf b}-\mathbf m)^T\mathbf D_\star^{-1}(\mathbf K^T\hat{\mathbf b}-\mathbf m)-\frac{1}{N}(\mathbf K^T\hat{\mathbf b}-\mathbf m)^T\mathbf D_\star^{-1}\mathbf 1_{a-1}\mathbf 1_{a-1}^T\mathbf D_\star^{-1}(\mathbf K^T\hat{\mathbf b}-\mathbf m)\\ =&\sum_{i=2}^an_i(\bar y_{1\cdot}-\bar y_{i\cdot})^2-\frac{1}{N}\left[\sum_{i=2}^an_i(\bar y_{1\cdot}-\bar y_{i\cdot})\right]^2\\ =&\sum_{i=2}^an_i(\bar y_{1\cdot}-\bar y_{i\cdot})^2-\frac{1}{N}\left[N(\bar y_{1\cdot}-\bar y_{\cdot\cdot})\right]^2\\ =&\cdots\\ =&\sum_{i=1}^an_i(\bar y_{i\cdot}-\bar y_{\cdot\cdot})^2\quad\quad\left(\text{Using } \sum_{i=1}^an_i(\bar y_{i\cdot}-\bar y_{1\cdot}+\bar y_{1\cdot}-\bar y_{\cdot\cdot})^2=\cdots\right) \end{aligned} $$

Clearly, $SSE=\sum_{i=1}^a\sum_{j=1}^{n_i}(y_{ij}-\bar y_{i\cdot})^2$. So our $F$-statistic is $$ F=\frac{\left(\sum_{i=1}^an_i(\bar y_{i\cdot}-\bar y_{\cdot\cdot})^2\right)/(a-1)}{\left(\sum_{i=1}^a\sum_{j=1}^{n_i}(y_{ij}-\bar y_{i\cdot})^2\right)/(N-a)}\stackrel{\text{under }H}{\sim} F_{a-1,N-a} $$ Recall that we could have used $H_\star$ instead of $H$.

Q: Will we get the same test?

A: Yes.

In general, suppose that we have two equivalent hypothesis: $H:\mathbf K^T\mathbf b=\mathbf m$, $H_\star:\mathbf K^T_\star\mathbf b=\mathbf m_\star$.

Equivalence $\Rightarrow $ $S=\left\{\mathbf b\in\R^p:\mathbf K^T\mathbf b=\mathbf m\right\}=\{\mathbf b\in\R^p:\mathbf K^T_*\mathbf b=\mathbf m_\star\}=S_\star$

Result A.13 implies that points in $S$ can be expressed as $$ \mathbf K(\mathbf K^T\mathbf K)^{-1}\mathbf m+(\mathbf I-\mathbf P_\mathbf K)\mathbf z,\quad\mathbf z\in\R^p $$ Since $S=S_\star$, we have $$ \mathbf K^T_\star\left[\mathbf K(\mathbf K^T\mathbf K)^{-1}\mathbf m+(\mathbf I-\mathbf P_\mathbf K)\mathbf z\right]=\mathbf m_\star,\quad\forall\mathbf z\in\R^p\\ \text{or }\mathbf K_*^T\mathbf K(\mathbf K^T\mathbf K)^{-1}\mathbf m-\mathbf m_\star+\mathbf K_\star^T(\mathbf I-\mathbf P_\mathbf K)\mathbf z=\mathbf 0,\quad\forall \mathbf z\in\R^p $$ Result A.8 implies that

$\mathbf K^T_\star(\mathbf I-\mathbf P_\mathbf K)=\mathbf 0$, or $\mathbf P_\mathbf K\mathbf K_*=\mathbf K_\star$
$\mathbf K^T_\star\mathbf K(\mathbf K^T\mathbf K)^{-1}\mathbf m=\mathbf m_\star$

$\mathbf P_\mathbf K\mathbf K_\star=\mathbf K_\star\Rightarrow\mathcal C(\mathbf K_\star)\subseteq\mathcal C(\mathbf K)$. If we reverse the roles of $\mathbf K$ and $\mathbf K_\star$ in the previous argument, we will get $\mathcal C(\mathbf K)\subseteq\mathcal C(\mathbf K_\star)$.

So $\mathcal C(\mathbf K)=\mathcal C(\mathbf K_\star)\Rightarrow\exist$ a nonsingular matrix $\mathbf Q$ such that $\mathbf K_{p\times s}\mathbf Q^T_{s\times s}=\mathbf K_{\star p\times s}$.

It follows from $\mathbf K^T_\star\mathbf K(\mathbf K^T\mathbf K)^{-1}\mathbf m=\mathbf m_\star$ that $$ \mathbf Q\mathbf K^T\mathbf K(\mathbf K^T\mathbf K)^{-1}\mathbf m=\mathbf m_\star\Rightarrow\mathbf Q\mathbf m=\mathbf m_\star $$ Also, $$ \mathbf K\mathbf Q^T=\mathbf K_\star\Rightarrow\mathbf K^T\mathbf K\mathbf Q^T=\mathbf K^T\mathbf K_\star\Rightarrow \mathbf Q^T=(\mathbf K^T\mathbf K)^{-1}\mathbf K^T\mathbf K_\star\Rightarrow\mathbf Q=\mathbf K_\star^T\mathbf K(\mathbf K^T\mathbf K)^{-1}. $$ Now, let’s look at the numerator of the $F$-statistic for testing $H_\star:\mathbf K_\star^T\mathbf b=\mathbf m_\star$: $$ \begin{aligned} &(\mathbf K_\star^T\hat{\mathbf b}-\mathbf m_\star)^T\left[\mathbf K_\star^T(\mathbf X^T\mathbf X)^g\mathbf K_\star\right]^{-1}(\mathbf K_\star^T\hat{\mathbf b}-\mathbf m_\star)\\ =&(\mathbf Q\mathbf K^T\hat{\mathbf b}-\mathbf Q\mathbf m)^T\left[\mathbf Q\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K\mathbf Q^T\right]^{-1}(\mathbf Q\mathbf K^T\hat{\mathbf b}-\mathbf Q\mathbf m)\\ =&(\mathbf K^T\hat{\mathbf b}-\mathbf m)\left[\mathbf K^T(\mathbf X^T\mathbf X)^T\mathbf K\right]^{-1}(\mathbf K^T\hat{\mathbf b}-\mathbf m) \end{aligned} $$

What happens when $s=1$ so that $\mathbf K^T$ is a $1\times p$ vector?

$H:\mathbf K^T\mathbf b=\mathbf m$

Now we can do both one-sided and two-sided test: $A_1:\mathbf K^T\mathbf b>\mathbf m$ $A_2:\mathbf K^T\mathbf b\not=\mathbf m$. $$ \mathbf K^T\hat{\mathbf b}\sim\mathcal N\left(\mathbf K^T\mathbf b,\sigma^2\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K\right) $$

$$ \Rightarrow \frac{\frac{\mathbf K^T\hat{\mathbf b}-\mathbf m}{\sqrt{\sigma^2\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K}}}{\sqrt{\frac{SSE}{\sigma^2(N-r)}}}=\frac{\mathcal N(\mu,1)}{\sqrt{\frac{\chi^2_{N-r}}{N-r}}}\\ \Rightarrow t=\frac{\mathbf K^T\hat{\mathbf b}-\mathbf m}{\sqrt{\hat{\sigma^2}\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K}}\sim t_{N-r}(\mu),\quad\mu=\frac{\mathbf K^T\mathbf b-\mathbf m}{\sqrt{\sigma^2\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K}} $$

For $A_1$, we reject $H$ if $t>t_{N-r,\alpha}$

For $A_2$, we reject $H$ if $|t|>t_{N-r,\alpha/2}$

Example (Full and reduced models):

Let’s return to the test for no interaction in the two-way model. $y_{ijk}=\mu+\alpha_i+\beta_j+\gamma_{ij}+e_{ijk}$, $i=1,2,\cdots,a$, $j=1,2,\cdots, b$, $k=1,2,\cdots, n$. $$ \mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_n&\mathbf 1_n&\mathbf 0&\mathbf 0&\mathbf 1_n&\mathbf 0&\mathbf 1_n&\mathbf 0&\mathbf 0&\mathbf 0&\mathbf 0&\mathbf 0\\ \mathbf 1_n&\mathbf 1_n&\mathbf 0&\mathbf 0&\mathbf 0&\mathbf 1_n&\mathbf 0&\mathbf 1_n&\mathbf 0&\mathbf 0&\mathbf 0&\mathbf 0\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ \mathbf 1_n&\mathbf 0&\mathbf 0&\mathbf 1_n&\mathbf 0&\mathbf 1_n&\mathbf 0&\mathbf 0&\mathbf 0&\mathbf 0&\mathbf 0&\mathbf 1_n\end{pmatrix} \begin{pmatrix}\mu\\\alpha_1\\\alpha_2\\\alpha_3\\\beta_1\\\beta_2\\\gamma_{11}\\\gamma_{12}\\\vdots\\\gamma_{32}\end{pmatrix} $$

$$ \mathbb E\mathbf y=\mathbf X\mathbf b=\begin{pmatrix}\mathbf X_0&\mathbf X_1\end{pmatrix}\begin{pmatrix}\mathbf b_0\\\mathbf b_1\end{pmatrix}=\mathbf X_0\mathbf b_0+\mathbf X_1\mathbf b_1 $$

where $\mathbf X_0$ consists of the first $6$ columns of $\mathbf X$.

We would like to test $H_1:\mathbf X_1\mathbf b_1=\mathbf 0$, but components of $\mathbf X_1\mathbf b_1$ are not estimable.

Recall the alternative parametrization: $$ \mathbf y=\mathbf X_0\mathbf c_0+(\mathbf I-\mathbf P_{\mathbf X_0})\mathbf X_1\mathbf c_1+\mathbf e $$ Note that $$ \mathbb E[(\mathbf I-\mathbf P_{\mathbf X_0})\mathbf y]=(\mathbf I-\mathbf P_{\mathbf X_0})\begin{pmatrix}\mathbf X_0&(\mathbf I-\mathbf P_{\mathbf X_0})\mathbf X_1\end{pmatrix}\begin{pmatrix}\mathbf c_0\\\mathbf c_1\end{pmatrix}\\=\begin{pmatrix}\mathbf 0&(\mathbf I-\mathbf P_{\mathbf X_0})\mathbf X_1\end{pmatrix}\begin{pmatrix}\mathbf c_0\\\mathbf c_1\end{pmatrix}=(\mathbf I-\mathbf P_{\mathbf X_0})\mathbf X_1\mathbf c_1 $$ Thus the components of $(\mathbf I-\mathbf P_{\mathbf X_0})\mathbf X_1\mathbf c_1$ are estimable.

$H:(\mathbf I-\mathbf P_{\mathbf X_0})\mathbf X_1\mathbf c_1=\mathbf 0$.

Problem: $(\mathbf I-\mathbf P_{\mathbf X_0})\mathbf X_1$ may not be full rank.

Let $rank(\mathbf X)=r$, $rank(\mathbf X_0)=r_0$, and $rank((\mathbf I-\mathbf P_{\mathbf X_0})\mathbf X_1)=s=r-r_0$. Just pick out $s$ L.I. estimable functions from $(\mathbf I-\mathbf P_{\mathbf X_0})\mathbf X_1\mathbf c_1$ and perform the corresponding F-test.

6.4 The Likelihood Ratio Test (LRT)

$\mathbf y\sim\mathcal N_N(\mathbf X\mathbf b,\sigma^2\mathbf I)$

Want to test $H:\mathbf K^T\mathbf b=\mathbf m$, alternative $A:\mathbf K^T\mathbf b\not=\mathbf m$

Full parameter space: $\Omega=\{(\mathbf b,\sigma^2):\mathbf b\in\R^p,\sigma^2>0\}$

Reduced parameter space: $\Omega_0=\{(\mathbf b,\sigma^2):\mathbf b\in\R^p,\mathbf K^T\mathbf b=\mathbf m, \sigma^2>0\}$

Likelihood function: $L(\mathbf b,\sigma^2|\mathbf y)=(2\pi\sigma^2)^{-\frac{N}{2}}e^{-\frac{1}{2\sigma^2}Q(\mathbf b)}$, where $Q(\mathbf b)=(\mathbf y-\mathbf X\mathbf b)^T(\mathbf y-\mathbf X\mathbf b)$

LRT statistic: $$ \phi(\mathbf y)=\frac{\sup_{(\mathbf b,\sigma^2)\in\Omega_0}L(\mathbf b,\sigma^2|\mathbf y)}{\sup_{(\mathbf b,\sigma^2)\in\Omega}L(\mathbf b,\sigma^2|\mathbf y)} $$ Reject when $\phi(\mathbf y)$ is “small”.

Recall that we can maximize $L(\mathbf b,\sigma^2|\mathbf y)$ sequentially. First, w.r.t. $\mathbf b$, and then w.r.t. $\sigma^2$. Suppose we have the minimizers of $Q(\mathbf b)$: $Q(\hat{\mathbf b})$ and $Q(\hat{\mathbf b}_H)$.

Now, we know for fixed $Q(\mathbf b)$, the maximizer of $L(\mathbf b,\sigma^2|\mathbf y)$ in $\sigma^2$ is $\hat{\sigma^2}=\frac{Q(\mathbf b)}{N}$. Thus, $$ \phi(\mathbf y)=\frac{\left(2\pi\frac{Q(\hat{\mathbf b}_H)}{N}\right)^{-\frac{N}{2}}e^{-\frac{1}{2Q(\hat{\mathbf b}_H)/N}Q(\hat{\mathbf b}_H)}}{\left(2\pi\frac{Q(\hat{\mathbf b})}{N}\right)^{-\frac{N}{2}}e^{-\frac{1}{2Q(\hat{\mathbf b}_H)/N}Q(\hat{\mathbf b})}}=\left(\frac{Q(\hat{\mathbf b}_H)}{Q(\hat{\mathbf b})}\right)^{-\frac{N}{2}} $$

$$ \begin{aligned} \left(\frac{Q(\hat{\mathbf b}_H)}{Q(\hat{\mathbf b})}\right)^{-\frac{N}{2}}<c&\Leftrightarrow \frac{Q(\hat{\mathbf b}_H)}{Q(\hat{\mathbf b})}>c^{-\frac{N}{2}}\\ &\Leftrightarrow \frac{Q(\hat{\mathbf b}_H)-Q(\hat{\mathbf b})}{Q(\hat{\mathbf b})}>c^{-\frac{N}{2}}-1\\ &\Leftrightarrow \frac{(Q(\hat{\mathbf b}_H)-Q(\hat{\mathbf b}))/s}{Q(\hat{\mathbf b})/(N-r)}>\frac{N-r}{s}(c^{-\frac{N}{2}}-1) \end{aligned} $$

Example 6.5 (One-way ANOVA):

$Q(\hat{\mathbf b})=\sum_{i=1}^a\sum_{j=1}^n(y_{ij}-\bar y_{i\cdot})^2=\mathbf y^T(\mathbf I-\mathbf P_\mathbf X)\mathbf y$.

How do we minimize $Q(\mathbf b)$ subject to $\alpha_1=\alpha_2=\cdots=\alpha_a$?

$$ \begin{aligned} \inf_{(\mathbf b,\sigma^2)\in\Omega_0}\sum_{i=1}^a\sum_{j=1}^{n_i}(y_{ij}-(\mu+\alpha_{ij}))^2&=\inf_{\mu,\alpha\in\R}\sum_{i=1}^a\sum_{j=1}^{n_i}(y_{ij}-(\mu+\alpha))^2\\ &=\inf_{\delta\in\R}\sum_{i=1}^a\sum_{j=1}^{n_i}(y_{ij}-\delta)^2\\ &=\sum_{i=1}^a\sum_{j=1}^{n_i}(y_{ij}-\bar y_{\cdot\cdot})^2 \end{aligned} $$ Then $Q(\hat{\mathbf b}_H)-Q(\hat{\mathbf b})=\sum_{i=1}^an_i(\bar y_{i\cdot}-\bar y_{\cdot\cdot})^2$. $$ \frac{(Q(\hat{\mathbf b}_H)-Q(\hat{\mathbf b}))/s}{Q(\hat{\mathbf b})/(N-r)}=\frac{\sum_{i=1}^an_i(\bar y_{i\cdot}-\bar y_{\cdot\cdot})^2/(a-1)}{\sum_{i=1}^a\sum_{j=1}^{n_i}(y_{ij}-\bar y_{i\cdot})^2/(N-a)}=F\stackrel{H}{\sim} F_{a-1,N-a} $$

Full v.s. Reduced Models: $$ \mathbb E\mathbf y=\mathbf X\mathbf b=\begin{pmatrix}\mathbf X_0&\mathbf X_1\end{pmatrix}\begin{pmatrix}\mathbf b_0\\\mathbf b_1\end{pmatrix}=\mathbf X_0\mathbf b_0+\mathbf X_1\mathbf b_1 $$ Want to test: $H:\mathbf X_1\mathbf b_1=\mathbf 0$, but components may not be estimable. $$ \mathbf y=\mathbf X_0\mathbf c_0+(\mathbf I-\mathbf P_{\mathbf X_0})\mathbf X_1\mathbf c_1+\mathbf e $$ $rank(\mathbf X)=r$, $rank(\mathbf X_0)=r_0$, and $rank((\mathbf I-\mathbf P_{\mathbf X_0})\mathbf X_1)=s=r-r_0$

How do we use this test with the LRT? $$ Q(\hat{\mathbf b}_H)=\mathbf y^T(\mathbf I-\mathbf P_{\mathbf X_0})\mathbf y=SSE(\text{reduced})\\ Q(\hat{\mathbf b})=\mathbf y^T(\mathbf I-\mathbf P_{\mathbf X})\mathbf y=SSE(\text{full})\\ \Rightarrow Q(\hat{\mathbf b}_H)-Q(\hat{\mathbf b})=\mathbf y^T(\mathbf P_\mathbf X-\mathbf P_{\mathbf X_0})\mathbf y $$ Now, $$ \underbrace{\mathbf P_{\mathbf X_0}}_{\text{rank }r}+\underbrace{\mathbf P_{\mathbf X}-\mathbf P_{\mathbf X_0}}_{\text{rank }s=r-r_0}+\underbrace{\mathbf I-\mathbf P_\mathbf X}_{\text{rank }N-r}=\mathbf I $$ All these are projections.

By Cochran’s Theorem, $\mathbf y^T\mathbf P_{\mathbf X_0}\mathbf y/\sigma^2$, $\mathbf y^T(\mathbf P_\mathbf X-\mathbf P_{\mathbf X_0})\mathbf y/\sigma^2$, and $\mathbf y^T(\mathbf I-\mathbf P_\mathbf X)\mathbf y/\sigma^2$ are independent noncentral $\chi^2$’s with noncentrality parameters $(\mathbf X\mathbf b)^T\mathbf P_{\mathbf X_0}(\mathbf X\mathbf b)/2\sigma^2$, $(\mathbf X\mathbf b)^T(\mathbf P_{\mathbf X}-\mathbf P_{\mathbf X_0})(\mathbf X\mathbf b)/2\sigma^2$ and $0$. $$ \frac{[Q(\hat{\mathbf b}_H)-Q(\hat{\mathbf b})]/s}{Q(\hat{\mathbf b})/(N-r)}=\frac{\mathbf y^T(\mathbf P_\mathbf X-\mathbf P_{\mathbf X_0})\mathbf y/s}{\mathbf y^T(\mathbf I-\mathbf P_{\mathbf X})\mathbf y/(N-r)}=F\sim F_{s,N-r}(\phi),\\\phi=\frac{(\mathbf X\mathbf b)^T(\mathbf P_\mathbf X-\mathbf P_{\mathbf X_0})(\mathbf X\mathbf b)}{2\sigma^2} $$ It is easy to see that if $H:\mathbf X_1\mathbf b_1=\mathbf 0$, then $\phi=0$.

6.5 First Principles Test and LRT

Theorem 6.1: Let $\mathbf K^T\mathbf b$ be a set of linearly independent estimable functions. Also, let $\hat{\mathbf b}_H$ be part of a solution to the R.N.s with constraint $\mathbf K^T\mathbf b=\mathbf m$ (instead of $\mathbf P^T\mathbf b=\pmb\delta$). Then $$ Q(\hat{\mathbf b}_H)-Q(\hat{\mathbf b})=(\hat{\mathbf b}_H-\hat{\mathbf b})^T\mathbf X^T\mathbf X(\hat{\mathbf b}_H-\hat{\mathbf b})\\=(\mathbf K^T\hat{\mathbf b}-\mathbf m)^T\left[\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K\right]^{-1}(\mathbf K^T\hat{\mathbf b}-\mathbf m) $$ (That is, the two tests are the same)

proof: $$ \begin{aligned} Q(\hat{\mathbf b}_H)&=(\mathbf y-\mathbf X\hat{\mathbf b}_H)^T(\mathbf y-\mathbf X\hat{\mathbf b}_H)\\ &=(\mathbf y-\mathbf X\hat{\mathbf b}+\mathbf X\hat{\mathbf b}-\mathbf X\hat{\mathbf b}_H)^T(\mathbf y-\mathbf X\hat{\mathbf b}+\mathbf X\hat{\mathbf b}-\mathbf X\hat{\mathbf b}_H)\\ &=Q(\hat{\mathbf b})+(\hat{\mathbf b}-\hat{\mathbf b}_H)^T\mathbf X^T\mathbf X(\hat{\mathbf b}-\hat{\mathbf b}_H)+2(\hat{\mathbf b}-\hat{\mathbf b}_H)^T\underbrace{\mathbf X^T(\mathbf y-\mathbf X\hat{\mathbf b})}_{=\mathbf 0} \end{aligned} $$ R.N.E.s: $$ \begin{pmatrix}\mathbf X^T\mathbf X&\mathbf K\\\mathbf K^T&\mathbf 0\end{pmatrix}\begin{pmatrix}\mathbf b\\\pmb\theta\end{pmatrix}=\begin{pmatrix}\mathbf X^T\mathbf y\\\mathbf m\end{pmatrix} $$

$$ \mathbf X^T\mathbf X(\hat{\mathbf b}-\hat{\mathbf b}_H)=\mathbf X^T\mathbf y-(\mathbf X^T\mathbf y-\mathbf K\hat{\pmb\theta}_H)=\mathbf K\hat{\pmb\theta}_H\quad(\star) $$

Premultiplying $(\star)$ by $\mathbf K^T(\mathbf X^T\mathbf X)^g$ yields $$ \mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf X(\hat{\mathbf b}-\hat{\mathbf b}_H)=\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K\hat{\pmb\theta}_H $$ But $\left[\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf X\right]^T=\mathbf X^T\mathbf X[(\mathbf X^T\mathbf X)^g]^T\mathbf K=\mathbf K$ since columns of $\mathbf K$ are in $\mathcal C(\mathbf X^T)$. Thus, $$ \mathbf K^T(\hat{\mathbf b}-\hat{\mathbf b}_H)=\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K\hat{\pmb\theta}_H\\ \Rightarrow \left[\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K\right]^{-1}\mathbf K^T(\hat{\mathbf b}-\hat{\mathbf b}_H)=\hat{\pmb\theta}_H. $$ Now, premultiplying $(\star)$ by $(\hat{\mathbf b}-\hat{\mathbf b}_H)^T$ yields $$ \begin{aligned} Q(\hat{\mathbf b}_H)-Q(\hat{\mathbf b})&=(\hat{\mathbf b}-\hat{\mathbf b}_H)^T\mathbf X^T\mathbf X(\hat{\mathbf b}-\hat{\mathbf b}_H)\\ &=(\hat{\mathbf b}-\hat{\mathbf b}_H)^T\mathbf K\hat{\pmb\theta}_H\\ &=(\hat{\mathbf b}-\hat{\mathbf b}_H)^T\mathbf K\left[\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K\right]^{-1}\mathbf K^T(\hat{\mathbf b}-\hat{\mathbf b}_H)\\ &=(\mathbf K^T\hat{\mathbf b}-\mathbf K^T\hat{\mathbf b}_H)^T\left[\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K\right]^{-1}(\mathbf K^T\hat{\mathbf b}-\mathbf K^T\hat{\mathbf b}_H)\\ &=(\mathbf K^T\hat{\mathbf b}-\mathbf m)^T\left[\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K\right]^{-1}(\mathbf K^T\hat{\mathbf b}-\mathbf m) \end{aligned} $$

Corollary 6.4: Let $\mathbf K^T\mathbf b$ be a set of L.I. estimable functions and $\hat{\mathbf b}$ a solution to the N.E.s. We can find $\hat{\mathbf b}_H$ by solving for $\mathbf b$ in the following equation: $$ \mathbf X^T\mathbf X\mathbf b=\mathbf X^T\mathbf y-\mathbf K\left[\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K\right]^{-1}[\mathbf K^T\hat{\mathbf b}-\mathbf m] $$

proof: From R.N.E.s, we have $$ \mathbf X^T\mathbf X\hat{\mathbf b}_H+\mathbf K\hat{\pmb\theta}_H=\mathbf X^T\mathbf y $$ But we know from the proof of Theorem 6.1 that $$ \hat{\pmb\theta}_H=\left[\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K\right]^{-1}\mathbf K^T(\hat{\mathbf b}-\hat{\mathbf b}_H) $$ Now plug in, and rearrange.

Scheffe method for dealing with multiple comparisons

Suppose $\mathbf A$ is a positive definite matrix and let $\mathbf w$ be a vector. We want $$ \max_{\mathbf x\not=0}\frac{(\mathbf x^T\mathbf w)^2}{\mathbf x^T\mathbf A\mathbf x} $$

$$ \frac{(\mathbf x^T\mathbf w)^2}{\mathbf x^T\mathbf A\mathbf x}=\frac{\langle\mathbf A^{1/2}\mathbf x,\mathbf A^{-1/2}\mathbf w\rangle^2}{|\mathbf A^{1/2}\mathbf x|^2}\stackrel{\text{C-S}}{\leq}|\mathbf A^{-1/2}\mathbf w|^2 $$

but we attain equality when $\mathbf A^{1/2}\mathbf x=\mathbf A^{-1/2}\mathbf w$. $$ \Rightarrow\max_{\mathbf x\not=0}\frac{(\mathbf x^T\mathbf w)^2}{\mathbf x^T\mathbf A\mathbf x}=\mathbf w^T\mathbf A^{-1}\mathbf w\quad(\star\star) $$ Now suppose that $\tau=\pmb\Lambda^T\mathbf b$ where $\pmb\Lambda_{p\times s}$ has L.I. columns in $\mathcal C(\mathbf X^T)$. Define $$ R(\mathbf u,c)=\left(\mathbf u^T\hat\tau-c\hat\sigma\sqrt{\mathbf u^T\mathbf H\mathbf u},\mathbf u^T\hat\tau+c\hat\sigma\sqrt{\mathbf u^T\mathbf H\mathbf u}\right) $$ where $\mathbf H=\pmb\Lambda^T(\mathbf X^T\mathbf X)^g\pmb\Lambda$, $\hat{\sigma}^2=SSE/(N-r)$, $\mathbf u\in\R^s$, $c>0$.

Goal: Find $c>0$, s.t. $$ P\left(\mathbf u^T\tau\in R(\mathbf u,c),\forall\mathbf u\in\R^s\right)=1-\alpha. $$ If the above holds, we can form individual C.I.s for as many linear combinations of the $\tau_j$’s as we like, and the whole collection will hold simultaneously at level $\alpha$, or with probability $1-\alpha$.

First $$ P(\mathbf u^T\tau\in R(\mathbf u,c))=P\left(\left|\frac{\mathbf u^T\tau-\mathbf u^T\hat{\tau}}{\hat\sigma\sqrt{\mathbf u^T\mathbf H\mathbf u}}\right|\leq c\right). $$ Thus, $$ \begin{aligned} P(\mathbf u^T\tau\in R(\mathbf u,c),\forall \mathbf u\in\R^s)&=P\left(\max_{\mathbf u\in\R^s,\mathbf u\not=\mathbf 0}\left|\frac{\mathbf u^T\tau-\mathbf u^T\hat{\tau}}{\hat\sigma\sqrt{\mathbf u^T\mathbf H\mathbf u}}\right|\leq c\right)\\ &=P\left(\max_{\mathbf u\in\R^s,\mathbf u\not=\mathbf 0}\frac{[\mathbf u^T(\tau-\hat\tau)]^2}{\hat{\sigma}^2\mathbf u^T\mathbf H\mathbf u}\leq c^2\right)\\ &\stackrel{(\star\star)}{=}P\left((\hat\tau-\tau)^T\mathbf H^{-1}(\hat\tau-\tau)/\hat\sigma^2\leq c^2\right)\\ &=P\left((\hat\tau-\tau)^T\mathbf H^{-1}(\hat\tau-\tau)/s\hat\sigma^2\leq\frac{c^2}{s}\right) \end{aligned} $$ But $$ \frac{(\hat\tau-\tau)^T\mathbf H^{-1}(\hat\tau-\tau)/s}{\hat\sigma^2}\sim F_{s,N-r} $$ So, all we need to do is setting $\frac{c^2}{s}=F_{s,N-r,\alpha}$ so that $P(\mathbf u^T\tau\in R(\mathbf u,c),\forall\mathbf u\in\R^s)=1-\alpha$. This yields $$ c=\sqrt{sF_{s,N-r,\alpha}} $$

Contents

STA6246: Theory of Linear Models

Review of Linear Algebra

Chapter 1: The GLM and Examples

Chapter 2: The Linear Least Squares Problem

2.1 The Normal Equations

2.2 The Geometry of Least Squares

2.3 Reparametrization

2.4 A Version of the QR Decomposition

Chapter 3: Estimability and Least Squares Estimators

3.7 Reparameterization Revisited

3.8 Imposing Conditions on $\mathbf b$ to Get a Unique Solution to the Normal Equations

3.9 Constrained Parameter Space

Chapter 4: Gauss–Markov Model

Gauss-Markov Model

4.3 Variance estimation

4.5 The Aitken Model and Generalized Least Squares

4.7 Best Estimation in a Constrained Parameter Space

Chapter 5: Distributional Theory

5.2 Multivariate Normal Distribution

5.3 Chi-Square and Related Distributions

5.4 Distribution of Quadratic Forms

5.5 Cochran’s Theorem

Chapter 6: Statistical Inference

6.2 Results from Statistical Theory

6.3 Testing the General Linear Hypothesis

6.4 The Likelihood Ratio Test (LRT)

6.5 First Principles Test and LRT

Scheffe method for dealing with multiple comparisons