STA6246: Theory of Linear Models
Instructor James Hobert
Taken by Yu Zheng in 2022 fall
Review of Linear Algebra
Def: A set of vectors $\mathbf x^{(1)}, \mathbf x^{(2)},\cdots,\mathbf x^{(n)}$ is linearly dependent if there exist coefficients $c_j$, $j=1,\cdots,n$, not all zero, such that $$ \sum_{j=1}^nc_j\mathbf x^{(j)}=\mathbf 0. $$ This set of vectors is linearly independent if $\sum_{j=1}^nc_j\mathbf x^{(j)}=\mathbf 0$ implies $c_j\equiv0$.
Def: Two vectors are orthogonal to each other, written $\mathbf x\perp \mathbf y$, if $$ \mathbf x^T\mathbf y=\mathbf y^T\mathbf x=\sum_j x_jy_j=0. $$ A set of vectors $\mathbf x^{(1)}, \mathbf x^{(2)}, \cdots, \mathbf x^{(n)}$ are mutually orthogonal iff $\mathbf x^{(j)T}\mathbf x^{(j)}=0$ for all $i\not=j$.
- A set of mutually orthogonal nonzero vectors are also linearly independent.
Def: A vector space $\mathcal S$ is a set of vectors that are closed under addition and scalar multiplication, that is, if $\mathbf x^{(1)}$ and $\mathbf x^{(2)}$ are in $\mathcal S$, then $c_1\mathbf x^{(1)}+c_2\mathbf x^{(2)}$ is in $\mathcal S$.
Def: A vector space $\mathcal S$ is said to be generated by a set of vectors $\mathbf x ^{(1)}, \mathbf x^{(2)},\cdots, \mathbf x^{(n)}$ if for every $\mathbf x\in\mathcal S$, there exist some coefficients $c_j$ so that we can write $$ \mathbf x=\sum_{j}c_j\mathbf x^{(j)}. $$ Def: If a vector space $\mathcal S$ is generated by a set of linearly independent vectors $\mathbf x ^{(1)}, \mathbf x^{(2)},\cdots, \mathbf x^{(n)}$, then this set of vectors form a basis for the space $\mathcal S$.
Def: The set of all linear combinations of a set of vectors is called the span of that set.
- $\mathbf x\in span\{\mathbf x ^{(1)}, \mathbf x^{(2)},\cdots, \mathbf x^{(n)}\}$ if and only if there exists constants $c_j$ such that $\mathbf x=\sum_jc_j\mathbf x^{(j)}$.
Def: The number of vectors in the basis for a vector space $\mathcal S$ is the dimension of the space $\mathcal S$, written $dim(\mathcal S)$.
Def: The rank of a matrix $\mathbf A$ is the number of linear independent rows or columns, and denoted by $rank(\mathbf A)$ or $r(\mathbf A)$.
Def: The column space of a matrix, denoted by $\mathcal C(\mathbf A)$, is the vector space spannned by the columns of the matrix, that is $$ \mathcal C(\mathbf A )=\{\mathbf x: \text{there exists a vectors }\mathbf c\text{ such that }\mathbf x=\mathbf A \mathbf c \}. $$
- $\dim(\mathcal C(\mathbf A))=rank(\mathbf A)$
- $rank(\mathbf A \mathbf B)\leq \min(rank(\mathbf A),rank(\mathbf B))$
- $\mathcal C(\mathbf A\mathbf B)\subset \mathcal C(\mathbf A)$
- If $\mathcal C(\mathbf A)\subset \mathcal C(\mathbf B)$, then there exista a matrix $\mathbf C$ such that $\mathbf A=\mathbf B\mathbf C$
Def: The null space of a matrix, denoted by $\mathcal N(\mathbf A)$, is defined by $\mathcal N(\mathbf A)=\{\mathbf y:\mathbf A\mathbf y=\mathbf 0\}$.
Notice that if the matrix $\mathbf A$ is $m\times n$, then vectors in $\mathcal N(\mathbf A)$ have dimension $n$ while vectors in $\mathcal C(\mathbf A)$ have dimension $m$. Mathematically, this may be expressed as $\mathcal C(\mathbf A)\subset \mathbb R^m$ and $\mathcal N(\mathbf A)\subset \mathbb R^n$.
If $\mathbf A$ has full-column rank, then $\mathcal N(\mathbf A)=\{\mathbf 0\}$
Suppose $\mathbf A\in\mathbb R ^{m\times n}$, then $\dim(\mathcal N(\mathbf A))=n-r$ where $r=rank(\mathbf A)$, or, more elegantly, $$ \dim(\mathcal N(\mathbf A))+\dim(\mathcal C(\mathbf A))=n. $$
Def: Two vector spaces $\mathcal S$ and $\mathcal T$ form orthogonal complements in $\mathbb R^m$ if and only if $\mathcal S$, $\mathcal T\subset\mathbb R^m$, $\mathcal S\cap \mathcal T={\mathbf 0}$, $\dim(\mathcal S)=r$, $\dim(\mathcal T)=n-r$, and every vector in $\mathcal S$ is orthogonal to every vector in $\mathcal T$.
- Let $\mathcal S$ and $\mathcal T$ be orthogonal complements in $\mathbb R^m$, then any vector $\mathbf x\in\mathbb R^m$ can be written as $\mathbf x=\mathbf s+\mathbf t$ where $\mathbf s\in\mathcal S$ and $\mathbf t\in\mathcal T$, and this decomposition is unique
- If $\mathbf A$ is an $m\times n$ matrix, then $\mathcal C(\mathbf A)$ and $\mathcal N(\mathbf A^T)$ are orthogonal complements in $\mathbb R^m$
- Let $\mathcal S_1$ and $\mathcal T_1$ be orthogonal complements, as welle as $\mathcal S_2$ and $\mathcal T_2$; then if $\mathcal S_1\subset \mathcal S_2$, then $\mathcal T_2\subset \mathcal T_1$
- COnsider two vectors spaces $\mathcal S$ and $\mathcal T$. If $\mathcal S\subset \mathcal T$ and $\dim(\mathcal S)=\dim(\mathcal T)=k$, then $\mathcal S=\mathcal T$
- Let $\mathbf A$ be an $m\times n$ matrix and $\mathbf b$ a fixed vector. If $\mathbf A\mathbf x+\mathbf b=\mathbf 0$ for all $\mathbf x\in\mathbb R^n$, then $\mathbf A=\mathbf 0$ and $\mathbf b=\mathbf 0$
- If $\mathbf B\mathbf x=\mathbf C\mathbf x$ for all $\mathbf x$, then $\mathbf B=\mathbf C$
- Let $\mathbf A$ have full-column rank; then if $\mathbf A\mathbf B=\mathbf A\mathbf C$, then $\mathbf B=\mathbf C$
- If $\mathbf C^T\mathbf C=\mathbf 0$, then $\mathbf C=\mathbf 0$
Def: A system of equations $\mathbf A\mathbf x=\mathbf c$ is consistent iff there exists a solution $\mathbf x^\star$ such that $\mathbf A\mathbf x^\star=\mathbf c$
- A system of equations $\mathbf A\mathbf x=\mathbf c$ is consistent iff $\mathbf c\in\mathcal C(\mathbf A)$
Def: A matrix $\mathbf G$ is a generalized inverse of the matrix $\mathbf A$ iff it satisfies $\mathbf A\mathbf G\mathbf A=\mathbf A$
Let $\mathbf A$ be an $m\times n$ matrix with rank $r$. If $\mathbf A$ can be partitioned as below, with $rank(\mathbf A)=rank(\mathbf C)=r$, $$ \mathbf A=\begin{bmatrix}\mathbf C&\mathbf D\\\mathbf E&\mathbf F\end{bmatrix}\begin{matrix}r\\m-r\end{matrix} $$ so that $\mathbf C$ is nonsingular, then the matrix $$ \mathbf G=\begin{bmatrix}\mathbf C^{-1}&\mathbf 0\\\mathbf 0&\mathbf 0\end{bmatrix}\begin{matrix}r\\n-r\end{matrix} $$ is a generalized inverse of $\mathbf A$
Let $\mathbf A$ be an $m\times n$ matrix with rank $r$. If $\mathbf A$ can be partitioned as below, with $rank(\mathbf A)=rank(\mathbf F)=r$, $$ \mathbf A=\begin{bmatrix}\mathbf C&\mathbf D\\\mathbf E&\mathbf F\end{bmatrix}\begin{matrix}m-r\\r\end{matrix} $$ so that $\mathbf C$ is nonsingular, then the matrix $$ \mathbf G=\begin{bmatrix}\mathbf 0&\mathbf 0\\\mathbf 0&\mathbf F^{-1}\end{bmatrix}\begin{matrix}n-r\\r\end{matrix} $$ is a generalized inverse of $\mathbf A$
Let $\mathbf A$ be an $m\times n$ matrix with rank $r$. Let $\mathbf P$ and $\mathbf Q$ be permutation matrices such that $$ \mathbf P\mathbf A\mathbf Q=\begin{bmatrix}\mathbf C&\mathbf D\\\mathbf E&\mathbf F\end{bmatrix} $$ where $rank(\mathbf A)=rank(\mathbf C)=r$ and $\mathbf C$ is nonsingular. Then the matrix $\mathbf G$ below is a generalized inverse of $\mathbf A$: $$ \mathbf G=\mathbf Q\begin{bmatrix}\mathbf C^{-1}&\mathbf 0\\\mathbf 0&\mathbf 0\end{bmatrix}\mathbf P $$
Let $\mathbf A\mathbf x=\mathbf c$ be a consistent system of equations and let $\mathbf G$ be a generalized inverse of $\mathbf A$, then $\mathbf G\mathbf c$ is a solution to the euqations $\mathbf A\mathbf x=\mathbf c$
Let $\mathbf A\mathbf x=\mathbf c$ be a consistent system of equations and let $\mathbf G$ be a generalized inverse of $\mathbf A$; then $\tilde {\mathbf x}$ is a solution to the equations $\mathbf A\mathbf x=\mathbf c$ iff there exists a vector $\mathbf z$ such that $\tilde{\mathbf x}=\mathbf G\mathbf c+(\mathbf I-\mathbf G\mathbf A)\mathbf z$
Def: A square matrix $\mathbf P$ is idempotent iff $\mathbf P^2=\mathbf P$
Def: A square matrix $\mathbf P$ is a projection onto the vector space $\mathcal S$ iff
- $\mathbf P$ is idempotent,
- for any $\mathbf x$, $\mathbf P\mathbf x\in\mathcal S$. and
- if $\mathbf z\in\mathcal S$, $\mathbf P\mathbf z=\mathbf z$
$\mathbf A\mathbf A^g$ is a projection onto $\mathcal C(\mathbf A)$
$(\mathbf I-\mathbf A^g\mathbf A)$ is a projection onto $\mathcal N(\mathbf A)$
A symmetric, idempotent matrix $\mathbf P$ that projects onto the vector space $\mathcal S$ is unique
If a symmetric, idempotent matrix $\mathbf P$ projects onto $\mathcal S$, then $\mathbf I-\mathbf P$ projects onto its orthogonal complement
$trace(\mathbf A\mathbf B)=trace(\mathbf B\mathbf A)$
$trace(\mathbf A^T\mathbf A)=\sum_i\sum_jA_{ij}^2$
$|\mathbf A^{-1}|=1/|\mathbf A|$
$det\left(\begin{bmatrix}\mathbf A&\mathbf B\\\mathbf C&\mathbf D\end{bmatrix}\right)=|\mathbf A||\mathbf D-\mathbf C\mathbf A^{-1}\mathbf B|=|\mathbf D||\mathbf A-\mathbf B\mathbf D^{-1}\mathbf C|$
The spectral decomposition of a symmetric matrix is $\mathbf A=\mathbf Q\mathbf \Lambda\mathbf Q^T=\sum_j\lambda_j\mathbf q^{(j)}\mathbf q^{(j)T}$where $\mathbf \Lambda$ is a diagonal matrix of the eigenvalues, ordered in the same way the eigenvectors are stacked as columns in $\mathbf Q$
For a symmetric $n\times n$ matrix $\mathbf A$, $|\mathbf A|=\lambda_1\times\cdots\times\lambda_n$ and $trace(\mathbf A)=\lambda_1+\cdots+\lambda_n$, where $\lambda_j$ are the eigenvalues of $\mathbf A$
For a symmetric $n\times n$ matrix $\mathbf A$, the rank of it is equal to the number of nonzero eigenvalues
Def: A matrix $\mathbf A$ is nonnegative definite iff $\mathbf x^T\mathbf A\mathbf x\geq0$ for all $\mathbf x$
Def: A matrix $\mathbf A$ is positive definite iff $\mathbf x^T\mathbf A\mathbf x>0$ for all $\mathbf x$
- (Cholesky factorization) A square matrix $\mathbf A$ is positive definite iff there exists a nonsigunlar lower triangular matrix $\mathbf L$ such that $\mathbf A=\mathbf L\mathbf L^T$
- $\left(\mathbf A^{\frac{1}{2}}\right)^{-1}=\mathbf A^{-\frac{1}{2}}=\mathbf Q\mathbf \Lambda^{-\frac{1}{2}}\mathbf Q^T$
Chapter 1: The GLM and Examples
GLM: $\mathbf y=\mathbf X\mathbf b+\mathbf e$
$\mathbf y$: $N\times 1$ random vector of observable responses
$\mathbf X$: $N\times p$ matrix of known constants
$\mathbf b$: $p\times1$ vector of unknown constants
$\mathbf e$: $N\times 1$ random vector of unobservable error such that $\mathbb E(\mathbf e)=\mathbf 0$
It is called a linear model because the expected value of $\mathbf y$, $\mathbb E(\mathbf y)$, is linear in $\mathbf b$.
Define $\mu(\mathbf b)=\mathbf X\mathbf b=\mathbb E(\mathbf y)$. $\mu(\cdot)$ is a linear function: Fix $\mathbf b^{(1)}$, $\mathbf b^{(2)}\in\mathbb R^p$, and $a_1,a_2\in\mathbb R$, then $$ \mu(a_1\mathbf b^{(1)}+a_2\mathbf b^{(2)})=\mathbf X(a_1\mathbf b^{(1)}+a_2\mathbf b^{(2)})=a_1\mathbf X\mathbf b^{(1)}+a_2\mathbf X\mathbf b^{(2)}=a_1\mu(\mathbf b^{(1)})+a_2\mu(\mathbf b^{(2)}) $$ Examples:
One-sample problem (p=1): $ y_1,\cdots,y_N$ i.i.d., $\mathbb E(y_1)=\mu$, $Var( y_1)=\sigma^2$. This is a special case of the GLM with $\mathbf y=(y_1,\cdots,y_N)^T$ and $\mathbf X\mathbf b=\mathbf 1_N\mu$; $\mathbf e$ is just a vector of i.i.d. centered errors $e_1\stackrel{d}{=}y_1-\mu$.
Simple linear regression (p=2): $y_i=\beta_0+\beta_1x_i+e_i$, $i=1,2,\cdots,N$, where $\{e_i\}_{i=1}^N$ uncorrelated r.v.s with $\mathbb E(e_i)=0$ and $Var(e_i)=\sigma^2$. This is a special case of the GLM with $\mathbf y=(y_1,\cdots,y_N)^T$, $\mathbf X\mathbf b=\begin{pmatrix}1&x_0\\\vdots&\vdots\\1&x_N\end{pmatrix}\begin{pmatrix}\beta_0\\\beta_1\end{pmatrix}$, $\mathbf e=(e_1,\cdots,e_N)^T$.
Actually we are making the assumption that the $x_i$’s are measured without error.
Eg. $y_i=\beta_0+\beta_1n_i+e_i$ (True Model)
$y_i$ is the crop yield, and $n_i$ is the nitrogen in soil. We don’t see $n_i$ exactly; we see a distorted version $x_i=n_i+u_i$, where $u_i$ is a random measurement error.
$y_i=\beta_0+\beta_1(n_i+u_i)+e_i$ (Observed Model) $\rightarrow$ Is this a special case of the GLM? No!
Multiple regression (p=k+1): $y_i=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\cdots+\beta_kx_{ik}+e_i$, $i=1,2,\cdots,N$, where $\{e_i\}^N_{i=1}$ uncorrelated r.v.s with $\mathbb E(e_i)=0$ and $Var(e_i)=\sigma^2$. This is a special case of the GLM with $\mathbf y=(y_1,\cdots,y_N)^T$, $\mathbf X\mathbf b=\begin{pmatrix}1&x_{11}&\cdots&x_{1k}\\\vdots&\vdots&\ddots&\vdots\\1&x_{11}&\cdots&x_{1k}\end{pmatrix}\begin{pmatrix}\beta_0\\\vdots\\\beta_N\end{pmatrix}$, $\mathbf e=(e_1,\cdots,e_N)^T$.
- $y_i=\beta_0+\beta_1i+\beta_2i^2+e_i$ (Yes)
- $y_i=\beta_0+\beta_1\cos\left(\frac{2\pi i}{7}\right)+\beta_2\log(i+1)+e_i$ (Yes)
- $y_i=\beta_0+\beta_1 e^{-\beta_2x_i}+e_i$ (No)
- $y_i=\frac{\beta_0}{\beta_1+\beta_2 x_i}+e_i$ (No)
One-way ANOVA: $y_{ij}=\mu+\alpha_i+e_{ij}$, $i=1,\cdots, a$, $j=1,\cdots, n_i$, where $\{e_{ij}\}^{a,n_i}_{i,j=1}$ uncorrelated r.v.s with $\mathbb E(e_{ij})=0$ and $Var(e_{ij})=\sigma^2$. If we assume that $\{\alpha_i\}^a_{i=1}$ are unknown constants, then this is a special case of the GLM with $\mathbf y=(y_{11},\cdots,y_{1n_1},y_{21},\cdots,y_{2n_2},\cdots,y_{a1},\cdots, y_{an_a})^T$, $\mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_{n_1}&\mathbf 1_{n_1}&\mathbf0&\cdots&\mathbf 0\\\mathbf 1_{n_2}&\mathbf 0&\mathbf 1_{n_2}&\cdots&\mathbf 0\\\vdots&\vdots&\vdots&\ddots&\vdots\\\mathbf 1_{n_a}&\mathbf0&\mathbf0&\cdots&\mathbf1_{n_a}\end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\vdots\\\alpha_a\end{pmatrix}$, $\mathbf e=(e_{11},\cdots,e_{1n_1},e_{21},\cdots,e_{2n_2},\cdots,e_{a1},\cdots, e_{an_a})^T$.
The comments:
$\mathbf X$ is not full rank
If the $a$ treatments were randomly selected from a population of treatments, then $\{\alpha_i\}_{i=1}^a$ must be considered random, and in this case, we have a so-called mixed model. However, if we redefine $\mathbf X\mathbf b$ and $\mathbf e$, this mixed model is still a special case of the GLM with $\mathbf X\mathbf b=\begin{pmatrix}1\\1\\\vdots\\1\end{pmatrix}\mu$, $\mathbf e=\begin{pmatrix}\alpha_1+e_{11}\\\alpha_1+e_{12}\\\vdots\\\alpha_a+e_{a1}\\\vdots\\\alpha_a+e_{an_a}\end{pmatrix}$
Here elements of $\mathbf e$ are no longer uncorrelated–but that’s okay.
Two-way nested model: $y_{ijk}=\mu+\alpha_i+\beta_{ij}+e_{ijk}$, $i=1,\cdots,a$, $j=1,\cdots,b_i$, $k=1,\cdots,n_{ij}$, where $\{e_{ijk}\}_{i,j,k=1}^{a,b_i,n_{ij}}$ uncorrelated r.v.s with $\mathbb E(e_{ijk})=0$ and $Var(e_{ijk})=\sigma^2$. We’ll assume the $\{\alpha_i\}_{i=1}^a$ and $\{\beta_{ij}\}_{i,j=1}^{a,b_i}$ are fixed treatment effects. This is a special case of the GLM.
Eg: Hospital costs from medical patients in two states (Florida & Pennsylvania)
$b_1=2$ Florida (Dade county, Menrec county)
$b_2=3$ Penn (A, B, E)
$\mathbf y=\begin{pmatrix}y_{F,D,1}\\\vdots\\ y_{F,D,n_{11}}\\ y_{F,M,1}\\\vdots\\ y_{F,M,n_{12}}\\ y_{P,A,1}\\\vdots\\ y_{P,E,n_{23}}\end{pmatrix}$,
$\mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_{n_{11}}&\mathbf1_{n_{11}}&\mathbf0&\mathbf1_{n_{11}}&\mathbf0&\mathbf0&\mathbf0&\mathbf0\\\mathbf 1_{n_{12}}&\mathbf1_{n_{12}}&\mathbf0&\mathbf0&\mathbf1_{n_{12}}&\mathbf0&\mathbf0&\mathbf0\\\mathbf 1_{n_{21}}&\mathbf0&\mathbf1_{n_{21}}&\mathbf0&\mathbf0&\mathbf1_{n_{21}}&\mathbf0&\mathbf0\\\mathbf 1_{n_{22}}&\mathbf0&\mathbf1_{n_{22}}&\mathbf0&\mathbf0&\mathbf0&\mathbf1_{n_{22}}&\mathbf0\\\mathbf 1_{n_{23}}&\mathbf0&\mathbf1_{n_{23}}&\mathbf0&\mathbf0&\mathbf0&\mathbf0&\mathbf1_{n_{23}}\end{pmatrix}\begin{pmatrix}\mu\\\alpha_F\\\alpha_P\\\beta_{F,D}\end{pmatrix}$
$p=a+1+\sum_{i=1}^ab_i$, $N=\sum_{i=1}^a\sum_{j=1}^{b_i}n_{ij}$
Two-way crossed model: $y_{ijk}=\mu+\alpha_i+\beta_j+\gamma_{ij}+e_{ijk}$, $i=1,\cdots,a$, $j=1,\cdots,b$, $k=1,\cdots,n_{ij}$, where $\{e_{ijk}\}_{i,j,k=1}^{a,b,n_{ij}}$ uncorrelated r.v.s with $\mathbb E(e_{ijk})=0$ and $Var(e_{ijk})=\sigma^2$. Assume all fixed effects. This is a special case of the GLM.
Simple special case: $y_{ij}=\mu+\alpha_i+\beta_j+e_{ij}$, $i=1,\cdots,a$, $j=1,\cdots,b$.
$\rightarrow$Randomized block model: $\alpha_i$-treatment effects, $\beta_j$-block effects. If block effects are considered fixed, this is a special case of the GLM.
$p=a+b+1$, $N=ab$, $\mathbf y=(y_{11},\cdots,y_{1b},y_{21},\cdots,y_{2b},\cdots,y_{a1},\cdots,y_{ab})^T$, $\mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_b&\mathbf 1_b&\mathbf 0&\cdots&\mathbf 0&\mathbf I_b\\\mathbf 1_b&\mathbf 0&\mathbf 1_b&\cdots&\mathbf 0&\mathbf I_b\\\vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\\mathbf 1_b&\mathbf 0&\mathbf 0&\cdots&\mathbf 1_b&\mathbf I_b \end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\vdots\\\alpha_a\\\beta_1\\\vdots\\\beta_b\end{pmatrix}$
if block effects are considered random, then $\mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_b&\mathbf 1_b&\mathbf 0&\cdots&\mathbf 0\\\mathbf 1_b&\mathbf 0&\mathbf 1_b&\cdots&\mathbf 0\\\vdots&\vdots&\vdots&\ddots&\vdots\\\mathbf 1_b&\mathbf 0&\mathbf 0&\cdots&\mathbf 1_b\end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\vdots\\\alpha_a\\\end{pmatrix}$, $\mathbf e=\begin{pmatrix}\beta_1+e_{11}\\\vdots\\\beta_b+e_{1b}\\\vdots\\\beta_1+e_{a1}\\\vdots\\\beta_b+e_{ab}\end{pmatrix}$.
Analysis of covariate: $y_{ij}=\mu+\alpha_i+\beta x_{ij}+e_{ij}$, $i=1,\cdots,a$, $j=1,\cdots,n_i$, where $\{e_{ij}\}_{i,j=1}^{a,n_i}$ uncorrelated r.v.s with $\mathbb E(e_{ij})=0$ and $Var(e_{ij})=\sigma^2$.
Eg: $y_{ij}=\text{weight gain for individual }j\text{ on diet }i$, $x_{ij}=\text{ initial weight}$
$p=a+2$, $N=\sum_{i=1}^an_i$, $\mathbf y=(y_{11},\cdots,y_{1n_1},y_{21},\cdots,y_{2n_2},\cdots,y_{a1},\cdots, y_{an_a})^T$, $\mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_{n_1}&\mathbf 1_{n_1}&\mathbf0&\cdots&\mathbf 0&\tilde{\mathbf x}_1\\ \mathbf 1_{n_2}&\mathbf 0&\mathbf 1_{n_2}&\cdots&\mathbf 0&\tilde{\mathbf x}_2\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ \mathbf 1_{n_a}&\mathbf0&\mathbf0&\cdots&\mathbf1_{n_a}&\tilde{\mathbf x}_a\end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\vdots\\\alpha_a\\\beta\end{pmatrix}$, $\tilde{\mathbf x}_i=(x_{i1},\cdots,x_{in_i})^T$, $\mathbf e=(e_{11},\cdots,e_{1n_1},e_{21},\cdots,e_{2n_2},\cdots,e_{a1},\cdots, e_{an_a})^T$.
We will gradually build up assumptions on $\mathbf e$:
- No assumptions. Just find $\mathbf b$ to minimize $|\mathbf y-\mathbf X\mathbf b|$
- Assume $\mathbb E(\mathbf e)=0$. Find unbiased estimators of $\pmb \lambda^T\mathbf b$ , where $\pmb\lambda^T=(\lambda_1,\cdots,\lambda_p)$ (Fixed vector)
- Assume $\{e_i\}_{i=1}^n$ are uncorrelated with $\mathbb E(\mathbf e)=\mathbf 0$ and $Var(\mathbf e_{ij})=\sigma^2$. Find the minimum variance unbiased estimator of $\pmb \lambda^T\mathbf b$
- Assume $\mathbf e$ is MVN and develop tests and CIs concerning $\pmb \lambda^T\mathbf b$
Chapter 2: The Linear Least Squares Problem
2.1 The Normal Equations
Assumption: $\mathbf y\in\mathbb R^N$ and $\mathbf X\in\mathbb R^{N\times p}$
Goal: Find the value of $\mathbf b\in\mathbb R^p$ that minimizes $$ Q(\mathbf b)=(\mathbf y-\mathbf X\mathbf b)^T(\mathbf y-\mathbf X\mathbf b)=|\mathbf y-\mathbf X\mathbf b|^2 $$ A value of $\mathbf b$ that minimizes $Q(\mathbf b)$ will be called a least squares solution $$ \mathcal C(\mathbf X)=\{\mathbf v\in\mathbb R^N:\mathbf v=\mathbf X\mathbf w\text{ for some }\mathbf w\in\mathbb R^p\} $$ Mathematically, trying to find the closest point in $\mathcal C(\mathbf X)$ to the data $\mathbf y$
The gradient vector is given by $$ \frac{\partial Q}{\partial \mathbf b}=(\frac{\partial Q}{\partial b_1},\frac{\partial Q}{\partial b_2},\cdots,\frac{\partial Q}{\partial b_p})^T $$ The minimum of $Q$ will occur where the gradient is zero, so we want to solve $$ \frac{\partial Q}{\partial\mathbf b}=\mathbf 0 $$ Result 2.1: Assume $\mathbf a$ is a $p\times 1$ vector, $\mathbf b$ is a $p\times1$ vector, $\mathbf A$ is a $p\times p$ matrix. Then
- $\frac{\partial \mathbf a^T\mathbf b}{\partial \mathbf b}=\mathbf a$
- $\frac{\partial\mathbf b^T\mathbf A\mathbf b}{\partial \mathbf b}=(\mathbf A+\mathbf A^T)\mathbf b$
proof: For the first result, note that $\mathbf a^T\mathbf b=a_1b_1+a_2b_2+\cdots+a_pb_p$, so $\left(\frac{\partial\mathbf a^T\mathbf b}{\partial \mathbf b}\right)_j=\frac{\partial\mathbf a^T\mathbf b}{\partial\mathbf b_j}=a_j$.
Similarly, $$ \mathbf b^T\mathbf A\mathbf b=\sum_{j=1}^p\sum_{k=1}^pA_{jk}b_jb_k, $$ and to form $\left(\frac{\partial\mathbf b^T\mathbf A\mathbf b}{\partial\mathbf b}\right)_j$, the terms that depend on $b_j$ are $$ A_{jj}b_j^2+\sum_{k\not=j}(A_{jk}+A_{kj})b_kb_j, $$ so the partial derivative w.r.t. $b_j$ is $$ 2A_{jj}b_j+\sum_{k\not=j}A_{jk}b_k+\sum_{k\not=j}A_{kj}b_k, $$ which is the $j$-th element of $(\mathbf A+\mathbf A^T)\mathbf b$.
Back to $Q(\mathbf b)$, using Result 2.1, $$ \frac{\partial Q}{\partial \mathbf b}=-2\mathbf X^T\mathbf y+2\mathbf X^T\mathbf X\mathbf b. $$ Now, setting the gradient to 0 yields $$ \mathbf X^T\mathbf X\mathbf b=\mathbf X^T\mathbf y. $$ They are the normal equations.
Example:
- SLR: $y_i=\beta_0+\beta_1x_i+e_i$, $i=1,\cdots,N$ $$ \mathbf X^T\mathbf X\mathbf b=\begin{pmatrix}N&N\bar{\mathbf x}\\N\bar{\mathbf x}&\sum_{i=1}^Nx_i^2\end{pmatrix}\begin{pmatrix}\beta_0\\\beta_1\end{pmatrix},\quad \mathbf X^T\mathbf y=\begin{pmatrix}N\bar{\mathbf y}\\\sum_{i=1}^Nx_iy_i\end{pmatrix} $$ Assume that $\sum_{i=1}^N(x_i-\bar{\mathbf x})^2>0$, i.e., the $x_i$’s are NOT all the same. The solution to the N.E. is $$ \hat \beta_1=\frac{\sum_{i=1}^N(x_i-\bar{\mathbf x})y_i}{\sum_{i=1}^N(x_i-\bar{\mathbf x})^2}\\ \hat\beta_0=\bar{\mathbf y}-\hat \beta_1\bar{\mathbf x} $$ If $\sum_{i=1}^N(x_i-\bar{\mathbf x})^2=0$, then there are infinitely many solutions to the N.E.s $$ \hat\beta_1=c\\\hat\beta_0=\bar{\mathbf y}-c\bar{\mathbf x} $$
Balanced one-way ANOVA:
$y_{ij}=\mu+\alpha_i+e_{ij}$, $i=1,\cdots,a$, $j=1,\cdots,n$ $$ \mathbf X=\begin{pmatrix}\mathbf 1_n&\mathbf 1_n&\mathbf 0&\cdots&\mathbf 0\\\mathbf 1_n&\mathbf 0&\mathbf 1_n&\cdots&\mathbf 0\\\vdots&\vdots&\vdots&\ddots&\vdots\\\mathbf 1_n&\mathbf 0&\mathbf 0&\cdots&\mathbf 1_n\end{pmatrix} $$
$$ \mathbf X^T\mathbf X\mathbf b=\begin{pmatrix}n&n&n&\cdots&n\\n&n&0&\cdots&0\\n&0&n&\cdots&0\\\vdots&\vdots&\vdots&\ddots&\vdots\\n& 0& 0&\cdots&n\end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\vdots\\\alpha_a\end{pmatrix},\quad \mathbf X^T\mathbf y=\begin{pmatrix}y_{\cdot\cdot}\\y_{1\cdot}\\\vdots\\y_{a\cdot}\end{pmatrix} $$
$\mathbf X^T\mathbf X$ is singular, again, there are infinitely many solutions to the N.E.s $$ \hat\mu=c\\\hat\alpha_i=\bar{y}_{i}-c $$ for any $c\in\mathbb R$.
Next, we will show that the N.E.s are consistent and we’ll also show that $$ Q(\mathbf b)\text{ is minimized at }\hat{\mathbf b}\quad\Leftrightarrow\quad \hat{\mathbf b}\text{ is a solution to the N.E.s} $$
2.2 The Geometry of Least Squares
Recall that a vector space, $\mathcal S$, in $\mathbb R^m$ is a collection of vectors that is closed under addition and scalar multiplication. So if $\mathbf x^{(1)}, \mathbf x^{(2)}\in\mathcal S$ and $c_1,c_2\in\mathbb R$, then $c_1\mathbf x^{(1)}+c_2\mathbf x^{(2)}\in\mathcal S$.
Def A.4: Two vector spaces $\mathcal S$ and $\mathcal T$ form orthogonal complements in $\mathbb R^m$ if and only if $\mathcal S$, $\mathcal T\subset\mathbb R^m$, $\mathcal S\cap \mathcal T={\mathbf 0}$, $\dim(\mathcal S)+\dim(\mathcal T)=n$, and $\mathbf s^T\mathbf t=0$ for every vector $\mathbf s$ in $\mathcal S$ and every vector $\mathbf t$ in $\mathcal T$.
Result A.4: Let $\mathcal S$ and $\mathcal T$ be orthogonal complements in $\mathbb R^m$, then any vector $\mathbf x\in\mathbb R^m$ can be written as $\mathbf x=\mathbf s+\mathbf t$ where $\mathbf s\in\mathcal S$ and $\mathbf t\in\mathcal T$, and this decomposition is unique
Result A.5: If $\mathbf A$ is an $m\times n$ matrix, then $\mathcal C(\mathbf A)$ and $\mathcal N(\mathbf A^T)$ are orthogonal complements in $\mathbb R^m$
Result A.6: Let $\mathcal S_1$ and $\mathcal T_1$ be orthogonal complements, as welle as $\mathcal S_2$ and $\mathcal T_2$; then if $\mathcal S_1\subset \mathcal S_2$, then $\mathcal T_2\subset \mathcal T_1$
Question: Are the N.E.s consistent? In other words, is it always true that $\mathbf X^T\mathbf y\in\mathcal C(\mathbf X^T\mathbf X)$?
Lemma 2.1: $\mathcal N(\mathbf X^T\mathbf X)=\mathcal N(\mathbf X)$
proof: $\mathbf w\in\mathcal N(\mathbf X)\Rightarrow \mathbf X\mathbf w=\mathbf 0\Rightarrow \mathbf X^T\mathbf X\mathbf w=\mathbf 0\Rightarrow \mathbf w\in\mathcal N(\mathbf X^T\mathbf X)$
$\mathbf w\in\mathcal N(\mathbf X^T\mathbf X)\Rightarrow \mathbf X^T\mathbf X\mathbf w=\mathbf 0\Rightarrow \mathbf w^T\mathbf X^T\mathbf X\mathbf w=0\Rightarrow |\mathbf X\mathbf w|=0\Rightarrow \mathbf X\mathbf w=\mathbf 0\Rightarrow \mathbf w\in\mathcal N(\mathbf X)$
Result 2.2: $\mathcal C(\mathbf X^T\mathbf X)=\mathcal C(\mathbf X^T)$
proof: Result A.5 $\Rightarrow$$\mathcal N(\mathbf X)$ and $\mathcal C(\mathbf X^T)$ are O.C.s, $\mathcal N(\mathbf X^T\mathbf X)$ and $\mathcal C(\mathbf X^T\mathbf X)$ are O.C.s
Result A.6 $\Rightarrow$ $\mathcal N(\mathbf X)\subset \mathcal N(\mathbf X^T\mathbf X)\Rightarrow \mathcal C(\mathbf X^T\mathbf X)\subset\mathcal C(\mathbf X^T)$, $\mathcal N(\mathbf X^T\mathbf X)\subset \mathcal N(\mathbf X)\Rightarrow \mathcal C(\mathbf X^T)\subset\mathcal C(\mathbf X^T\mathbf X)$
Corollary 2.1: The N.E.s are consistent
proof: Result 2.2: $\mathbf X^T\mathbf y\in\mathcal C(\mathbf X^T)=\mathcal C(\mathbf X^T\mathbf X) $
Result 2.3: $\hat{\mathbf b}$ is a solution to the N.E.s iff $\hat{\mathbf b}$ minimizes $Q(\mathbf b)=|\mathbf y-\mathbf X\mathbf b|^2$
*By Result A.12, $\hat{\mathbf b}=(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y$ is a solution to the N.E.s
proof: By Corollary 2.1, we know that there exists a solution to the N.E.s, $\hat{\mathbf b}$. $$ \begin{aligned} Q(\mathbf b)&=(\mathbf y-\mathbf X\mathbf b)^T(\mathbf y-\mathbf X\mathbf b)\\ &=(\mathbf y-\mathbf X\hat{\mathbf b}+\mathbf X\hat{\mathbf b}-\mathbf X\mathbf b)^T(\mathbf y-\mathbf X\hat{\mathbf b}+\mathbf X\hat{\mathbf b}-\mathbf X\mathbf b)\\ &=(\mathbf y-\mathbf X\hat{\mathbf b})^T(\mathbf y-\mathbf X\hat{\mathbf b})+2(\hat{\mathbf b}-\mathbf b)^T\mathbf X^T(\mathbf y-\mathbf X\hat{\mathbf b})+(\mathbf X\hat{\mathbf b}-\mathbf X\mathbf b)^T(\mathbf X\hat{\mathbf b}-\mathbf X\mathbf b)\\ &=Q(\hat{\mathbf b})+|\mathbf X(\hat{\mathbf b}-\mathbf b)|^2 \end{aligned} $$ where the cross-product term vanishes since $\hat{\mathbf b}$ solves the N.E.s $\mathbf X^T(\mathbf y-\mathbf X^T\mathbf b)=\mathbf 0$. Now since no other value of $\mathbf b$ can give a smaller value of $Q(\mathbf b)$, clearly $\hat{\mathbf b}$ minimizes $Q(\mathbf b)$.
For the other direction, we still have $\hat{\mathbf b}$ that solves N.E.s, and we now know that $\hat{\mathbf b}$ minimizes $Q$ as well.
Suppose $\tilde {\mathbf b}$ minimizes $Q$. So $Q(\tilde {\mathbf b})=Q(\hat{\mathbf b})+|\mathbf X(\hat{\mathbf b}-\tilde{\mathbf b})|^2$. Since $Q(\tilde {\mathbf b})=Q(\hat{\mathbf b})$, we know $|\mathbf X(\hat{\mathbf b}-\tilde{\mathbf b})|^2=0\Rightarrow \mathbf X(\hat{\mathbf b}-\tilde{\mathbf b})=\mathbf 0\Rightarrow \mathbf X\hat{\mathbf b}=\mathbf X\tilde{\mathbf b}$. Now $\mathbf X^T\mathbf y=\mathbf X^T\mathbf X\hat{\mathbf b}=\mathbf X^T\mathbf X\tilde{\mathbf b}$ , so $\tilde{\mathbf b}$ satisfies the N.E.s.
Corollary 2.3: $\mathbf X\hat{\mathbf b}$ is invariant to the choice of a solution to the N.E.s.
proof: If $\hat{\mathbf b}$ and $\tilde{\mathbf b}$ are both solutions to N.E.s, then $Q(\tilde{\mathbf b})=Q(\hat{\mathbf b})+|\mathbf X(\hat{\mathbf b}-\tilde{\mathbf b})|^2$. So $|\mathbf X(\hat{\mathbf b}-\tilde{\mathbf b})|=0$.
We have shown that the closest point in $\mathcal C(\mathbf X)$ to $\mathcal Y$ is the vector of fitted values, $\hat{\mathbf y}=\mathbf X\hat{\mathbf b}$. The residual vector $\hat{\mathbf e}=\mathbf y-\hat{\mathbf y}$ is in $\mathcal N(\mathbf X^T)$ since $\mathbf X^T\hat{\mathbf e}=\mathbf X^T(\mathbf y-\hat{\mathbf y})=\mathbf X^T\mathbf y-\mathbf X^T\mathbf X\hat{\mathbf b}=0$.
This unique orthogonal decomposition (from Result A.4) also yields a decomposition of sums of squares from the Pythagorean Theorem: $$ |\mathbf y|^2=|\hat{\mathbf y}|^2+|\hat{\mathbf e}|^2. $$ That is, the total sum of squares $|\mathbf y|^2$ is the sum of the regression sum of squares, or $SSR = |\mathbf X\hat{\mathbf b}|^2$, and error sum of squares, or $SSE=|\hat{\mathbf e}|^2$.
Now let’s look at a different way of developing this decomposition using projections.
Result 2.4: $\mathbf X^T\mathbf X\mathbf A=\mathbf X^T\mathbf X\mathbf B\Leftrightarrow\mathbf X\mathbf A=\mathbf X\mathbf B$.
proof: ($\Leftarrow$) Obvious
($\Rightarrow$) We will show two different proofs:
Algebraic: $\mathbf X^T\mathbf X(\mathbf A-\mathbf B)=\mathbf 0\Rightarrow(\mathbf A-\mathbf B)^T\mathbf X^T\mathbf X(\mathbf A-\mathbf B)=0\Rightarrow\left[\mathbf X(\mathbf A-\mathbf B)\right]^T\mathbf X(\mathbf A-\mathbf B)=0$
$\Rightarrow (\text{Lemma A.1}) \mathbf X(\mathbf A-\mathbf B)=\mathbf 0\Rightarrow \mathbf X\mathbf A=\mathbf X\mathbf B$.
Geometric: $\mathbf X^T\mathbf X(\mathbf A-\mathbf B)=0$ implies that the columns of $\mathbf X(\mathbf A-\mathbf B)$ are in $\mathcal N(\mathbf X^T)$. But the columns of $\mathbf X(\mathbf A-\mathbf B)$ are also all in $\mathcal C(\mathbf X)$. Since $\mathcal C(\mathbf X)$ and $\mathcal N(\mathbf X^T)$ are O.C.s, the only vector that have in common is the zero vector. Thus, each conlumn of $\mathbf X(\mathbf A-\mathbf B)$ must be $\mathbf 0$, i.e., $\mathbf X(\mathbf A-\mathbf B)=\mathbf 0$.
Some material from the appendix:
Def: A square matrix $\mathbf P$ is a projection onto the vector space $\mathcal S$ iff
- $\mathbf P$ is idempotent,
- for any $\mathbf x$, $\mathbf P\mathbf x\in\mathcal S$. and
- if $\mathbf z\in\mathcal S$, $\mathbf P\mathbf z=\mathbf z$
Result A.14: $\mathbf A\mathbf A^g$ is a projection onto $\mathcal C(\mathbf A)$.
proof: 1. $\mathbf A\mathbf A^g\mathbf A\mathbf A^g=\mathbf A\mathbf A^g$
- $\mathbf A\mathbf A^g\mathbf x=\mathbf A(\mathbf A^g\mathbf x)\in\mathcal C(\mathbf A)$
- If $\mathbf z\in\mathcal C(\mathbf A)$, then $\mathbf z=\mathbf A\mathbf y$ for some $\mathbf y$. $\mathbf A\mathbf A^g\mathbf z=\mathbf A\mathbf A^g\mathbf A\mathbf y=\mathbf A\mathbf y=z$
Result A.16: There is only one symmetric projection onto $\mathcal S$.
Result 2.5: $(\mathbf X^T\mathbf X)^g\mathbf X^T$ is a generalized inverse of $\mathbf X$.
proof: $(\mathbf X^T\mathbf X)(\mathbf X^T\mathbf X)^g(\mathbf X^T\mathbf X)=\mathbf X^T\mathbf X$. By Result 2.4, we have $\mathbf X(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf X=\mathbf X$.
Theorem 2.1: Define $\mathbf P_{\mathbf X}=\mathbf X(\mathbf X^T\mathbf X)^g\mathbf X^T$. $\mathbf P_{\mathbf X}$ is the symmetric projection onto $\mathcal C(\mathbf X)$.
proof: First, note that $\mathbf P_\mathbf X=\mathbf A\mathbf A^g$ where $\mathbf X$ is playing the role of $\mathbf A$. So by result A.14, $\mathbf P_\mathbf X$ is a projection onto $\mathcal C(\mathbf X)$.
In order to establish symmetry, we first show that $\mathbf P_\mathbf X$ is invariant to the choice of G.I. of $\mathbf X^T\mathbf X$. Let $\mathbf G_1$ and $\mathbf G_2$ be two G.I.s of $\mathbf X^T\mathbf X$. Then $(\mathbf X^T\mathbf X)\mathbf G_1(\mathbf X^T\mathbf X)=\mathbf X^T\mathbf X=(\mathbf X^T\mathbf X)\mathbf G_2(\mathbf X^T\mathbf X)$. By Result 2.4: $$ \mathbf X\mathbf G_1(\mathbf X^T\mathbf X)=\mathbf X\mathbf G_2(\mathbf X^T\mathbf X)\Rightarrow\mathbf X^T\mathbf X\mathbf G_1^T\mathbf X^T=\mathbf X^T\mathbf X\mathbf G_2^T\mathbf X^T. $$ Using Result 2.4 again: $$ \mathbf X\mathbf G_1^T\mathbf X^T=\mathbf X\mathbf G_2^T\mathbf X^T\Rightarrow\mathbf X\mathbf G_1\mathbf X^T=\mathbf X\mathbf G_2\mathbf X^T. $$ Hence, invariance.
Now, note that $$ (\mathbf X^T\mathbf X)\mathbf G_1(\mathbf X^T\mathbf X)=\mathbf X^T\mathbf X\Rightarrow \mathbf X^T\mathbf X\mathbf G_1^T\mathbf X^T\mathbf X=\mathbf X^T\mathbf X $$ which indicates that if $\mathbf G_1$ is a G.I. of $\mathbf X^T\mathbf X$, then so is $G_1^T$.
Finally, $$ \mathbf P_\mathbf X^T=\left[\mathbf X(\mathbf X^T\mathbf X)^g\mathbf X^T\right]^T=\mathbf X\left[(\mathbf X^T\mathbf X)^g\right]^T\mathbf X^T=\mathbf P_\mathbf X. $$ Result 2.6: $\mathbf I-\mathbf P_\mathbf X$ is the unique symmetric projection onto $\mathcal N(\mathbf X^T)$.
proof: Follows immediately from corollary A.4.
Example: $\mathcal S=\{\mathbf v\in\mathbb R^N:\mathbf v=c\mathbf 1_N\text{ for some }c\in\mathbb R\}=\mathcal C(\mathbf 1_N)$
$\mathcal T=\{\mathbf w\in\mathbb R^N:\mathbf 1_N^T\mathbf w=0\}=\mathcal N(\mathbf 1_N^T)$
- $\mathcal S\bigcap\mathcal T=\{0\}$
- $\dim(\mathcal S)+\dim(\mathcal T)=1+N-1=N$ (Theorem A.1)
- $\mathbf v\in\mathcal S,\mathbf w\in\mathcal T$, $\mathbf v^T\mathbf w=c\mathbf 1^T_N\mathbf w=0$
$\mathbf P_{\mathbf 1_N}=\frac{1}{N}\mathbf 1_N\mathbf 1_N^T$, $\mathbf P_{\mathbf 1_N}\mathbf y=\mathbf 1_N\frac{1}{N}\mathbf 1^T_N\mathbf y=\mathbf 1_N\frac{1}{N}\sum_{i=1}^Ny_i=\mathbf 1_N\bar{y}$, $(\mathbf I-\mathbf P_{\mathbf 1_N})\mathbf y=\mathbf y-\mathbf P_{\mathbf 1_N}\mathbf y=\mathbf y-\mathbf 1_N\bar{y}$.
Recall that we have the orthogonal decomposition: $\mathbf y=\mathbf X\hat{\mathbf b}+(\mathbf y-\mathbf X\hat{\mathbf b})=\hat{\mathbf y}+\hat{\mathbf e}$.
Now, $\hat{\mathbf y}=\mathbf X\hat{\mathbf b}=\mathbf X(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y=\mathbf P_\mathbf X\mathbf y\in\mathcal C(\mathbf X)$ and $\hat{\mathbf e}=\mathbf y-\mathbf X\hat{\mathbf b}=(\mathbf I-\mathbf P_\mathbf X)\mathbf y\in\mathcal N(\mathbf X^T)$.
Another way to think about minimizing $Q(\mathbf b)$: $$ Q(\mathbf b)=|\mathbf y-\mathbf X\mathbf b|^2=|\mathbf y-\hat{\mathbf y}+\hat{\mathbf y}-\mathbf X\mathbf b|^2=|\hat{\mathbf y}-\mathbf X\mathbf b|^2+|\hat{\mathbf e}|^2, $$ where the second piece cannot be minimized by varying $\mathbf b$. The first piece can be minimized to zero, because the equations $$ \mathbf X\mathbf b=\mathbf P_\mathbf X\mathbf y=\hat{\mathbf y} $$ are consistent because $\mathbf P_\mathbf X\mathbf y\in\mathcal C(\mathbf X)$.
Result 2.7: The solutions to the N.E.s are the same as the solutions to the consistent equations $\mathbf X\mathbf b=\mathbf P_\mathbf X\mathbf y$
proof: Assume that $\hat{\mathbf b}$ solves $\mathbf X\mathbf b=\mathbf P_\mathbf X\mathbf y$. Then $\mathbf X^T\mathbf X\hat{\mathbf b}=\mathbf X^T\mathbf P_\mathbf X\mathbf y=\mathbf X^T\mathbf y$, so $\hat{\mathbf b}$ solves N.E.s.
Now assume that $\hat{\mathbf b}$ solves N.E.s. Then $\mathbf X^T\mathbf X\hat{\mathbf b}=\mathbf X^T\mathbf y=\mathbf X^T\mathbf P_\mathbf X\mathbf y=\mathbf X^T\mathbf X(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y$. Using Result 2.4, we have $\mathbf X\hat{\mathbf b}=\mathbf X(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y=\mathbf P_\mathbf X\mathbf y$, so $\hat{\mathbf b}$ solves $\mathbf X\mathbf b=\mathbf P_\mathbf X\mathbf y$.
Theorem 2.2: If $\mathcal C(\mathbf W)\subset\mathcal C(\mathbf X)$, then $\mathbf P_\mathbf X-\mathbf P_\mathbf W$ is the symmetric projection onto $\mathcal C\left((\mathbf I-\mathbf P_\mathbf W)\mathbf X\right)$.
proof: We first show that $\mathbf P_\mathbf X-\mathbf P_\mathbf W$ is idempotent. $$ (\mathbf P_\mathbf X-\mathbf P_\mathbf W)^2=\mathbf P_\mathbf X^2-\mathbf P_\mathbf X\mathbf P_\mathbf W-\mathbf P_\mathbf W\mathbf P_\mathbf X+\mathbf P^2_\mathbf W=\mathbf P_\mathbf X-\mathbf P_\mathbf X\mathbf P_\mathbf W-\mathbf P_\mathbf W\mathbf P_\mathbf X+\mathbf P_\mathbf W. $$ Now, for any $\mathbf z$, $\mathbf P_\mathbf W\mathbf z\in\mathcal C(\mathbf W)\subset\mathcal C(\mathbf X)$. So $\mathbf P_\mathbf X\mathbf P_\mathbf W\mathbf z=\mathbf P_\mathbf W\mathbf z$, $\forall \mathbf z$. Hence $\mathbf P_\mathbf X\mathbf P_\mathbf W=\mathbf P_\mathbf W$ (Corollary A.1).
Then $$ (\mathbf P_\mathbf X\mathbf P_\mathbf W)^T=\mathbf P_W^T=\mathbf P_\mathbf W\mathbf P_\mathbf X=\mathbf P_\mathbf W\\ \Rightarrow (\mathbf P_\mathbf X-\mathbf P_\mathbf W)^2=\mathbf P_\mathbf X-\mathbf P_\mathbf W-\mathbf P_\mathbf W+\mathbf P_\mathbf W=\mathbf P_\mathbf X-\mathbf P_\mathbf W. $$ Second, for any vector $\mathbf u$ can be decomposed as $\mathbf u=\mathbf X\mathbf s+\mathbf t$ where $\mathbf t\in\mathcal N(\mathbf X^T)$. So $$ \begin{aligned} (\mathbf P_\mathbf X-\mathbf P_\mathbf W)\mathbf U&=(\mathbf P_\mathbf X-\mathbf P_\mathbf W)(\mathbf X\mathbf s+\mathbf t)\\ &=\mathbf P_\mathbf X\mathbf X\mathbf s-\mathbf P_\mathbf W\mathbf X\mathbf s+\mathbf P_\mathbf X\mathbf t-\mathbf P_\mathbf W\mathbf t\\ &=\mathbf X\mathbf s-\mathbf P_\mathbf W\mathbf X\mathbf s+\mathbf 0-\mathbf P_\mathbf W\mathbf P_\mathbf X\mathbf t\\ &=\mathbf X\mathbf s-\mathbf P_\mathbf W\mathbf X\mathbf s+\mathbf 0-\mathbf 0\\ &=(\mathbf I-\mathbf P_\mathbf W)\mathbf X\mathbf s. \end{aligned} $$ So for any $\mathbf u$, $(\mathbf P_\mathbf X-\mathbf P_\mathbf W)\mathbf u\in\mathcal C((\mathbf I-\mathbf P_\mathbf W)\mathbf X)$.
Third, if $\mathbf y\in\mathcal C((\mathbf I-\mathbf P_\mathbf W)\mathbf X)$, then $\mathbf y=(\mathbf I-\mathbf P_\mathbf W)\mathbf X\mathbf c$ for some $\mathbf c$. $$ \begin{aligned} (\mathbf P_\mathbf X-\mathbf P_\mathbf W)\mathbf y&=(\mathbf P_\mathbf X-\mathbf P_\mathbf W)(\mathbf I-\mathbf P_\mathbf W)\mathbf X\mathbf c\\ &=(\mathbf P_\mathbf X-\mathbf P_\mathbf W)\mathbf X\mathbf c-(\mathbf P_\mathbf X-\mathbf P_\mathbf W)\mathbf P_\mathbf W\mathbf X\mathbf c\\ &=\mathbf X\mathbf c-\mathbf P_\mathbf W\mathbf X\mathbf c-\mathbf P_\mathbf X\mathbf P_\mathbf W\mathbf X\mathbf c+\mathbf P_\mathbf W\mathbf X\mathbf c\\ &=(\mathbf I-\mathbf P_\mathbf W)\mathbf X\mathbf c=\mathbf y \end{aligned} $$
2.3 Reparametrization
One-way ANOVA:
- $y_{ij}=\mu+\alpha_i+e_{ij}$, $i=1,2,3$
- $y_{ij}=d_i+e_{ij}$
- $y_{ij}=c_1+c_2\mathbb 1_{\{1\}}(i)+c_3\mathbb 1_{\{2\}}(i)+e_{ij}$
These three versions of the ANOVA model should lead ti equivalent inferences.
Def: Two L.M.s, $\mathbf y=\mathbf X\mathbf b+\mathbf e$, where $\mathbf X$ is $N\times p$, and $\mathbf y=\mathbf W\mathbf c+\mathbf e$, where $\mathbf W$ is $N\times t$, are equivalent or reparametrizations of each other, if the two design matrices, $\mathbf X$ and $\mathbf W$, have the same column sapce, i.e., $\mathcal C(\mathbf X)=\mathcal C(\mathbf W)$.
Back to ANOVA examples: $$ \mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_{n_1}&\mathbf 1_{n_1}&\mathbf 0&\mathbf 0\\ \mathbf 1_{n_2}&\mathbf 0&\mathbf 1_{n_2}&\mathbf 0\\\mathbf 1_{n_3}&\mathbf 0&\mathbf 0&\mathbf 1_{n_3}\end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\alpha_2\\\alpha_3\end{pmatrix} $$
$$ \mathbf W\mathbf c=\begin{pmatrix}\mathbf 1_{n_1}&\mathbf 1_{n_1}&\mathbf 0\\\mathbf 1_{n_2}&\mathbf 0&\mathbf 1_{n_2}\\\mathbf 1_{n_3}&\mathbf 0&\mathbf 0\end{pmatrix}\begin{pmatrix}c_1\\c_2\\c_3\\\end{pmatrix} $$
$\mathcal C(\mathbf X)=\mathcal C(\mathbf W)$: Note that in $\mathbf X$, column 4=column1-column2-column3.
Result 2.8: If two L.M.s are equivalent, then $\mathbf P_\mathbf X=\mathbf P_\mathbf W$.
Corollary 2.4: If two L.M.s are equivalent, then $\hat{\mathbf y}$ and $\hat{\mathbf e}$ are the same for the two models.
proof: $\hat{\mathbf y}=\mathbf P_\mathbf X\mathbf y$, $\hat{\mathbf e}=(\mathbf I-\mathbf P_\mathbf X)\mathbf y$.
Assume $\mathcal C(\mathbf X)=\mathcal C(\mathbf W)$. Result A.2 implies the existence of two matrices, $\mathbf S$ and $\mathbf T$, such that $\mathbf W=\mathbf X\mathbf T$ and $\mathbf X=\mathbf W\mathbf S$.
Result 2.9: Assume $\mathcal C(\mathbf X)=\mathcal C(\mathbf W)$. If $\hat{\mathbf c}$ solves the N.E.s $\mathbf W^T\mathbf W\mathbf c=\mathbf W^T\mathbf y$, then $\hat{\mathbf b}=\mathbf T\hat{\mathbf c}$ solves the N.E.s $\mathbf X^T\mathbf X\mathbf b=\mathbf X^T\mathbf y$.
proof:
$$
\mathbf X^T\mathbf X\mathbf T\hat{\mathbf c}=\mathbf X^T\mathbf W\hat{\mathbf c}=\mathbf X^T\mathbf P_\mathbf W\mathbf y=\mathbf X^T\mathbf P_\mathbf X\mathbf y=\mathbf X^T\mathbf y.
$$
Example: $y_{ij}=\mu+\alpha_i+e_{ij}$, $i=1,2,3$.
$$
\hat{\mathbf b}=\begin{pmatrix}\bar{y}_{\cdot\cdot}\\
\bar{y}_{1\cdot}-\bar{y}_{\cdot\cdot}\\\bar{y}_{2\cdot}-\bar{y}_{\cdot\cdot}\\\bar{y}_{3\cdot}-\bar{y}_{\cdot\cdot}\end{pmatrix}
$$
General solution:
$$
\begin{pmatrix}0\\\bar{y}_{1\cdot}\\\bar{y}_{2\cdot}\\\bar{y}_{3\cdot}\end{pmatrix}+z\begin{pmatrix}1\\-1\\-1\\-1\end{pmatrix},\quad\text{for any }z\in\mathbb R
$$
$y_{ij}=c_1+c_2\mathbf I_{\{1\}}(i)+c_3\mathbf I_{\{2\}}(i)+e_{ij}$, $i=1,2,3$
$$
\hat{c}=\begin{pmatrix}\bar{y}_{3\cdot}\\\bar{y}_{1\cdot}-\bar{y}_{2\cdot}\\\bar{y}_{2\cdot}-\bar{y}_{3\cdot}\end{pmatrix}
$$
For $\mathbf W$ and $\mathbf X$ presented before, we have $\mathbf W=\mathbf X\mathbf T$, where $\mathbf T$ is give by
$$
\mathbf T=\begin{pmatrix}1&0&0\\0&1&0\\0&0&1\\0&0&0\end{pmatrix}
$$
$$ \hat{\mathbf b}=\mathbf T\hat{\mathbf c}=\begin{pmatrix}1&0&0\\0&1&0\\0&0&1\\0&0&0\end{pmatrix}\begin{pmatrix}\bar{y}_{3\cdot}\\\bar{y}_{1\cdot}-\bar{y}_{2\cdot}\\\bar{y}_{2\cdot}-\bar{y}_{3\cdot}\end{pmatrix}=\begin{pmatrix}\bar{y}_{3\cdot}\\\bar{y}_{1\cdot}-\bar{y}_{2\cdot}\\\bar{y}_{2\cdot}-\bar{y}_{3\cdot}\\0\end{pmatrix} $$
2.4 A Version of the QR Decomposition
Result: Let $\mathbf W_{N\times p}$ have full column rank. Then $\mathbf W$ can be decomposed as $\mathbf W=\mathbf Q\mathbf R$ where $\mathbf Q_{N\times p}$ is s.t. $\mathbf Q^T\mathbf Q=\mathbf I_p$ and $\mathbf R_{p\times p}$ is upper triangular, with positive elements on the diagonal.
Idea of the proof: Write the columns of $\mathbf W$ as $\mathbf W_{\cdot1},\mathbf W_{\cdot2},\cdots,\mathbf W_{\cdot p}$. We begin by constructing a set of orthogonal vectors $\mathbf U_{\cdot1},\mathbf U_{\cdot2},\cdots,\mathbf U_{\cdot p}$, each of which is a linear combination of the columns of $\mathbf W$, and such that for each $k=1,2,\cdots,p$, we have $span\{\mathbf W_{\cdot1},\mathbf W_{\cdot2},\cdots,\mathbf W_{\cdot k}\}=span\{\mathbf U_{\cdot1},\mathbf U_{\cdot2},\cdots,\mathbf U_{\cdot k}\}$.
Taking $k=p$ shows that $\mathcal C(\mathbf W)=\mathcal C(\mathbf U)$, where $\mathbf U$ is the matrix whose columns are $\mathbf U_{\cdot1},\mathbf U_{\cdot2},\cdots,\mathbf U_{\cdot p}$.
Since the $\mathbf U_{\cdot i}$’s are orthogonal, we have $\mathbf U^T\mathbf U=\mathbf D$ where $\mathbf D$ is a diagonal matrix with $k$-th diagonal element equal to $|\mathbf U_{\cdot k}|^2$.
So, if we take $\mathbf Q=\mathbf U\mathbf D^{-\frac{1}{2}}$, then $\mathbf Q^T\mathbf Q=\mathbf D^{-1/2}\mathbf U^T\mathbf U\mathbf D^{-1/2}=\mathbf D^{-1/2}\mathbf D^{1/2}\mathbf D^{1/2}\mathbf D^{-1/2}=\mathbf I$.
proof: Because we construct $\mathbf U_{\cdot 1},\mathbf U_{\cdot2},\cdots,\mathbf U_{\cdot p}$, Recall basic facts about L.S. theory:
$\mathbf y=\mathbf X\mathbf b+\mathbf e$, $\mathbf X^T\mathbf X\mathbf b=\mathbf X^T\mathbf y$, $\hat{\mathbf b}=(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y$, $\hat{\mathbf e}=\mathbf y-\mathbf X\hat{\mathbf b}$ $$ \mathbf X^T\hat{\mathbf e}=\mathbf X^T(\mathbf y-\mathbf X\hat{\mathbf b})=\mathbf X^T\mathbf y-\mathbf X^T\mathbf X\hat{\mathbf b}=\mathbf 0\ \Rightarrow\hat{\mathbf e}\text{ is orthogonal to every column of }\mathbf X. $$ Step 1: $\mathbf U_{\cdot1}=\mathbf W_{\cdot1}$
Step 2: Regress $\mathbf W_{\cdot2}$ on $\mathbf U_{\cdot1}$; that is, fit the model $\mathbf W_{\cdot2}=\mathbf U_{\cdot1}b+\mathbf e$. Then $$ \hat{b}^{(2)}=\frac{\mathbf U^T_{\cdot1}\mathbf W_{\cdot2}}{\mathbf U^T_{\cdot1}\mathbf U_{\cdot1}}. $$ Now set $\mathbf U_{\cdot 2}=\mathbf W_{\cdot2}-\mathbf U_{\cdot1}\hat{b}^{(2)}$ (Essentially $\mathbf U_{\cdot2}$ is $\hat{\mathbf e}$). So we have $\mathbf U_{\cdot1}\perp\mathbf U_{\cdot2}$.
Step 3: Regress $\mathbf W_{\cdot3}$ on $\mathbf U_{\cdot1}$ and $\mathbf U_{\cdot2}$; that is, fit the model $\mathbf W_{\cdot3}=(\mathbf U_{\cdot1};\mathbf U_{\cdot2})\begin{pmatrix}b_1\\b_2\end{pmatrix}+\mathbf e$. Then $$ \mathbf X^T\mathbf X=\begin{pmatrix}\mathbf U^T_{\cdot1}\\ \mathbf U^T_{\cdot2}\end{pmatrix}(\mathbf U_{\cdot1};\mathbf U_{\cdot2})=\begin{pmatrix}|\mathbf U_{\cdot1}|^2&0\\0&|\mathbf U_{\cdot2}|^2\end{pmatrix} $$ and it is easy to solve N.E. $\mathbf X^T\mathbf X\mathbf b=\mathbf X^T\mathbf y$ directly. For $j=1,2$: $$ \left(\hat{\mathbf b}^{(3)}\right)_j=\frac{\mathbf U^T_{\cdot j}\mathbf W_{\cdot3}}{\mathbf U^T_{\cdot j}\mathbf U_{\cdot j}}. $$ Now set $\mathbf U_{\cdot 3}=\mathbf W_{\cdot 3}-(\mathbf U_{\cdot1};\mathbf U_{\cdot2})\hat{\mathbf b}^{(3)}\Rightarrow \mathbf U_{\cdot3}\perp\mathbf U_{\cdot j}(j=1,2)$.
Now for $i\in\{3,4,5,\cdots,p-1\}$, at step $p+1$, we regress $\mathbf W_{i+1}$ on $\mathbf U_{\cdot1},\mathbf U_{\cdot2},\cdots,\mathbf U_{\cdot i}$. As above, for $j=1,2,c\dots,i$, $$ \left(\hat{\mathbf b}^{(i+1)}\right)_j=\frac{\mathbf U^T_{\cdot j}\mathbf W_{\cdot i+1}}{\mathbf U^T_{\cdot j}\mathbf U_{\cdot j}}. $$ Now set $U_{\cdot i+1}=\mathbf W_{\cdot i+1}-(\mathbf U_{\cdot 1};\mathbf U_{\cdot2}\cdots\mathbf U_{\cdot i})\hat{\mathbf b}^{(i+1)}\Rightarrow \mathbf U_{\cdot i+1}\perp\mathbf U_{\cdot j}(j=1,2,\cdots,i)$.
Now collect all of the $\hat{\mathbf b}$’s in a $p\times p$ matrix as follows: $$ \mathbf S=\begin{pmatrix} 1 &(\hat{\mathbf b^{(2)}})_1 & (\hat{\mathbf b}^{(3)})_1 &\cdots & (\hat{\mathbf b}^{(p)})_1\\ 0 & 1 & (\hat{\mathbf b}^{(3)})_2 &\cdots & (\hat{\mathbf b}^{(p)})_2\\ 0 &0&1&\cdots&(\hat{\mathbf b}^{(p)})_3 \\ \vdots &\vdots &\vdots&\ddots&\vdots \\ 0 &0&0&\cdots&(\hat{\mathbf b}^{(p)})_{p-1} \\ 0 &0&0&\cdots&1 \end{pmatrix} $$ More formally, $\mathbf S_{j,i+1}=\left(\hat{\mathbf b}^{(i+1)}\right)_j$ for $i=1,2,\cdots,p-1$ and $j=1,2,\cdots,i$. Diagonal elements are all $1$, and all $0$’s below diagonal.'
Now for $i=1,2,\cdots,p-1$, we have $$ \mathbf U_{\cdot i+1}=\mathbf W_{\cdot i+1}-\sum_{j=1}^i\left(\hat{\mathbf b}^{(i+1)}\right)\mathbf U_{\cdot j} =\mathbf W_{\cdot i+1}-\sum_{j=1}^i\mathbf S_{j,i+1}\mathbf U_{\cdot j}\\ \Rightarrow \mathbf W_{\cdot i+1}=\mathbf U_{\cdot i+1}+\sum_{j=1}^i\mathbf S_{j,i+1}\mathbf U_{\cdot j}=\sum_{j=1}^{i+1}\mathbf S_{j,i+1}\mathbf U_{\cdot j}\\ \Rightarrow \mathbf W_{N\times p}=\mathbf U_{N\times p}\mathbf S_{p\times p} $$ Finally, we take $\mathbf Q=\mathbf U\mathbf D^{-\frac{1}{2}}$ and $\mathbf R=\mathbf D^{\frac{1}{2}}\mathbf S$. Then $$ \mathbf Q\mathbf R=\mathbf U\mathbf D^{-\frac{1}{2}}\mathbf D^{\frac{1}{2}}\mathbf S=\mathbf W $$
Chapter 3: Estimability and Least Squares Estimators
GLM: $\mathbf y=\mathbf X\mathbf b+\mathbf e$
We assume $\mathbb E\mathbf e=0$, so $\mathbb E\mathbf y=\mathbf X\mathbf b$.
Fix $\pmb\lambda\in\mathbb R^p$
Def 3.1: An estimator $t(\mathbf y)$ is unbiased for $\pmb\lambda^T\mathbf b$ if $\mathbf Et(\mathbf y)=\pmb\lambda^T\mathbf b$, $\forall \mathbf b\in\mathbb R^p$.
Def 3.2: An estimator $t(\mathbf y)$ is a linear estimator if $t(\mathbf y)=c+\mathbf a^T\mathbf y$ where $c\in\mathbb R$ and $\mathbf a\in\mathbb R^p$ are known constants.
Def 3.3: A function $\pmb\lambda^T\mathbf b$ is (linearly) estimable if $\exist$ a linear unbiased estimator of it; otherwise $\pmb\lambda^T\mathbf b$ is called non-estimable.
Result 3.1: $\pmb\lambda^T\mathbf b$ is estimable iff $\exist \mathbf a\in\mathbb R^N$ such that $\pmb\lambda=\mathbf X^T\mathbf a$, or, in other words, $\pmb\lambda\in\mathcal C(\mathbf X^T)$.
proof: ($\Leftarrow$) Assume there exists $\mathbf a\in\mathbb R^N$ such that $\mathbf X^T\mathbf a=\pmb\lambda$. Then $\mathbb E(\mathbf a^T\mathbf y)=\mathbf a^T\mathbb E(\mathbf y)=\mathbf a^T\mathbf X\mathbf b=\pmb\lambda^T\mathbf b$.
($\Rightarrow$) Assume that $\pmb\lambda^T\mathbf b$ is estimable. Then $\exist c,\mathbf a$ such that $\mathbb E(c+\mathbf a^T\mathbf y)=\pmb\lambda^T\mathbf b$, $\forall \mathbf b\in\mathbb R^p$. Then, $c+\mathbf a^T\mathbf X\mathbf b-\pmb\lambda^T\mathbf b=0$, $\forall \mathbf b\in\mathbb R^p\Rightarrow c+(\mathbf a^T\mathbf X-\pmb\lambda^T)\mathbf b=0$, $\forall \mathbf b\in\mathbb R^p$. Using Result A.8, we have $c=0$ and $\mathbf a^T\mathbf X-\pmb\lambda^T=\mathbf 0$ or $\pmb\lambda=\mathbf X^T\mathbf a$.
Example 3.1: One-way ANOVA with $a=n=2$. $y_{ij}=\mu+\alpha_i+e_{ij}$, $i=1,2$, $j=1,2$. $\mathbf X\mathbf b=\begin{pmatrix}1&1&0\\1&1&0\\1&0&1\\1&0&1\end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\alpha_2\end{pmatrix}$.
Q: Is $\alpha_1$ estimable?
A: First, $\alpha_1=\pmb\lambda^T\mathbf b=(0,1,0)(\mu,\alpha_1,\alpha_2)^T$. Is $(0,1,0)^T\in\mathcal C(\mathbf X^T)$? $$ \begin{pmatrix} 1&1&1&1\\1&1&0&0\\0&0&1&1 \end{pmatrix} \begin{pmatrix} a_1\\a_2\\a_3\\a_4 \end{pmatrix}=\begin{pmatrix}0\\1\\0\end{pmatrix} $$
$$ \Rightarrow \sum_{i=1}^4a_i=0,\quad a_1+a_2=1,\quad a_3+a_4=0 $$
This is impossible. So $(0,1,0)^T\not\in\mathcal C(\mathbf X^T)\Rightarrow\alpha_1$ is not estimable.
Result 3.1a: $\pmb\lambda^T\mathbf b$ is estimable if and only if $\pmb\lambda^T\mathbf b$ is a linear combination of the expected values of the $ y_i$’s.
proof: ($\Rightarrow$) Assume that $\pmb\lambda^T\mathbf b$ is estimable. Then $\exist$ $a\mathbf \in\mathbb R^N$ such that $\pmb\lambda=\mathbf X^T\mathbf a$. Then $\pmb\lambda^T\mathbf b=\mathbf a^T\mathbf X\mathbf b=\mathbf a^T\mathbb E(\mathbf y)$.
($\Leftarrow$) Assume that $\pmb\lambda^T\mathbf b$ is a linear combination of the elements of $\mathbf X\mathbf b$. Then, for all $\mathbf b\in\mathbb R^p$, $\pmb\lambda^T\mathbf b=\mathbf a^T\mathbf X\mathbf b\Rightarrow\pmb\lambda^T=\mathbf a^T\mathbf X$ or $\pmb\lambda=\mathbf X^T\mathbf a$, i.e., $\pmb\lambda\in\mathcal C(\mathbf X^T)$. So $\pmb\lambda^T\mathbf b$ is estimable.
*If we take $\mathbf a^T=(\mathbf e^{(i)})^T$, then $\mathbf a^T\mathbb E\mathbf y=\mathbb E y_i$. So the expected value of any $y_i$ is estimable.
Suppose $\pmb\lambda^{(j)}\in\mathbb R^p$, $j=1,2,\cdots,k$. If $\pmb\lambda^{(j)T}\mathbf b$ are all estimable, so is $\sum_{j=1}^kd_j\pmb\lambda^{(j)T}\mathbf b=\left[\sum_{j=1}^kd_j\pmb\lambda^{(j)}\right]^T\mathbf b$ for any $\{d_j\}_{j=1}^k$.
Def: A set of estimable functions $\{\pmb\lambda^{(j)}\mathbf b\}_{j=1}^k$ is called linearly independent if the $\pmb\lambda^{(j)}$’s are linearly independent.
Let $r=rank(\mathbf X^T)\leq p$ and $\dim(\mathcal C(\mathbf X^T))=rank(X^T)=r$. Thus, any set of L.I. estimable functions cannot have more than $r$ members.
Suppose that $\mathbf X$ has full column rank, $r=p$. Then $\mathcal C(\mathbf X^T)=\mathbb R^p$ and $\pmb\lambda^T\mathbf b$ is estimable for any $\pmb\lambda\in\mathbb R^p$.
Q: How do we establish that $\pmb\lambda^T\mathbf b$ is estimable?
A: We have three methods.
Method 3.1: Show that $\pmb\lambda^T\mathbf b$ can be written as a linear combination of the elements of $\mathbf X\mathbf b=\mathbb E(\mathbf y)$.
Method 3.2: Construct a set of basis vectors for $\mathcal C(\mathbf X^T)$, call them $\mathbf v^{(1)},\cdots,\mathbf v^{(r)}$ and show that $\pmb\lambda=\sum_{j=1}^rd_j\mathbf v^{(j)}$.
Method 3.3: Construct a set of basis vectors for $\mathcal N(\mathbf X)$, call them $\mathbf c^{(1)},\cdots,\mathbf c^{(p-r)}$. If $\pmb\lambda\perp\mathbf c^{(j)}$ for $j=1,2,\cdots,p-r$, then $\pmb\lambda\in\mathcal C(\mathbf X^T)$.
Note that $\mathbf X^T\mathbf X(\mathbf X^T\mathbf X)^g$ is a projection onto $\mathcal C(\mathbf X^T\mathbf X)=\mathcal C(\mathbf X^T)$. So if $(\mathbf X^T\mathbf X)^g$ has been calculated, and $\mathbf X^T\mathbf X(\mathbf X^T\mathbf X)^g\pmb\lambda=\pmb\lambda$. Then $\pmb\lambda\in\mathcal C(\mathbf X^T)\Rightarrow\pmb\lambda^T\mathbf b$ is estimable.
Examples:
One-way ANOVA: $y_{ij}=\mu+\alpha_i+e_{ij}$, $i=1,\cdots,a$, $j=1,\cdots,n_i$. $\mathbf y=(y_{11},\cdots,y_{1n_1},y_{21},\cdots,y_{2n_2},\cdots,y_{a1},\cdots, y_{an_a})^T$, $\mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_{n_1}&\mathbf 1_{n_1}&\mathbf0&\cdots&\mathbf 0\\\mathbf 1_{n_2}&\mathbf 0&\mathbf 1_{n_2}&\cdots&\mathbf 0\\\vdots&\vdots&\vdots&\ddots&\vdots\\\mathbf 1_{n_a}&\mathbf0&\mathbf0&\cdots&\mathbf1_{n_a}\end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\vdots\\\alpha_a\end{pmatrix}$, $p=a+1$, $N=\sum_{i=1}^an_i$, $rank(\mathbf X)=r=a<p$.
$\dim(\mathcal C(\mathbf X^T))+\dim(\mathcal N(\mathbf X))=a+1\Rightarrow\dim(\mathcal N(\mathbf X))=1$. To find a basis fo $\mathcal N(\mathbf X)$, all we need is to find a single vector, $\mathbf c$, such that $\mathbf X\mathbf c=\mathbf 0$. Note that $\mathbf c=\begin{pmatrix}1\\-\mathbf 1_a\end{pmatrix}$ is a basis for $\mathcal N(\mathbf X)$. Now let’s use this basis to figure out which functions $\pmb\lambda^T\mathbf b=\lambda_0\mu+\sum_{i=1}^a\lambda_i\alpha_i$ are estimable. According to Method 3.3, $\pmb\lambda^T\mathbf b$ is estimable iff $\pmb\lambda^T\mathbf c=0$. Hence, $\mu+\alpha_i$ and $\alpha_i-\alpha_k(i\not=k)$ are estimable, but $\alpha_i$ is not estimable. Note that $\sum_{i=1}^ad_i\alpha_i$ is estimable iff $\sum_{i=1}^ad_i=0$. Of course, when $\sum_{i=1}^ad_i=0$, $\sum_{i=1}^ad_i\alpha_i$ is called a contrast.
Def 3.4: The least sqaures (L.S.) estimator of the estimable function $\pmb\lambda^T\mathbf b$ is $\pmb\lambda^T\hat{\mathbf b}$ when $\hat{\mathbf b}$ is any solution to the N.E.s.
Result 3.2: $\pmb\lambda^T\hat{\mathbf b}$ is invariant to the choice of a solution to the N.E.s.
proof: Suppose that $\hat{\mathbf b}_1$ and $\hat{\mathbf b}_2$ are both solutions to the N.E.s. Then $\mathbf X^T\mathbf X(\hat{\mathbf b}_1-\hat{\mathbf b}_2)=\mathbf 0$. In other words, $\hat{\mathbf b}_1-\hat{\mathbf b}_2\in\mathcal N(\mathbf X^T\mathbf X)=\mathcal N(\mathbf X)$. But $\pmb\lambda\in\mathcal C(\mathbf X^T)\perp\mathcal N(\mathbf X)$. So $\pmb\lambda\perp(\hat{\mathbf b}_1-\hat{\mathbf b}_2)$, or $\pmb\lambda^T(\hat{\mathbf b}_1-\hat{\mathbf b}_2)=0$, or $\pmb\lambda^T\hat{\mathbf b}_1=\pmb\lambda^T\hat{\mathbf b}_2$.
Result 3.3: The L.S. estimator $\pmb\lambda^T\hat{\mathbf b}$ of an estimable function, $\pmb\lambda^T\mathbf b$, is a linear unbiased estimator of $\pmb\lambda^T\mathbf b$.
proof: $\lambda^T\hat{\mathbf b}=\mathbf a^T\mathbf X(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y=\mathbf a^T\mathbf P_\mathbf X\mathbf y$.
$\mathbb E(\pmb\lambda^T\hat{\mathbf b})=\mathbb E(\mathbf a^T\mathbf P_\mathbf X\mathbf y)=\mathbf a^T\mathbf P_\mathbf X\mathbb E(\mathbf y)=\mathbf a^T\mathbf P_\mathbf X\mathbf X\mathbf b=\mathbf a^T\mathbf X\mathbf b=\pmb\lambda^T\mathbf b$.
Example 1: Back to one-way ANOVA example: $y_{ij}=\mu+\alpha_i+e_{ij}$, $i=1,\cdots,a$, $j=1,\cdots,n_i$.
Let’s look at the least squares estimators of some estimable functions. Start with a solution to the N.E.s. $$ \mathbf X^T\mathbf X=\begin{pmatrix}N&n_1&n_2&\cdots&n_a\\n_1&n_1&0&\cdots& 0\\n_2&0&n_2&\cdots&0\\\vdots&\vdots&\vdots&\ddots&\vdots\\n_a&0&0&\cdots&n_a\end{pmatrix},\quad \mathbf X^T\mathbf y= \begin{pmatrix}N\bar{y}_{\cdot\cdot}\\n_1\bar{y}_{1\cdot}\\n_2\bar{y}_{2\cdot}\\\vdots\\n_a\bar{y}_{a\cdot}\end{pmatrix} $$ A simple G.I. of $\mathbf X^T\mathbf X$ is: $$ \begin{pmatrix} 0&0&0&\cdots&0\\ 0&1/n_1&0&\cdots&0\\ 0&0&1/n_2&\cdots&0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&\cdots&1/n_a \end{pmatrix} $$ This leads to the following general solution to the N.E.s: $$ \hat{\mathbf b}=\begin{pmatrix}0\\\bar{y}_{1\cdot}\\\bar{y}_{2\cdot}\\\vdots\\\bar{y}_{a\cdot}\end{pmatrix}+z\begin{pmatrix}-1\\1\\1\\\vdots\\1\end{pmatrix}\quad\text{ for any }z\in\mathbb R. $$ L.S. estimator of $\pmb\lambda^T\mathbf b$ is $\pmb\lambda^T\hat{\mathbf b}$: $$ \begin{aligned} &\mu+\alpha_i:\quad\pmb\lambda^T\hat{\mathbf b}=(0-z)+(\bar{y}_{i\cdot}+z)=\bar{y}_{i\cdot}\\ &\alpha_i-\alpha_k:\quad\pmb\lambda^T\hat{\mathbf b}=(\bar{y}_{i\cdot}+z)-(\bar{y}_{k\cdot}+z)=\bar{y}_{i\cdot}-\bar{y}_{k\cdot}\\ &\sum_{i=1}^ad_i\alpha_i:\quad\pmb\lambda^T\hat{\mathbf b}=\sum_{i=1}^ad_i(\bar{y}_{i\cdot}+z)=\sum_{i=1}^ad_i\bar{y}_{i\cdot}+z\sum_{i=1}^ad_i=\sum_{i=1}^ad_i\bar{y}_{i\cdot} \end{aligned} $$ Example 2: Two-way crossed model without interaction: $y_{ij}=\mu+\alpha_i+\beta_j+e_{ij}$, $i=1,\cdots,a$, $j=1,\cdots,b$. $N=ab$ and $p=a+b+1$. $$ \mathbf y=(y_{11},\cdots,y_{1b},y_{21},\cdots,y_{2b},\cdots,y_{a1},\cdots,y_{ab})^T $$
$$ \mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_b&\mathbf 1_b&\mathbf 0&\cdots&\mathbf 0&\mathbf I_b\\\mathbf 1_b&\mathbf 0&\mathbf 1_b&\cdots&\mathbf 0&\mathbf I_b\\\vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\\mathbf 1_b&\mathbf 0&\mathbf 0&\cdots&\mathbf 1_b&\mathbf I_b \end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\vdots\\\alpha_a\\\beta_1\\\vdots\\\beta_b\end{pmatrix} $$
$rank(\mathbf X)=a+b-1$. Reason: The first column is the sum of columns $2$ through $a+1$ (and also the sum of columns $a+2$ through $a+b+1$). The second column is the sum of columns $a+2$ through $a+b+1$ minus the sum of columns $3$ through $a+1$. Remaining $a+b-1$ columns are linearly independent.
Now, $dim(\mathcal N(\mathbf X))=2$. Here is a basis for $\mathcal N(\mathbf X)$: $$ \mathbf c^{(1)}=\begin{pmatrix}1\\-\mathbf 1_a\\\mathbf 0_b\end{pmatrix},\quad \mathbf c^{(2)}=\begin{pmatrix}1\\\mathbf 0_a\\-\mathbf 1_b\end{pmatrix} $$ Now let’s write $\pmb\lambda^T\mathbf b=\lambda_0\mu+\sum_{i=1}^a\lambda_i\alpha_i+\sum_{j=1}^b\lambda_{a+j}\beta_j$.
Invoking Method 3.3: $\pmb\lambda^T\mathbf b$ is estimable iff $\lambda_0-\sum_{i=1}^a\lambda_i=0$ and $\lambda_0-\sum_{j=1}^b\lambda_{a+j}=0$.
Common estimable functions:
- $\mu+\alpha_i+\beta_j$
- $\alpha_i-\alpha_k$
- $\beta_j-\beta_k$
- $\sum_{i=1}^ad_i\alpha_i$ (fix $\sum_{i=1}^ad_i=0$)
- $\sum_{j=1}^bf_j\beta_j$ (fixed$ \sum_{j=1}^df_j=0$)
Suppose that $a=3$ and $b=4$. $rank(\mathbf X)=6$. Here is a set of linearly independent estimable functions: $$ \{\mu+\alpha_1+\beta_1,\alpha_2-\alpha_1,\alpha_3-\alpha_1,\beta_2-\beta_1,\beta_3-\beta_1,\beta_4-\beta_1\} $$ Now, consider what happends when there is some missing data.
Eg 1:
1 | 2 | 3 | 4 | |
---|---|---|---|---|
1 | X | X | X | |
2 | X | X | ||
3 | X | X |
*column - B, row - A
So we do not observe $y_{14},y_{22},y_{24},y_{31},y_{32}$. $$ \mathbf X\mathbf b=\begin{pmatrix}1&1&0&0&1&0&0&0\\ 1&1&0&0&0&1&0&0\\ 1&1&0&0&0&0&1&0\\ 1&0&1&0&1&0&0&0\\ 1&0&1&0&0&0&1&0\\ 1&0&0&1&0&0&1&0\\ 1&0&0&1&0&0&0&1\end{pmatrix} \begin{pmatrix} \mu\\\alpha_1\\\alpha_2\\\alpha_3\\\beta_1\\\beta_2\\\beta_3\\\beta_4 \end{pmatrix} $$ $rank(\mathbf X)=6$. And $\mathcal C(\mathbf X^T)$ remains the same, so same set of $6$ linearly independent estimable functions is srill valid.
Eg 2:
1 | 2 | 3 | 4 | |
---|---|---|---|---|
1 | X | X | X | |
2 | X | X | X | |
3 | X |
*column - B, row - A
In this case, $rank(\mathbf X)=5$. Basis for $\mathcal N(\mathbf X)$: $$ \begin{pmatrix} 1\\-1\\-1\\-1\\0\\0\\0\\0 \end{pmatrix}, \quad \begin{pmatrix} 1\\0\\0\\0\\-1\\-1\\-1\\-1 \end{pmatrix}, \text{ and } \begin{pmatrix} 0\\0\\0\\1\\0\\0\\0\\-1 \end{pmatrix} $$ Here is a set of $5$ linearly independent estimable functions: $$ \{\mu+\alpha_1+\beta_1,\alpha_2-\alpha_1,\beta_2-\beta_1,\beta_3-\beta_1,\mu+\alpha_3+\beta_4\}. $$ Note that $\alpha_2-\alpha_1$ and $\beta_4-\beta_1$ are no longer estimable in this case. There is a simple reason for this: $$ \mathbb Ey_{34}=\mu+\alpha_3+\beta_4 $$ This is the only component of $\mathbf X\mathbf b$ that involves either $\alpha_3$ or $\beta_4$. We cannot differentiate between $\alpha_3$ and $\beta_4$.
Back to Eg 1:
$\mathbb E(y_{33})-\mathbb E(y_{13})=\mu+\alpha_3+\beta_3-(\mu+\alpha_1+\beta_3)=\alpha_3-\alpha_1$.
$\mathbb E(y_{34})-\mathbb E(y_{33})+\mathbb E(y_{13})-\mathbb E(y_{11})=(\mu+\alpha_3+\beta_4)-(\mu+\alpha_3+\beta_3)+(\mu+\alpha_1+\beta_3)-(\mu+\alpha_1+\beta_1)=\beta_4-\beta_1$.
The pattern is connected in Eg1, but not in Eg2.
3.7 Reparameterization Revisited
- $\mathbf y=\mathbf X\mathbf b+\mathbf e$ ($\mathbf X_{N\times p}$)
- $\mathbf y=\mathbf W\mathbf c+\mathbf e$ ($\mathbf W_{N\times t}$)
Assume the models are equivalent, so $\mathcal C(\mathbf X)=\mathcal C(\mathbf W)$. Then $\exist$ $\mathbf T$, $\mathbf S$, such that $\mathbf W=\mathbf X\mathbf T$ and $\mathbf X=\mathbf W\mathbf S$.
Result 3.4: If $\pmb\lambda^T\mathbf b$ is estimable in the $\mathbf X$ model, and $\hat{\mathbf c}$ solves the N.E.s $\mathbf W^T\mathbf W\mathbf c=\mathbf W^T\mathbf y$, then $\pmb\lambda^T\mathbf T\hat{\mathbf c}$ is the L.S. estimator of $\pmb\lambda^T\mathbf b$.
proof: If $\mathbf W^T\mathbf W\hat{\mathbf c}=\mathbf W^T\mathbf y$, then by Result 2.9, we know that $\mathbf X^T\mathbf X\mathbf T\hat{\mathbf c}=\mathbf X^T\mathbf y$, so the L.S. estimator of $\pmb\lambda^T\mathbf b$ is $\pmb\lambda^T\mathbf T\hat{\mathbf c}$.
Result 3.5: If $\mathbf q^T\mathbf c$ is estimable to the $\mathbf W$ model (so $\mathbf q\in\mathcal C(\mathbf W^T)$), then $\mathbf q^T\mathbf S\mathbf b$ is estimable in the $\mathbf X$ model, and the L.S. estimator is $\mathbf q^T\hat{\mathbf c}$ where $\mathbf W^T\mathbf W\hat{\mathbf c}=\mathbf W^T\mathbf y$.
proof: $\mathbf q\in\mathcal C(\mathbf W^T)\Rightarrow \mathbf q=\mathbf W^T\mathbf a$ for some $\mathbf a$. So $\mathbf S^T\mathbf q=\mathbf S^T\mathbf W^T\mathbf a=\mathbf X^T\mathbf a\in\mathcal C(\mathbf X)$. So $\mathbf q^T\mathbf S\mathbf b$ is estimable in the $\mathbf X$ model. Now, the estimator is $$ \mathbf q^T\mathbf S\hat{\mathbf b}=\mathbf q^T\mathbf S\mathbf T\hat{\mathbf c}=\mathbf a^T\mathbf W\mathbf S\mathbf T\hat{\mathbf c}=\mathbf a^T\mathbf X\mathbf T\hat{\mathbf c}=\mathbf a^T\mathbf W\hat{\mathbf c}=\mathbf q^T\hat{\mathbf c}. $$
*Note: The converse of Result 3.5 does not hold. That is, $\mathbf S^T\mathbf q\in\mathcal C(\mathbf X^T)\not\Rightarrow\mathbf q\in\mathcal C(\mathbf W^T)$. Indeed, $\mathbf S^T\mathbf q=\mathbf X^T\mathbf a=\mathbf S^T\mathbf W^T\mathbf a$. But this imples that $\mathbf q=\mathbf W^T\mathbf a$ only if $\mathbf S^T$ has full column rank (See corollary A.2).
The most common goal of reparameterization is to find a full-rank version of an overparameterized model. A full-rank version always exists, one simply has to delete columns of the design matrix that are linear combinations of the columns until there are no more such columns.
Recall that $\mathbf X_{N\times p}=\mathbf W_{N\times t}\mathbf S_{t\times p}$. If $\mathbf W$ has full column rank:
- $\mathbf W^T\mathbf W$ is non-singular, and there is a unique solution to the $\mathbf W$ N.E.s
- The matrix $\mathbf S$ is unique, so $\mathbf X\mathbf b=\mathbf W\mathbf c\Rightarrow\mathbf W\mathbf S\mathbf b=\mathbf W\mathbf c\Rightarrow\mathbf S\mathbf b=\mathbf c$. So, the parameter $\mathbf c$ has a unique representation in terms of the elements of $\mathbf b$.
Examples: Let’s look at several different full-rank versions of the one-way ANOVA model.
Standard model: $y_{ij}=\mu+\alpha_i+e_{ij}$, $i=1,\cdots,a$, $j=1,\cdots,n_i$. $$ \mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_{n_1}&\mathbf 1_{n_1}&\mathbf0&\cdots&\mathbf 0\\\mathbf 1_{n_2}&\mathbf 0&\mathbf 1_{n_2}&\cdots&\mathbf 0\\\vdots&\vdots&\vdots&\ddots&\vdots\\\mathbf 1_{n_a}&\mathbf0&\mathbf0&\cdots&\mathbf1_{n_a}\end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\vdots\\\alpha_a\end{pmatrix} $$
Cell means model: $y_{ij}=\mu_i+e_{ij}$ $$ \mathbf Z\pmb\mu=\begin{pmatrix}\mathbf 1_{n_1}&\mathbf0&\cdots&\mathbf 0\\\mathbf 0&\mathbf 1_{n_2}&\cdots&\mathbf 0\\\vdots&\vdots&\ddots&\vdots\\\mathbf0&\mathbf0&\cdots&\mathbf1_{n_a}\end{pmatrix}\begin{pmatrix}\mu_1\\\mu_2\\\vdots\\\mu_a\end{pmatrix} $$ $\mathbf X_{N\times(a+1)}=\mathbf Z_{N\times a}\mathbf S_{a\times(a+1)}$ where $\mathbf S=[\mathbf 1_a\;\mathbf I_a]$.
Let $\pmb\mu^\star=(\mu_1,\mu_2,\cdots,\mu_a)^T$. Then $\pmb\mu^\star=\mathbf S\mathbf b=\mathbf S(\mu,\alpha_1,\cdots,\alpha_a)^T=(\mu+\alpha_1,\mu+\alpha_2,\cdots,\mu+\alpha_a)^T$.
Cell reference model: $y_{ij}=c_1+c_2\mathbf I_{\{1\}}(i)+c_3\mathbf I_{\{2\}}(i)+\cdots+c_a\mathbf I_{\{a-1\}}(i)+e_{ij}$ $$ \mathbf W\mathbf c=\begin{pmatrix}\mathbf 1_{n_1}&\mathbf 1_{n_1}&\mathbf0&\cdots&\mathbf 0\\\mathbf 1_{n_2}&\mathbf 0&\mathbf 1_{n_2}&\cdots&\mathbf 0\\\vdots&\vdots&\vdots&\ddots&\vdots\\\mathbf 1_{n_{a-1}}&\mathbf0&\mathbf0&\cdots&\mathbf1_{n_{a-1}}\\\mathbf 1_{n_a}&\mathbf 0&\mathbf 0&\cdots&\mathbf0\end{pmatrix}\begin{pmatrix}c_1\\c_2\\\vdots\\c_a\end{pmatrix} $$ $\mathbf X_{N\times(a+1)}=\mathbf W_{N\times a}\mathbf S_{a\times(a+1)}$ where $\mathbf S=[\mathbf I_a;(1,-1,-1,\cdots,-1)^T]$.
Let $\mathbf c=(c_1,c_2,\cdots,c_a)^T$. Then $\mathbf c=\mathbf S\mathbf b=[\mathbf I_a;(1,-1,-1,\cdots,-1)^T]\begin{pmatrix}\mu\\\alpha_1\\\vdots\\\alpha_a\end{pmatrix}=\begin{pmatrix}\mu+\alpha_a\\\alpha_1-\alpha_a\\\vdots\\\alpha_{a-1}-\alpha_a\end{pmatrix}$.
Now, let’s look at the relationship between $\mathbf c$ and $\pmb\mu^*$: $$ \begin{pmatrix}\mu_1\\\mu_2\\\vdots\\\mu_{a-1}\\\mu_a\end{pmatrix}=\begin{pmatrix}c_1+c_2\\c_1+c_3\\\vdots\\c_1+c_a\\c_1\end{pmatrix}\Rightarrow \begin{cases}c_1=\mu_a\\c_2=\mu_1-\mu_a\\c_3=\mu_3-\mu_a\\\vdots\\c_a=\mu_{a-1}-\mu_a\end{cases} $$
3.8 Imposing Conditions on $\mathbf b$ to Get a Unique Solution to the Normal Equations
Example: One-way ANOVA: $y_{ij}=\mu+\alpha_i+e_{ij}$
N.E.s: $N\mu+\sum_{i=1}^an_a\alpha_i=N\bar{y}_{\cdot\cdot}$, $\mu+\alpha_i=\bar{y}_{i\cdot}$, $i=1,2,\cdots,a$
Here are those standard conditions on $\mathbf b=(\mu,\alpha_1,\cdots,\alpha_a)^T$ that are impossed to get a unqiue solution:
- $\alpha_1=0\Rightarrow\hat{\mu}=\bar{y}_{a\cdot},\hat{\alpha}_i=\bar{y}_{i\cdot}-\bar{y}_{a\cdot}$, $i=1,2,\cdots,a-1$, $\hat{\alpha}_a=0$
- $\sum_{i=1}^a\alpha_i=0\Rightarrow\hat \mu=\frac{1}{a}\sum_{i=1}^a\bar{y}_{i\cdot}$, $\hat\alpha_i=\bar{y}_{i\cdot}-\hat\mu$, $i=1,\cdots,a$
- $\sum_{i=1}^an_i\alpha_i=0\Rightarrow\hat{\mu}=\bar{y}_{\cdot\cdot}$, $\hat{\alpha}_i=\bar{y}_{i\cdot}-\bar{y}_{\cdot\cdot}$, $i=1,\cdots,a$
Q: Can we always impose a set of conditions on $\mathbf b$ to obtain a unique solution to the N.E.s?
A:Yes.
Define $rank(X)=r\leq p$.
The conditions on $\mathbf b$ will be specified as follows: $$ \mathbf C\mathbf b=\mathbf 0 $$ where $\mathbf C\in\mathbb R^{(p-r)\times p}$, $\mathbf b\in\mathbb R^{p\times 1}$, $\mathbf 0\in\mathbb R^{(p-r)\times 1}$, and $rank(\mathbf C)=p-r$.
Adding these conditions to the N.E.s yields $$ \begin{pmatrix}\mathbf X^T\mathbf X\\\mathbf C\end{pmatrix}\mathbf b=\begin{pmatrix}\mathbf X^T\mathbf y\\\mathbf 0\end{pmatrix}\quad\quad\quad(1) $$ or equivalently, $$ \begin{pmatrix}\mathbf X\\\mathbf C\end{pmatrix}\mathbf b=\begin{pmatrix}\mathbf P_\mathbf X\mathbf y\\\mathbf 0\end{pmatrix}\quad\quad\quad (2) $$ ($\hat{\mathbf b}$ solves N.E.s $\Leftrightarrow$ $\hat{\mathbf b}$ solves $\mathbf X\mathbf b=\mathbf P_\mathbf X\mathbf y$)
equations (2) will have a unique solution if $rank(\begin{pmatrix}\mathbf X\\\mathbf C\end{pmatrix})=p$. That is, if $\mathcal C(\begin{pmatrix}\mathbf X^T&\mathbf C^T\end{pmatrix})=\mathbb R^p$. Thus, choosing $\mathbf C$ is equivalent to adding L.I. rows to $\mathbf X$ until its rank is $p$. Equivalently, we can add L.I. columns to $\mathbf X^T$ until it has rank $p$. The columns of $\mathbf C^T$must be L.I. of one another, and of the collumns of $\mathbf X^T$. Thus, the columns of $\mathbf C^T$ CANNOT be in $\mathcal C(\mathbf X^T)$. In other words, the columns of $\mathbf C^T$ must correspond to non-estimable functions.
Suppose $\mathbf C$ satisfies $$ rank(\mathbf C)=p-r\quad\quad\quad(3)\\ \mathcal C(\mathbf X^T)\cap\mathcal C(\mathbf C^T)=\{0\}\quad\quad\quad(4) $$ Together, (3) and (4) implies $$ rank(\begin{pmatrix}\mathbf X\\\mathbf C\end{pmatrix})=p $$ Example: $y_{ij}=\mu+\alpha_i+e_{ij}$, $i=1,2,3$
$p=4$, $r=3$. A basis for $\mathcal C(\mathbf X^T)$ is given by $$ \begin{pmatrix}1\\1\\0\\0\end{pmatrix},\quad\begin{pmatrix}1\\0\\1\\0\end{pmatrix},\quad\begin{pmatrix}1\\0\\0\\1\end{pmatrix} $$ Consider the restriction $\sum_{i=1}^3\alpha_i=0$. In the form $\mathbf C\mathbf b=\mathbf 0$, we have $$ \begin{pmatrix}0&1&1&1\end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\alpha_2\\\alpha_3\end{pmatrix}=\mathbf 0 $$ So $\mathbf C^T=\begin{pmatrix}0\\1\\1\\1\end{pmatrix}$.
Q: Is $(0,1,1,1)^T$ linearly independent of the basis vectors for $\mathcal C(\mathbf X^T)$?
A: Yes. It is because $(0,1,1,1)^T=a_1(1,1,0,0)^T+a_2(1,0,1,0)^T+a_3(1,0,0,1)^T$ is impossible.
Lemma 3.1: If $\mathbf C$ satisfies (3) and (4), then (1) is equivalent to $$ \begin{pmatrix}\mathbf X^T\mathbf X\\\mathbf C^T\mathbf C\end{pmatrix}\mathbf b=\begin{pmatrix}\mathbf X^T\mathbf y\\\mathbf 0\end{pmatrix}\quad\quad\quad(6) $$ And, (6) is equivalent to $$ (\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)\mathbf b=\mathbf X^T\mathbf y\quad\quad\quad(7) $$ proof: Recall from Lemma 2.1 that $\mathcal N(\mathbf C)=\mathcal N(\mathbf C^T\mathbf C)$. Thus $\mathbf C\mathbf b=\mathbf 0\Leftrightarrow\mathbf C^T\mathbf C\mathbf b=\mathbf 0$.
Now, if (6) holds, then $$ \mathbf X^T\mathbf X\mathbf b=\mathbf X^T\mathbf y\quad\text{ and }\quad\mathbf C^T\mathbf C\mathbf b=\mathbf 0. $$ (7) follows by just adding these equations. Now assume that (7) holds, and rewrite it as follows: $$ \mathbf C^T\mathbf C\mathbf b=\mathbf X^T\mathbf y-\mathbf X^T\mathbf X\mathbf b $$ The RHS is in $\mathcal C(\mathbf X^T)$, but the only vectors that $\mathcal C(\mathbf X^T)$ and $\mathcal C(\mathbf C^T)$ have in common is $\mathbf 0$. So $\mathbf C^T\mathbf C\mathbf b=\mathbf 0$ and $\mathbf X^T\mathbf X\mathbf b=\mathbf X^T\mathbf y$.
Result 3.6: Assume $\mathbf C$ satisfies (3) and (4). Then
- $\mathbf X^T\mathbf X+\mathbf C^T\mathbf C$ is non-singular
- $(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)^{-1}\mathbf X^T\mathbf y$ is the unique solution to (1)
- $(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)^{-1}$ is a G.I. of $\mathbf X^T\mathbf X$
- $\mathbf C(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)^{-1}\mathbf X^T=\mathbf 0$
- $\mathbf C(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)^{-1}\mathbf C^T=\mathbf I$
proof:
$\mathbb R^p=\mathcal C(\begin{pmatrix}\mathbf X^T&\mathbf C^T\end{pmatrix})=\mathcal C\left(\begin{pmatrix}\mathbf X^T&\mathbf C^T\end{pmatrix}\begin{pmatrix}\mathbf X\\\mathbf C\end{pmatrix}\right)=\mathcal C(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)$
So, $\mathbf X^T\mathbf X+\mathbf C^T\mathbf C$ is a $p\times p$ matrix with rank $p$. So it is non-singular.
The unique solution of $(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)\mathbf b=\mathbf X^T\mathbf y$ is $(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)^{-1}\mathbf X^T\mathbf y$. Now, the result follows from Lemma 3.1
From 2, we have $\mathbf X^T\mathbf X(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)^{-1}\mathbf X^T\mathbf y=\mathbf X^T\mathbf y$ for $\forall\mathbf y\in\mathbb R^N$ because of the first equations in (1). So $\mathbf X^T\mathbf X(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)^{-1}\mathbf X^T=\mathbf X^T$. Postmultiply by $\mathbf X$: $$ \mathbf X^T\mathbf X(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)^{-1}\mathbf X^T\mathbf X=\mathbf X^T\mathbf X\Rightarrow(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)^{-1}\text{ is a G.I. of }\mathbf X^T\mathbf X $$
From 2, we have $\mathbf C(\mathbf X^T\mathbf X+\mathbf C^T\mathbf C)^{-1}\mathbf X^T\mathbf y=\mathbf 0$ for $\forall\mathbf y\in\mathbb R^N$.
3.9 Constrained Parameter Space
GLM: $\mathbf y=\mathbf X\mathbf b+\mathbf e$
Suppose we wish to restrict $\mathbf b$ to a subspcae of $\mathbb R^p$ given by $$ \mathcal T=\{\mathbf b\in\mathbb R^p:\mathbf P^T\mathbf b=\delta\}, $$ where we will insist that the matrix $\mathbf P\in\mathbb R^{p\times q}$ is of full-column rank.
Note: This may seem similar to Section 3.8, but here we do NOT insist that the columns of $\mathbf P$ correspond to non-estimable functions.
Def: $\pmb\lambda^T\mathbf b$ is estimable in the constrained model if $\exist$ $c\in\mathbb R$ and $\mathbf a\in\mathbb R^N$ s.t. $\mathbb E(c+\mathbf a^T\mathbf y)=\pmb\lambda^T\mathbf b$, $\forall\mathbf b\in\mathcal T$.
*Note: If $\pmb\lambda^T\mathbf b$ is estimable in the unconstrained model, then it is estimable in the constrained model. So the constraint can only increase the number of estimable functions.
Result 3.7: In the constrained model, $\pmb\lambda^T\mathbf b$ is estimable iff $\exist$ $\mathbf a\in\mathbb R^N$ and $\mathbf d\in\mathbb R^q$ s.t. $\pmb\lambda=\mathbf X^T\mathbf a+\mathbf P\mathbf d$.
proof: ($\Leftarrow$): $\mathbb E(\mathbf d^T\delta+\mathbf a^T\mathbf y)=\mathbf d^T\delta+\mathbf a^T\mathbf X\mathbf b=\mathbf d^T\delta+(\pmb\lambda-\mathbf P\mathbf d)^T\mathbf b=\mathbf d^T\delta+\pmb\lambda^T\mathbf b-\mathbf d^T\mathbf P^T\mathbf b=\mathbf d^T\delta+\pmb\lambda^T\mathbf b-\mathbf d^T\delta=\pmb\lambda^T\mathbf b$.
($\Rightarrow$): Let $\mathbf W$ be a matrix such that $\mathcal C(\mathbf W)=\mathcal N(\mathbf P^T)$. Fix $\mathbf b_\star\in\mathcal T$. Then for any $\mathbf z$ we have $\mathbf P^T(\mathbf b_\star+\mathbf W\mathbf z)=\mathbf P^T\mathbf b_\star+\mathbf P^T\mathbf W\mathbf z=\delta$. Now, since $\mathbb E(c+\mathbf a^T\mathbf y)=\pmb\lambda^T\mathbf b$, $\forall \mathbf b\in\mathcal T$, we have $$ \begin{aligned} &c+\mathbf a^T\mathbf X\mathbf b=\pmb\lambda^T\mathbf b,\quad\forall\mathbf b\in\mathcal T\\ \Rightarrow&c+\mathbf a^T\mathbf X(\mathbf b_\star+\mathbf W\mathbf z)=\pmb\lambda^T(\mathbf b_\star+\mathbf W\mathbf z),\quad\forall\mathbf z\\ \Rightarrow&(\pmb\lambda-\mathbf X^T\mathbf a)^T\mathbf W\mathbf z-c-\mathbf a^T\mathbf X\mathbf b_\star+\pmb\lambda^T\mathbf b_\star=0,\quad\forall \mathbf z \end{aligned} $$ Using Result A.8, we have $(\pmb\lambda-\mathbf X^T\mathbf a)^T\mathbf W=\mathbf 0$. Thus $\mathbf W^T(\pmb\lambda-\mathbf X^T\mathbf a)=\mathbf 0\Rightarrow(\pmb\lambda-\mathbf X^T\mathbf a)\in\mathcal N(\mathbf W^T)=\mathcal C(\mathbf P)\Rightarrow\exist$ $\mathbf d\in\mathbb R^q$, s.t. $\pmb\lambda-\mathbf X^T\mathbf a=\mathbf P\mathbf d$.
Note that the existence of $\mathbf a\in\mathbb R^N$ and $\mathbf d\in\mathbb R^q$ s..t. $\pmb\lambda=\mathbf X^T\mathbf a+\mathbf P\mathbf d$ means that $$ \pmb\lambda=\begin{pmatrix}\mathbf X^T&\mathbf P\end{pmatrix}\begin{pmatrix}\mathbf a\\\mathbf d\end{pmatrix}\\ \Rightarrow\pmb\lambda\in\mathcal C\left(\begin{pmatrix}\mathbf X^T&\mathbf P\end{pmatrix}\right)\supset\mathcal C(\mathbf X^T) $$ Now consider minimizing $Q(\mathbf b)=|\mathbf y-\mathbf X\mathbf b|^2$ subject to the constraint that $\mathbf P^T\mathbf b=\delta$.
Method of Lagrange Multipliers $$ L(\mathbf b,\pmb\theta)=(\mathbf y-\mathbf X\mathbf b)^T(\mathbf y-\mathbf X\mathbf b)+2\pmb\theta^T(\mathbf P^T\mathbf b-\delta) $$
$$ \begin{aligned} &\frac{\partial L(\mathbf b,\pmb\theta)}{\partial\mathbf b}=-2\mathbf X^T(\mathbf y-\mathbf X\mathbf b)+2\mathbf P\pmb\theta\\ &\frac{\partial L(\mathbf b,\pmb\theta)}{\partial\pmb\theta}=2(\mathbf P^T\mathbf b-\delta) \end{aligned} $$
Setting these to zeo yields the R.N.E.s: $$ \begin{pmatrix} \mathbf X^T\mathbf X&\mathbf P\\ \mathbf P^T&\mathbf 0 \end{pmatrix}\begin{pmatrix}\mathbf b\\\pmb\theta\end{pmatrix}= \begin{pmatrix}\mathbf X^T\mathbf y\\\delta\end{pmatrix} $$ Are the restricted normal equations (R.N.E.s) consistent?
Let’s recall how we establish that the N.E.s $(\mathbf X^T\mathbf X\mathbf b=\mathbf X^T\mathbf y)$ are consistent. We showed that $\mathcal N(\mathbf X^T\mathbf X)=\mathcal N(\mathbf X)$, which implies that $\mathcal C(\mathbf X^T\mathbf X^T)=\mathcal C(\mathbf X^T)$. Note that it would have been sufficient to show that $\mathcal N(\mathbf X^T\mathbf X)\subseteq\mathcal N(\mathbf X)$ because by result A.6, it follows that $\mathcal C(\mathbf X^T\mathbf X)\supseteq\mathcal C(\mathbf X^T)$. Then $\mathbf X^T\mathbf y\in\mathcal C(\mathbf X^T)\Rightarrow \mathbf X^T\mathbf y\in\mathcal C(\mathbf X^T\mathbf X)$.
Result 3.8: The R.N.Es are consistent.
proof: Claim $$ \begin{pmatrix} \mathbf X^T\mathbf y\\\delta \end{pmatrix}\in\mathcal C\left(\begin{pmatrix} \mathbf X^T&\mathbf 0\\ \mathbf 0&\mathbf P^T \end{pmatrix}\right) $$ Indeed, there esits a vector $$ \mathbf u=\begin{pmatrix}\mathbf u_1\in\mathbb R^{N\times1}\\\mathbf u_2\in\mathbb R^{p\times1}\end{pmatrix}\quad\text{s.t.}\quad\begin{pmatrix}\mathbf X^T&\mathbf 0\\\mathbf 0&\mathbf P^T\end{pmatrix}\begin{pmatrix}\mathbf u_1\\\mathbf u_2\end{pmatrix}=\begin{pmatrix}\mathbf X^T\mathbf y\\\delta\end{pmatrix} $$ where $\mathbf u_1=\mathbf y$ and $\mathbf u_2$ exists because we assume that $\delta\in\mathcal C(\mathbf P^T)$.
Thus, it is enough to show that $$ \mathcal C\left(\begin{pmatrix}\mathbf X^T&\mathbf 0\\\mathbf 0&\mathbf P^T\end{pmatrix}\right)\subseteq\mathcal C\left(\begin{pmatrix}\mathbf X^T\mathbf X&\mathbf P\\mathbf P^T&\mathbf 0\end{pmatrix}\right). $$ To start, let $\mathbf v=\begin{pmatrix}\mathbf v_1\in\mathbb R^{p\times1}\\\mathbf v_2\in\mathbb R^{q\times 1}\end{pmatrix}$ be s.t. $$ \begin{pmatrix}\mathbf X^T\mathbf X&\mathbf P\\\mathbf P^T&\mathbf 0\end{pmatrix}\begin{pmatrix}\mathbf v_1\\\mathbf v_2\end{pmatrix}=\begin{pmatrix}\mathbf0\\mathbf 0\end{pmatrix}. $$ Then, we have (i) $\mathbf X^T\mathbf X\mathbf v_1+\mathbf P\mathbf v_2=\mathbf 0$ and (ii) $\mathbf P^T\mathbf v_1=\mathbf 0$. $$ (\text{i})\Rightarrow \mathbf v_1^T\mathbf X^T\mathbf X\mathbf v_1+\mathbf v^T_1\mathbf P\mathbf v_2=\mathbf 0\stackrel{\text{(ii)}}{\Rightarrow}\mathbf v_1^T\mathbf X^T\mathbf X\mathbf v_1=\mathbf 0\Rightarrow\mathbf X\mathbf v_1=\mathbf 0. $$ But by (i), $\mathbf P\mathbf v_2=\mathbf 0$ $$ \Rightarrow\begin{pmatrix}\mathbf X&\mathbf 0\\\mathbf 0&\mathbf P\end{pmatrix}\begin{pmatrix}\mathbf v_1\\\mathbf v_2\end{pmatrix}=\begin{pmatrix}\mathbf 0\\\mathbf 0\end{pmatrix}\\ \Rightarrow \mathcal N\left(\begin{pmatrix}\mathbf X^T\mathbf X&\mathbf P\\\mathbf P^T&\mathbf 0\end{pmatrix}\right)\subseteq\mathcal N\left(\begin{pmatrix}\mathbf X&\mathbf 0\\\mathbf0&\mathbf P\end{pmatrix}\right)\\ \stackrel{\text{Result A.6}}{\Rightarrow}\mathcal C\left(\begin{pmatrix}\mathbf X^T&\mathbf 0\\\mathbf 0&\mathbf P^T\end{pmatrix}\right)\subseteq\mathcal C\left(\begin{pmatrix}\mathbf X^T\mathbf X&\mathbf P\\\mathbf P^T&\mathbf 0\end{pmatrix}\right). $$
Let $\begin{pmatrix}\hat{\mathbf b}_H\\\hat{\pmb\theta}_H\end{pmatrix}$ denote a generic solution to the R.N.E.s
Result 3.9: $\hat{\mathbf b}_H$ minimizes $Q(\mathbf b)$ subject to $\mathbf b\in\mathcal T$.
proof: Suppose $\mathbf P^T\tilde{\mathbf b}=\delta$. $$ \begin{aligned} Q(\tilde{\mathbf b})&=(\mathbf y-\mathbf X\tilde{\mathbf b})^T(\mathbf y-\mathbf X\tilde{\mathbf b})=\left(\mathbf y-\mathbf X\hat{\mathbf b}_H+\mathbf X(\hat{\mathbf b}_H-\tilde{\mathbf b})\right)^T\left(\mathbf y-\mathbf X\hat{\mathbf b}_H+\mathbf X(\hat{\mathbf b}_H-\tilde{\mathbf b})\right)\\ &=Q(\hat{\mathbf b}_H)+|\mathbf X(\hat{\mathbf b}_H-\tilde{\mathbf b})|^2 \end{aligned} $$ since $(\hat{\mathbf b}_H-\tilde{\mathbf b})^T\mathbf X^T(\mathbf y-\mathbf X\hat{\mathbf b}_H)=(\hat{\mathbf b}_H-\tilde{\mathbf b})^T\mathbf P\hat{\pmb\theta}_H=\left[\mathbf P^T(\hat{\mathbf b}_H-\tilde{\mathbf b})\right]^T\hat{\pmb\theta}_H=0$ (by constraint). So, we have $Q(\hat{\mathbf b}_H)\leq Q(\tilde{\mathbf b})$ for all $\tilde{\mathbf b}\in\mathcal T$, with equality iff $|\mathbf X(\hat{\mathbf b}_H-\tilde{\mathbf b})|^2=0$, or $\mathbf X\hat{\mathbf b}_H=\mathbf X\tilde{\mathbf b}$.
Result 3.10: Let $\tilde{\mathbf b}$ satisfiy $\mathbf P^T\tilde{\mathbf b}=\delta$. $Q(\hat{\mathbf b}_H)=Q(\tilde{\mathbf b})$ iff $\exist$ $\tilde{\pmb\theta}$ s.t. $\begin{pmatrix}\tilde{\mathbf b}&\tilde{\pmb\theta}\end{pmatrix}$ solves the R.N.E.s.
proof: ($\Leftarrow$) If $\begin{pmatrix}\hat{\mathbf b}_H&\hat{\pmb\theta}_H\end{pmatrix}$ and $\begin{pmatrix}\tilde{\mathbf b}&\tilde{\pmb\theta}\end{pmatrix}$ both solve the R.N.E.s, then $Q(\hat{\mathbf b}_H)=Q(\tilde{\mathbf b})=\text{min}$.
($\Rightarrow$) From the proof of Result 3.9, we have $Q(\hat{\mathbf b}_H)=Q(\tilde{\mathbf b})\Leftrightarrow\mathbf X\hat{\mathbf b}_H=\mathbf X\tilde{\mathbf b}\Rightarrow \mathbf X^T\mathbf X\hat{\mathbf b}_H=\mathbf X^T\mathbf X\tilde{\mathbf b}$. Now, since $\mathbf X^T\mathbf X\hat{\mathbf b}_H+\mathbf P\hat{\pmb\theta}_H=\mathbf X^T\mathbf y=\mathbf X^T\mathbf X\tilde{\mathbf b}+\mathbf P\hat{\pmb\theta}_H$ we have that $\begin{pmatrix}\tilde{\mathbf b}&\hat{\pmb\theta}_H\end{pmatrix}$ solves the R.N.E.s,
Chapter 4: Gauss–Markov Model
Suppose $\Sigma$ is an $n\times n$ covariance matrix associated with $Z=(Z_1,\cdots,Z_n)^T$. Then for $v\in\mathbb R^n$, we have $$ v^T\Sigma v=Var(v^TZ)=Var(\sum_{i=1}^nv_iZ_i)\geq0\quad\Rightarrow\quad\Sigma\text{ is non-negative definite}. $$ When is $\Sigma$ positive definite? The only way that $v^T\Sigma v=0$ when $v\not=0$ is if $Var(\sum_{i=1}^nv_iZ_i)=0$. This means that $\sum_{i=1}^nv_iZ_i$ must be constant. If no such relationship exists, then $\Sigma$ is positive definite.
Gauss-Markov Model
$\mathbf y=\mathbf X\mathbf b+\mathbf e$, $\mathbb E\mathbf e=0$, $Cov(\mathbf e)=\sigma^2\mathbf I$.
So, each element of $\mathbf e$ has mean $0$, variance $\sigma^2$ and $Cov(e_i,e_j)=0(i\not=j)$.
Let’s derive the variance of the L.S. estimator of the estimable function $\pmb\lambda^T\mathbf b$.
Of course, L.S. estimator of $\pmb\lambda^T\mathbf b$ is $\pmb\lambda^T\hat{\mathbf b}=\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y$. We know that $\mathbb E(\pmb\lambda^T\hat{\mathbf b})=\pmb\lambda^T\mathbf b$. $$ \begin{aligned} Var(\pmb\lambda^T\hat{\mathbf b})&=Var(\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y)\\ &=\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^TCov(\mathbf y)[\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T]^T\\ &=\sigma^2\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda \end{aligned} $$ Note that $\mathbf X^T\mathbf X[(\mathbf X^T\mathbf X)^g]^T$ projects onto $\mathcal C(\mathbf X^T\mathbf X)=\mathcal C(\mathbf X^T)$ and $\pmb\lambda\in\mathcal C(\mathbf X^T)$, so $\mathbf X^T\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda=\pmb\lambda$. So $$ Var(\pmb\lambda^T\hat{\mathbf b})=\sigma^2\pmb\lambda^T(\mathbf X^T\mathbf X)^g\pmb\lambda. $$ Example:
SLR: $y_i=\beta_0+\beta_1X_i+e_i$. Find $Var(\hat{\beta}_1)$. $$ \mathbf X^T\mathbf X=\begin{pmatrix}N&N\bar X\\N\bar X&\sum_{i=1}^NX_i^2\end{pmatrix} $$
$$ \beta_1=\pmb\lambda^T\mathbf b=\begin{pmatrix}0&1\end{pmatrix}\begin{pmatrix}b_0\\b_1\end{pmatrix} $$
$$ \begin{aligned} Var(\hat{\beta}_1)&=\sigma^2\pmb\lambda^T(\mathbf X^T\mathbf X)^g\pmb\lambda\\ &=\sigma^2\begin{pmatrix}0&1\end{pmatrix}\begin{pmatrix}N&N\bar X\\N\bar X&\sum_{i=1}^NX_i^2\end{pmatrix}^{-1}\begin{pmatrix}0\\1\end{pmatrix}\\ &=\frac{\sigma^2}{NS_{XX}}\begin{pmatrix}0&1\end{pmatrix}\begin{pmatrix}\sum_{i=1}^NX_i^2 & -N\bar X\\-N\bar X&N\end{pmatrix}\begin{pmatrix}0\\1\end{pmatrix}\\ &=\frac{\sigma^2}{S_{XX}} \end{aligned} $$
Theorem 4.1 (Gauss-Markov Theorem): Assume that $\pmb\lambda^T\mathbf b$ is estimable. Under the G-M model, the L.S. estimator $\pmb\lambda^T\hat{\mathbf b}$ is the best (minimum variance) linear unbiased estimator (BLUE).
proof: Suppose that $c+\mathbf d^T\mathbf y$ is an unbiased estimator of $\pmb\lambda^T\mathbf b$. Then $c+\mathbf d^T\mathbf X\mathbf b=\pmb\lambda^T\mathbf b$, $\forall \mathbf b\in\mathbb R^p$. This implies $c=0$ and $\mathbf d^T\mathbf X=\pmb\lambda^T$. Now, $$ \begin{aligned} &Var(c+\mathbf d^T\mathbf y)\\ =&Var(\mathbf d^T\mathbf y)\\ =&Var(\pmb\lambda^T\hat{\mathbf b}+\mathbf d^T\mathbf y-\pmb\lambda^T\hat{\mathbf b})\\ =&Var(\pmb\lambda^T\hat{\mathbf b})+Var(\mathbf d^T\mathbf y-\pmb\lambda^T\hat{\mathbf b})+2Cov(\pmb\lambda^T\hat{\mathbf b},\mathbf d^T\mathbf y-\pmb\lambda^T\hat{\mathbf b})\\ =&Var(\pmb\lambda^T\hat{\mathbf b})+Var(\mathbf d^T\mathbf y-\pmb\lambda^T\hat{\mathbf b})+2Cov(\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y,[\mathbf d-\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda]^T\mathbf y)\\ =&Var(\pmb\lambda^T\hat{\mathbf b})+Var(\mathbf d^T\mathbf y-\pmb\lambda^T\hat{\mathbf b})+2\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^TCov(\mathbf y)[\mathbf d-\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda]\\ =&Var(\pmb\lambda^T\hat{\mathbf b})+Var(\mathbf d^T\mathbf y-\pmb\lambda^T\hat{\mathbf b})+2\sigma^2\pmb\lambda^T(\mathbf X^T\mathbf X)^g[\mathbf X^T\mathbf d-\mathbf X^T\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda]\\ =&Var(\pmb\lambda^T\hat{\mathbf b})+Var(\mathbf d^T\mathbf y-\pmb\lambda^T\hat{\mathbf b})+2\sigma^2\pmb\lambda^T(\mathbf X^T\mathbf X)^g(\pmb\lambda-\pmb\lambda)\\ =&Var(\pmb\lambda^T\hat{\mathbf b})+Var(\mathbf d^T\mathbf y-\pmb\lambda^T\hat{\mathbf b})\\ \geq& Var(\pmb\lambda^T\hat{\mathbf b}) \end{aligned} $$ with equality iff $$ \begin{aligned} 0&=Var(\mathbf d^T\mathbf y-\pmb\lambda^T\hat{\mathbf b})\\ &=Var(\mathbf d^T\mathbf y-\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y)\\ &=Var\left([\mathbf d-\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda]^T\mathbf y\right)\\ &=[\mathbf d-\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda]^TCov(\mathbf y)[\mathbf d-\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda]\\ &=\sigma^2|\mathbf d-\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda|^2 \end{aligned} $$ but this norm square is $0$ iff $\mathbf d^T=\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\Rightarrow\mathbf d^T\mathbf y=\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y=\pmb\lambda^T\hat{\mathbf b}\Rightarrow$ BLUE is unique.
Note that the crucial step in the proof of G-M theorem is showing that $Cov(\pmb\lambda^T\hat{\mathbf b},\mathbf d^T\mathbf y-\pmb\lambda^T\hat{\mathbf b})=0$. Of course, $\mathbf d^T\mathbf y-\pmb\lambda^T\hat{\mathbf b}$ is an unbiased estimator of $0$.
Result 4.1: The BLUE $\pmb\lambda^T\hat{\mathbf b}$ is uncorrelated with all linear unbiased estimator of $0$.
proof: Suppose $\mathbb E(c+\mathbf a^T\mathbf y)=c+\mathbf a^T\mathbf X\mathbf b=0$, $\forall \mathbf b\in\mathbb R^p$. It follows that $c=0$ and $\mathbf a^T\mathbf X=\mathbf 0\Rightarrow\mathbf a\in\mathcal N(\mathbf X^T)$. So, $$ \begin{aligned} Cov(\pmb\lambda^T\hat{\mathbf b},\mathbf a^T\mathbf y)&=Cov(\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y,\mathbf a^T\mathbf y)\\ &=\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^TCov(\mathbf y)\mathbf a\\ &=\sigma^2\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf a\\ &=0 \end{aligned} $$
*The G-M theorem can be extended to the case where we have $r$ linearly independent estimable functions $\{\pmb\lambda^{(j)T}\mathbf b\}_{j=1}^r$. (Assume $rank(\mathbf X)=r$)
Let $\pmb\Lambda$ be a $p\times r$ matrix whose $j$th column is $\pmb\lambda^{(j)}$. Then $\pmb\Lambda^T\hat{\mathbf b}$ is the $r\times 1$ vector of L.S. estimators, and $\mathbb E(\pmb\Lambda^T\hat{\mathbf b})=\pmb\Lambda^T\mathbf b$, $$ \begin{aligned} Cov(\pmb\Lambda^T\hat{\mathbf b})&=Cov(\pmb\Lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y)\\ &=\pmb\Lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^TCov(\mathbf y)[\pmb\Lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T]^T\\ &=\sigma^2\pmb\Lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\Lambda\\ &=\sigma^2\pmb\Lambda^T(\mathbf X^T\mathbf X)^g\pmb\Lambda. \end{aligned} $$ Q: Is $\sigma^2\pmb\Lambda^T(\mathbf X^T\mathbf X)^g\pmb\Lambda$ positive definite?
A: Yes. Note that $\pmb\Lambda=\mathbf X^T\mathbf S$, so $\pmb\Lambda^T(\mathbf X^T\mathbf X)^g\pmb\Lambda=\mathbf S^T\mathbf X(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf S=\mathbf S^T\mathbf P_\mathbf X\mathbf P_\mathbf X\mathbf S=(\mathbf P_\mathbf X\mathbf S)^T\mathbf P_\mathbf X\mathbf S$.
Because $\mathbf X^T\mathbf P_\mathbf X\mathbf S=\mathbf X^T\mathbf S=\pmb\Lambda$, we have $$ r=rank(\pmb\Lambda)\leq\min\{rank(\mathbf X^T),rank(\mathbf P_\mathbf X\mathbf S)\}\leq rank(\mathbf P_\mathbf X\mathbf S)\Rightarrow rank(\mathbf P_\mathbf X\mathbf S)=r. $$ So, $\sigma^2\pmb\Lambda^T(\mathbf X^T\mathbf X)^g\pmb\Lambda$ is positive definite.
Suppose that $\mathbb E(\mathbf C^T\mathbf y)=\pmb\Lambda^T\mathbf b$, $\forall\mathbf b\in\mathbb R^p$. ($\mathbf C\in\mathbb R^{N\times r}$)
So, $\mathbf C^T\mathbf X\mathbf b=\pmb\Lambda^T\mathbf b$, $\forall \mathbf b\in\mathbb R^p\Rightarrow \mathbf C^T\mathbf X=\pmb\Lambda^T$. Now, $Cov(\mathbf C^T\mathbf y)=\mathbf C^TCov(\mathbf y)\mathbf C=\sigma^2\mathbf C^T\mathbf C$ and $$ \begin{aligned} Cov(\mathbf C^T\mathbf y)-Cov(\pmb\Lambda^T\hat{\mathbf b})&=\sigma^2\left[\mathbf C^T\mathbf C-\pmb\Lambda^T(\mathbf X^T\mathbf X)^g\pmb\Lambda\right]\\ &=\sigma^2\left[\mathbf C^T\mathbf C-\mathbf C^T\mathbf X(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf C\right]\\ &=\sigma^2\mathbf C^T(\mathbf I-\mathbf P_\mathbf X)\mathbf C \end{aligned} $$ Note that, for $\mathbf v\in\mathbb R^r$, we have $$ \mathbf v^T\mathbf C^T[\mathbf I-\mathbf P_\mathbf X]\mathbf C\mathbf v=\mathbf v^T\mathbf C(\mathbf I-\mathbf P_\mathbf X)(\mathbf I-\mathbf P_\mathbf X)\mathbf C\mathbf v=\left((\mathbf I-\mathbf P_\mathbf X)\mathbf C\mathbf v\right)^T(\mathbf I-\mathbf P_\mathbf X)\mathbf C\mathbf v\geq0\\ \Rightarrow \mathbf C^T[\mathbf I-\mathbf P_\mathbf X]\mathbf C\text{ is non-negative definite}\\ \Rightarrow Cov(\mathbf C^T\mathbf y)-Cov(\Lambda^T\hat{\mathbf b}) \text{ is non-negative definite}. $$ Fix $\mathbf v\in\mathbb R^r$, $\mathbf v\not=\mathbf 0$. Let’s compare $\mathbf v^T\mathbf C^T\mathbf y$ with $\mathbf v^T\pmb\Lambda^T\hat{\mathbf b}$. $$ Var(\mathbf v^T\mathbf C^T\mathbf y)-Var(\mathbf v^T\pmb\Lambda^T\hat{\mathbf b})=\mathbf v^T Cov(\mathbf C^T\mathbf y)\mathbf v-\mathbf v^TCov(\pmb\Lambda^T\hat{\mathbf b})\mathbf v\\=\mathbf v^T(Cov(\mathbf C^T\mathbf y)-Cov(\Lambda^T\hat{\mathbf b}))\mathbf v\geq0. $$
Suppose $r=rank(\mathbf X)=p$. Then any functions $\pmb\lambda^T\mathbf b$ are estimable. So we can take $\pmb\Lambda=\mathbf I_p$, which yields $$ Cov(\hat{\mathbf b})=\sigma^2(\mathbf X^T\mathbf X)^{-1}. $$ We also have $$ Cov(\tilde{\mathbf b})-Cov(\hat{\mathbf b})\text{ is non-negative definite}, $$ where $\tilde{\mathbf b}$ is any generic unbasied estimator of $\mathbf b$.
4.3 Variance estimation
Let $\mathbf Z$ be random variable such that $\mathbb E\mathbf Z=\mu$ and $Cov(\mathbf Z)=\pmb\Sigma$. Then if $\mathbf A$ is a fixed matrix, we have $$ \mathbb E(\mathbf Z^T\mathbf A\mathbf Z)=\mu^T\mathbf A\mu+tr(\mathbf A\pmb\Sigma). $$ proof: See Page 76 in the textbook.
Fact: $tr(\mathbf I-\mathbf P_\mathbf X)=N-r$, where $r=rank(\mathbf X)$.
Reason: $tr(\mathbf I-\mathbf P_\mathbf X)=tr(\mathbf I)-tr(\mathbf P_\mathbf X)$. Since $\mathbf P_\mathbf X$ is idempotent, $rank(\mathbf P_\mathbf X)=tr(\mathbf P_\mathbf X)$, but $\mathcal C(\mathbf P_\mathbf X)=\mathcal C(\mathbf X)$, so $rank(\mathbf P_\mathbf X)=rank(\mathbf X)=r$.
Recall that $SSE=\hat{\mathbf e}^T\hat{\mathbf e}=\mathbf y^T(\mathbf I-\mathbf P_\mathbf X)\mathbf y$. Define $\hat{\sigma}^2=SSE/(N-r)$.
Result 4.2: Under the G-M model, $\hat{\sigma}^2$ is an unbiased estimator of $\sigma^2$.
proof: $$ \begin{aligned} \mathbb E(\mathbf y^T(\mathbf I-\mathbf P_\mathbf X)\mathbf y)&=\mathbf b^T\mathbf X^T(\mathbf I-\mathbf P_\mathbf X)\mathbf X\mathbf b+tr((\mathbf I-\mathbf P_\mathbf X)Cov(\mathbf y))\\ &=0+\sigma^2tr(\mathbf I-\mathbf P_\mathbf X)\\ &=\sigma^2(N-r). \end{aligned} $$
4.5 The Aitken Model and Generalized Least Squares
Aitken model: $\mathbf y=\mathbf A\mathbf b+\mathbf e$, $\mathbb E(\mathbf e)=\mathbf 0$, $Cov(\mathbf e)=\sigma^2\mathbf V$, $\mathbf V$ is a fixed positive definite matrix.
The key to dealing with this model is the decomposition $\mathbf R\mathbf V\mathbf R^T=\mathbf I$.
Two ways to find $\mathbf R$.
Cholesky factorization: $\mathbf V=\mathbf L\mathbf L^T$, where $\mathbf L$ is an invertible lower-triangular matrix. Then $\mathbf R=\mathbf L^{-1}$ and $\mathbf R\mathbf V\mathbf R^T=\mathbf L^{-1}\mathbf V(\mathbf L^{-1})^T=\mathbf L^{-1}\mathbf L\mathbf L^T(\mathbf L^{-1})^T=\mathbf I$.
Spectral decomposition: $\mathbf V=\mathbf Q\pmb\Lambda\mathbf Q^T$. Here $\mathbf R=\mathbf Q\pmb\Lambda^{-1/2}\mathbf Q^T$ (symmetric) and $$ \mathbf R\mathbf V\mathbf R^T=\mathbf Q\pmb\Lambda^{-1/2}\mathbf Q^T\mathbf Q\pmb\Lambda\mathbf Q^T\mathbf Q\pmb\Lambda^{-1/2}\mathbf Q^T=\mathbf Q\pmb\Lambda^{-1/2}\pmb\Lambda\pmb\Lambda^{-1/2}\mathbf Q^T=\mathbf Q\mathbf Q^T=\mathbf I. $$
Now, we use $\mathbf R$ to transform the Aitken model: $$ \mathbf R\mathbf y=\mathbf R\mathbf X\mathbf b+\mathbf R\mathbf e\\ \Leftrightarrow\mathbf z=\mathbf U\mathbf b+\mathbf f $$ $\mathbb E(\mathbf f)=\mathbb E(\mathbf R\mathbf e)=\mathbf R\mathbb E(\mathbf e)=\mathbf 0$, $Cov(\mathbf f)=Cov(\mathbf R\mathbf e)=\mathbf RCov(\mathbf e)\mathbf R^T=\sigma^2\mathbf R\mathbf V\mathbf R^T=\sigma^2\mathbf I$.
Thus, the transformed model is nothing but a G-M model. We will solve problems using G-M theory, and then transform back.
Estimability: $\pmb\lambda^T\mathbf b$ is estimable $\Leftrightarrow$ $\pmb\lambda\in\mathcal C(\mathbf U^T)$. But $\mathcal C(\mathbf U^T)=\mathcal C(\mathbf X^T\mathbf R^T)=\mathcal C(\mathbf X^T)$ since $\mathbf R^T$ is invertible.
Generalized least squares
$\mathbf U^T\mathbf U\mathbf b=\mathbf U^T\mathbf z\Rightarrow (\mathbf R\mathbf X)^T\mathbf R\mathbf X\mathbf b=(\mathbf R\mathbf X)^T\mathbf z\Rightarrow \mathbf X^T\mathbf R^T\mathbf R\mathbf X\mathbf b=\mathbf X^T\mathbf R^T\mathbf R\mathbf y\Rightarrow \mathbf X^T\mathbf V^{-1}\mathbf X\mathbf b=\mathbf X^T\mathbf V^{-1}\mathbf y$. Here we have used $\mathbf V=\mathbf R^{-1}(\mathbf R^T)^{-1}=(\mathbf R^T\mathbf R)^{-1}\Rightarrow \mathbf V^{-1}=\mathbf R^T\mathbf R$.
We call $\mathbf X^T\mathbf V^{-1}\mathbf X\mathbf b=\mathbf X^T\mathbf V^{-1}\mathbf y$ Aitken equations, and call the solution $\hat{\mathbf b}_{GLS}$.
Theorem (Aitken’s Theorem): Let $\mathbf y=\mathbf X\mathbf b+\mathbf e$ with $\mathbb E \mathbf e=0$ and $Cov(\mathbf e)=\sigma^2\mathbf V$ where $\mathbf V$ is known positive definite. If $\pmb\lambda^T\mathbf b$ is estimable, then $\pmb\lambda^T\hat{\mathbf b}_{GLS}$ is the BLUE for $\pmb\lambda^T\mathbf b$.
From the results in Chapter 2, we know that $\hat{\mathbf b}_{GLS}$ minimizes $$ |\mathbf z-\mathbf U\mathbf b|^2=|\mathbf R(\mathbf y-\mathbf X\mathbf b)|^2=(\mathbf y-\mathbf X\mathbf b)^T\mathbf V^{-1}(\mathbf y-\mathbf X\mathbf b) $$ Examples:
SLR: $y_i=\beta_0+\beta_1x_i+e_i$, $\mathbb E(e_i)=0$, $Var(\mathbf e)=\sigma^2\mathbf V$ where $\mathbf V$ is diagonal. $$ (\mathbf y-\mathbf X\mathbf b)^T\mathbf V^{-1}(\mathbf y-\mathbf X\mathbf b)=\sum_{i=1}^N\frac{1}{V_{ii}}(y_i-\beta_0-\beta_1x_i)^2\quad\text{“weighted L.S.”} $$
Regression through the origin with heteroskedastic errors. $y_i=\beta_1x_i+e_i$, $\mathbb Ee_i=0$, $Var(e_i)=\sigma^2x_i^2$, $Cov(e_i,e_j)=0$ ($i\not=j$). $$ \mathbf V=\begin{pmatrix} x_1^2&&& \\ &x_2^2&&\\ &&\ddots&\\ &&&x_N^2 \end{pmatrix}\quad \mathbf R=\begin{pmatrix} x_1^{-1}&&&\\ &x_2^{-1}&&\\ &&\ddots&\\ &&&x_N^{-1} \end{pmatrix} $$ $\mathbf R\mathbf y=(y_1/x_1,y_2/x_2,\cdots,y_N/x_N)^T$, $\mathbf R\mathbf X=\mathbf 1_N$.
Transformed model: $\frac{y_i}{x_i}=\beta+\frac{e_i}{x_i}$.
$\mathbf X^T\mathbf V^{-1}\mathbf X=N$, $\mathbf X^T\mathbf V^{-1}\mathbf y=\sum_{i=1}^N\frac{y_i}{x_i}$, $\hat{\mathbf b}_{GLS}=(\mathbf X^T\mathbf V^{-1}\mathbf X)^{-1}\mathbf X^T\mathbf V^{-1}\mathbf y=\frac{1}{N}\sum_{i=1}^N\frac{y_i}{x_i}$, $Var(\hat{\mathbf b}_{GLS})=\frac{1}{N^2}\sum_{i=1}^NVar(y_i/x_i)=\frac{1}{N^2}\sum_{i=1}^N\sigma^2=\sigma^2/N$
Q: Is $\pmb\lambda^T\hat{\mathbf b}_{OLS}$ ever BLUE under the Aitken model?
Result 4.3 (Generalization of Result 4.1): Let $\mathbf y$ be a random variable s.t. $Var(y_i)<\infty$. The linear estimator $\mathbf t^T\mathbf y$ is BLUE for $\mathbb E(\mathbf t^T\mathbf y)$ iff it is uncorrelated with all linear unbiased estimator of $0$.
proof: ($\Leftarrow$) Let $\mathbf a^T\mathbf y$ be such that $\mathbb E(\mathbf a^T\mathbf y)=\mathbb E(\mathbf t^T\mathbf y)$. $$ \begin{aligned} Var(\mathbf a^T\mathbf y)&=Var(\mathbf t^T\mathbf y+\mathbf a^T\mathbf y-\mathbf t^T\mathbf y)\\ &=Var(\mathbf t^T\mathbf y)+Var(\mathbf a^T\mathbf y-\mathbf t^T\mathbf y)+2Cov(\mathbf t^T\mathbf y,\mathbf a^T\mathbf y-\mathbf t^T\mathbf y)\\ &=Var(\mathbf t^T\mathbf y)+Var(\mathbf a^T\mathbf y-\mathbf t^T\mathbf y)\\ &\geq Var(\mathbf t^T\mathbf y) \end{aligned} $$ ($\Rightarrow $) Suppose $\mathbb E(\mathbf h^T\mathbf y)=0$. Need to show that $Cov(\mathbf t^T\mathbf y,\mathbf h^T\mathbf y)=0$. If $\mathbf h^T\mathbf y$ is constant then the result clearly holds. So assume that $\mathbf h^T\mathbf y$ is not constant. Let $Cov(\mathbf t^T\mathbf y,\mathbf h^T\mathbf y)=c$ and $Var(\mathbf h^T\mathbf y)=d>0$. Now consider an alternative linear unbiased estimator of $\mathbb E(\mathbf t^T\mathbf y)$: $\mathbf a^T\mathbf y=\mathbf t^T\mathbf y-(\frac{c}{d})\mathbf h^T\mathbf y$, and we have $$ \begin{aligned} Var(\mathbf a^T\mathbf y)&= Var(\mathbf t^T\mathbf y)+(\frac{c}{d})^2Var(\mathbf h^T\mathbf y)-2\frac{c}{d}Cov(\mathbf t^T\mathbf y,\mathbf h^T\mathbf y)\\ &=Var(\mathbf t^T\mathbf y)+\frac{c^2}{d}-2\frac{c^2}{d}\\ &=Var(\mathbf t^T\mathbf y)-\frac{c^2}{d} \end{aligned} $$ But $\mathbf t^T\mathbf y$ is BLUE, so $c$ must be $0$.
Corollary 4.1: Under the Aitken model, $\mathbf t^T\mathbf y$ is the BLUE for $\mathbb E(\mathbf t^T\mathbf y)$ iff $\mathbf V\mathbf t\in\mathcal C(\mathbf X)$.
proof: It is enough to show that $\mathbf t^T\mathbf y$ is uncorrelated with all linear unbiased estimators of $0$ iff $\mathbf V\mathbf t\in\mathcal C(\mathbf X)$.
First, $\mathbb E(\mathbf h^T\mathbf y)=0\Leftrightarrow \mathbf h^T\mathbf X\mathbf b=0,\forall\mathbf b\in\mathbb R^p\Leftrightarrow \mathbf h^T\mathbf X=\mathbf 0\Leftrightarrow \mathbf h\in\mathcal N(\mathbf X^T)$.
($\Leftarrow$) Suppose $\mathbb E(\mathbf h^T\mathbf y)=0$. Then $\mathbf h\in\mathcal N(\mathbf X^T)$ and $Cov(\mathbf t^T\mathbf y,\mathbf h^T\mathbf y)=\mathbf t^TCov(\mathbf y)\mathbf h=\sigma^2\mathbf t^T\mathbf V\mathbf h=0$.
($\Rightarrow$) If $Cov(\mathbf t^T\mathbf y,\mathbf h^T\mathbf y)=\sigma^2\mathbf t^T\mathbf V\mathbf h=0$ for $\forall \mathbf h\in\mathcal N(\mathbf X^T)$, then $\mathbf V^T\mathbf t\perp\mathcal N(\mathbf X^T)\Rightarrow\mathbf V\mathbf t\in\mathcal C(\mathbf X)$.
Result 4.4: Under the Aitkin model, all OLS estimators are BLUE (that is, each $\pmb\lambda^T\hat{\mathbf b}_{OLS}$ is BLUE for the corresponding estimable $\pmb\lambda^T\mathbf b$) iff $\exist$ a matrix $\mathbf Q$ s.t. $\mathbf V\mathbf X=\mathbf X\mathbf Q$.
proof: ($\Leftarrow$) A generic OLS estuimator takes the form $$ \pmb\lambda^T\hat{\mathbf b}_{OLS}=\pmb\lambda^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y=\mathbf t^T\mathbf y $$ where $\mathbf t=\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda$.
Now, if $\mathbf V\mathbf X=\mathbf X\mathbf Q$, we have $$ \mathbf V\mathbf t=\mathbf V\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda=\mathbf X\mathbf Q[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda\in\mathcal C(\mathbf X) $$ so $\pmb\lambda^T\hat{\mathbf b}_{OLS}$ is BLUE by corollary 4.1.
($\Rightarrow$) Let $\mathbf X^T\mathbf X=(\pmb\lambda^{(1)},\pmb\lambda^{(2)},\cdots,\pmb\lambda^{(p)})$; that is, define $\pmb\lambda^{(j)}$ to be the $j$-th column of $\mathbf X^T\mathbf X$, $j=1,2,\cdots,p$.
Now, $\pmb\lambda^{(j)T}\hat{\mathbf b}_{OLS}=\mathbf t^{(j)T}\mathbf y$ where $\mathbf t^{(j)}=\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda^{(j)}$. Since we know that $\pmb\lambda^{(j)T}\hat{\mathbf b}_{OLS}$ is BLUE, $\mathbf V\mathbf t^{(j)}\in\mathcal C(\mathbf X)$, or, in other words, $\exist$ $\mathbf q^{(j)}$ s.t. $\mathbf V\mathbf t^{(j)}=\mathbf X\mathbf q^{(j)}\Rightarrow\mathbf V\underbrace{(\mathbf t^{(1)},\mathbf t^{(2)},\cdots,\mathbf t^{(p)})}_\mathbf T=\mathbf X\underbrace{(\mathbf q^{(1)},\mathbf q^{(2)},\cdots,\mathbf q^{(p)})}_\mathbf Q$. But $$ \begin{aligned} \mathbf V\mathbf T&=\mathbf V\left[\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda^{(1)}\quad\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda^{(2)}\quad\cdots\quad\mathbf X[(\mathbf X^T\mathbf X)^g]^T\pmb\lambda^{(p)}\right]\\ &=\mathbf V\mathbf X[(\mathbf X^T\mathbf X)^g]^T[\pmb\lambda^{(1)}\quad\pmb\lambda^{(2)}\quad\cdots\quad\pmb\lambda^{(p)}]\\ &=\mathbf V\mathbf X\underbrace{[(\mathbf X^T\mathbf X)^g]^T\mathbf X^T}_{\mathbf P_\mathbf X}\mathbf X=\mathbf V\mathbf X \end{aligned} $$
Example: Back to heteroskedastic regression through the origin.
$y_i=\beta x_i+e_i$, $\mathbb Ee_i=0$, $Var(e_i)=\sigma^2x_i^2$, $Cov(e_i,e_j)=0$ ($i\not=j$). Assume $x_i\not=0$, $\forall i$. $$ \mathbf V=\begin{pmatrix}x_1^2&&&\\ &x_2^2&&\\ &&\ddots&\\ &&&x_N^2\end{pmatrix}\quad \mathbf X=\begin{pmatrix} x_1\\x_2\\\vdots\\x_N \end{pmatrix} $$ We showed that $\hat{\mathbf b}_{GLS}=\frac{1}{N}\sum_{i=1}^N\frac{y_i}{x_i}$ and $Var(\hat{\mathbf b}_{GLS})=\frac{\sigma^2}{N}$. Of course, $$ \hat{\mathbf b}_{OLS}=\frac{\sum_{i=1}^Nx_iy_i}{\sum_{i=1}^Nx_i^2}\quad Var(\hat{\mathbf b}_{OLS})=\frac{\sum_{i=1}^Nx_i^2Var(y_i)}{\left(\sum_{i=1}^Nx_i^2\right)^2}=\sigma^2\frac{\sum_{i=1}^Nx_i^4}{\left(\sum_{i=1}^Nx_i^2\right)^2}. $$ Here is a slick way of comparing these variances:
Let $U$ be a random variable taking values $x_1^2,x_2^2,\cdots,x_N^2$ with equal probabilities $1/N$. Then $\mathbb EU=\frac{1}{N}\sum x_i^2$ and $\mathbb EU^2=\frac{1}{N}\sum x_i^4$, and $$ Var(U)=\frac{1}{N}\sum x_i^4-\left(\frac{1}{N}\sum x_i^2\right)^2\geq0\\ \Rightarrow \frac{1}{N}\sum x_i^4\geq\frac{1}{N^2}\left(\sum x_i^2\right)^2\\ \Rightarrow N\sum x_i^4\geq\left(\sum x_i^2\right)^2,\quad\text{or}\quad \frac{\sum x_i^4}{(\sum x_i^2)^2}\geq\frac{1}{N} $$ with equality iff $Var(U)=0$; that is, iff $|x_i|=c>0$ for $\forall i=1,2,\cdots,N$.
Of course, if $|x_i|=c>0$, $\forall i=1,2,\cdots,N$, $$ \begin{aligned} \hat{\mathbf b}_{GLS}&=\frac{1}{N}\sum\frac{y_i}{x_i}=\frac{1}{N}\sum\frac{x_iy_i}{x_i^2}=\frac{1}{N}\sum\frac{x_iy_i}{c^2}\\ &=\frac{1}{Nc^2}\sum x_iy_i=\frac{\sum x_iy_i}{\sum x_i^2}=\hat{\mathbf b}_{OLS}. \end{aligned} $$ In other words, the OLS estimator $\hat{\mathbf b}_{OLS}$ is BLUE iff $|x_i|=c>0$, $\forall i$.
Now, let’s consider Result 4.4 in this context. When does there exist a $\mathbf Q$ such that $\mathbf V\mathbf X=\mathbf X\mathbf Q$? $$ \mathbf V\mathbf X=\begin{pmatrix}x_1^3\\x_2^3\\\vdots\\x_N^3\end{pmatrix}\quad\mathbf X\mathbf Q=\begin{pmatrix}x_1\\x_2\\\vdots\\x_N\end{pmatrix}q $$ The only way that $\mathbf V\mathbf X=\mathbf X\mathbf Q$ is if $x_i^3=x_iq$, $\forall i\Rightarrow $ $x_i^2=q=c^2>0$, $\forall i\Rightarrow $ $|x_i|=c>0$, $\forall i$.
Example: $\mathbf y=\mathbf X\mathbf b+\mathbf e$, $\mathbb E\mathbf e=\mathbf 0$, $Cov(\mathbf e)=\sigma^2\mathbf V=\sigma^2\mathbf I_N+\tau^2\mathbf 1_N\mathbf 1_N^T=\sigma^2(\mathbf I_N+\frac{\tau^2}{\sigma^2}\mathbf 1_N\mathbf 1_N^T)$. Assume that $\mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_N&\mathbf X^\star\end{pmatrix}\begin{pmatrix}b_1\\\mathbf b_2\end{pmatrix}$
Can we find $\mathbf Q$ s.t. $\mathbf V\mathbf X=\mathbf X\mathbf Q$? $$ \begin{aligned} \sigma^2\mathbf V\mathbf X&=(\sigma^2\mathbf I_N+\tau^2\mathbf 1_N\mathbf 1_N^T)\begin{pmatrix}\mathbf 1_N&\mathbf X^\star\end{pmatrix}\\ &=\begin{pmatrix}\sigma^2\mathbf 1_N+N\tau^2\mathbf 1_N&\sigma^2\mathbf X^*+\tau^2\mathbf 1_N\mathbf 1_N^T\mathbf X^\star\end{pmatrix} \end{aligned} $$
$$ \mathbf X\mathbf Q=\begin{pmatrix}\mathbf 1_N&\mathbf X^\star\end{pmatrix}\begin{pmatrix}Q_{11}&\mathbf Q_{12}\\\mathbf Q_{21}&\mathbf Q_{22}\end{pmatrix}=\begin{pmatrix}\mathbf 1_N Q_{11}+\mathbf X^\star\mathbf Q_{21}&\mathbf 1_N\mathbf Q_{12}+\mathbf X^\star\mathbf Q_{22}\end{pmatrix} $$
Matching up terms, we have $$ \begin{aligned} &Q_{11}=\sigma^2+N\tau^2\quad\mathbf Q_{21}=\mathbf 0\quad\mathbf Q_{12}=\tau^2\mathbf 1_N^T\mathbf X^*\quad\mathbf Q_{22}=\sigma^2\mathbf I_{p-1}\\ \Rightarrow &\sigma^2\mathbf V\mathbf X=\mathbf X\mathbf Q\quad\text{or}\quad\mathbf V\mathbf X=\mathbf X\mathbf Q’,\text{ where }\mathbf Q’=\mathbf Q/\sigma^2\\ \Rightarrow &\text{ OLS estimators are BLUE}. \end{aligned} $$
4.7 Best Estimation in a Constrained Parameter Space
Recall the “constrained” problem from [Section 3.9](#3.9 Constrained Parameter Space) $$ \mathbf y=\mathbf X\mathbf b+\mathbf e,\quad\text{where }\mathbf b\in\mathcal T=\left\{\mathbf b\in\R^p:\mathbf P^T\mathbf b=\delta\right\} $$ $\mathbf P$ is a fixed $p\times q$ matrix of rank $q$ and $\delta\in\mathcal C(\mathbf P^T)$ is fixed.
Recall the restricted normal equations: $$ \begin{pmatrix} \mathbf X^T\mathbf X&\mathbf P\\ \mathbf P^T&\mathbf 0 \end{pmatrix}\begin{pmatrix}\mathbf b\\\pmb\theta\end{pmatrix}= \begin{pmatrix}\mathbf X^T\mathbf y\\\delta\end{pmatrix} $$ The R.N.E.s are consistent (for all $\mathbf y\in\mathbb R^N$ and $\delta\in\mathcal C(\mathbf P^T)$). As before, let $\begin{pmatrix}\hat{\mathbf b}_H\\\hat{\pmb\theta}_H\end{pmatrix}$denote a generic solution to the R.N.E.s.
We now show that $\pmb\lambda^T\hat{\mathbf b}_H$ is BLUE for estimable $\pmb\lambda^T\mathbf b$ in the constrained model.
Lemma 4.2: Assume that $\pmb\lambda^T\mathbf b$ is estimable in the constrained model. Then the following equations are consistent: $$ \begin{pmatrix} \mathbf X^T\mathbf X&\mathbf P\\ \mathbf P^T&\mathbf 0 \end{pmatrix}\begin{pmatrix}\mathbf v_1\\\mathbf v_2\end{pmatrix}= \begin{pmatrix}\pmb\lambda\\\mathbf 0\end{pmatrix}\quad(\star) $$ proof: Result 3.7a implies that $\pmb\lambda=\mathbf X^T\mathbf a+\mathbf P\mathbf d$ for some $\mathbf a\in\R^N$ and $\mathbf d\in\R^q$.
Now consider the R.N.E.s with $\mathbf y=\mathbf a$ and $\delta=\mathbf 0$: $$ \begin{pmatrix} \mathbf X^T\mathbf X&\mathbf P\\ \mathbf P^T&\mathbf 0 \end{pmatrix}\begin{pmatrix}\mathbf w_1\\\mathbf w_2\end{pmatrix}= \begin{pmatrix}\mathbf X^T\mathbf a\\\mathbf 0\end{pmatrix} $$ Since $\mathbf 0\in\mathcal C(\mathbf P^T)$, these equations are consistent. Let $\mathbf w^\star =\begin{pmatrix}\mathbf w_1^\star\\\mathbf w^\star_2\end{pmatrix}$ be a solution. Then $$ \begin{pmatrix} \mathbf X^T\mathbf X&\mathbf P\\ \mathbf P^T&\mathbf 0 \end{pmatrix}\begin{pmatrix}\mathbf w_1^\star\\\mathbf w_2^\star+\mathbf d\end{pmatrix}= \begin{pmatrix}\mathbf X^T\mathbf X\mathbf w^*_1+\mathbf P(\mathbf w^\star_2+\mathbf d)\\\mathbf P^T\mathbf w_1^\star\end{pmatrix} =\begin{pmatrix}\mathbf X^T\mathbf a+\mathbf P\mathbf d\\\mathbf 0\end{pmatrix}=\begin{pmatrix}\pmb\lambda\\\mathbf 0\end{pmatrix}. $$
Lemma 4.3: $\pmb\lambda^T\hat{\mathbf b}_H$ is a linear unbiased estimator of $\pmb\lambda^T\mathbf b$ (in the constrained model).
proof: Let $\mathbf v^\star =\begin{pmatrix}\mathbf v^\star_1\\\mathbf v^\star_2\end{pmatrix}$ be a solution to ($\star$). Then $$ \begin{aligned} \pmb\lambda^T\hat{\mathbf b}_H&=\begin{pmatrix}\pmb\lambda^T&\mathbf 0\end{pmatrix}\begin{pmatrix}\hat{\mathbf b}_H\\\hat{\pmb\theta}_H\end{pmatrix}\stackrel{(\star)}{=}\begin{pmatrix}\mathbf v^{\star T}_1&\mathbf v^{\star T}_2\end{pmatrix}\begin{pmatrix}\mathbf X\mathbf X^T&\mathbf P\\\mathbf P^T&\mathbf 0\end{pmatrix}\begin{pmatrix}\hat{\mathbf b}_H\\\hat{\pmb\theta}_H\end{pmatrix}\\ &=\begin{pmatrix}\mathbf v^{\star T}_1&\mathbf v^{\star T}_2\end{pmatrix}\begin{pmatrix}\mathbf X^T\mathbf y\\\delta\end{pmatrix}=\mathbf v^{*T}_1\mathbf X^T\mathbf y+\mathbf v^{*T}_2\delta\to\text{linear estimator} \end{aligned} $$
$$ \begin{aligned} \mathbb E(\pmb\lambda^T\hat{\mathbf b}_H)&=\mathbb E(\mathbf v^{\star T}_1\mathbf X^T\mathbf y+\mathbf v^{\star T}_2\delta)\\ &=\mathbf v^{\star T}_1\mathbf X^T\mathbf X\mathbf b+\mathbf v^{\star T}_2\delta\\ &\stackrel{(\star)}{=}(\pmb\lambda-\mathbf P\mathbf v^{\star }_2)^T\mathbf b+\mathbf v^{\star T}_2\delta\\ &=\pmb\lambda^T\mathbf b-\mathbf v^{\star T}_2(\mathbf P^T\mathbf b-\delta)\\ &=\pmb\lambda^T\mathbf b. \end{aligned} $$
Result 4.5: Under the G-M assumptions, $\pmb\lambda^T\hat{\mathbf b}_H$ is the BLUE of the estimable function $\pmb\lambda^T\mathbf b$ in the constrained model.
proof: Let $\mathbf a^T\mathbf y+c$ be an unbiased estimator of $\pmb\lambda^T\mathbf b$ under the constraint. Again, $\pmb\lambda=\mathbf X^T\mathbf a+\mathbf P\mathbf d$ for some $\mathbf a\in\R^N$ and $\mathbf d\in\R^q$. Now, $$ Var(\mathbf a^T\mathbf y+c)=Var(\pmb\lambda^T\hat{\mathbf b}_H)+Var(\mathbf a^T\mathbf y+c-\pmb\lambda^T\hat{\mathbf b}_H)+Cov(\pmb\lambda^T\hat{\mathbf b}_H,\mathbf a^T\mathbf y+c-\pmb\lambda^T\hat{\mathbf b}_H) $$ From the previous proof, we know that $\pmb\lambda^T\hat{\mathbf b}_H=\mathbf v^{\star T}_1\mathbf X^T\mathbf y+\mathbf v^{\star T}_2\delta$. Thus, $$ \begin{aligned} Cov(\pmb\lambda^T\hat{\mathbf b}_H,\mathbf a^T\mathbf y+c-\pmb\lambda^T\hat{\mathbf b}_H)&=Cov(\mathbf v_1^{\star T}\mathbf X^T\mathbf y,\mathbf a^T\mathbf y-\mathbf v_1^{*T}\mathbf X^T\mathbf y)\\ &=Cov(\mathbf v^{\star T}_1\mathbf X^T\mathbf y,(\mathbf a-\mathbf X\mathbf v_1^{\star })^T\mathbf y)\\ &=\mathbf v_1^{\star T}\mathbf X^TCov(\mathbf y)(\mathbf a-\mathbf X\mathbf v_1^{\star })\\ &=\sigma^2\mathbf v_1^{\star T}(\mathbf X^T\mathbf a-\mathbf X^T\mathbf X\mathbf v_1^{\star })\\ &\stackrel{(\star)}{=}\sigma^2\mathbf v_1^{\star T}(\pmb\lambda-\mathbf P\mathbf d-(\pmb\lambda-\mathbf P\mathbf v_2^{\star }))\\ &=\sigma^2\mathbf v_1^{\star T}\mathbf P(\mathbf v_2^{\star }-\mathbf d)\\ &\stackrel{(\star)}{=}0. \end{aligned} $$
Chapter 5: Distributional Theory
5.2 Multivariate Normal Distribution
Let $\mathbf X\sim\mathcal N(\mu,\sigma^2)$ and $\mathbf Y=a\mathbf X+b$ for $a,b\in\mathbb R$. Then $\mathbf Y\sim\mathcal N(a\mu+b,a^2\sigma^2)$.
What about the multivariate version of this?
In math statistics class, we defined the MVN distribution through the density. We said $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf V)$, where $\pmb\mu\in\R^p$ and $\mathbf V_{p\times p}$ is positive definite, if $$ p_\mathbf X(\mathbf x)=\frac{1}{(2\pi)^{p/2}|\mathbf V|^{1/2}}e^{-\frac{1}{2}(\mathbf x-\pmb\mu)^T\mathbf V^{-1}(\mathbf x-\pmb\mu)}. $$ Now, what about a linear function of $\mathbf X$? Suppose $\mathbf A_{q\times p}$ and $\mathbf b\in\R^q$. Define $\mathbf Y=\mathbf A\mathbf X+\mathbf b$. Is $\mathbf Y$ MVN? $$ Cov(\mathbf Y)=\mathbf ACov(\mathbf X)\mathbf A^T=\mathbf A\mathbf V\mathbf A^T $$ If $q>p$, then $\mathbf A\mathbf V\mathbf A^T$ cannot be positive definite since $rank(\mathbf A\mathbf V\mathbf A^T)<q$.
So $\mathbf Y=\mathbf A\mathbf X+\mathbf b$ need not be MVN in our old “density” sense.
However, $\mathbf Y=\mathbf A\mathbf X+\mathbf b$ is MVN in a more general sense.
Def: The mgf of a $p$-dimensional r.v. is defined as $$ M_\mathbf X(t)=\mathbb Ee^{\mathbf t^T\mathbf X} $$ provided there exists $h>0$ such that this expectation exists (is finite) for all $\mathbf t\in S$, where $S:=\{\mathbf t\in\mathbb R^p:t_i\in(-h,h),i=1,2,\cdots,p\}$.
Def: Let $\mathbf Z=(Z_1,\cdots,Z_p)^T$ where $\{Z_i\}_{i=1}^p$ are i.i.d. $\mathcal N(0,1)$. Then $\mathbf Z$ has the standard multivariate normal distribution, denoted by $\mathcal N_p(\mathbf 0,\mathbf I)$. Its pdf is given by $$ p_\mathbf Z(\mathbf z)=(2\pi)^{-p/2}e^{-\frac{1}{2}\mathbf z^T\mathbf z} $$ Q: What is the mgf of $\mathbf Z$ (assuming existence)?
Fix $\mathbf t\in\R^p$, $$ \begin{aligned} M_\mathbf Z(\mathbf t)=\mathbb Ee^{\mathbf t^T\mathbf Z}&=\int_{\R^p}(2\pi)^{-p/2}e^{-\frac{1}{2}\mathbf z^T\mathbf z+\mathbf t^T\mathbf z},d\mathbf z\\ &=\int_{\mathbb R^p}(2\pi)^{-p/2}e^{-\frac{1}{2}\sum_{i=1}^n(z_i^2-2t_iz_i)},d\mathbf z\\ &=\prod_{i=1}^p\int_\R(2\pi)^{-1/2}e^{-\frac{1}{2}(z_i^2-2t_iz_i)},dz_i\\ &=\prod_{i=1}^p\int_\mathbb R(2\pi)^{-1/2}e^{-\frac{1}{2}(z_i-t_i)^2}e^{\frac{1}{2}t_i^2},dz_i\\ &=\prod_{i=1}^pe^{\frac{1}{2}t_i^2}=e^{\frac{1}{2}\mathbf t^T\mathbf t} \end{aligned} $$ Note that the expectation exists (is finite) for all $\mathbf t\in\mathbb R^p$.
Result 5.1: Let $X_1$, $X_2$ be two r.v.s. If the mgf’s exist, and $M_{X_1}(t)=M_{X_2}(t)$ for all $t$ in an open square around the origin, then $X_1\stackrel{d}{=}X_2$.
Result 5.2: Suppose $X_1,X_2,\cdots, X_p$ have mgfs $M_{X_i}(t_i),i=1,2,\cdots,p$. Let $X=(X^T_1,X_2^T,\cdots,X_p^T)^T$ have mgf $M_X(t)$, where $t=(t_1^T,t_2^T,\cdots,t_p^T)^T$. Then $X_1,X_2,\cdots,X_p$ are mutually independent iff $$ M_X(t)=\prod_{i=1}^pM_{X_i}(t_i) $$ for all $t$ in an open square around the origin.
Now suppose $\mathbf Z\sim\mathcal N_p(\mathbf 0,\mathbf I)$ and let $\mathbf X=\pmb\mu+\mathbf A\mathbf Z$. Then $$ M_\mathbf X(\mathbf t)=\mathbb E\left[e^{\mathbf t^T(\pmb\mu+\mathbf A\mathbf Z)}\right]=e^{\mathbf t^T\pmb\mu}\mathbb E[e^{\mathbf t^T\mathbf A\mathbf Z}]\\=e^{\mathbf t^T\pmb\mu}\mathbb E[e^{(\mathbf A^T\mathbf t)^T\mathbf Z}]=e^{\mathbf t^T\pmb\mu}M_\mathbf Z(\mathbf A^T\mathbf t)=e^{\mathbf t^T\pmb\mu+\frac{1}{2}\mathbf t^T\mathbf A\mathbf A^T\mathbf t}. $$ Thus, the distribution of $\mathbf X=\pmb\mu+\mathbf A\mathbf Z$ depends on $\mathbf A$ only through $\mathbf A\mathbf A^T$. Indeed, if $\mathbf B\mathbf B^T=\mathbf A\mathbf A^T$, then $\pmb\mu+\mathbf A\mathbf Z\stackrel{d}{=}\pmb\mu+\mathbf B\mathbf Z$. Of course, $\mathbf A\mathbf A^T=Cov(\mathbf X)$.
Def: Let $\pmb\mu\in\R^p$ and let $\mathbf V_{p\times p}$ be non-negative definite. We say that $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf V)$ if $$ M_\mathbf X(\mathbf t)=e^{\mathbf t^T\pmb\mu+\frac{1}{2}\mathbf t^T\mathbf V\mathbf t}. $$ If $\mathbf V$ is not positive definite, then $\mathbf X$ does not have a density.
Example: $\mathbf Z\sim\mathcal N_2(\mathbf 0,\mathbf I)$. $\mathbf X=\mathbf A\mathbf Z$ where $\mathbf A=\begin{pmatrix}1&0\\0&0\end{pmatrix}$. Then $$ \begin{pmatrix}X_1\\X_2\end{pmatrix}=\begin{pmatrix}1&0\\0&0\end{pmatrix}\begin{pmatrix}Z_1\\Z_2\end{pmatrix}=\begin{pmatrix}Z_1\\0\end{pmatrix} $$
$$ \mathbf A\mathbf A^T=\begin{pmatrix}1&0\\0&0\end{pmatrix}\begin{pmatrix}1&0\\0&0\end{pmatrix}=\begin{pmatrix}1&0\\0&0\end{pmatrix} $$
$$ \mathbf X\sim\mathcal N_2\left(\mathbf 0,\begin{pmatrix}1&0\\0&0\end{pmatrix}\right) $$
$\mathbf X$ lies on this set: $L=\{(u,v)\in\R^2:v=0\}=\R\times\{0\}$.
Note that, if $p(x,y)$ is a density on $\mathbb R^2$, that is, $p(x,y)\geq0$ and $\int_{\R^2}p(x,y)dxdy=1$, then $\int_Lp(x,y)dxdy=0$.
Result 5.3: If $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf V)$ and $\mathbf Y=\mathbf a+\mathbf B\mathbf X$ where $\mathbf a_{q\times 1}$ and $\mathbf B_{q\times p}$, then $$ \mathbf Y\sim\mathcal N_q(\mathbf a+\mathbf B\pmb\mu,\mathbf B\mathbf V\mathbf B^T). $$ proof: $M_\mathbf Y(\mathbf t)=\mathbb Ee^{\mathbf t^T\mathbf Y}=\mathbb Ee^{\mathbf t^T(\mathbf a+\mathbf B\mathbf X)}=e^{\mathbf t^T\mathbf a}\mathbb Ee^{\mathbf t^T\mathbf B\mathbf X}=e^{\mathbf t^T\mathbf a}M_\mathbf X(\mathbf B^T\mathbf t)=e^{\mathbf t^T\mathbf a}e^{\mathbf t^T\mathbf B\pmb\mu+\frac{1}{2}\mathbf t^T\mathbf B\mathbf V\mathbf B^T\mathbf t}=e^{\mathbf t^T(\mathbf a+\mathbf B\pmb\mu)+\frac{1}{2}\mathbf t^T\mathbf B\mathbf V\mathbf B^T\mathbf t}$.
Corollary 5.1: If $\mathbf X$ is MVN, then the joint distribution of any subset of the components of $\mathbf X$ is also MVN.
proof: Partition $\mathbf X$, $\pmb\mu$ and $\mathbf V$ as follows: $$ \mathbf X=\begin{pmatrix}\mathbf X_1\\\mathbf X_2\end{pmatrix}\begin{matrix}p_1\\p_2\end{matrix},\quad\pmb\mu=\begin{pmatrix}\pmb\mu_1\\\pmb\mu_2\end{pmatrix}\begin{matrix}p_1\\p_2\end{matrix},\quad\mathbf V=\begin{pmatrix}\mathbf V_{11}&\mathbf V_{12}\\\mathbf V_{21}&\mathbf V_{22}\end{pmatrix}\begin{matrix}p_1\\p_2\end{matrix} $$ Now, let $\mathbf W=\mathbf B\mathbf X$ where $\mathbf B=\begin{pmatrix}\mathbf I_{p_1}&\mathbf 0\end{pmatrix}$, so $\mathbf W=\mathbf X_1$. Then from Result 5.3, we have $$ \mathbf X_1\sim\mathcal N_{p_1}\left(\begin{pmatrix}\mathbf I_{p_1}&\mathbf 0\end{pmatrix}\begin{pmatrix}\pmb\mu_1\\\pmb\mu_2\end{pmatrix},\begin{pmatrix}\mathbf I_{p_1}&\mathbf 0\end{pmatrix}\begin{pmatrix}\mathbf V_{11}&\mathbf V_{12}\\\mathbf V_{21}&\mathbf V_{22}\end{pmatrix}\begin{pmatrix}\mathbf I_{p_1}\\\mathbf 0\end{pmatrix}\right)\Rightarrow\mathbf X_1\sim\mathcal N_p(\pmb\mu_1,\mathbf V_{11}) $$
Corollary 5.2: Suppose $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf V)$ and $\mathbf V$ is positive definite. Then
- $\exist$ nonsingular $\mathbf A$ s.t. $\mathbf V=\mathbf A\mathbf A^T$
- $\mathbf A^{-1}(\mathbf X-\pmb\mu)\sim\mathcal N_p(\mathbf 0,\mathbf I_p)$
- the pdf of $\mathbf X$ is $(2\pi)^{-p/2}|\mathbf V|^{-1/2}e^{-\frac{1}{2}(\mathbf x-\pmb\mu)^T\mathbf V^{-1}(\mathbf x-\pmb\mu)}$.
proof is HW.
Result 5.4: Assume $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf V)$. Partition as follows $$ \mathbf X=\begin{pmatrix}\mathbf X_1\\\mathbf X_2\\\vdots\\\mathbf X_m\end{pmatrix}\begin{matrix}p_1\\p_2\\\vdots\\p_m\end{matrix}, \quad\pmb\mu=\begin{pmatrix}\pmb\mu_1\\\pmb\mu_2\\\vdots\\\pmb\mu_m\end{pmatrix}\begin{matrix}p_1\\p_2\\\vdots\\p_m\end{matrix}, \quad\mathbf V=\begin{pmatrix}\mathbf V_{11}&\mathbf V_{12}&\cdots&\mathbf V_{1m}\\\mathbf V_{21}&\mathbf V_{22}&\cdots&\mathbf V_{2m}\\\vdots&\vdots&\ddots&\vdots\\\mathbf V_{m1}&\mathbf V_{m2}&\cdots&\mathbf V_{mm}\end{pmatrix}\begin{matrix}p_1\\p_2\\\vdots\\p_m\end{matrix} $$ Then $\mathbf X_1,\mathbf X_2,\cdots,\mathbf X_m$ are mutually independent $\Leftrightarrow$ $\mathbf V_{ij}=\mathbf 0$, $\forall i\not=j$.
proof: ($\Rightarrow$) $\mathbf V_{ij}=\mathbb E\left[(\mathbf X_i-\pmb\mu_i)(\mathbf X_i-\pmb\mu_i)^T\right]\stackrel{ind}{=}\left[\mathbf E(\mathbf X_i-\pmb\mu_i)\right]\left[\mathbf E(\mathbf X_i-\pmb\mu_i)\right]^T=\mathbf 0$.
($\Leftarrow$) $M_\mathbf X(\mathbf t)=e^{\mathbf t^T\pmb\mu+\frac{1}{2}\mathbf t^T\mathbf V\mathbf t}=e^{\sum_{i=1}^m\mathbf t_i^T\pmb\mu_i+\sum_{i=1}^m\mathbf t_i^T\mathbf V_{ii}\mathbf t_i}=\prod_{i=1}^me^{\mathbf t_i^T\pmb\mu+\frac{1}{2}\mathbf t^T_i\mathbf V_{ii}\mathbf t_i}=\prod_{i=1}^mM_{\mathbf X_i}(\mathbf t_i)$. Result now follows from Result 5.2.
Corollary 5.3: $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf V)$. Let $\mathbf Y_1=\mathbf a_1+\mathbf B_1\mathbf X$ and $\mathbf Y_2=\mathbf a_2+\mathbf B_2\mathbf X$. Then $\mathbf Y_1$ and $\mathbf Y_2$ are independent iff $\mathbf B_1\mathbf V\mathbf B_2^T=\mathbf 0$.
proof: $\mathbf Y=\begin{pmatrix}\mathbf Y_1\\\mathbf Y_2\end{pmatrix}=\begin{pmatrix}\mathbf a_1\\\mathbf a_2\end{pmatrix}+\begin{pmatrix}\mathbf B_1\\\mathbf B_2\end{pmatrix}\mathbf X$. According to Result 5.3, we have $$ \mathbf Y\sim\mathcal N\left(\begin{pmatrix}\mathbf a_1\\\mathbf a_2\end{pmatrix}+\begin{pmatrix}\mathbf B_1\\\mathbf B_2\end{pmatrix}\pmb\mu,\begin{pmatrix}\mathbf B_1\\\mathbf B_2\end{pmatrix}\mathbf V\begin{pmatrix}\mathbf B_1&\mathbf B_2\end{pmatrix}^T\right)\\=\mathcal N\left(\begin{pmatrix}\mathbf a_1\\\mathbf a_2\end{pmatrix}+\begin{pmatrix}\mathbf B_1\\\mathbf B_2\end{pmatrix}\pmb\mu,\begin{pmatrix}\mathbf B_1\mathbf V\mathbf B_1^T&\mathbf B_1\mathbf V\mathbf B_2^T\\\mathbf B_2\mathbf V\mathbf B_1^T&\mathbf B_2\mathbf V\mathbf B_2^T\end{pmatrix}\right) $$ Result follows immediately from Result 5.4.
5.3 Chi-Square and Related Distributions
Suppose that $\mathbf Z\sim\mathcal N_p(\mathbf 0,\mathbf I)$. Define $U=\mathbf Z^T\mathbf Z=\sum_{i=1}^pZ_i^2$. $$ \begin{aligned} M_U(t)&=\mathbb Ee^{t\sum_{i=1}^pZ_i^2}=\mathbb E\prod_{i=1}^pe^{tZ_i^2}=\prod_{i=1}^p\mathbb Ee^{tZ_i^2}\\ &=\prod_{i=1}^p\int_\R(2\pi)^{-1/2}e^{-\frac{1}{2}(z_i^2-2tz_i^2)}dz_i\\ &=\prod_{i=1}^p\int_\R(2\pi)^{-1/2}e^{-\frac{1}{2}(1-2t)z_i^2}dz_i\\ &=\prod_{i=1}^p(1-2t)^{-1/2}\\&=(1-2t)^{-\frac{p}{2}}\quad(t<\frac{1}{2}) \end{aligned} $$ This implies that $$ U\sim \text{Gamma}(\frac{p}{2},2)=\chi_p^2\\ p_U(u)=\frac{1}{\Gamma(p/2)2^{p/2}}u^{\frac{p}{2}-1}e^{-\frac{p}{2}u}I_{\mathbb R^+}(u). $$ Def 5.6: Let $J\sim $ Poisson($\phi$) ($\phi>0$). Let $U|J=j\sim\chi^2_{p+2j}$. Then $U\sim\chi^2_{p}(\phi)$, that is, noncentral $\chi^2$ with $p$ degrees of freedom and noncentrality parameter $\phi$. The pdf of $U\sim\chi_p^2(\phi)$ is given by $$ p_U(u)=\sum_{j=0}^\infty\left[\frac{e^{-\phi}\phi^j}{j!}\right]\left[\frac{u^{\frac{p+2j-2}{2}}e^{-\frac{u}{2}}}{\Gamma(\frac{p+2j}{2})2^{\frac{p+2j}{2}}}\right]I_{\R^+}(u). $$ Result 5.5: If $U\sim\chi_p^2(\phi)$, then $$ M_U(t)=(1-2t)^{-\frac{p}{2}}e^{\frac{2\phi t}{1-2t}} $$ proof: $$ \begin{aligned} M_U(t)&=\mathbb Ee^{tU}=\mathbb E[\mathbb E[e^{tU}|J]]=\mathbb E[(1-2t)^{-(p+2J)/2}]\quad(t<\frac{1}{2})\\ &=\sum_{j=0}^\infty (1-2t)^{-(p+2j)/2}\left[\frac{e^{-\phi}\phi^j}{j!}\right]\\ &=(1-2t)^{-p/2}e^{-\phi}\sum_{j=0}^\infty\frac{\left(\frac{\phi}{1-2t}\right)^j}{j!}\\ &=(1-2t)^{-p/2}e^{-\phi}e^{\frac{\phi}{1-2t}}\\ &=(1-2t)^{-p/2}e^{\frac{2\phi t}{1-2t}}. \end{aligned} $$
Result 5.6: If $U\sim\chi_p^2(\phi)$, then $\mathbb EU=p+2\phi$, $Var(U)=2p+8\phi$.
proof: Easy; use $\mathbb EU=\mathbb E[\mathbb E[U|J]]$ and $Var(U)=Var[\mathbb E[U|J]]+\mathbb E[Var[U|J]]$.
Result 5.7: If $\{U_i\}_{i=1}^m$ are independent $\chi_{p_i}^2(\phi_i)$, then $U=\sum_{i=1}^mU_i\sim\chi^2_{\sum p_i}(\sum\phi_i)$.
proof: Easy; $M_U(t)=\prod_{i=1}^mM_{U_i}(t)$.
Result 5.8: If $X\sim\mathcal N(\mu,1)$, then $U=X^2\sim\chi^2_1(\mu^2/2)$.
proof: $$ \begin{aligned} M_U(t)&=\mathbb Ee^{tX^2}=\int_\R \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(x-\mu)^2}e^{tx^2},dx\\ &=e^{-\mu^2/2}\int_\R \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(x^2-2x\mu-2tx^2)},dx\\ &=e^{-\mu^2/2+\mu^2/(2(1-2t))}\int_\R\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(1-2t)\left(x-\frac{\mu}{1-2t}\right)^2},dx\\ &=(1-2t)^{-\frac{1}{2}}e^{\frac{\mu^2}{2}\frac{2t}{1-2t}}. \end{aligned} $$
Result 5.9: If $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf I)$, then $W=\mathbf X^T\mathbf X\sim\chi^2_p(\pmb\mu^T\pmb\mu/2)$.
proof: $W=\sum_{i=1}^pU_i$ where $U_i\sim\chi^2_1(\mu_i^2/2)$ $\Rightarrow W\sim\chi^2_p(\sum\mu_i^2/2)$ (By Result 5.7).
Result 5.10: If $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf V)$ where $\mathbf V$ is nonsingular, then $W=\mathbf X^T\mathbf V^{-1}\mathbf X\sim\chi^2_p(\frac{1}{2}\pmb\mu^T\mathbf V^{-1}\pmb\mu)$.
proof: $\exist$ a nonsingular matrix $\mathbf A$ s.t. $\mathbf A^T\mathbf A=\mathbf V$. Define $\mathbf Z=\mathbf A^{-T}\mathbf X$. Then $\mathbf Z\sim\mathcal N_p(\mathbf A^{-T}\pmb\mu,\mathbf A^{-T}\mathbf V\mathbf A^{-1})=\mathcal N_p(\mathbf A^{-T}\pmb\mu,\mathbf I)$. Now, by Result 5.9, we have $$ W=\mathbf X^T\mathbf V^{-1}\mathbf X=\mathbf X^T\mathbf A^{-1}\mathbf A^{-T}\mathbf X=\mathbf Z^T\mathbf Z\sim\chi^2_p\left(\frac{\pmb\mu^T\mathbf A^{-1}\mathbf A^{-T}\pmb\mu}{2}\right)\\ \Rightarrow W\sim\chi^2_p\left(\frac{1}{2}\pmb\mu^T\mathbf V^{-1}\pmb\mu\right). $$
Result 5.11: Let $U\sim\chi^2_p(\phi)$. Then $P(U>c)$ is strictly increasing in $\phi$ for fixed $c$ and $p$.
*More explanation: Suppose $\phi’>\phi$, $U’\sim\chi^2_p(\phi’)$, $U\sim\chi^2_p(\phi)$. $U’$ is stochastically larger than $U$. $P(U’>c)>P(U>c)$, or $F_{U’}(t)<F_U(t)$.
proof: Define $$ v_k=P(\chi^2_k>c)=\int_c^\infty \frac{v^{k/2-1}e^{-v/2}}{\Gamma(k/2)2^{k/2}},dv $$ Claim $v_{k+2}>v_k$: $$ \begin{aligned} v_{k+2}&=\int_c^\infty \frac{v^{(k+2)/2-1}e^{-v/2}}{\Gamma((k+2)/2)2^{(k+2)/2}},dv\\ &=\frac{1}{k\Gamma(k/2)2^{k/2}}\int_c^\infty v^{k/2}e^{-v/2},dv\\ &=\frac{1}{k\Gamma(k/2)2^{k/2}}\left[-2v^{k/2}e^{-v/2}\bigg|_c^\infty +k\int_c^\infty v^{k/2-1}e^{-v/2},dv\right]\\ &=\frac{2c^{k/2}e^{-c/2}}{k\Gamma(k/2)2^{k/2}}+\frac{1}{\Gamma(k/2)2^{k/2}}\int_c^\infty v^{k/2-1}e^{-v/2},dv\\ &=\text{something positive} + v_k \end{aligned} $$ So claim is established.
Now $P(U>c)=\sum_{j=0}^\infty\left[\frac{e^{-\phi}\phi^j}{j!}\right]v_{p+2j}$ and $$ \begin{aligned} \frac{d}{d\phi}P(U>c)&=\frac{d}{d\phi}\left[e^{-\phi}v_p+\sum_{j=1}^\infty\left[\frac{e^{-\phi}\phi^j}{j!}\right]v_{p+2j}\right]\\ &=-e^{-\phi}v_p+\sum_{j=1}^\infty\left[\frac{e^{-\phi}\phi^{j-1}}{(j-1)!}-\frac{e^{-\phi}\phi^j}{j!}\right]v_{p+2j}\\ &=\sum_{j=1}^\infty\frac{e^{-\phi}\phi^{j-1}}{(j-1)!}v_{p+2j}-\sum_{j=0}^\infty\frac{e^{-\phi}\phi^j}{j!}v_{p+2j}\\ &=\sum_{k=0}^\infty\frac{e^{-\phi}\phi^k}{k!}v_{p+2(k+1)}-\sum_{j=0}^\infty\frac{e^{-\phi}\phi^j}{j!}v_{p+2k}\\ &=\sum_{k=0}^\infty\frac{e^{-\phi}\phi^k}{k!}[v_{p+2(k+1)}-v_{p+2k}]\\ &>0\quad(\text{By claim}). \end{aligned} $$
Def 5.7: $U_1\perp U_2$, $U_1\sim\chi^2_{p_1}$, $U_2\sim\chi^2_{p_2}$. Then $F=\frac{U_1/p_1}{U_2/p_2}\sim F_{p_1,p_2}$.
Def 5.8: $U_1\perp U_2$, $U_1\sim\chi^2_{p_1}(\phi)$, $U_2\sim\chi^2_{p_2}$. Then $F=\frac{U_1/p_1}{U_2/p_2}\sim F_{p_1,p_2}(\phi)$ (Noncentral F).
Result 5.13: $W\sim F_{p_1,p_2}(\phi)$. For fixed $p_1,p_2$ and $c>0$, $P(W>c)$ is strictly increasing in $\phi$.
*In other words, if $\phi’>\phi$ and $W’\sim F_{p_1,p_2}(\phi’)$ and $W\sim F_{p_1,p_2}(\phi)$, then $W’$ is sochastically larger than $W$.
proof: $$ \begin{aligned} P(W>c)&=P\left(\frac{U_1/p_1}{U_2/p_2}>c\right)=P\left(U_1>\frac{p_1}{p_2}cU_2\right)\\ &=\int_0^\infty P\left(U_1>\frac{p_1}{p_2}cU_2\bigg|U_2=u_2\right)f_{U_2}(u_2),du_2\\ &=\int_0^\infty P\left(U_1>\frac{p_1}{p_2}cu_2\right)f_{U_2}(u_2),du_2 \end{aligned} $$ $W’=\frac{U_1’/p_1}{U_2/p_2}$, $W=\frac{U_1/p_1}{U_3/p_2}$, $U_1\sim\chi^2_{p_1}(\phi)$, $U’_1\sim\chi^2_{p_1}(\phi’)$, $U_2\sim\chi^2_{p_2}$, $U_3\sim\chi^2_{p_2}$. $$ P(W’>c)-P(W>c)=\int_0^\infty\underbrace{\left[P\left(U’_1>\frac{p_1}{p_2}cu_2\right)-P\left(U_1>\frac{p_1}{p_2}cu_2\right)\right]}_{>0\text{ for all }u_2>0\text{ due to Result 5.11}}f_{U_2}(u_2),du_2>0. $$
Def 5.9: $U\sim \mathcal N(\mu,1)$ and $V\sim\chi^2_k$, $U\perp V$. Then $T=\frac{U}{\sqrt{V/k}}\sim t_k(\mu)$ noncentral student’s t distribution with $k$ d.f. and noncentrality parameter $\mu$. If $\mu=0$, this is the usual student’s distribution.
*Note that $T\sim t_k(\mu)$, $T=\frac{U}{\sqrt{V/k}}$, $T^2=\frac{U^2}{V/k}\sim \frac{\chi^2_1(\mu^2/2)}{\chi^2_k}=F_{1,k}(\mu^2/2)$.
5.4 Distribution of Quadratic Forms
Lemma 5.1: Let $\mathbf A$ be a $p\times p$ symmetric matrix. Then $\mathbf A$ is idempotent with rank $s$ iff $\exist $ a $p\times s$ matrix $\mathbf G$ s.t. $\mathbf G^T\mathbf G=\mathbf I_s$ and $\mathbf G\mathbf G^T=\mathbf A$.
proof: ($\Leftarrow$) $\mathbf A^2=\mathbf G\mathbf G^T\mathbf G\mathbf G^T=\mathbf G\mathbf G^T=\mathbf A$ so $\mathbf A$ is idempotent. $s=rank(\mathbf G^T\mathbf G)=rank(\mathbf G\mathbf G^T)=rank(\mathbf A)$.
($\Rightarrow$) Since $\mathbf A^2=\mathbf A$, all eigenvalues of $\mathbf A$ are either $0$ or $1$, and $rank(\mathbf A)$ is equal to the number of nonzero eigenvalues. Since $\mathbf A$ is symmetric, we have a spectral decomposition $$ \mathbf A=\mathbf Q\pmb\Lambda\mathbf Q^T=\begin{pmatrix}\mathbf Q_1&\mathbf Q_2\end{pmatrix}\begin{pmatrix}\mathbf I_s&\mathbf 0\\\mathbf 0&\mathbf 0\end{pmatrix}\begin{pmatrix}\mathbf Q_1^T\\\mathbf Q_2^T\end{pmatrix}=\mathbf Q_1\mathbf Q_1^T $$ $\mathbf Q_1^T\mathbf Q_1=\mathbf I_s$ because columns of $\mathbf Q_1$ are orthogonal. So we have $\mathbf G=\mathbf Q_1$.
Result 5.14: Let $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf I)$. If $\mathbf A$ is symmetric and idempotent with rank $s$, then $$ \mathbf X^T\mathbf A\mathbf X\sim\chi^2_s\left(\frac{1}{2}\pmb\mu^T\mathbf A\pmb\mu\right). $$ proof: By Lemma 5.1, $\exist$ $\mathbf G$ s.t. $\mathbf G^T\mathbf G=\mathbf I_s$ and $\mathbf G\mathbf G^T=\mathbf A$.
Now, $\mathbf G^T\mathbf X\sim\mathcal N_s(\mathbf G^T\pmb\mu,\mathbf I_s)$. Result 5.9 $\Rightarrow$ $\mathbf X^T\mathbf A\mathbf X=(\mathbf G^T\mathbf X)^T\mathbf G\mathbf X\sim\chi^2_s\left(\frac{1}{2}\pmb\mu^T\mathbf G\mathbf G^T\pmb\mu\right)=\chi^2_s\left(\frac{1}{2}\pmb\mu^T\mathbf A\pmb\mu\right)$.
Result 5.15: Let $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf V)$ with $\mathbf V$ nonsingular. Let $\mathbf A$ be a symmetric matrix. If $\mathbf A\mathbf V$ is idempotent with rank $s$, then $$ \mathbf X^T\mathbf A\mathbf X\sim\chi^2_s\left(\frac{1}{2}\pmb\mu^T\mathbf A\pmb\mu\right) $$ proof: $\mathbf V=\pmb\Gamma\pmb\Gamma^T$. Define $\mathbf B=\pmb\Gamma^T\mathbf A\pmb\Gamma$ so that $\mathbf A=(\pmb\Gamma^T)^{-1}\mathbf B\pmb\Gamma^{-1}$. Now $\mathbf A\mathbf V=\mathbf A\mathbf V\mathbf A\mathbf V\Rightarrow \mathbf A=\mathbf A\mathbf V\mathbf A$ because $\mathbf V$ is nonsingular. Thus $\mathbf B=\pmb\Gamma^T\mathbf A\pmb\Gamma=\pmb\Gamma^T\mathbf A\mathbf V\mathbf A\pmb\Gamma=\pmb\Gamma^T\mathbf A\pmb\Gamma\pmb\Gamma^T\mathbf A\pmb\Gamma=\mathbf B^2$. So $\mathbf B$ is idempotent. $\mathbf B$ is also symmetric. Also, $rank(\mathbf B)=rank(\mathbf A)=rank(\mathbf A\mathbf V)=s$.
Now, $\mathbf Y=\pmb\Gamma^{-1}\mathbf X\sim\mathcal N_p(\pmb\Gamma^{-1}\pmb\mu,\mathbf I_p)$. Result 5.14 $\Rightarrow$ $\mathbf X^T\mathbf A\mathbf X=\mathbf X^T(\pmb\Gamma^T)^{-1}\pmb\Gamma^T\mathbf A\pmb\Gamma\pmb\Gamma^{-1}\mathbf X=\mathbf Y^T\mathbf B\mathbf Y\sim\chi^2_s\left(\frac{1}{2}\pmb\mu^T(\pmb\Gamma^{-1})^T\mathbf B\pmb\Gamma^{-1}\pmb\mu\right)=\chi^2_s\left(\frac{1}{2}\pmb\mu^T\mathbf A\pmb\mu\right)$.
Back to G-M model $\mathbf y\sim\mathcal N_N(\mathbf X\mathbf b,\sigma^2\mathbf I)$, $r=rank(\mathbf X)$. We will apply Result 5.15 with $\pmb\mu=\mathbf X\mathbf b$ and $\mathbf V=\sigma^2\mathbf I$.
First, take $\mathbf A=\frac{1}{\sigma^2}(\mathbf I-\mathbf P_\mathbf X)$. Then $\mathbf A\mathbf V=\mathbf I-\mathbf P_\mathbf X$ which is idempotent and has rank $N-r$. By Result 5.15, $\mathbf y^T\mathbf A\mathbf y=\frac{|\hat{\mathbf e}|^2}{\sigma^2}=\frac{SSE}{\sigma^2}\sim\chi^2_{N-r}$; noncentrality parameter is $0$ since $\frac{1}{2\sigma^2}(\mathbf X\mathbf b)^T(\mathbf I-\mathbf P_\mathbf X)(\mathbf X\mathbf b)=0$.
Second, take $\mathbf A=\frac{1}{\sigma^2}\mathbf P_\mathbf X$. Then $\mathbf A\mathbf V=\mathbf P_\mathbf X$, which is idempotent and has rank $r$. By Result 5.15, $\mathbf y^T\mathbf A\mathbf y=\frac{|\hat{\mathbf y}|^2}{\sigma^2}=\frac{SSR}{\sigma^2}\sim\chi^2_r\left(\frac{1}{2\sigma^2}|\mathbf X\mathbf b|^2\right)$; noncentrality parameter is $\frac{1}{2\sigma^2}(\mathbf X\mathbf b)^T\mathbf P_\mathbf X(\mathbf X\mathbf b)=\frac{1}{2\sigma^2}|\mathbf X\mathbf b|^2$.
Now note that $\begin{pmatrix}\hat{\mathbf y}\\\hat{\mathbf e}\end{pmatrix}=\begin{pmatrix}\mathbf P_\mathbf X\mathbf y\\(\mathbf I-\mathbf P_\mathbf X)\mathbf y\end{pmatrix}=\begin{pmatrix}\mathbf P_\mathbf X\\\mathbf I-\mathbf P_\mathbf X\end{pmatrix}\mathbf y$. By Result 5.3, we have $$ \begin{aligned} \begin{pmatrix}\hat{\mathbf y}\\\hat{\mathbf e}\end{pmatrix}&\sim\mathcal N_{2N}\left(\begin{pmatrix}\mathbf P_\mathbf X\\\mathbf I-\mathbf P_\mathbf X\end{pmatrix}\mathbf X\mathbf b,\begin{pmatrix}\mathbf P_\mathbf X\\\mathbf I-\mathbf P_\mathbf X\end{pmatrix}\sigma^2\mathbf I\begin{pmatrix}\mathbf P_\mathbf X&\mathbf I-\mathbf P_\mathbf X\end{pmatrix}\right)\\ &\sim\mathcal N_{2N}\left(\begin{pmatrix}\mathbf X\mathbf b\\\mathbf 0\end{pmatrix},\sigma^2\begin{pmatrix}\mathbf P_\mathbf X&\mathbf 0\\\mathbf 0&\mathbf I-\mathbf P_\mathbf X\end{pmatrix}\right) \end{aligned} $$ Corollary 5.3 $\Rightarrow$ $\hat{\mathbf y}$ and $\hat{\mathbf e}$ are independent. Hence, SSE and SSR are also independent. Thus $$ F=\frac{|\hat{\mathbf y}|^2/r}{|\hat{\mathbf e}|^2/(N-r)}\sim F_{r,N-r}\left(\frac{1}{2\sigma^2}|\mathbf X\mathbf b|^2\right) $$
Result 5.16: Let $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf V)$ and $\mathbf A$ is a symmetric matrix, and $\mathbf B$ is a $q\times p$ matrix. If $\mathbf B\mathbf V\mathbf A=\mathbf 0$, then $\mathbf B\mathbf X$ and $\mathbf X^T\mathbf A\mathbf X$ are independent.
proof: $\mathbf A=\mathbf Q\pmb\Lambda\mathbf Q^T=\begin{pmatrix}\mathbf Q_1&\mathbf Q_2\end{pmatrix}\begin{pmatrix}\pmb\Lambda_1&\mathbf 0\\\mathbf 0&\mathbf 0\end{pmatrix}\begin{pmatrix}\mathbf Q_1^T\\\mathbf Q^T_2\end{pmatrix}$ where $\mathbf Q^T\mathbf Q=\mathbf Q\mathbf Q^T=\mathbf I$ and $\pmb\Lambda_1$ is diagonal with rank$=rank(A)=s$. So $\mathbf A=\mathbf Q_1\pmb\Lambda_1\mathbf Q^T_1$.
Result 5.3 $\Rightarrow$ $$ \begin{pmatrix}\mathbf B\\\mathbf Q_1^T\end{pmatrix}\mathbf X\sim\mathcal N_{q+s}\left(\begin{pmatrix}\mathbf B\pmb\mu\\\mathbf Q^T_1\pmb\mu\end{pmatrix},\begin{pmatrix}\mathbf B\mathbf V\mathbf B^T&\mathbf B\mathbf V\mathbf Q_1\\\mathbf Q_1^T\mathbf V\mathbf B^T&\mathbf Q_1^T\mathbf V\mathbf Q_1\end{pmatrix}\right) $$ So, if $\mathbf B\mathbf V\mathbf Q_1=\mathbf 0$, then $\mathbf B\mathbf X$ and $\mathbf Q_1^T\mathbf X$ are independent. But note that $\mathbf B\mathbf X$ and $\mathbf Q_1^T\mathbf X$ are independent, then so are $\mathbf B\mathbf X$ and $(\mathbf Q_1^T\mathbf X)^T\pmb\Lambda_1(\mathbf Q_1^T\mathbf X)=\mathbf X^T\mathbf A\mathbf X$.
Finally, if $\mathbf B\mathbf V\mathbf A=\mathbf 0$, then $\mathbf B\mathbf V\mathbf Q_1\pmb\Lambda_1\mathbf Q_1^T=\mathbf 0\Rightarrow\mathbf B\mathbf V\mathbf Q_1\pmb\Lambda_1\mathbf Q_1^T\mathbf Q_1=\mathbf 0\Rightarrow\mathbf B\mathbf V\mathbf Q_1\pmb\Lambda_1=\mathbf 0\Rightarrow\mathbf B\mathbf V\mathbf Q_1=\mathbf 0$.
Corollary 5.4: Let $\mathbf X\sim\mathcal N_p(\pmb\mu,\mathbf V)$. Suppose $\mathbf A$ and $\mathbf B$ are symmetric. If $\mathbf B\mathbf V\mathbf A=\mathbf 0$, then $\mathbf X^T\mathbf A\mathbf X$ and $\mathbf X^T\mathbf B\mathbf X$ are independent.
proof: HW.
5.5 Cochran’s Theorem
Theorem 5.1: Let $\mathbf y\sim\mathcal N_N(\pmb\mu,\sigma^2\mathbf I)$. Suppose that $\{\mathbf A_i\}_{i=1}^k$ are symmetric, idempotent matrices s.t. $rank(\mathbf A_i)=s_i$ and $\sum_{i=1}^k\mathbf A_i=\mathbf I_N$. Then $\sum_{i=1}^k s_i=N$, $\{\mathbf y^T\mathbf A_i\mathbf y\}_{i=1}^k$ are independent, and $\frac{1}{\sigma^2}\mathbf y^T\mathbf A_i\mathbf y\sim\chi^2_{s_i}\left(\frac{1}{2\sigma^2}\pmb\mu^T\mathbf A_i\pmb\mu\right)$.
proof: $N=\text{trace}(\mathbf I_N)=\text{trace}(\sum_{i=1}^k\mathbf A_i)=\sum_{i=1}^k\text{trace}(\mathbf A_i)=\sum_{i=1}^k s_i$.
By Lemma 5.1, for each $i=1,2,\cdots,k$, $\exist$ a $N\times s_i$ matrix $\mathbf Q_i$ s.t. $\mathbf Q_i\mathbf Q_i^T=\mathbf A_i$ and $\mathbf Q_i^T\mathbf Q_i=\mathbf I_{s_i}$ (Note that $rank(\mathbf Q_i)=s_i$).
Define $\mathbf Q=\begin{pmatrix}\mathbf Q_1&\mathbf Q_2&\cdots&\mathbf Q_k\end{pmatrix}_{N\times N}$. Now $\mathbf Q\mathbf Q^T=\sum_{i=1}^k\mathbf Q_i\mathbf Q_i^T=\sum_{i=1}^k\mathbf A_i=\mathbf I_N$. Since $\mathbf Q$ is square, it must be orthogonal, so $\mathbf Q^T\mathbf Q=\mathbf I_N$. $$ \mathbf Q^T\mathbf Q=\begin{pmatrix}\mathbf Q_1^T\\\mathbf Q_2^T\\\vdots\\\mathbf Q_k^T\end{pmatrix}\begin{pmatrix}\mathbf Q_1&\mathbf Q_2&\cdots&\mathbf Q_k\end{pmatrix}=\begin{pmatrix}\mathbf I_{s_1}&\mathbf 0&\cdots&\mathbf 0\\\mathbf 0&\mathbf I_{s_2}&\cdots&\mathbf 0\\\vdots&\vdots&\ddots&\vdots\\\mathbf 0&\mathbf 0&\cdots&\mathbf I_{s_k}\end{pmatrix}\Rightarrow\mathbf Q_i^T\mathbf Q_j=\mathbf 0,\forall i\not=j. $$ Now $\mathbf Q^T\mathbf y=\begin{pmatrix}\mathbf Q_1^T\mathbf y\\\mathbf Q_2^T\mathbf y\\\vdots\\\mathbf Q_k^T\mathbf y\end{pmatrix}\sim\mathcal N(\mathbf Q^T\pmb\mu,\sigma^2\mathbf I_N)$ $\Rightarrow$ $\{\mathbf Q_i^T\mathbf y\}_{i=1}^k$ are independent and $\mathbf Q_i^T\mathbf y\sim\mathcal N(\mathbf Q_i^T\pmb\mu,\sigma^2\mathbf I_{s_i})$, i.e., $\frac{1}{\sigma}\mathbf Q_i^T\mathbf y\sim\mathcal N(\mathbf Q_i^T\pmb\mu/\sigma,\mathbf I_{s_i})$.
Now, since $\mathbf y^T\mathbf A_i\mathbf y=|\mathbf Q_i^T\mathbf y|^2$, $\{\mathbf y^T\mathbf A_i\mathbf y\}_{i=1}^k$ are also independent.
And, finally, we have $$ \frac{1}{\sigma^2}\mathbf y^T\mathbf A_i\mathbf y=\frac{1}{\sigma^2}|\mathbf Q_i^T\mathbf y|^2\sim\chi^2_{s_i}\left(\frac{1}{2\sigma^2}\pmb\mu^T\mathbf A_i\pmb\mu\right). $$ Example: One-way ANOVA
$y_{ij}=\mu+\alpha_i+e_{ij}$, $i=1,2,\cdots,a$, $j=1,2,\cdots,n_i$, $e_{ij}\sim\mathcal N(0,\sigma^2)$ independent. $$ \mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_{n_1}&\mathbf 1_{n_1}&\mathbf 0&\cdots&\mathbf 0\\ \mathbf 1_{n_2}&\mathbf 0&\mathbf 1_{n_2}&\cdots&\mathbf 0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ \mathbf 1_{n_a}&\mathbf 0&\mathbf 0&\cdots&\mathbf 1_{n_a}\end{pmatrix} \begin{pmatrix} \mu\\\alpha_1\\\alpha_2\\\vdots\\\alpha_a \end{pmatrix} $$
$$ \mathbf A_1=\mathbf P_{\mathbf 1_ N},\quad\mathbf A_2=\mathbf P_\mathbf X-\mathbf P_{\mathbf 1_N},\quad \mathbf A_3=\mathbf I-\mathbf P_\mathbf X $$
Clearly, $\mathbf A_1+\mathbf A_2+\mathbf A_3=\mathbf I$.
$rank(\mathbf P_{\mathbf 1_N})=1$, $rank(\mathbf I-\mathbf P_\mathbf X)=N-a$ $\Rightarrow rank(\mathbf P_\mathbf X-\mathbf P_{\mathbf 1_N})=a-1$.
$\mathcal C(\mathbf 1_N)\subset\mathcal C(\mathbf X)$ $\stackrel{\text{Thm 2.2}}{\Rightarrow}$ $\mathbf P_\mathbf X-\mathbf P_{\mathbf 1_N}$ is symmetric projection onto $\mathcal C((\mathbf I-\mathbf P_{\mathbf 1_N})\mathbf X)$. $$ \begin{aligned} &\mathbf y^T\mathbf A_1\mathbf y=SSM=N\bar y^2\quad\quad\quad\quad\frac{SSM}{\sigma^2}\sim\chi^2_1\left(\frac{1}{2\sigma^2}(\mathbf X\mathbf b)^T\mathbf P_{\mathbf 1_N}\mathbf X\mathbf b\right)\\ &\mathbf y^T\mathbf A_2\mathbf y=SSA_{cfm}=\sum_{i=1}^an_i\bar y_{i\cdot}-N\bar y^2 \quad\frac{SSA_{cfm}}{\sigma^2}\sim\chi^2_{a-1}\left(\frac{1}{2\sigma^2}(\mathbf X\mathbf b)^T(\mathbf P_\mathbf X-\mathbf P_{\mathbf 1_N})\mathbf X\mathbf b\right)\\ &\mathbf y^T\mathbf A_3\mathbf y=SSE=\sum_{i=1}^a\sum_{j=1}^{n_i}(y_{ij}-\bar y_{i\cdot})^2\quad\frac{SSE}{\sigma^2}\sim\chi^2_{N-a}\left(\frac{1}{2\sigma^2}(\mathbf X\mathbf b)^T(\mathbf I-\mathbf P_\mathbf X)\mathbf X\mathbf b\right)=\chi^2_{N-a} \end{aligned} $$ ANOVA Table:
Source | DF | Projection | SS | Noncentrality |
---|---|---|---|---|
Mean | 1 | $\mathbf P_{\mathbf 1_N}$ | SSM=$N\bar y^2$ | $\frac{1}{2\sigma^2}N(\mu+\bar \alpha)^2$ |
Group | a-1 | $\mathbf P_\mathbf X-\mathbf P_{\mathbf 1_N}$ | SSA$_{cfm}=\sum_in_i\bar y^2_{i\cdot}-N\bar y^2$ | $\frac{1}{2\sigma^2}\sum_in_i(\alpha_i-\bar \alpha)^2$ |
Error | N-a | $\mathbf I-\mathbf P_\mathbf X$ | SSE=$\sum_{i,j}(y_{ij}-\bar y_{i\cdot})^2$ | 0 |
Chapter 6: Statistical Inference
6.2 Results from Statistical Theory
Assume $\mathbf y\sim\mathcal N_N(\mathbf X\mathbf b,\sigma^2\mathbf I)$. $$ \begin{aligned} f(\mathbf y|\mathbf b,\sigma^2)&=(2\pi\sigma^2)^{-N/2}\exp\{-\frac{1}{2\sigma^2}(\mathbf y-\mathbf X\mathbf b)^T(\mathbf y-\mathbf X\mathbf b)\}\\ &=(2\pi\sigma^2)^{-N/2}\exp\{-\frac{1}{2\sigma^2}\mathbf b^T\mathbf X^T\mathbf X\mathbf b\}\exp\{-\frac{1}{2\sigma^2}\mathbf y^T\mathbf y+\frac{1}{\sigma^2}\mathbf b^T\mathbf X^T\mathbf y\} \end{aligned} $$ By the factorization theorem, we see that $(\mathbf y^T\mathbf y,\mathbf X^T\mathbf y)$ is a sufficient statistic for $(\mathbf b,\sigma^2)$.
Result 6.1: Assume $\mathbf y\sim\mathcal N_N(\mathbf X\mathbf b,\sigma^2\mathbf I)$. $(\mathbf y^T\mathbf y,\mathbf X^T\mathbf y)$ is a minimal sufficient statistic for $(\mathbf b,\sigma^2)$.
proof: According to Theorem 6.2.B in C&B, it is enough to show that for two different responses, $\mathbf y_1,\mathbf y_2\in\mathbb R^N$, $\mathbf y_1\not=\mathbf y_2$, $$ \frac{f(\mathbf y_1|\mathbf b,\sigma^2)}{f(\mathbf y_2|\mathbf b,\sigma^2)}\text{ is constant in }(\mathbf b,\sigma^2)\quad\Leftrightarrow\quad(\mathbf y_1^T\mathbf y_1,\mathbf X^T\mathbf y_1)=(\mathbf y_2^T\mathbf y_2,\mathbf X^T\mathbf y_2) $$ First $$ \frac{f(\mathbf y_1|\mathbf b,\sigma^2)}{f(\mathbf y_2|\mathbf b,\sigma^2)}=\exp\left\{\frac{1}{2\sigma^2}(\mathbf y_2^T\mathbf y_2-\mathbf y_1^T\mathbf y_1)+\frac{1}{\sigma^2}\mathbf b^T(\mathbf X^T\mathbf y_1-\mathbf X^T\mathbf y_2)\right\} $$ ($\Leftarrow$) Obvious
($\Rightarrow$) $\forall (\mathbf b,\sigma^2)\in\R^p\times\R^+$, $\frac{1}{2\sigma^2}(\mathbf y_2^T\mathbf y_2-\mathbf y_1^T\mathbf y_1)+\frac{1}{\sigma^2}\mathbf b^T(\mathbf X^T\mathbf y_1-\mathbf X^T\mathbf y_2)=c$ where $c$ does not depend on $(\mathbf b,\sigma^2)$. For fixed $\sigma^2>0$, $\frac{1}{2}(\mathbf y_2^T\mathbf y_2-\mathbf y_1^T\mathbf y_1)-c\sigma^2+\mathbf b^T(\mathbf X^T\mathbf y_1-\mathbf X^T\mathbf y_2)=0$, $\forall\mathbf b\in\R^p$. By Result A.8, $\mathbf X^T\mathbf y_1-\mathbf X^T\mathbf y_2=\mathbf 0$ $\Rightarrow$ $\frac{1}{2\sigma^2}(\mathbf y_2^T\mathbf y_2-\mathbf y_1^T\mathbf y_1)=c$, $\forall \sigma^2>0$ $\Rightarrow$ $\mathbf y_2^T\mathbf y_2-\mathbf y_1^T\mathbf y_1=0$.
Corollary 6.1: $\mathbf y\sim\mathcal N_N(\mathbf X\mathbf b,\sigma^2\mathbf I)$. $(SSE,\mathbf X^T\mathbf y)$ is also minimal sufficient for $(\mathbf b,\sigma^2)$.
Q: How does least squares related to M.L. (maximum likelihood)?
A: Almost the same.
Result 6.3: Assume $\mathbf y\sim\mathcal N_N(\mathbf X\mathbf b,\sigma^2\mathbf I)$. Let $\hat{\mathbf b}$ be a solution to the N.E.s. $(\hat{\mathbf b},SSE/N)$ is a ML estimator of $(\mathbf b,\sigma^2)$.
proof: Recall that $Q(\mathbf b)=(\mathbf y-\mathbf X\mathbf b)^T(\mathbf y-\mathbf X\mathbf b)$. $$ L(\mathbf b,\sigma^2|\mathbf y)=(2\pi)^{-N/2}(\sigma^2)^{-N/2}e^{-\frac{1}{2\sigma^2}Q(\mathbf b)} $$ For any $\sigma^2>0$, $e^{-\frac{1}{2\sigma^2}Q(\mathbf b)}$ is maximized by minimizing $Q(\mathbf b)$. Of course, $Q(\mathbf b)$ is minimized at $\hat{\mathbf b}$. So, we now can say for all $(\mathbf b,\sigma^2)\in\R^p\times\R^+$, $$ L(\mathbf b,\sigma^2|\mathbf y)\leq L(\hat{\mathbf b},\sigma^2|\mathbf y) $$ Now $$ \log L(\hat{\mathbf b},\sigma^2|\mathbf y)=-\frac{N}{2}\log\sigma^2-\frac{1}{2\sigma^2}Q(\hat{\mathbf b})+\text{constant} $$
$$ \frac{d}{d\sigma^2}\log L(\hat{\mathbf b},\sigma^2|\mathbf y)=-\frac{N}{2\sigma^2}+\frac{1}{2(\sigma^2)^2}Q(\hat{\mathbf b})\stackrel{!}{=}0\\ \Rightarrow\hat{\sigma^2}=\frac{SSE}{N}\\ \frac{d^2}{d(\sigma^2)^2}\log L(\hat{\mathbf b},\sigma^2|\mathbf y)\bigg|_{\hat{\sigma^2}}<0\Rightarrow\text{ Maximum} $$
Finally, we can say: for all $(\mathbf b,\sigma^2)\in\R^p\times\R^+$, $L(\mathbf b,\sigma^2|\mathbf y)\leq L(\hat{\mathbf b},\sigma^2|\mathbf y)\leq L(\hat{\mathbf b},\hat{\sigma^2}|\mathbf y)$.
Corollary 6.3: Assume $\mathbf y\sim\mathcal N_N(\mathbf X\mathbf b,\sigma^2\mathbf I)$. The ML estimator of an estimable function, $\pmb\lambda^T\mathbf b$, is $\pmb\lambda^T\hat{\mathbf b}$ when $\hat{\mathbf b} $ solve the N.E.s.
proof: Invariance of MLE.
6.3 Testing the General Linear Hypothesis
$$ H:\mathbf K^T\mathbf b=\mathbf m\quad\quad A:\mathbf K^T\mathbf b\not=\mathbf m $$
Assume:
- $\mathbf K_{p\times s}$ has rank $s$. If $\mathbf K$ is not full column rank, there will be redundant constrains.
- Components of $\mathbf K^T\mathbf b$ are estimable. In other words, each column of $\mathbf K$ is in $\mathcal C(\mathbf X^T)$. Thus $\mathbf K=\mathbf X^T\mathbf A$ for some $\mathbf A_{N\times s}$.
Example: Two-way crossed model with interaction $$ y_{ijk}=\mu+\alpha_i+\beta_j+\gamma_{ij}+e_{ijk},\quad i=1,2,\cdots,a,j=1,2,\cdots,b,k=1,2,\cdots,n_{ij} $$ Q: How do we test for no interaction?
Maybe $$ \mathbf K^T\mathbf b=\begin{pmatrix}\mathbf 0_{ab\times(a+b+1)}&\mathbf I_{ab}\end{pmatrix}\begin{pmatrix}\mu\\\alpha_1\\\vdots\\\alpha_a\\\beta_1\\\vdots\\\beta_b\\\gamma_{11}\\\vdots\\\gamma_{ab}\end{pmatrix},\quad\mathbf m=\mathbf 0 $$ Doesn’t work since $\gamma_{ij}$ is not estimable.
Let’s think about a special case with $K=1$, $a=3$ and $b=2$.
Q: What does no interaction really mean?
The difference in response between levels $i$ and $i’$ of factor $A$ does not depend on factor $B$. $$ \mathbb E(y_{ij}-y_{i’j})=\mathbb E(y_{ij’}-y_{i’j’}),\quad i\not=i’,j\not=j’\\ \Rightarrow \mu+\alpha_i+\beta_j+\gamma_{ij}-(\mu+\alpha_{i’}+\beta_j+\gamma_{i’j})=\mu+\alpha_i+\beta_{j’}+\gamma_{ij’}-(\mu+\alpha_{i’}+\beta_{j’}+\gamma_{i’j’})\\ \text{or }\gamma_{ij}-\gamma_{i’j}-\gamma_{ij’}+\gamma_{i’j’}=0 $$ $(i,i’)=(1,2)$, $(i,i’)=(1,3)$, $(i,i’)=(2,3)$
When $a=3$ and $b=2$, we only need $2$ such equations to test the hypothesis on no interaction:
- $\gamma_{11}-\gamma_{21}-\gamma_{12}+\gamma_{22}=0$
- $\gamma_{11}-\gamma_{31}-\gamma_{12}+\gamma_{32}=0$
Subtract first from second, you get $\gamma_{21}-\gamma_{31}-\gamma_{22}+\gamma_{32}=0$.
In general, we need $(a-1)(b-1)$ equations to test for no interaction. This makes sense as the rank of $\mathbf X$ in the full model is $ab$, and in the model without interaction, the rank of $\mathbf X$ is $a+b-1$, and $ab-(a+b-1)=(a-1)(b-1)$.
Back to the general case.
Let’s use “first principles” (or “do what makes sense”) to construct a statistical test for $H:\mathbf K^T\mathbf b=\mathbf m$.
The BLUE of $\mathbf K^T\mathbf b$ is $\mathbf K^T\hat{\mathbf b}=\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y$. $$ \mathbf K^T\hat{\mathbf b}\sim\mathcal N_s\left(\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf X\mathbf b,\sigma^2\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf X[(\mathbf X^T\mathbf X)^g]^T\mathbf K\right) $$ But $\left[\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf X\right]^T=\underbrace{(\mathbf X^T\mathbf X)\left[(\mathbf X^T\mathbf X)^g\right]^T}_{\text{projection onto }\mathcal C(\mathbf X^T\mathbf X)=\mathcal C(\mathbf X^T)}\mathbf K=\mathbf K$. So $$ \mathbf K^T\hat{\mathbf b}\sim\mathcal N_s\left(\mathbf K^T\mathbf b,\sigma^2\underbrace{\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K}_{\mathbf H}\right) $$ Result 6.4: The $s\times s$ matrix $\mathbf H=\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K$ is nonsingular, and hence positive definite.
proof: $\mathbf K_{p\times s}=\mathbf X^T_{p\times N}\mathbf A_{N\times s}$, $rank(\mathbf X)=r$. Now, $\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K=\mathbf A^T\mathbf X(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf A=\mathbf A^T\mathbf P_\mathbf X\mathbf P_\mathbf X\mathbf A=(\mathbf P_\mathbf X\mathbf A)^T\mathbf P_\mathbf X\mathbf A$. If $\mathbf P_\mathbf X\mathbf A_{N\times s}$ has full rank, then $(\mathbf P_\mathbf X\mathbf A)^T\mathbf P_\mathbf X\mathbf A$ is invertible, which is what we want.
Now, $\mathbf X^T\mathbf P_\mathbf X\mathbf A=\mathbf X^T\mathbf A=\mathbf K$. $s=rank(\mathbf K)\leq\min\{rank(\mathbf X^T),rank(\mathbf P_\mathbf X\mathbf A\}\leq rank(\mathbf P_\mathbf X\mathbf A)\Rightarrow rank(\mathbf P_\mathbf X\mathbf A)=s$.
$$ \mathbf K^T\hat{\mathbf b}-\mathbf m\sim\mathcal N_N(\mathbf K^T\mathbf b-\mathbf m,\sigma^2\mathbf H) $$ Result 5.10 $\Rightarrow $ $$ (\mathbf K^T\hat{\mathbf b}-\mathbf m)^T(\sigma^2\mathbf H)^{-1}(\mathbf K^T\hat{\mathbf b}-\mathbf m)\sim\chi^2_s(\phi) $$ where $\phi=\frac{1}{2}(\mathbf K^T\mathbf b-\mathbf m)(\sigma^2\mathbf H)^{-1}(\mathbf K^T\mathbf b-\mathbf m)$.
Recall that $\mathbf K=\mathbf X^T\mathbf A$, so $\mathbf K^T\hat{\mathbf b}=\mathbf A^T\mathbf X(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf y=\mathbf A^T\mathbf P_\mathbf X\mathbf y$. But we know that $\mathbf P_\mathbf X\mathbf y$ and $(\mathbf I-\mathbf P_\mathbf X)\mathbf y$ are independent, so $\mathbf K^T\hat{\mathbf b}$ and $SSE$ are also independent.
Finally, $$ F=\frac{(\mathbf K^T\hat{\mathbf b}-\mathbf m)^T\mathbf H^{-1}(\mathbf K^T\hat{\mathbf b}-\mathbf m)/s}{SSE/(N-r)}\sim F_{s,N-r}(\phi). $$ Under $H:\mathbf K^T\mathbf b=\mathbf m$, $F\sim F_{s,N-r}$.
Also, we know from Result 5.13 that $F$ is stochastically increasing in $\phi$. It therefore makes sense to reject when $F$ is large.
Let’s reject when $F>F_{s,N-r,\alpha}$. This is level $\alpha$ test, and it is an unbiased because $P(rejecting)$ is the smallest when $\phi=0$.
Example 6.5: One-way ANOVA $$ y_{ij}=\mu+\alpha_i+e_{ij},\quad i=1,2,\cdots,a,j=1,2,\cdots,n_i $$ $H:\alpha_1-\alpha_2=0,\alpha_1-\alpha_3=0,\cdots,\alpha_1-\alpha_a=0$. $$ \mathbf K^T=\begin{pmatrix} 0&1&-1&0&\cdots&0\\ 0&1&0&-1&\cdots&0\\ \vdots&\vdots&\vdots&\vdots&\ddots&\vdots\\ 0&1&0&0&\cdots&-1 \end{pmatrix}, \quad \mathbf b=\begin{pmatrix} \mu\\\alpha_1\\\alpha_2\\\vdots\\\alpha_a \end{pmatrix}, \quad\mathbf m=\mathbf 0. $$ If $H$ is true, then $\alpha_1=\alpha_2=\cdots=\alpha_a$.
Note that we could have specified $H$ in a different way, e.g., $H_*=\alpha_1-\alpha_2=0,\alpha_2-\alpha_3=0,\cdots,\alpha_{a-1}-\alpha_a=0$.
Let’s construct the $F$ statistic for $H$. Here $N=\sum_{i=1}^an_i$, $p=a+1$, $s=a-1$. $$ \mathbf X=\begin{pmatrix}\mathbf 1_{n_1}&\mathbf 1_{n_1}&\mathbf 0&\cdots&\mathbf 0\\ \mathbf 1_{n_2}&\mathbf 0&\mathbf 1_{n_2}&\cdots&\mathbf 0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ \mathbf 1_{n_a}&\mathbf 0&\mathbf 0&\cdots&\mathbf 1_{n_a}\end{pmatrix},\\ (\mathbf X^T\mathbf X)^g=\begin{pmatrix} 0&&&&\\ &\frac{1}{n_1}&&&\\ &&\frac{1}{n_2}&&\\ &&&\ddots&\\ &&&&\frac{1}{n_a} \end{pmatrix}=\begin{pmatrix}0&&\\ &\frac{1}{n_1}&\\ &&\mathbf D_*\end{pmatrix} $$ This yields $$ \hat{\mathbf b}=\begin{pmatrix} 0\\\bar y_{1\cdot}\\\bar y_{2\cdot}\\\vdots\\\bar y_{a\cdot} \end{pmatrix} $$
$$ \mathbf K\hat{\mathbf b}-\mathbf m=\begin{pmatrix}\mathbf 0&\mathbf 1_{a-1}&-\mathbf I_{a-1}\end{pmatrix}\begin{pmatrix}0\\\bar y_{1\cdot}\\\vdots\\ y_{a\cdot}\end{pmatrix}=\begin{pmatrix}\bar y_{1\cdot}-\bar y_{2\cdot}\\\bar y_{1\cdot}-\bar y_{3\cdot}\\\vdots\\\bar y_{1\cdot}-\bar y_{a\cdot}\end{pmatrix} $$
$$ \begin{aligned} \mathbf H&=\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K=\begin{pmatrix}\mathbf 0&\mathbf 1_{a-1}&-\mathbf I_{a-1}\end{pmatrix}\begin{pmatrix}0&&\\ &\frac{1}{n_1}&\\ &&\mathbf D_\star\end{pmatrix} \begin{pmatrix}\mathbf 0\\\mathbf 1^T_{a-1}\\-\mathbf I_{a-1}\end{pmatrix}\\ &=\begin{pmatrix}\mathbf 0&\frac{1}{n_1}\mathbf 1_{a-1}&-\mathbf D_\star\end{pmatrix}\begin{pmatrix}\mathbf 0\\\mathbf 1^T_{a-1}\\-\mathbf I_{a-1}\end{pmatrix}\\ &=\mathbf D_\star+\frac{1}{n_1}\mathbf 1_{a-1}\mathbf 1^T_{a-1} \end{aligned} $$
Using the result from Problem A.75: $$ \left[\mathbf D_\star+\frac{1}{n_1}\mathbf 1_{a-1}\mathbf 1_{a-1}^T\right]^{-1}=\mathbf D_\star^{-1}-\frac{1}{N}\mathbf D_\star^{-1}\mathbf 1_{a-1}\mathbf 1_{a-1}^T\mathbf D_\star^{-1} $$
$$ \begin{aligned} &(\mathbf K^T\hat{\mathbf b}-\mathbf m)^T[\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K]^{-1}(\mathbf K^T\hat{\mathbf b}-\mathbf m)\\ =&(\mathbf K^T\hat{\mathbf b}-\mathbf m)^T\mathbf D_\star^{-1}(\mathbf K^T\hat{\mathbf b}-\mathbf m)-\frac{1}{N}(\mathbf K^T\hat{\mathbf b}-\mathbf m)^T\mathbf D_\star^{-1}\mathbf 1_{a-1}\mathbf 1_{a-1}^T\mathbf D_\star^{-1}(\mathbf K^T\hat{\mathbf b}-\mathbf m)\\ =&\sum_{i=2}^an_i(\bar y_{1\cdot}-\bar y_{i\cdot})^2-\frac{1}{N}\left[\sum_{i=2}^an_i(\bar y_{1\cdot}-\bar y_{i\cdot})\right]^2\\ =&\sum_{i=2}^an_i(\bar y_{1\cdot}-\bar y_{i\cdot})^2-\frac{1}{N}\left[N(\bar y_{1\cdot}-\bar y_{\cdot\cdot})\right]^2\\ =&\cdots\\ =&\sum_{i=1}^an_i(\bar y_{i\cdot}-\bar y_{\cdot\cdot})^2\quad\quad\left(\text{Using } \sum_{i=1}^an_i(\bar y_{i\cdot}-\bar y_{1\cdot}+\bar y_{1\cdot}-\bar y_{\cdot\cdot})^2=\cdots\right) \end{aligned} $$
Clearly, $SSE=\sum_{i=1}^a\sum_{j=1}^{n_i}(y_{ij}-\bar y_{i\cdot})^2$. So our $F$-statistic is $$ F=\frac{\left(\sum_{i=1}^an_i(\bar y_{i\cdot}-\bar y_{\cdot\cdot})^2\right)/(a-1)}{\left(\sum_{i=1}^a\sum_{j=1}^{n_i}(y_{ij}-\bar y_{i\cdot})^2\right)/(N-a)}\stackrel{\text{under }H}{\sim} F_{a-1,N-a} $$ Recall that we could have used $H_\star$ instead of $H$.
Q: Will we get the same test?
A: Yes.
In general, suppose that we have two equivalent hypothesis: $H:\mathbf K^T\mathbf b=\mathbf m$, $H_\star:\mathbf K^T_\star\mathbf b=\mathbf m_\star$.
Equivalence $\Rightarrow $ $S=\left\{\mathbf b\in\R^p:\mathbf K^T\mathbf b=\mathbf m\right\}=\{\mathbf b\in\R^p:\mathbf K^T_*\mathbf b=\mathbf m_\star\}=S_\star$
Result A.13 implies that points in $S$ can be expressed as $$ \mathbf K(\mathbf K^T\mathbf K)^{-1}\mathbf m+(\mathbf I-\mathbf P_\mathbf K)\mathbf z,\quad\mathbf z\in\R^p $$ Since $S=S_\star$, we have $$ \mathbf K^T_\star\left[\mathbf K(\mathbf K^T\mathbf K)^{-1}\mathbf m+(\mathbf I-\mathbf P_\mathbf K)\mathbf z\right]=\mathbf m_\star,\quad\forall\mathbf z\in\R^p\\ \text{or }\mathbf K_*^T\mathbf K(\mathbf K^T\mathbf K)^{-1}\mathbf m-\mathbf m_\star+\mathbf K_\star^T(\mathbf I-\mathbf P_\mathbf K)\mathbf z=\mathbf 0,\quad\forall \mathbf z\in\R^p $$ Result A.8 implies that
- $\mathbf K^T_\star(\mathbf I-\mathbf P_\mathbf K)=\mathbf 0$, or $\mathbf P_\mathbf K\mathbf K_*=\mathbf K_\star$
- $\mathbf K^T_\star\mathbf K(\mathbf K^T\mathbf K)^{-1}\mathbf m=\mathbf m_\star$
$\mathbf P_\mathbf K\mathbf K_\star=\mathbf K_\star\Rightarrow\mathcal C(\mathbf K_\star)\subseteq\mathcal C(\mathbf K)$. If we reverse the roles of $\mathbf K$ and $\mathbf K_\star$ in the previous argument, we will get $\mathcal C(\mathbf K)\subseteq\mathcal C(\mathbf K_\star)$.
So $\mathcal C(\mathbf K)=\mathcal C(\mathbf K_\star)\Rightarrow\exist$ a nonsingular matrix $\mathbf Q$ such that $\mathbf K_{p\times s}\mathbf Q^T_{s\times s}=\mathbf K_{\star p\times s}$.
It follows from $\mathbf K^T_\star\mathbf K(\mathbf K^T\mathbf K)^{-1}\mathbf m=\mathbf m_\star$ that $$ \mathbf Q\mathbf K^T\mathbf K(\mathbf K^T\mathbf K)^{-1}\mathbf m=\mathbf m_\star\Rightarrow\mathbf Q\mathbf m=\mathbf m_\star $$ Also, $$ \mathbf K\mathbf Q^T=\mathbf K_\star\Rightarrow\mathbf K^T\mathbf K\mathbf Q^T=\mathbf K^T\mathbf K_\star\Rightarrow \mathbf Q^T=(\mathbf K^T\mathbf K)^{-1}\mathbf K^T\mathbf K_\star\Rightarrow\mathbf Q=\mathbf K_\star^T\mathbf K(\mathbf K^T\mathbf K)^{-1}. $$ Now, let’s look at the numerator of the $F$-statistic for testing $H_\star:\mathbf K_\star^T\mathbf b=\mathbf m_\star$: $$ \begin{aligned} &(\mathbf K_\star^T\hat{\mathbf b}-\mathbf m_\star)^T\left[\mathbf K_\star^T(\mathbf X^T\mathbf X)^g\mathbf K_\star\right]^{-1}(\mathbf K_\star^T\hat{\mathbf b}-\mathbf m_\star)\\ =&(\mathbf Q\mathbf K^T\hat{\mathbf b}-\mathbf Q\mathbf m)^T\left[\mathbf Q\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K\mathbf Q^T\right]^{-1}(\mathbf Q\mathbf K^T\hat{\mathbf b}-\mathbf Q\mathbf m)\\ =&(\mathbf K^T\hat{\mathbf b}-\mathbf m)\left[\mathbf K^T(\mathbf X^T\mathbf X)^T\mathbf K\right]^{-1}(\mathbf K^T\hat{\mathbf b}-\mathbf m) \end{aligned} $$
What happens when $s=1$ so that $\mathbf K^T$ is a $1\times p$ vector?
$H:\mathbf K^T\mathbf b=\mathbf m$
Now we can do both one-sided and two-sided test: $A_1:\mathbf K^T\mathbf b>\mathbf m$ $A_2:\mathbf K^T\mathbf b\not=\mathbf m$. $$ \mathbf K^T\hat{\mathbf b}\sim\mathcal N\left(\mathbf K^T\mathbf b,\sigma^2\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K\right) $$
$$ \Rightarrow \frac{\frac{\mathbf K^T\hat{\mathbf b}-\mathbf m}{\sqrt{\sigma^2\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K}}}{\sqrt{\frac{SSE}{\sigma^2(N-r)}}}=\frac{\mathcal N(\mu,1)}{\sqrt{\frac{\chi^2_{N-r}}{N-r}}}\\ \Rightarrow t=\frac{\mathbf K^T\hat{\mathbf b}-\mathbf m}{\sqrt{\hat{\sigma^2}\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K}}\sim t_{N-r}(\mu),\quad\mu=\frac{\mathbf K^T\mathbf b-\mathbf m}{\sqrt{\sigma^2\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K}} $$
For $A_1$, we reject $H$ if $t>t_{N-r,\alpha}$
For $A_2$, we reject $H$ if $|t|>t_{N-r,\alpha/2}$
Example (Full and reduced models):
Let’s return to the test for no interaction in the two-way model. $y_{ijk}=\mu+\alpha_i+\beta_j+\gamma_{ij}+e_{ijk}$, $i=1,2,\cdots,a$, $j=1,2,\cdots, b$, $k=1,2,\cdots, n$. $$ \mathbf X\mathbf b=\begin{pmatrix}\mathbf 1_n&\mathbf 1_n&\mathbf 0&\mathbf 0&\mathbf 1_n&\mathbf 0&\mathbf 1_n&\mathbf 0&\mathbf 0&\mathbf 0&\mathbf 0&\mathbf 0\\ \mathbf 1_n&\mathbf 1_n&\mathbf 0&\mathbf 0&\mathbf 0&\mathbf 1_n&\mathbf 0&\mathbf 1_n&\mathbf 0&\mathbf 0&\mathbf 0&\mathbf 0\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ \mathbf 1_n&\mathbf 0&\mathbf 0&\mathbf 1_n&\mathbf 0&\mathbf 1_n&\mathbf 0&\mathbf 0&\mathbf 0&\mathbf 0&\mathbf 0&\mathbf 1_n\end{pmatrix} \begin{pmatrix}\mu\\\alpha_1\\\alpha_2\\\alpha_3\\\beta_1\\\beta_2\\\gamma_{11}\\\gamma_{12}\\\vdots\\\gamma_{32}\end{pmatrix} $$
$$ \mathbb E\mathbf y=\mathbf X\mathbf b=\begin{pmatrix}\mathbf X_0&\mathbf X_1\end{pmatrix}\begin{pmatrix}\mathbf b_0\\\mathbf b_1\end{pmatrix}=\mathbf X_0\mathbf b_0+\mathbf X_1\mathbf b_1 $$
where $\mathbf X_0$ consists of the first $6$ columns of $\mathbf X$.
We would like to test $H_1:\mathbf X_1\mathbf b_1=\mathbf 0$, but components of $\mathbf X_1\mathbf b_1$ are not estimable.
Recall the alternative parametrization: $$ \mathbf y=\mathbf X_0\mathbf c_0+(\mathbf I-\mathbf P_{\mathbf X_0})\mathbf X_1\mathbf c_1+\mathbf e $$ Note that $$ \mathbb E[(\mathbf I-\mathbf P_{\mathbf X_0})\mathbf y]=(\mathbf I-\mathbf P_{\mathbf X_0})\begin{pmatrix}\mathbf X_0&(\mathbf I-\mathbf P_{\mathbf X_0})\mathbf X_1\end{pmatrix}\begin{pmatrix}\mathbf c_0\\\mathbf c_1\end{pmatrix}\\=\begin{pmatrix}\mathbf 0&(\mathbf I-\mathbf P_{\mathbf X_0})\mathbf X_1\end{pmatrix}\begin{pmatrix}\mathbf c_0\\\mathbf c_1\end{pmatrix}=(\mathbf I-\mathbf P_{\mathbf X_0})\mathbf X_1\mathbf c_1 $$ Thus the components of $(\mathbf I-\mathbf P_{\mathbf X_0})\mathbf X_1\mathbf c_1$ are estimable.
$H:(\mathbf I-\mathbf P_{\mathbf X_0})\mathbf X_1\mathbf c_1=\mathbf 0$.
Problem: $(\mathbf I-\mathbf P_{\mathbf X_0})\mathbf X_1$ may not be full rank.
Let $rank(\mathbf X)=r$, $rank(\mathbf X_0)=r_0$, and $rank((\mathbf I-\mathbf P_{\mathbf X_0})\mathbf X_1)=s=r-r_0$. Just pick out $s$ L.I. estimable functions from $(\mathbf I-\mathbf P_{\mathbf X_0})\mathbf X_1\mathbf c_1$ and perform the corresponding F-test.
6.4 The Likelihood Ratio Test (LRT)
$\mathbf y\sim\mathcal N_N(\mathbf X\mathbf b,\sigma^2\mathbf I)$
Want to test $H:\mathbf K^T\mathbf b=\mathbf m$, alternative $A:\mathbf K^T\mathbf b\not=\mathbf m$
Full parameter space: $\Omega=\{(\mathbf b,\sigma^2):\mathbf b\in\R^p,\sigma^2>0\}$
Reduced parameter space: $\Omega_0=\{(\mathbf b,\sigma^2):\mathbf b\in\R^p,\mathbf K^T\mathbf b=\mathbf m, \sigma^2>0\}$
Likelihood function: $L(\mathbf b,\sigma^2|\mathbf y)=(2\pi\sigma^2)^{-\frac{N}{2}}e^{-\frac{1}{2\sigma^2}Q(\mathbf b)}$, where $Q(\mathbf b)=(\mathbf y-\mathbf X\mathbf b)^T(\mathbf y-\mathbf X\mathbf b)$
LRT statistic: $$ \phi(\mathbf y)=\frac{\sup_{(\mathbf b,\sigma^2)\in\Omega_0}L(\mathbf b,\sigma^2|\mathbf y)}{\sup_{(\mathbf b,\sigma^2)\in\Omega}L(\mathbf b,\sigma^2|\mathbf y)} $$ Reject when $\phi(\mathbf y)$ is “small”.
Recall that we can maximize $L(\mathbf b,\sigma^2|\mathbf y)$ sequentially. First, w.r.t. $\mathbf b$, and then w.r.t. $\sigma^2$. Suppose we have the minimizers of $Q(\mathbf b)$: $Q(\hat{\mathbf b})$ and $Q(\hat{\mathbf b}_H)$.
Now, we know for fixed $Q(\mathbf b)$, the maximizer of $L(\mathbf b,\sigma^2|\mathbf y)$ in $\sigma^2$ is $\hat{\sigma^2}=\frac{Q(\mathbf b)}{N}$. Thus, $$ \phi(\mathbf y)=\frac{\left(2\pi\frac{Q(\hat{\mathbf b}_H)}{N}\right)^{-\frac{N}{2}}e^{-\frac{1}{2Q(\hat{\mathbf b}_H)/N}Q(\hat{\mathbf b}_H)}}{\left(2\pi\frac{Q(\hat{\mathbf b})}{N}\right)^{-\frac{N}{2}}e^{-\frac{1}{2Q(\hat{\mathbf b}_H)/N}Q(\hat{\mathbf b})}}=\left(\frac{Q(\hat{\mathbf b}_H)}{Q(\hat{\mathbf b})}\right)^{-\frac{N}{2}} $$
$$ \begin{aligned} \left(\frac{Q(\hat{\mathbf b}_H)}{Q(\hat{\mathbf b})}\right)^{-\frac{N}{2}}<c&\Leftrightarrow \frac{Q(\hat{\mathbf b}_H)}{Q(\hat{\mathbf b})}>c^{-\frac{N}{2}}\\ &\Leftrightarrow \frac{Q(\hat{\mathbf b}_H)-Q(\hat{\mathbf b})}{Q(\hat{\mathbf b})}>c^{-\frac{N}{2}}-1\\ &\Leftrightarrow \frac{(Q(\hat{\mathbf b}_H)-Q(\hat{\mathbf b}))/s}{Q(\hat{\mathbf b})/(N-r)}>\frac{N-r}{s}(c^{-\frac{N}{2}}-1) \end{aligned} $$
Example 6.5 (One-way ANOVA):
$Q(\hat{\mathbf b})=\sum_{i=1}^a\sum_{j=1}^n(y_{ij}-\bar y_{i\cdot})^2=\mathbf y^T(\mathbf I-\mathbf P_\mathbf X)\mathbf y$.
How do we minimize $Q(\mathbf b)$ subject to $\alpha_1=\alpha_2=\cdots=\alpha_a$?
$$ \begin{aligned} \inf_{(\mathbf b,\sigma^2)\in\Omega_0}\sum_{i=1}^a\sum_{j=1}^{n_i}(y_{ij}-(\mu+\alpha_{ij}))^2&=\inf_{\mu,\alpha\in\R}\sum_{i=1}^a\sum_{j=1}^{n_i}(y_{ij}-(\mu+\alpha))^2\\ &=\inf_{\delta\in\R}\sum_{i=1}^a\sum_{j=1}^{n_i}(y_{ij}-\delta)^2\\ &=\sum_{i=1}^a\sum_{j=1}^{n_i}(y_{ij}-\bar y_{\cdot\cdot})^2 \end{aligned} $$ Then $Q(\hat{\mathbf b}_H)-Q(\hat{\mathbf b})=\sum_{i=1}^an_i(\bar y_{i\cdot}-\bar y_{\cdot\cdot})^2$. $$ \frac{(Q(\hat{\mathbf b}_H)-Q(\hat{\mathbf b}))/s}{Q(\hat{\mathbf b})/(N-r)}=\frac{\sum_{i=1}^an_i(\bar y_{i\cdot}-\bar y_{\cdot\cdot})^2/(a-1)}{\sum_{i=1}^a\sum_{j=1}^{n_i}(y_{ij}-\bar y_{i\cdot})^2/(N-a)}=F\stackrel{H}{\sim} F_{a-1,N-a} $$
Full v.s. Reduced Models: $$ \mathbb E\mathbf y=\mathbf X\mathbf b=\begin{pmatrix}\mathbf X_0&\mathbf X_1\end{pmatrix}\begin{pmatrix}\mathbf b_0\\\mathbf b_1\end{pmatrix}=\mathbf X_0\mathbf b_0+\mathbf X_1\mathbf b_1 $$ Want to test: $H:\mathbf X_1\mathbf b_1=\mathbf 0$, but components may not be estimable. $$ \mathbf y=\mathbf X_0\mathbf c_0+(\mathbf I-\mathbf P_{\mathbf X_0})\mathbf X_1\mathbf c_1+\mathbf e $$ $rank(\mathbf X)=r$, $rank(\mathbf X_0)=r_0$, and $rank((\mathbf I-\mathbf P_{\mathbf X_0})\mathbf X_1)=s=r-r_0$
How do we use this test with the LRT? $$ Q(\hat{\mathbf b}_H)=\mathbf y^T(\mathbf I-\mathbf P_{\mathbf X_0})\mathbf y=SSE(\text{reduced})\\ Q(\hat{\mathbf b})=\mathbf y^T(\mathbf I-\mathbf P_{\mathbf X})\mathbf y=SSE(\text{full})\\ \Rightarrow Q(\hat{\mathbf b}_H)-Q(\hat{\mathbf b})=\mathbf y^T(\mathbf P_\mathbf X-\mathbf P_{\mathbf X_0})\mathbf y $$ Now, $$ \underbrace{\mathbf P_{\mathbf X_0}}_{\text{rank }r}+\underbrace{\mathbf P_{\mathbf X}-\mathbf P_{\mathbf X_0}}_{\text{rank }s=r-r_0}+\underbrace{\mathbf I-\mathbf P_\mathbf X}_{\text{rank }N-r}=\mathbf I $$ All these are projections.
By Cochran’s Theorem, $\mathbf y^T\mathbf P_{\mathbf X_0}\mathbf y/\sigma^2$, $\mathbf y^T(\mathbf P_\mathbf X-\mathbf P_{\mathbf X_0})\mathbf y/\sigma^2$, and $\mathbf y^T(\mathbf I-\mathbf P_\mathbf X)\mathbf y/\sigma^2$ are independent noncentral $\chi^2$’s with noncentrality parameters $(\mathbf X\mathbf b)^T\mathbf P_{\mathbf X_0}(\mathbf X\mathbf b)/2\sigma^2$, $(\mathbf X\mathbf b)^T(\mathbf P_{\mathbf X}-\mathbf P_{\mathbf X_0})(\mathbf X\mathbf b)/2\sigma^2$ and $0$. $$ \frac{[Q(\hat{\mathbf b}_H)-Q(\hat{\mathbf b})]/s}{Q(\hat{\mathbf b})/(N-r)}=\frac{\mathbf y^T(\mathbf P_\mathbf X-\mathbf P_{\mathbf X_0})\mathbf y/s}{\mathbf y^T(\mathbf I-\mathbf P_{\mathbf X})\mathbf y/(N-r)}=F\sim F_{s,N-r}(\phi),\\\phi=\frac{(\mathbf X\mathbf b)^T(\mathbf P_\mathbf X-\mathbf P_{\mathbf X_0})(\mathbf X\mathbf b)}{2\sigma^2} $$ It is easy to see that if $H:\mathbf X_1\mathbf b_1=\mathbf 0$, then $\phi=0$.
6.5 First Principles Test and LRT
Theorem 6.1: Let $\mathbf K^T\mathbf b$ be a set of linearly independent estimable functions. Also, let $\hat{\mathbf b}_H$ be part of a solution to the R.N.s with constraint $\mathbf K^T\mathbf b=\mathbf m$ (instead of $\mathbf P^T\mathbf b=\pmb\delta$). Then $$ Q(\hat{\mathbf b}_H)-Q(\hat{\mathbf b})=(\hat{\mathbf b}_H-\hat{\mathbf b})^T\mathbf X^T\mathbf X(\hat{\mathbf b}_H-\hat{\mathbf b})\\=(\mathbf K^T\hat{\mathbf b}-\mathbf m)^T\left[\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K\right]^{-1}(\mathbf K^T\hat{\mathbf b}-\mathbf m) $$ (That is, the two tests are the same)
proof: $$ \begin{aligned} Q(\hat{\mathbf b}_H)&=(\mathbf y-\mathbf X\hat{\mathbf b}_H)^T(\mathbf y-\mathbf X\hat{\mathbf b}_H)\\ &=(\mathbf y-\mathbf X\hat{\mathbf b}+\mathbf X\hat{\mathbf b}-\mathbf X\hat{\mathbf b}_H)^T(\mathbf y-\mathbf X\hat{\mathbf b}+\mathbf X\hat{\mathbf b}-\mathbf X\hat{\mathbf b}_H)\\ &=Q(\hat{\mathbf b})+(\hat{\mathbf b}-\hat{\mathbf b}_H)^T\mathbf X^T\mathbf X(\hat{\mathbf b}-\hat{\mathbf b}_H)+2(\hat{\mathbf b}-\hat{\mathbf b}_H)^T\underbrace{\mathbf X^T(\mathbf y-\mathbf X\hat{\mathbf b})}_{=\mathbf 0} \end{aligned} $$ R.N.E.s: $$ \begin{pmatrix}\mathbf X^T\mathbf X&\mathbf K\\\mathbf K^T&\mathbf 0\end{pmatrix}\begin{pmatrix}\mathbf b\\\pmb\theta\end{pmatrix}=\begin{pmatrix}\mathbf X^T\mathbf y\\\mathbf m\end{pmatrix} $$
$$ \mathbf X^T\mathbf X(\hat{\mathbf b}-\hat{\mathbf b}_H)=\mathbf X^T\mathbf y-(\mathbf X^T\mathbf y-\mathbf K\hat{\pmb\theta}_H)=\mathbf K\hat{\pmb\theta}_H\quad(\star) $$
Premultiplying $(\star)$ by $\mathbf K^T(\mathbf X^T\mathbf X)^g$ yields $$ \mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf X(\hat{\mathbf b}-\hat{\mathbf b}_H)=\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K\hat{\pmb\theta}_H $$ But $\left[\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf X^T\mathbf X\right]^T=\mathbf X^T\mathbf X[(\mathbf X^T\mathbf X)^g]^T\mathbf K=\mathbf K$ since columns of $\mathbf K$ are in $\mathcal C(\mathbf X^T)$. Thus, $$ \mathbf K^T(\hat{\mathbf b}-\hat{\mathbf b}_H)=\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K\hat{\pmb\theta}_H\\ \Rightarrow \left[\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K\right]^{-1}\mathbf K^T(\hat{\mathbf b}-\hat{\mathbf b}_H)=\hat{\pmb\theta}_H. $$ Now, premultiplying $(\star)$ by $(\hat{\mathbf b}-\hat{\mathbf b}_H)^T$ yields $$ \begin{aligned} Q(\hat{\mathbf b}_H)-Q(\hat{\mathbf b})&=(\hat{\mathbf b}-\hat{\mathbf b}_H)^T\mathbf X^T\mathbf X(\hat{\mathbf b}-\hat{\mathbf b}_H)\\ &=(\hat{\mathbf b}-\hat{\mathbf b}_H)^T\mathbf K\hat{\pmb\theta}_H\\ &=(\hat{\mathbf b}-\hat{\mathbf b}_H)^T\mathbf K\left[\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K\right]^{-1}\mathbf K^T(\hat{\mathbf b}-\hat{\mathbf b}_H)\\ &=(\mathbf K^T\hat{\mathbf b}-\mathbf K^T\hat{\mathbf b}_H)^T\left[\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K\right]^{-1}(\mathbf K^T\hat{\mathbf b}-\mathbf K^T\hat{\mathbf b}_H)\\ &=(\mathbf K^T\hat{\mathbf b}-\mathbf m)^T\left[\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K\right]^{-1}(\mathbf K^T\hat{\mathbf b}-\mathbf m) \end{aligned} $$
Corollary 6.4: Let $\mathbf K^T\mathbf b$ be a set of L.I. estimable functions and $\hat{\mathbf b}$ a solution to the N.E.s. We can find $\hat{\mathbf b}_H$ by solving for $\mathbf b$ in the following equation: $$ \mathbf X^T\mathbf X\mathbf b=\mathbf X^T\mathbf y-\mathbf K\left[\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K\right]^{-1}[\mathbf K^T\hat{\mathbf b}-\mathbf m] $$
proof: From R.N.E.s, we have $$ \mathbf X^T\mathbf X\hat{\mathbf b}_H+\mathbf K\hat{\pmb\theta}_H=\mathbf X^T\mathbf y $$ But we know from the proof of Theorem 6.1 that $$ \hat{\pmb\theta}_H=\left[\mathbf K^T(\mathbf X^T\mathbf X)^g\mathbf K\right]^{-1}\mathbf K^T(\hat{\mathbf b}-\hat{\mathbf b}_H) $$ Now plug in, and rearrange.
Scheffe method for dealing with multiple comparisons
Suppose $\mathbf A$ is a positive definite matrix and let $\mathbf w$ be a vector. We want $$ \max_{\mathbf x\not=0}\frac{(\mathbf x^T\mathbf w)^2}{\mathbf x^T\mathbf A\mathbf x} $$
$$ \frac{(\mathbf x^T\mathbf w)^2}{\mathbf x^T\mathbf A\mathbf x}=\frac{\langle\mathbf A^{1/2}\mathbf x,\mathbf A^{-1/2}\mathbf w\rangle^2}{|\mathbf A^{1/2}\mathbf x|^2}\stackrel{\text{C-S}}{\leq}|\mathbf A^{-1/2}\mathbf w|^2 $$
but we attain equality when $\mathbf A^{1/2}\mathbf x=\mathbf A^{-1/2}\mathbf w$. $$ \Rightarrow\max_{\mathbf x\not=0}\frac{(\mathbf x^T\mathbf w)^2}{\mathbf x^T\mathbf A\mathbf x}=\mathbf w^T\mathbf A^{-1}\mathbf w\quad(\star\star) $$ Now suppose that $\tau=\pmb\Lambda^T\mathbf b$ where $\pmb\Lambda_{p\times s}$ has L.I. columns in $\mathcal C(\mathbf X^T)$. Define $$ R(\mathbf u,c)=\left(\mathbf u^T\hat\tau-c\hat\sigma\sqrt{\mathbf u^T\mathbf H\mathbf u},\mathbf u^T\hat\tau+c\hat\sigma\sqrt{\mathbf u^T\mathbf H\mathbf u}\right) $$ where $\mathbf H=\pmb\Lambda^T(\mathbf X^T\mathbf X)^g\pmb\Lambda$, $\hat{\sigma}^2=SSE/(N-r)$, $\mathbf u\in\R^s$, $c>0$.
Goal: Find $c>0$, s.t. $$ P\left(\mathbf u^T\tau\in R(\mathbf u,c),\forall\mathbf u\in\R^s\right)=1-\alpha. $$ If the above holds, we can form individual C.I.s for as many linear combinations of the $\tau_j$’s as we like, and the whole collection will hold simultaneously at level $\alpha$, or with probability $1-\alpha$.
First $$ P(\mathbf u^T\tau\in R(\mathbf u,c))=P\left(\left|\frac{\mathbf u^T\tau-\mathbf u^T\hat{\tau}}{\hat\sigma\sqrt{\mathbf u^T\mathbf H\mathbf u}}\right|\leq c\right). $$ Thus, $$ \begin{aligned} P(\mathbf u^T\tau\in R(\mathbf u,c),\forall \mathbf u\in\R^s)&=P\left(\max_{\mathbf u\in\R^s,\mathbf u\not=\mathbf 0}\left|\frac{\mathbf u^T\tau-\mathbf u^T\hat{\tau}}{\hat\sigma\sqrt{\mathbf u^T\mathbf H\mathbf u}}\right|\leq c\right)\\ &=P\left(\max_{\mathbf u\in\R^s,\mathbf u\not=\mathbf 0}\frac{[\mathbf u^T(\tau-\hat\tau)]^2}{\hat{\sigma}^2\mathbf u^T\mathbf H\mathbf u}\leq c^2\right)\\ &\stackrel{(\star\star)}{=}P\left((\hat\tau-\tau)^T\mathbf H^{-1}(\hat\tau-\tau)/\hat\sigma^2\leq c^2\right)\\ &=P\left((\hat\tau-\tau)^T\mathbf H^{-1}(\hat\tau-\tau)/s\hat\sigma^2\leq\frac{c^2}{s}\right) \end{aligned} $$ But $$ \frac{(\hat\tau-\tau)^T\mathbf H^{-1}(\hat\tau-\tau)/s}{\hat\sigma^2}\sim F_{s,N-r} $$ So, all we need to do is setting $\frac{c^2}{s}=F_{s,N-r,\alpha}$ so that $P(\mathbf u^T\tau\in R(\mathbf u,c),\forall\mathbf u\in\R^s)=1-\alpha$. This yields $$ c=\sqrt{sF_{s,N-r,\alpha}} $$