2 Multiple Linear Regression: Theory

2.1 Some Useful Matrix Properties

A concise treatment of the multiple linear regression model relies quite a bit on matrix algebra. An overview of some matrix background material is presented below.

2.1.1 The Mean and Variance of Random Vectors

Definition 2.1 Suppose that \(X = (\rv X n)\) is a random vector. The mean of \(X\) denoted \(\E[X]\) is defined as \[ \E[X] = (\E[X_1], \ldots, \E[X_n]), \] and the variance matrix (also: covariance matrix, variance-covariance matrix) of \(X\) denoted \(\var(X)\) is defined as \[ \var(X) = \begin{bmatrix} \var(X_1) & \cov(X_1, X_2) & \cdots & \cov(X_1, X_n) \\ \cov(X_2,X_1) & \var(X_2) & \cdots & \cov(X_2, X_n) \\ \vdots & \vdots & \ddots & \vdots \\ \cov(X_n, X_1) & \cov(X_n, X_2) & \cdots & \var(X_n) \end{bmatrix}. \]
Remark. From now on, any (deterministic or random) \(n\)-dimensional vector \(a \in \mathbb R^n\) will be written as \(a = (\rv a n)\) when its not involved in a matrix calculation. When we switch to matrix algebra, such vectors default to \((n\times 1)\) column vectors, \(a = a_{n\times 1}\).

It turns out that the mean and variance of a linear combination of the \(X_i\), \(y = \sum_{i=1}^n a_i X_i\), can be obtained from just the elements of \(\E[X]\) and \(\var(X)\): \[ \E[y] = \sum_{i=1}^n a_i \E[X_i], \qquad \var(y) = \sum_{i=1}^n a_i^2 \var(X_i) + \sum_{i\neq j} a_ia_j\cov(X_i, X_j). \] In fact, it will be extremely convenient for later to represent these quantities using matrix algebra. That is, consider \(X\), \(\mu = \E[X]\), and \(a = (\rv a n)\) as \((n\times 1)\) column vectors and let \(V = \var(X)\) denote the \(n\times n\) variance matrix. Then \(y = a' X\), and \[ \E[y] = a'\mu, \qquad \var(y) = a' V a. \]

Now consider a random vector of \(k\) linear combinations of \(X\): \[\begin{align*} Y_1 & = a_{11} X_1 + a_{12} X_2 + \cdots + a_{1n} X_n \\ Y_2 & = a_{21} X_1 + a_{22} X_2 + \cdots + a_{2n} X_n \\ & \vdots \\ Y_k & = a_{k1} X_1 + a_{k2} X_2 + \cdots + a_{kn} X_n. \end{align*}\] Then the column vector \(Y = (\rv Y k)\) can be expressed as the matrix-vector product \(Y = A X\), where \(A\) is a \((k\times n)\) matrix with elements \([A]_{ij} = a_{ij}\). Moreover, we can express the mean and variance matrix of \(Y\) with matrix algebra as well. To do this, recall the linearity property of covariance: \[ \cov\left(\sum_{i=1}^n a_i X_i + c, \sum_{j=1}^n b_j X_j + d\right) = \sum_{i=1}^n\sum_{j=1}^n a_ib_j \cov(X_i, X_j). \] By applying this property to each of the elements of \(\var(Y)\), we obtain \[\begin{equation} \E[Y] = A \mu, \qquad \var(Y) = A V A'. \tag{2.1} \end{equation}\]

Remark. In the beginning, its confusing to remember where the transpose in the variance formula comes in the first or second term. The trick is to make sure that the dimensions of the matrix multiplication agree. Since \(V = \var(X)\) is \((n\times n)\), it must be premultiplied by a matrix with \(n\) columns, which is \(A\). For the single linear combination case, we had \(\var(y) = a' V a\) with the transpose in the first term, but this is because we considered \(a\) as an \((n\times 1)\) column vector. The matrix multiplication \(a V a'\) is forbidden since the dimensions don’t agree.

2.1.2 Two Matrix Decompositions

Definition 2.2 The variance matrix \(V\) of a random vector \(X\) has two important properties:

  1. \(V\) must be a symmetric matrix, since \(\cov(X_i, X_j) = \cov(X_j, X_i)\).
  2. For any vector \(a \in \mathbb R^d\) with \(a\neq 0\) (i.e., at least one of the elements of \(a\) is nonzero), we must have \(a'V a \ge 0\). This is because the linear combination \(y = a'X\) is a random variable which has non-negative variance: \(\var(y) = a'V a \ge 0\).
If in addition to these two properties, we have \(a'Va = 0 \iff a = 0\), then \(V\) is called positive definite.

Variance matrices have two very important decompositions which will come in handy later:

  1. Cholesky Decomposition: Any variance matrix \(V\) can be uniquely written as

    \[\begin{equation} V = L L', \tag{2.2} \end{equation}\]

    where \(L\) is a lower triangular matrix with non-negative entries \(L_{ii} \ge 0\) on the diagonal. When \(V\) is positive-definite then \(L_{ii} > 0\).

  2. Eigen Decomposition: Any variance matrix \(V\) can be uniquely written as

    \[\begin{equation} V = \Gamma' D \Gamma, \tag{2.3} \end{equation}\]

    where \(\Gamma\) is an orthogonal matrix, i.e. \(\Gamma^{-1} = \Gamma'\), and \(D = \left[\begin{smallmatrix} \lambda_1 & & 0 \\& \ddots & \\ 0 & & \lambda_d \end{smallmatrix}\right]\) is a diagonal matrix with \(\lambda_i \ge 0\). When \(V\) is positive-definite then \(\lambda_i > 0\).

When \(V\) is a positive-definite variance matrix, either of these matrix decompositions can be used to calculate the inverse of \(V\): \[\begin{align*} V^{-1} & = (LL')^{-1} = (L')^{-1}L^{-1} = (L^{-1})'L^{-1} \\ & = (\Gamma'D\Gamma)^{-1} = \Gamma^{-1}D^{-1}(\Gamma')^{-1} = \Gamma'\left[\begin{smallmatrix} \lambda_1^{-1} & & 0 \\& \ddots & \\ 0 & & \lambda_d^{-1} \end{smallmatrix}\right] \Gamma. \end{align*}\]

Remark. Let’s assume that all variance matrices are positive-definite, unless explicitly stated otherwise. We’ll only need variance matrices which are not positive-definite later in these notes.

2.1.3 Block Multiplication

Suppose that we have two so-called block matrices \[ M_1 = \bordermatrix{r & s}{p \\ q}{\begin{bmatrix} A & B \\ C & D\end{bmatrix}}, \quad M_2 = \bordermatrix{t & u}{r \\ s}{\begin{bmatrix} E & F \\ G & H\end{bmatrix}}, \] where the lower case letters indicate block dimensions. As long as the block dimensions agree, i.e., \(p = r\) and \(q = s\), then we can multiply these matrices exactly as we would in the \((2\times 2)\) case: \[ M_1 M_2 = \begin{bmatrix} A_{p\times r} & B_{p\times s} \\ C_{q\times r} & D_{q\times s} \end{bmatrix} \begin{bmatrix} E_{r\times t} & F_{r\times u} \\ G_{s\times t} & H_{s\times u} \end{bmatrix} = \bordermatrix{t & u}{p \\ q}{\begin{bmatrix} AE + BG & AF + BH \\ & CE + DG & CF + DH\end{bmatrix}}. \] Note that this works for other dimensions of block multiplication as well, e.g, \((3\times 5)\) block times \((5 \times 2)\) block, as long as the block dimensions agree, the formula is the same as for matrices composed of scalars.

2.1.4 Complete-The-Square

Recall the complete-the-square formula in one dimension: \[ ax^2 - 2bx + c = a(x-b/a)^2 + c - b^2/a \] In \(n\) dimensions, we have a similar formula which will be useful for later.

Suppose that \(A_{n\times n} = [a_{ij}]\) is a variance matrix and consider the quadratic \[ Q(x) = x'Ax - 2b'x + c = \sum_{1\le i,j \le n} a_{ij} x_i x_j - 2 \sum_{k=1}^n b_k x_k + c. \] Then it’s straightforward to check that \[ Q(x) = (x-\mu)'A(x-\mu) + c - \mu'A\mu, \where \mu = A^{-1}b. \] By writing \(Q(x)\) this way we can see that \(\argmin_x Q(x) = \mu\). This is because \(c\) and \(\mu'A\mu\) do not dependent on \(x\), and \(y'Ay \ge 0\) for any value of \(y\), with \(y'Ay = 0 \iff y = 0\), such that \((x-\mu)'A(x-\mu)\) has its unique minumum at \(x = \mu\).

2.2 The Multivariate Normal Distribution

Definition 2.3 A random vector \(Y = (\rv Y d)\) is said to have a multivariate normal distribution if it is a linear combination of iid standard normals. That is, there exists:

  • a random vector \(Z = (\rv Z k)\), \(Z_i \iid \N(0,1)\),
  • a \((d \times k)\) matrix \(A\) and vector \(\mu = (\rv \mu d)\)
such that \(Y = AZ + \mu\). Thus we have \(\E[Y] = \mu\) and \(\var(Y) = AA' = V\), and the multivariate normal distribution is denoted \(Y \sim \N_d(\mu, V)\) or simply \(\N(\mu, V)\).

2.2.1 Properties of the Multivariate Normal

Example 2.1 Let \(Z = (\rv Z d)\) be a collection of iid standard normals: \(Z_i \iid \N(0,1)\). Then \(Z\) is trivially multivariate normal, and we denote this \(Z \sim \N(0, I)\). The pdf of \(Z\) is \[ f_Z(z) = \prod_{i=1}^d \frac{1}{\sqrt{2\pi}} \exp(-\tfrac 1 2 z_i^2). \] More generally, if \(Y \sim \N(\mu, V)\) and \(V\) is invertible, then the PDF of \(y\) is \[ f_Y(y) = \frac{1}{(2\pi)^{d/2}} \exp\left\{-\frac 1 2 (y - \mu)' V^{-1} (y-\mu) - \frac 1 2 \log \abs V \right\}, \where \abs V = \tx{det}(V). \] This shows that the mean and variance of a multivariate normal distribution completely determine its PDF. That is, if for \(i=1,2\) we have \(Y_i \sim \N(\mu_i, V_i)\), and \(\mu_1 = \mu_2\) and \(V_1 = V_2\), then \(Y_1\) and \(Y_2\) have the same PDF (though of course this doesn’t mean that \(Y_1 = Y_2\)).

Example 2.2 If \(X \sim \N(\mu, V)\), then \(Y = CX + d \sim\N(C\mu + d, CVC')\).

This is because we have \(X = AZ + \mu\), where \(Z \sim \N(0, I)\), and \(Y = CAZ + (C\mu + d)\).

As a consequence of this, if \(V = LL'\) is the Cholesky decomposition, then \(Y = L^{-1}X-L^{-1}\mu\) is normal with \(Y \sim \N(0, I)\), since \(L^{-1}LL'(L^{-1})' = L^{-1}LL'(L')^{-1} = I\). In other words, \(Y = L^{-1}(X-\mu)\) is an explicit transformation of \(X\) into iid standard normals.

Example 2.3 If \(X = (\rv X d) \sim \N(\mu, V)\) is multivariate normal, then any subset \(\tilde X = (X_{i_1}, X_{i_2}, \ldots, X_{i_m})\), with \(i_j \in \{1, \ldots, d\}\) is also multivariate normal. The mean and variance of this subset is obtained by reading off the appropriate elements of \(\mu\) and \(V\).

To illustrate the general case, let \(d = 4\) and \(\tilde X = (X_4, X_2)\). Then \[ \tilde X = A_{2\times 4} X, \qquad A = \begin{bmatrix} 0 & 0 & 0 & 1 \\ 0 & 1 & 0 & 0 \end{bmatrix}, \] such that \(\tilde X\) is multivariate normal, and \[ \E[\tilde X] = A\mu = (\mu_4, \mu_2), \qquad \var(\tilde X) = A V A' = \begin{bmatrix} V_{44} & V_{42} \\ V_{24} & V_{22} \end{bmatrix}. \]

Example 2.4 Let \(X_1 \sim \N(0,1)\) and \(B \sim \textrm{Bernoulli}(\sfrac 1 2)\) be independent of \(X_1\). Now consider the random variable \[ X_2 = \begin{cases} X_1 & B = 0, \\ -X_1 & B = 1. \end{cases} \] Convince yourself that \(X_2 \sim \N(0,1)\) by recalling that it is defined by a 50-50 mixture of two \(\N(0,1)\) PDF’s, or by running the following commands:

# simulate X2
N <- 1e6
X1 <- rnorm(N) # N(0,1)
B <- rbinom(N, size = 1, prob = .5) # Bernoulli(1/2)
X2 <- ifelse(B == 0, X1, -X1) # if B==0 X2 <- X1, else X2 <- -X1

# histogram of X2
hist(X2, breaks = 100, freq = FALSE) # histogram
curve(dnorm, add = TRUE, col = "red") # superimpose PDF of N(0,1)

Now, let \(Y = X_1 + X_2\). Then \(Y\) is not normal, since normals are continuous distributions but \(\Pr(Y = 0) = \Pr(B = 1) = \sfrac 1 2\). So what went wrong?

Answer: While \(X_1\) and \(X_2\) are each normally distributed, \(X = (X_1, X_2)\) is not multivariate normal. That is, it’s impossible to find \(Z = (\rv Z k)\), \(Z_i \iid \N(0,1)\) and \(A_{2 \times k}\) such that \(X = (X_1, X_2) = A Z\).

Example 2.5 Suppose that \(X \sim \N(\mu, V)\) and \(V_{ij} = 0\). Then \(X_i\) and \(X_j\) are independent.

To see this, assume without loss of generality that \(i=1\) and \(j = 2\). Then using the result of Example 2.3, we have \[ \begin{bmatrix} X_1 \\ X_2 \end{bmatrix} \sim \N\left(\begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix}, \begin{bmatrix} V_{11} & 0 \\ 0 & V_{22} \end{bmatrix} \right). \] Assuming that \(V_{11}\) and \(V_{22}\) are invertible, the joint pdf of \((X_1, X_2)\) is \[\begin{align*} f(x_1, x_2) & = \frac{1}{2\pi \sqrt{V_{11}V_{22}}} \exp\left\{-\frac 1 2 \begin{bmatrix} x_1-\mu_1 & x_2 - \mu_2\end{bmatrix} \begin{bmatrix} V_{11}^{-1} & 0 \\ 0 & V_{22}^{-1} \end{bmatrix} \begin{bmatrix} x_1-\mu_1 \\ x_2-\mu_2 \end{bmatrix} \right\} \\ & = \frac{1}{2\pi \sqrt{V_{11}V_{22}}} \exp\left\{-\frac 1 2 \frac{(x_1-\mu_1)^2}{V_{11}} - \frac 1 2 \frac{(x_2-\mu_2)^2}{V_{22}}\right\} \\ & = \frac{1}{ \sqrt{2\pi V_{11}}} \exp\left\{-\frac 1 2 \frac{(x_1-\mu_1)^2}{V_{11}}\right\} \times \frac{1}{ \sqrt{2\pi V_{22}}} \exp\left\{-\frac 1 2 \frac{(x_2-\mu_2)^2}{V_{22}}\right\} \\ & = f_{X_1}(x_1) \times f_{X_2}(x_2). \end{align*}\]

This can be generalized as follows. Suppose that \(X = (\rv X p)\) and \(Y = (\rv Y q)\) are jointly multivariate normal, with \[ \begin{bmatrix} X \\ Y \end{bmatrix} \sim \N\left(\begin{bmatrix} \mu_X \\ \mu_Y \end{bmatrix}, \begin{bmatrix} V_{X} & 0 \\ 0 & V_{Y} \end{bmatrix} \right). \] Then \(X\) and \(Y\) are independent random vectors, i.e., \(g(X) \amalg h(Y)\) for any functions \(g\) and \(h\).

2.2.2 Summary of Properties of the Multivariate Normal

If \(X = (\rv X d) \sim \N(\mu, V)\) is multivariate normal, then:

  • The mean and variance \(\mu\) and \(V\) completely determine the PDF of \(X\).
  • Any linear combination \(Y = AX + b\) is also multivariate normal with \(Y \sim \N(A\mu + b, AVA')\).
  • If \(V = LL'\), the linear combination \(Z = (\rv Z d) = L^{-1}(X - \mu)\) is a collection of iid standard normals: \(Z_i \iid \N(0,1)\).
  • Any combination of the elements of \(X\), say \(Y = (\rv Y m) = (\rv [i_1] X {i_m})\) for some set of indices \(I = (\rv i m)\), \(1 \le i_j \le d\), is also multivariate normal. The means and covariances of the elements in \(Y\) are \(\E[Y_j] = \mu_{i_j}\) and \(\cov(Y_j, Y_k) = V_{i_j,i_k}\).
  • If \(I\) is a set of indices as above, and \(\cov(X_{i_j}, X_{i_k}) = 0\) for every \(i_j,i_k \in I\) such that \(j\neq k\), then \(\rv [i_1] X {i_m}\) are independent random variables. \end{itemize}

2.3 Multiple Linear Regression

The multiple linear regression model is a straightforward extension of simple linear regression which allows for more than one covariate. That is, each response variable is still \(y_i\), but the covariate vector for subject \(i\) is \(x_i = (x_{i1}, \ldots, x_{ip})\). The model in error form is \[\begin{equation} y_i = x_i'\beta + \eps_i, \qquad \eps_i \iid \N(0, \sigma^2), \tag{2.4} \end{equation}\] with \(\beta = (\rv \beta p)\). However, for what follows it is more convenient to write (2.4) as a multivariate normal. That is, let \[ y = \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix}, \quad X = \begin{bmatrix} x_{11} & \cdots & x_{1p} \\ \vdots & & \vdots \\ x_{n1} & \cdots & x_{np} \end{bmatrix} = \begin{bmatrix} x_1' \\ \vdots \\ x_n' \end{bmatrix}, \quad \eps = \begin{bmatrix} \eps_1 \\ \vdots \\ \eps_n \end{bmatrix}. \] Then (2.4) can be rewritten in matrix form as \(y = X\beta + \eps\) with \(\eps \sim \N(0, \sigma^2 I)\), or equivalently, \[\begin{equation} y \sim \N(X\beta, \sigma^2 I). \tag{2.5} \end{equation}\]
Definition 2.4 In model (2.5) above, \(X\) is called the design matrix, and \(y\) is called the response vector.
Example 2.6 With \(p = 2\) and \(x_{0i} = 1\), and \(x_{1i} = x_i\), (2.5) reduces to \(y_i = \beta_0 + \beta_1 x_{i} + \eps_i\), which is just the simple linear regression model with one covariate.

Example 2.7 A pharmaceutical company is designing a new drug to reduce blood pressure. The drug contains ingredients \(A\), \(B\), and \(C\), and the pharmaceutical company wishes to understand how each ingredient affect blood pressure outcomes. Therefore, a clinical trial is conducted with \(n = 100\) patients with similar medical characteristics and initial blood pressure level. Each patient is administered a different concentration of the three active ingredients and the change in blood pressure after 10 days of usage is recorded. In this setting, we might model the change in blood pressure \(y_i\) in patient \(i\) as \[ y_i = \beta_A x_{A,i} + \beta_B x_{B,i} + \beta_C x_{C,i} + \eps_i, \] where \(x_{A,i}\) is the concentration of ingredient \(A\) administered to patient \(i\), and so on.

An interesting feature of this example is that unlike Dr. Gladstone’s brain data, here the covariates \(x_A\), \(x_B\), and \(x_C\) are truly fixed in that they are controlled within the experiment. That is, we can imagine giving exactly the same concentration of each ingredient to a number of patients and there would be some variability in their response \(y_i\). This example is conceptually closer to the notion of \(p(y \mid x)\) with fixed \(x\) which we’ve been assuming all along.

2.3.1 Summary of Upcoming Results

In the following few sections, we’ll be proving the following propositions:

  • The MLE of \(\beta\) is \(\hat \beta = (X'X)^{-1}X'y\), and its distribution is

    \[ \hat \beta \sim \N\big(\beta, \sigma^2(X'X)^{-1}\big). \]

  • The MLE of \(\sigma^2\) is

    \[ \hat \sigma^2_{(\tx{ML})} = \frac{\sum_{i=1}^n(y_i - x_i'\hat \beta)^2}{n} = \frac{e'e}{n}, \]

    where \(e = y - X\hat\beta\).

  • The unbiased estimator of \(\sigma^2\) is

    \[ \hat \sigma^2 = \frac{e'e}{n-p}. \]

    The distribution of the unbiased estimator is

    \[ \frac{n-p}{\sigma^2} \cdot \hat \sigma^2 \sim \chi^2_{(n-p)}. \]

  • The random vector \(\hat \beta\) and the random variable \(\hat \sigma\) are independent.

2.3.2 The MLE of \(\bm \beta\)

The likelihood function is \[ \mathcal L(\beta, \sigma^2 \mid y, X) \propto p(y \mid X, \beta, \sigma^2) = \exp\left\{-\frac 1 2 \frac{(y - X\beta)'(y - X\beta)}{\sigma^2} - \frac n 2 \log(\sigma^2) \right\}. \] One way of finding the MLE of \(\beta\) is to take derivatives and set them to zero. Another is to complete the square as in Section 2.1.4.

Proposition 2.1 Suppose that the design matrix \(X\) has linearly independent columns. In practice, this is almost always the case as long as \(p \le n\). Then the loglikelihood function of the multiple regression model (2.5) can be written as \[\begin{equation} \ell(\beta, \sigma \mid y, X) = -\frac 1 2 \frac{(\beta-\hat \beta)'X'X(\beta-\hat\beta) + e'e}{\sigma^2} - \frac n 2 \log(\sigma^2), \tag{2.6} \end{equation}\] where \(\hat \beta = (X'X)^{-1}X'y\) and \(e = y - X\hat\beta\).

Before proving Proposition 2.1, note that (2.6) easily shows that \(\hat \beta\) is the MLE of \(\beta\). This is because \(X'X\) is symmetric, and since \(X\) has linearly independent columns, \(b = Xa = 0 \iff a = 0\), such that \(a'X'Xa = b'b \ge 0\), with equality if and only if \(a = 0\). Therefore, \(X'X\) is positive definite. Since \(e'e\) and \(\log(\sigma)\) do not depend on \(\beta\), the loglikelihood is maximized when \((\beta-\hat\beta)'X'X(\beta-\hat\beta)\) is minimized, which happens at the unique value of \(\beta = \hat \beta\).

Proof. Expanding \((y-X\beta)'(y-X\beta)\) and completing the square, we obtain \[\begin{align*} (y-X\beta)'(y-X\beta) & = y'y - 2\beta'X'y + \beta' X'X \beta \\ & = \beta'X'X\beta - 2 \beta'(X'X)\underbrace{(X'X)^{-1}X'y}_{\hat \beta} + y'y \\ & = (\beta - \hat \beta)'X'X(\beta - \hat \beta) + y'y - \hat \beta'X'X\hat\beta. \end{align*}\] Moreover, \[\begin{align*} e'e = (y-X\hat\beta)'(y - X\hat \beta) & = y'y - 2\hat\beta'X'y + \hat\beta'X'X\hat\beta \\ & = y'y -2\hat\beta'(X'X)(X'X)^{-1}X'y + \hat\beta'X'X\hat\beta \\ & = y'y - 2\hat\beta'X'X\hat\beta + \hat \beta'X'X\hat\beta = y'y - \hat\beta'X'X\hat\beta, \end{align*}\] such that \[ (y-X\beta)'(y-X\beta) = (\beta-\hat\beta)'X'X(\beta-\hat\beta) + e'e. \] Therefore, the loglikelihood function is \[\begin{align*} \ell(\beta, \sigma \mid y, X) & = -\frac 1 2 \frac{(y - X\beta)'(y - X\beta)}{\sigma^2} - \frac n 2 \log(\sigma^2) \\ & = -\frac 1 2 \frac{(\beta-\hat \beta)'X'X(\beta-\hat\beta) + e'e}{\sigma^2} - \frac n 2 \log(\sigma^2). \end{align*}\] which yields (2.6).

2.3.2.1 Distribution of \(\hat \beta\)

Since \(\hat \beta\) is a linear combination of the \(y \sim \N(X\beta, \sigma^2 I)\), it is also normal with \[\begin{align*} \E[\hat \beta] & = (X'X)^{-1}X'\E[y] = (X'X)^{-1}X'X\beta = \beta, \\ \var(\hat \beta) & = (X'X)^{-1}X'\var(y)X(X'X)^{-1} = \sigma^2 (X'X)^{-1}X'X(X'X)^{-1} = \sigma^2(X'X)^{-1}. \end{align*}\] That is, \(\hat \beta \sim \N\big(\beta, \sigma^2 (X'X)^{-1}\big)\).

2.3.3 Maximum Likelihood vs. Unbiased Estimator of \(\bm \sigma^2\)

To find the MLE of \(\sigma^2\), we take derivatives of the log-likelihood at \(\hat \beta\) and set it to zero: \[ \del {\sigma^2} \ell(\hat \beta, \sigma \mid y, X) = \frac 1 2 \frac{e'e}{(\sigma^2)^2} - \frac n {2\sigma^2} = 0 \implies \sigma^2_\tx{ML} = \frac{e'e}{n} = \frac{\sum_{i=1}^n(y_i - x_i' \hat \beta)^2}{n}. \] However, it turns out that the unbiased estimator of \(\sigma^2\) loses a few degrees of freedom in the denominator. To show this, let’s proceed in a few steps.

Proposition 2.2 Consider the residual vector \(e = (\rv e n) = y - X\hat \beta\). That is, \(e_i = y_i - x_i'\hat\beta\). Then \((\hat \beta, e)\) is multivariate normal \[\begin{equation} \begin{bmatrix} \hat \beta \\ e\end{bmatrix} \sim \N\left(\begin{bmatrix} \beta \\ 0 \end{bmatrix}, \sigma^2 \begin{bmatrix} (X'X)^{-1} & 0 \\ 0 & I - H\end{bmatrix} \right), \tag{2.7} \end{equation}\] where \(H = X(X'X)^{-1}X'\) is the so-called Hat matrix. In particular, since \((\hat \beta, e)\) are multivariate normal and \(\cov(\hat \beta, e) = 0\), we have \(\hat \beta \amalg e\).
Proof. We have \(\hat \beta = (X'X)^{-1}X'y\) and \[ e = y - X \hat \beta = y - X(X'X)^{-1}X'y = (I_n - X(X'X)^{-1}X') y = (I - H) y. \] Therefore, the random vector \((\hat \beta, e)\) is a linear combination of the multivariate normal random vector \(y \sim \N(X\beta, \sigma^2 I)\), and so it is also normal. All that’s left to check are its mean and variance. We already calculated \(\E[\hat \beta] = \beta\) above, and \[\begin{align*} \E[e] = (I - H)\E[y] & = (I-H)X\beta \\ & = X\beta - X\ixtx X'X\beta = X\beta - X\beta = 0. \end{align*}\] The variance matrix is given by \[\begin{equation} \begin{split} \var\left(\begin{bmatrix} \hat \beta \\ e \end{bmatrix} \right) & = \begin{bmatrix} \cov(\hat \beta) & \cov(\hat \beta, e) \\ \cov(e, \hat \beta) & \var(e) \end{bmatrix} \\ & = \sigma^2 \begin{bmatrix} (X'X)^{-1}X' \\ I - H \end{bmatrix} I \begin{bmatrix} {[}(X'X)^{-1}X'{]}' & [I-H]' \end{bmatrix} \\ & = \sigma^2 \begin{bmatrix} (X'X)^{-1}X'X(X'X)^{-1} & (X'X)^{-1}X'(I-H') \\ (I-H)X(X'X)^{-1} & (I-H)(I-H') \end{bmatrix}. \end{split} \tag{2.8} \end{equation}\] The variance of \(\hat \beta\) was previously calculated as \(\var(\hat \beta) = \sigma^2\ixtx\). The covariance between \(\hat \beta\) and \(e\) is \[\begin{align*} \cov(\hat \beta, e) & = (X'X)^{-1}X' (I - H)' \\ & = (X'X)^{-1}X'(I - X(X'X)^{-1}X') \\ & = (X'X)^{-1}X' - (X'X)^{-1}X'X(X'X)^{-1}X' \\ & = (X'X)^{-1}X' - (X'X)^{-1}X' = 0_{p\times n}. \end{align*}\] Finally, note that \(H = X(X'X)^{-1}X'\) is a symmetric matrix, and in fact \[ H^2 = X(X'X)^{-1}X'[X(X'X)^{-1}X'] = X(X'X)^{-1}X' = H. \] A symmetric matrix \(H\) with \(H^2 = H\) is called idempotent. In fact it is easy to check that \(I-H\) is also symmetric and \((I-H)^2 = I - 2H + H^2 = I-H\), such that \(I-H\) is also idempotent. Using this result, the variance of \(e\) is calculated as \[ \var(e) = \sigma^2(I-H)^2 = \sigma^2(I-H), \] which completes the proof of Proposition 2.2.

We are now ready to find the distribution of the sum-of-squared residuals.

Proposition 2.3 The distribution of the sum-of-squared residuals \(e'e = \sum_{i=1}^n(y_i - x_i'\hat\beta)^2\) is chi-squared: \[ \frac 1 {\sigma^2} e'e \sim \chi^2_{(n-p)}. \] In particular, this shows that \(\hat \sigma^2 = e'e/(n-p)\) is an unbiased estimator of \(\sigma^2\).

Proof. The variance matrix of \(e\) is \(\var(e) = \sigma^2(I - H)\). In deriving this above, we showed that \((I-H)\) is symmetric and that \((I-H)^2 = (I-H)\).

Let’s now consider the eigen decomposition \[ I-H = \Gamma'D\Gamma, \where \Gamma^{-1} = \Gamma', \quad D = \left[\begin{smallmatrix} \lambda_1 & & \\ & \ddots & \\ & & \lambda_n \end{smallmatrix}\right]. \] Then for \(\tilde e = \Gamma e\), we have \(\E[\tilde e] = 0\) and \(\var(\tilde e) = \sigma^2 \Gamma(I-H)\Gamma' = \sigma^2 \Gamma \Gamma' D \Gamma \Gamma'\), such that \[ \tilde e_i \ind \N(0, \sigma^2 \lambda_i). \] Moreover, \(\tilde e' \tilde e = e'\Gamma'\Gamma e = e'e\). Therefore, \(e'e/\sigma^2 = \sum_{i=1}^n Z_i^2\), where \(Z_i = \tilde e_i/\sigma \ind \N(0,\lambda_i)\). To complete the proof, we’ll show that exactly \((n-p)\) of the \(\lambda_i\) equal 1, and the remaining \(p\) of the \(\lambda_i\) equal 0. Thus, exactly \((n-p)\) of the \(Z_i \sim \N(0,1)\) and exactly \(p\) of the \(Z_i =0\), such that \(e'e/\sigma^2\) is the sum of \((n-p)\) squares of iid standard normals, which is \(\chi^2_{(n-p)}\).

To show the desired property of the \(\lambda_i\)’s, note that \[\begin{align*} (I-H)^2 = (\Gamma'D\Gamma)(\Gamma'D\Gamma) & = \Gamma'D^2 \Gamma \\ & = I-H = \Gamma'D\Gamma. \end{align*}\] That is, \(D^2 = D \iff \lambda_i^2 = \lambda_i\), which implies that \(\lambda_i = 0 \tx{ or } 1\).

Next, we’ll show that \(\sum_{i=1}^n \lambda_i = n-p\). To do this, recall the trace of a matrix, \(\tr(A_{n\times n}) = \sum_{i=1}^n A_{ii}\), and the trace property that \(\tr(AB) = \tr(B A)\) as long as the matrix multiplications have the right dimensions. Therefore, \[ \tr(I-H) = \tr(\Gamma'D\Gamma) = \tr(D\Gamma\Gamma') = \tr(D) = \sum_{i=1}^n \lambda_i, \] and \[\begin{align*} \tr(I-H) = \tr(I_n) - \tr(H) & = n - \tr(X(X'X)^{-1}X') \\ & = n - \tr((X'X)^{-1}X'X) = n - \tr(I_p) = n-p. \end{align*}\] Thus, \(\sum_{i=1}^n \lambda_i = n-p\), such that exactly \((n-p)\) of them are 1’s and the other \(p\) are 0’s, which completes the proof.
Remark. While \(e = (\rv e n) \sim \N(0, \sigma^2(I-H))\) is multivariate normal, \(\lvert I-H \rvert = 0\) and so \(e\) does not have a PDF. To gain some insight as to why this might be the case, let’s consider a very basic regression model with \(p = 1\): \[ y_i \iid \N(\beta_0, \sigma^2). \] In this case, the MLE of \(\beta_0\) is \(\hat \beta_0 = \bar y\), and \(e_i = y_i - \bar y\). However, \(\sum_{i=1}^n e_i = \sum_{i=1}^n y_i - n \bar y = 0\). So if \(\rv e {n-1}\) are given, then we must have \(e_n = -\sum_{i=1}^{n-1} e_i\). In other words, the residuals for this model have only \(n-1\) “degrees of freedom”. It turns out that for the more general regression model we lose a couple more. That is, if \(\rv e {n-p}\) are given, then \(\rv [n-p+1] e n\) are completely determined.