2 Multiple Linear Regression: Theory
2.1 Some Useful Matrix Properties
A concise treatment of the multiple linear regression model relies quite a bit on matrix algebra. An overview of some matrix background material is presented below.
2.1.1 The Mean and Variance of Random Vectors
It turns out that the mean and variance of a linear combination of the \(X_i\), \(y = \sum_{i=1}^n a_i X_i\), can be obtained from just the elements of \(\E[X]\) and \(\var(X)\): \[ \E[y] = \sum_{i=1}^n a_i \E[X_i], \qquad \var(y) = \sum_{i=1}^n a_i^2 \var(X_i) + \sum_{i\neq j} a_ia_j\cov(X_i, X_j). \] In fact, it will be extremely convenient for later to represent these quantities using matrix algebra. That is, consider \(X\), \(\mu = \E[X]\), and \(a = (\rv a n)\) as \((n\times 1)\) column vectors and let \(V = \var(X)\) denote the \(n\times n\) variance matrix. Then \(y = a' X\), and \[ \E[y] = a'\mu, \qquad \var(y) = a' V a. \]
Now consider a random vector of \(k\) linear combinations of \(X\): \[\begin{align*} Y_1 & = a_{11} X_1 + a_{12} X_2 + \cdots + a_{1n} X_n \\ Y_2 & = a_{21} X_1 + a_{22} X_2 + \cdots + a_{2n} X_n \\ & \vdots \\ Y_k & = a_{k1} X_1 + a_{k2} X_2 + \cdots + a_{kn} X_n. \end{align*}\] Then the column vector \(Y = (\rv Y k)\) can be expressed as the matrix-vector product \(Y = A X\), where \(A\) is a \((k\times n)\) matrix with elements \([A]_{ij} = a_{ij}\). Moreover, we can express the mean and variance matrix of \(Y\) with matrix algebra as well. To do this, recall the linearity property of covariance: \[ \cov\left(\sum_{i=1}^n a_i X_i + c, \sum_{j=1}^n b_j X_j + d\right) = \sum_{i=1}^n\sum_{j=1}^n a_ib_j \cov(X_i, X_j). \] By applying this property to each of the elements of \(\var(Y)\), we obtain \[\begin{equation} \E[Y] = A \mu, \qquad \var(Y) = A V A'. \tag{2.1} \end{equation}\]
2.1.2 Two Matrix Decompositions
Definition 2.2 The variance matrix \(V\) of a random vector \(X\) has two important properties:
- \(V\) must be a symmetric matrix, since \(\cov(X_i, X_j) = \cov(X_j, X_i)\).
- For any vector \(a \in \mathbb R^d\) with \(a\neq 0\) (i.e., at least one of the elements of \(a\) is nonzero), we must have \(a'V a \ge 0\). This is because the linear combination \(y = a'X\) is a random variable which has non-negative variance: \(\var(y) = a'V a \ge 0\).
Variance matrices have two very important decompositions which will come in handy later:
Cholesky Decomposition: Any variance matrix \(V\) can be uniquely written as
\[\begin{equation} V = L L', \tag{2.2} \end{equation}\]
where \(L\) is a lower triangular matrix with non-negative entries \(L_{ii} \ge 0\) on the diagonal. When \(V\) is positive-definite then \(L_{ii} > 0\).
Eigen Decomposition: Any variance matrix \(V\) can be uniquely written as
\[\begin{equation} V = \Gamma' D \Gamma, \tag{2.3} \end{equation}\]
where \(\Gamma\) is an orthogonal matrix, i.e. \(\Gamma^{-1} = \Gamma'\), and \(D = \left[\begin{smallmatrix} \lambda_1 & & 0 \\& \ddots & \\ 0 & & \lambda_d \end{smallmatrix}\right]\) is a diagonal matrix with \(\lambda_i \ge 0\). When \(V\) is positive-definite then \(\lambda_i > 0\).
When \(V\) is a positive-definite variance matrix, either of these matrix decompositions can be used to calculate the inverse of \(V\): \[\begin{align*} V^{-1} & = (LL')^{-1} = (L')^{-1}L^{-1} = (L^{-1})'L^{-1} \\ & = (\Gamma'D\Gamma)^{-1} = \Gamma^{-1}D^{-1}(\Gamma')^{-1} = \Gamma'\left[\begin{smallmatrix} \lambda_1^{-1} & & 0 \\& \ddots & \\ 0 & & \lambda_d^{-1} \end{smallmatrix}\right] \Gamma. \end{align*}\]
2.1.3 Block Multiplication
Suppose that we have two so-called block matrices \[ M_1 = \bordermatrix{r & s}{p \\ q}{\begin{bmatrix} A & B \\ C & D\end{bmatrix}}, \quad M_2 = \bordermatrix{t & u}{r \\ s}{\begin{bmatrix} E & F \\ G & H\end{bmatrix}}, \] where the lower case letters indicate block dimensions. As long as the block dimensions agree, i.e., \(p = r\) and \(q = s\), then we can multiply these matrices exactly as we would in the \((2\times 2)\) case: \[ M_1 M_2 = \begin{bmatrix} A_{p\times r} & B_{p\times s} \\ C_{q\times r} & D_{q\times s} \end{bmatrix} \begin{bmatrix} E_{r\times t} & F_{r\times u} \\ G_{s\times t} & H_{s\times u} \end{bmatrix} = \bordermatrix{t & u}{p \\ q}{\begin{bmatrix} AE + BG & AF + BH \\ & CE + DG & CF + DH\end{bmatrix}}. \] Note that this works for other dimensions of block multiplication as well, e.g, \((3\times 5)\) block times \((5 \times 2)\) block, as long as the block dimensions agree, the formula is the same as for matrices composed of scalars.
2.1.4 Complete-The-Square
Recall the complete-the-square formula in one dimension: \[ ax^2 - 2bx + c = a(x-b/a)^2 + c - b^2/a \] In \(n\) dimensions, we have a similar formula which will be useful for later.
Suppose that \(A_{n\times n} = [a_{ij}]\) is a variance matrix and consider the quadratic \[ Q(x) = x'Ax - 2b'x + c = \sum_{1\le i,j \le n} a_{ij} x_i x_j - 2 \sum_{k=1}^n b_k x_k + c. \] Then it’s straightforward to check that \[ Q(x) = (x-\mu)'A(x-\mu) + c - \mu'A\mu, \where \mu = A^{-1}b. \] By writing \(Q(x)\) this way we can see that \(\argmin_x Q(x) = \mu\). This is because \(c\) and \(\mu'A\mu\) do not dependent on \(x\), and \(y'Ay \ge 0\) for any value of \(y\), with \(y'Ay = 0 \iff y = 0\), such that \((x-\mu)'A(x-\mu)\) has its unique minumum at \(x = \mu\).
2.2 The Multivariate Normal Distribution
Definition 2.3 A random vector \(Y = (\rv Y d)\) is said to have a multivariate normal distribution if it is a linear combination of iid standard normals. That is, there exists:
- a random vector \(Z = (\rv Z k)\), \(Z_i \iid \N(0,1)\),
- a \((d \times k)\) matrix \(A\) and vector \(\mu = (\rv \mu d)\)
2.2.1 Properties of the Multivariate Normal
Example 2.2 If \(X \sim \N(\mu, V)\), then \(Y = CX + d \sim\N(C\mu + d, CVC')\).
This is because we have \(X = AZ + \mu\), where \(Z \sim \N(0, I)\), and \(Y = CAZ + (C\mu + d)\).
As a consequence of this, if \(V = LL'\) is the Cholesky decomposition, then \(Y = L^{-1}X-L^{-1}\mu\) is normal with \(Y \sim \N(0, I)\), since \(L^{-1}LL'(L^{-1})' = L^{-1}LL'(L')^{-1} = I\). In other words, \(Y = L^{-1}(X-\mu)\) is an explicit transformation of \(X\) into iid standard normals.Example 2.3 If \(X = (\rv X d) \sim \N(\mu, V)\) is multivariate normal, then any subset \(\tilde X = (X_{i_1}, X_{i_2}, \ldots, X_{i_m})\), with \(i_j \in \{1, \ldots, d\}\) is also multivariate normal. The mean and variance of this subset is obtained by reading off the appropriate elements of \(\mu\) and \(V\).
To illustrate the general case, let \(d = 4\) and \(\tilde X = (X_4, X_2)\). Then \[ \tilde X = A_{2\times 4} X, \qquad A = \begin{bmatrix} 0 & 0 & 0 & 1 \\ 0 & 1 & 0 & 0 \end{bmatrix}, \] such that \(\tilde X\) is multivariate normal, and \[ \E[\tilde X] = A\mu = (\mu_4, \mu_2), \qquad \var(\tilde X) = A V A' = \begin{bmatrix} V_{44} & V_{42} \\ V_{24} & V_{22} \end{bmatrix}. \]Example 2.4 Let \(X_1 \sim \N(0,1)\) and \(B \sim \textrm{Bernoulli}(\sfrac 1 2)\) be independent of \(X_1\). Now consider the random variable \[ X_2 = \begin{cases} X_1 & B = 0, \\ -X_1 & B = 1. \end{cases} \] Convince yourself that \(X_2 \sim \N(0,1)\) by recalling that it is defined by a 50-50 mixture of two \(\N(0,1)\) PDF’s, or by running the following commands:
# simulate X2
N <- 1e6
X1 <- rnorm(N) # N(0,1)
B <- rbinom(N, size = 1, prob = .5) # Bernoulli(1/2)
X2 <- ifelse(B == 0, X1, -X1) # if B==0 X2 <- X1, else X2 <- -X1
# histogram of X2
hist(X2, breaks = 100, freq = FALSE) # histogram
curve(dnorm, add = TRUE, col = "red") # superimpose PDF of N(0,1)
Now, let \(Y = X_1 + X_2\). Then \(Y\) is not normal, since normals are continuous distributions but \(\Pr(Y = 0) = \Pr(B = 1) = \sfrac 1 2\). So what went wrong?
Answer: While \(X_1\) and \(X_2\) are each normally distributed, \(X = (X_1, X_2)\) is not multivariate normal. That is, it’s impossible to find \(Z = (\rv Z k)\), \(Z_i \iid \N(0,1)\) and \(A_{2 \times k}\) such that \(X = (X_1, X_2) = A Z\).
Example 2.5 Suppose that \(X \sim \N(\mu, V)\) and \(V_{ij} = 0\). Then \(X_i\) and \(X_j\) are independent.
To see this, assume without loss of generality that \(i=1\) and \(j = 2\). Then using the result of Example 2.3, we have \[ \begin{bmatrix} X_1 \\ X_2 \end{bmatrix} \sim \N\left(\begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix}, \begin{bmatrix} V_{11} & 0 \\ 0 & V_{22} \end{bmatrix} \right). \] Assuming that \(V_{11}\) and \(V_{22}\) are invertible, the joint pdf of \((X_1, X_2)\) is \[\begin{align*} f(x_1, x_2) & = \frac{1}{2\pi \sqrt{V_{11}V_{22}}} \exp\left\{-\frac 1 2 \begin{bmatrix} x_1-\mu_1 & x_2 - \mu_2\end{bmatrix} \begin{bmatrix} V_{11}^{-1} & 0 \\ 0 & V_{22}^{-1} \end{bmatrix} \begin{bmatrix} x_1-\mu_1 \\ x_2-\mu_2 \end{bmatrix} \right\} \\ & = \frac{1}{2\pi \sqrt{V_{11}V_{22}}} \exp\left\{-\frac 1 2 \frac{(x_1-\mu_1)^2}{V_{11}} - \frac 1 2 \frac{(x_2-\mu_2)^2}{V_{22}}\right\} \\ & = \frac{1}{ \sqrt{2\pi V_{11}}} \exp\left\{-\frac 1 2 \frac{(x_1-\mu_1)^2}{V_{11}}\right\} \times \frac{1}{ \sqrt{2\pi V_{22}}} \exp\left\{-\frac 1 2 \frac{(x_2-\mu_2)^2}{V_{22}}\right\} \\ & = f_{X_1}(x_1) \times f_{X_2}(x_2). \end{align*}\]
This can be generalized as follows. Suppose that \(X = (\rv X p)\) and \(Y = (\rv Y q)\) are jointly multivariate normal, with \[ \begin{bmatrix} X \\ Y \end{bmatrix} \sim \N\left(\begin{bmatrix} \mu_X \\ \mu_Y \end{bmatrix}, \begin{bmatrix} V_{X} & 0 \\ 0 & V_{Y} \end{bmatrix} \right). \] Then \(X\) and \(Y\) are independent random vectors, i.e., \(g(X) \amalg h(Y)\) for any functions \(g\) and \(h\).2.2.2 Summary of Properties of the Multivariate Normal
If \(X = (\rv X d) \sim \N(\mu, V)\) is multivariate normal, then:
- The mean and variance \(\mu\) and \(V\) completely determine the PDF of \(X\).
- Any linear combination \(Y = AX + b\) is also multivariate normal with \(Y \sim \N(A\mu + b, AVA')\).
- If \(V = LL'\), the linear combination \(Z = (\rv Z d) = L^{-1}(X - \mu)\) is a collection of iid standard normals: \(Z_i \iid \N(0,1)\).
- Any combination of the elements of \(X\), say \(Y = (\rv Y m) = (\rv [i_1] X {i_m})\) for some set of indices \(I = (\rv i m)\), \(1 \le i_j \le d\), is also multivariate normal. The means and covariances of the elements in \(Y\) are \(\E[Y_j] = \mu_{i_j}\) and \(\cov(Y_j, Y_k) = V_{i_j,i_k}\).
- If \(I\) is a set of indices as above, and \(\cov(X_{i_j}, X_{i_k}) = 0\) for every \(i_j,i_k \in I\) such that \(j\neq k\), then \(\rv [i_1] X {i_m}\) are independent random variables. \end{itemize}
2.3 Multiple Linear Regression
The multiple linear regression model is a straightforward extension of simple linear regression which allows for more than one covariate. That is, each response variable is still \(y_i\), but the covariate vector for subject \(i\) is \(x_i = (x_{i1}, \ldots, x_{ip})\). The model in error form is \[\begin{equation} y_i = x_i'\beta + \eps_i, \qquad \eps_i \iid \N(0, \sigma^2), \tag{2.4} \end{equation}\] with \(\beta = (\rv \beta p)\). However, for what follows it is more convenient to write (2.4) as a multivariate normal. That is, let \[ y = \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix}, \quad X = \begin{bmatrix} x_{11} & \cdots & x_{1p} \\ \vdots & & \vdots \\ x_{n1} & \cdots & x_{np} \end{bmatrix} = \begin{bmatrix} x_1' \\ \vdots \\ x_n' \end{bmatrix}, \quad \eps = \begin{bmatrix} \eps_1 \\ \vdots \\ \eps_n \end{bmatrix}. \] Then (2.4) can be rewritten in matrix form as \(y = X\beta + \eps\) with \(\eps \sim \N(0, \sigma^2 I)\), or equivalently, \[\begin{equation} y \sim \N(X\beta, \sigma^2 I). \tag{2.5} \end{equation}\]Example 2.7 A pharmaceutical company is designing a new drug to reduce blood pressure. The drug contains ingredients \(A\), \(B\), and \(C\), and the pharmaceutical company wishes to understand how each ingredient affect blood pressure outcomes. Therefore, a clinical trial is conducted with \(n = 100\) patients with similar medical characteristics and initial blood pressure level. Each patient is administered a different concentration of the three active ingredients and the change in blood pressure after 10 days of usage is recorded. In this setting, we might model the change in blood pressure \(y_i\) in patient \(i\) as \[ y_i = \beta_A x_{A,i} + \beta_B x_{B,i} + \beta_C x_{C,i} + \eps_i, \] where \(x_{A,i}\) is the concentration of ingredient \(A\) administered to patient \(i\), and so on.
An interesting feature of this example is that unlike Dr. Gladstone’s brain data, here the covariates \(x_A\), \(x_B\), and \(x_C\) are truly fixed in that they are controlled within the experiment. That is, we can imagine giving exactly the same concentration of each ingredient to a number of patients and there would be some variability in their response \(y_i\). This example is conceptually closer to the notion of \(p(y \mid x)\) with fixed \(x\) which we’ve been assuming all along.2.3.1 Summary of Upcoming Results
In the following few sections, we’ll be proving the following propositions:
The MLE of \(\beta\) is \(\hat \beta = (X'X)^{-1}X'y\), and its distribution is
\[ \hat \beta \sim \N\big(\beta, \sigma^2(X'X)^{-1}\big). \]
The MLE of \(\sigma^2\) is
\[ \hat \sigma^2_{(\tx{ML})} = \frac{\sum_{i=1}^n(y_i - x_i'\hat \beta)^2}{n} = \frac{e'e}{n}, \]
where \(e = y - X\hat\beta\).
The unbiased estimator of \(\sigma^2\) is
\[ \hat \sigma^2 = \frac{e'e}{n-p}. \]
The distribution of the unbiased estimator is
\[ \frac{n-p}{\sigma^2} \cdot \hat \sigma^2 \sim \chi^2_{(n-p)}. \]
The random vector \(\hat \beta\) and the random variable \(\hat \sigma\) are independent.
2.3.2 The MLE of \(\bm \beta\)
The likelihood function is \[ \mathcal L(\beta, \sigma^2 \mid y, X) \propto p(y \mid X, \beta, \sigma^2) = \exp\left\{-\frac 1 2 \frac{(y - X\beta)'(y - X\beta)}{\sigma^2} - \frac n 2 \log(\sigma^2) \right\}. \] One way of finding the MLE of \(\beta\) is to take derivatives and set them to zero. Another is to complete the square as in Section 2.1.4.
Before proving Proposition 2.1, note that (2.6) easily shows that \(\hat \beta\) is the MLE of \(\beta\). This is because \(X'X\) is symmetric, and since \(X\) has linearly independent columns, \(b = Xa = 0 \iff a = 0\), such that \(a'X'Xa = b'b \ge 0\), with equality if and only if \(a = 0\). Therefore, \(X'X\) is positive definite. Since \(e'e\) and \(\log(\sigma)\) do not depend on \(\beta\), the loglikelihood is maximized when \((\beta-\hat\beta)'X'X(\beta-\hat\beta)\) is minimized, which happens at the unique value of \(\beta = \hat \beta\).
2.3.2.1 Distribution of \(\hat \beta\)
Since \(\hat \beta\) is a linear combination of the \(y \sim \N(X\beta, \sigma^2 I)\), it is also normal with \[\begin{align*} \E[\hat \beta] & = (X'X)^{-1}X'\E[y] = (X'X)^{-1}X'X\beta = \beta, \\ \var(\hat \beta) & = (X'X)^{-1}X'\var(y)X(X'X)^{-1} = \sigma^2 (X'X)^{-1}X'X(X'X)^{-1} = \sigma^2(X'X)^{-1}. \end{align*}\] That is, \(\hat \beta \sim \N\big(\beta, \sigma^2 (X'X)^{-1}\big)\).
2.3.3 Maximum Likelihood vs. Unbiased Estimator of \(\bm \sigma^2\)
To find the MLE of \(\sigma^2\), we take derivatives of the log-likelihood at \(\hat \beta\) and set it to zero: \[ \del {\sigma^2} \ell(\hat \beta, \sigma \mid y, X) = \frac 1 2 \frac{e'e}{(\sigma^2)^2} - \frac n {2\sigma^2} = 0 \implies \sigma^2_\tx{ML} = \frac{e'e}{n} = \frac{\sum_{i=1}^n(y_i - x_i' \hat \beta)^2}{n}. \] However, it turns out that the unbiased estimator of \(\sigma^2\) loses a few degrees of freedom in the denominator. To show this, let’s proceed in a few steps.
We are now ready to find the distribution of the sum-of-squared residuals.
Proof. The variance matrix of \(e\) is \(\var(e) = \sigma^2(I - H)\). In deriving this above, we showed that \((I-H)\) is symmetric and that \((I-H)^2 = (I-H)\).
Let’s now consider the eigen decomposition \[ I-H = \Gamma'D\Gamma, \where \Gamma^{-1} = \Gamma', \quad D = \left[\begin{smallmatrix} \lambda_1 & & \\ & \ddots & \\ & & \lambda_n \end{smallmatrix}\right]. \] Then for \(\tilde e = \Gamma e\), we have \(\E[\tilde e] = 0\) and \(\var(\tilde e) = \sigma^2 \Gamma(I-H)\Gamma' = \sigma^2 \Gamma \Gamma' D \Gamma \Gamma'\), such that \[ \tilde e_i \ind \N(0, \sigma^2 \lambda_i). \] Moreover, \(\tilde e' \tilde e = e'\Gamma'\Gamma e = e'e\). Therefore, \(e'e/\sigma^2 = \sum_{i=1}^n Z_i^2\), where \(Z_i = \tilde e_i/\sigma \ind \N(0,\lambda_i)\). To complete the proof, we’ll show that exactly \((n-p)\) of the \(\lambda_i\) equal 1, and the remaining \(p\) of the \(\lambda_i\) equal 0. Thus, exactly \((n-p)\) of the \(Z_i \sim \N(0,1)\) and exactly \(p\) of the \(Z_i =0\), such that \(e'e/\sigma^2\) is the sum of \((n-p)\) squares of iid standard normals, which is \(\chi^2_{(n-p)}\).
To show the desired property of the \(\lambda_i\)’s, note that \[\begin{align*} (I-H)^2 = (\Gamma'D\Gamma)(\Gamma'D\Gamma) & = \Gamma'D^2 \Gamma \\ & = I-H = \Gamma'D\Gamma. \end{align*}\] That is, \(D^2 = D \iff \lambda_i^2 = \lambda_i\), which implies that \(\lambda_i = 0 \tx{ or } 1\).
Next, we’ll show that \(\sum_{i=1}^n \lambda_i = n-p\). To do this, recall the trace of a matrix, \(\tr(A_{n\times n}) = \sum_{i=1}^n A_{ii}\), and the trace property that \(\tr(AB) = \tr(B A)\) as long as the matrix multiplications have the right dimensions. Therefore, \[ \tr(I-H) = \tr(\Gamma'D\Gamma) = \tr(D\Gamma\Gamma') = \tr(D) = \sum_{i=1}^n \lambda_i, \] and \[\begin{align*} \tr(I-H) = \tr(I_n) - \tr(H) & = n - \tr(X(X'X)^{-1}X') \\ & = n - \tr((X'X)^{-1}X'X) = n - \tr(I_p) = n-p. \end{align*}\] Thus, \(\sum_{i=1}^n \lambda_i = n-p\), such that exactly \((n-p)\) of them are 1’s and the other \(p\) are 0’s, which completes the proof.