Chapter 1 Introduction to Bayesian Inference
Note: This chapter is still under construction. The corresponding slides are available here.
1.1 Recap of Frequentist Inference
In order to fix ideas, let’s start with a simple data-generating model
\[\begin{equation} \YY = (\yy_1, \ldots, \yy_n) \iid f(\yy \mid \tth), \tag{1.1} \end{equation}\]
where \(\yy_i\) is a \(d\)-dimensional random vector corresponding to observation \(i\), and \(\tth\) is a \(p\)-dimensional parameter to estimate.
The likelihood function is defined as
\[ \mathcal L(\tth \mid \YY) \propto p(\YY \mid \tth) = \prod_{i=1}^n f(\yy_i \mid \tth). \]
For calculations, it’s often more useful to work with the loglikelihood function
\[ \ell(\tth \mid \YY) = \log \mathcal L(\tth \mid \YY). \]
1.1.1 Parameter Inference
At a basic level, parameter inference consists of producing point and interval estimates for each element of \(\tth\). For the former, the go-to Frequentist estimator is the maximum likelihood estimator (MLE)
\[ \hat \tth = \hat \tth_{\tx{ML}} = \argmax_{\tth} \ell(\tth \mid \YY). \]
The MLE enjoys the following asymptotic property:
As \(n \to \infty\), we have \(\hat \tth \sim \N(\tth_0, \FI^{-1}(\tth))\), where \(\tth_0\) is the true parameter value and \(\FI(\tth_0)\) is the expected Fisher Information:
\[ \FI(\tth_0) = -E\left[\del[2]{\tth}\ell(\tth_0 \mid \YY)\right] = - \int \del[2]{\tth}\ell(\tth_0 \mid \YY) \cdot p(\YY \mid \tth_0) \,\ud \YY. \]
Theorem 1.1 Let \(\tilde \tth\) be any other estimator of \(\tth\). Then as \(n \to \infty\), either \(\tilde \tth \not\to \tth_0\) and/or \(\var(\tilde \tth) \ge \var(\hat \tth)\).
In other words, the MLE is asymptotically the most efficient consistent estimator of \(\tth\).
For interval estimates, we want 95% confidence intervals for each element \(\theta_j\) of \(\tth\). In other words, we are looking for a pair of random variables \(L = L(\YY)\) and \(U = U(\YY)\) such that \(\Pr(L < \theta_i < U) = 95\%\). This can be done asymptotically using the observed Fisher Information
\[ \hat \FI = -\del[2]{\tth}\ell(\hat \tth \mid \yy) \stackrel{n}{\to} \FI(\tth_0). \]
Therefore we have
\[ {\hat \FI}^{-1/2}(\hat \tth - \tth_0) \to \N(\bz, \bm{I}_{p \times p}). \]
and with some algebra1 we find this implies that
\[ \frac{\hat \theta_j - \theta_{0j}}{\se(\hat \theta_j)} \to \N(0, 1), \]
where \(\se(\hat \theta_j) = \sqrt{[\hat \FI^{-1}]_{jj}}\) is the standard error of \(\hat \theta_j\). Thus, an asymptotic 95% confidence interval for \(\theta_j\) is \(\hat \theta_j \pm 1.96 \times \se(\hat \theta_j)\).
1.1.2 Hypothesis Testing
Suppose we wish to test the null hypothesis
\[ H_0: \tth \in \TTh_0. \]
We then construct a test statistic \(T = T(\YY)\), for which large values of \(T\) are evidence against \(H_0\). The p-value is then defined as
\[ \pv = \Pr(T > T_\obs \mid H_0) = \Pr(T > T_\obs \mid \tth \in \TTh_0), \]
where \(T_\obs = T(\yy_\obs)\) is calculated for the current dataset, and \(T = T(\yy)\) is for a new dataset. In other words, \(\pv\) is probability of observing more evidence against \(H_0\) in new data than in the current data, given that \(H_0\) is true.
Typically the distribution \(p(T \mid H_0)\) doesn’t exist, only \(p(T \mid \tth)\) for a specific value of \(\tth\). So often we rely on an asymptotic \(p\)-value
\[ \pv \approx \Pr(T > T_\obs \mid \tth = \hat \tth_0), \]
where \(\hat \tth_0 = \argmax_{\tth \in \TTh_0} \ell(\tth \mid \YY)\) is the MLE restricted to values of \(\tth\) within the null hypothesis \(H_0\).
1.2 Basics of Bayesian Inference
We begin with the same model (1.1) and corresponding likelihood function \(\mathcal{L}(\tth \mid \YY) \propto p(\YY \mid \tth)\). We previously assumed that the unknown parameter \(\tth\) fixed. Let us now pretend that it is random with prior distribution \(\tth \sim \pi(\tth)\). In this course, we won’t spend much time discussing the philosophical implications of pretending that something fixed is in fact random. We’ll simply take this as a computational device which is sometimes useful (more on this shortly).
Now that the parameter \(\tth\) is random, we can use Bayes’ theorem to obtain its posterior distribution
\[ p(\tth \mid \YY) = \frac{p(\YY \mid \tth) \pi(\tth)}{p(\YY)} \propto \mathcal L(\tth \mid \YY) \cdot \pi(\tth). \]
Note that we didn’t even bother in the third expression to derive the exact value of the denominator, which incidentally is \(\int \mathcal{L}(\tth \mid \YY) \pi(\tth) \ud \tth\). This is because we’re not going to need it any time soon2, so for now it’s just whatever (unique) number makes the posterior distribution integrate to one.
1.2.1 Parameter Inference
Point Estimate: \(\hat \tth = \E[\tth \mid \YY]\).
Interval Estimate: \((L, U)\) such that \(\Pr(L < \theta_j < U \mid \YY) = 95\%\).
No asymptotics, and conditioned on this particular dataset \(\YY\).
1.2.2 Hypothesis Testing
Method 1: Simply calculate \(\Pr(H_0 \mid \YY) = \Pr(\tth \in \TTh_0 \mid \YY)\).
Method 2: Given a test statistic \(T = T(\YY) \sim f(T \mid \tth)\), calculate the posterior p-value
\[ \Pr(T > T_\obs \mid \YY_\obs, H_0) = \int_{\tth \in \TTh_0} \Pr(T > T_\obs \mid \tth) \cdot p(\tth \mid \YY_\obs, \tth) \ud \tth. \]
Again, no asymptotics!
1.3 Example 1: Normal with Unknown Mean
Model: \(\yy = (\rv y n) \iid \N(\mu, 1)\)
Likelihood:
\[ \ell(\mu \mid \yy) = - \frac 1 2 \sumi 1 n (y_i - \mu)^2 = -\frac n 2 (\bar y - \mu)^2, \]
where \(\bar y = \frac 1 n \sumi 1 n y_i\).
Prior Specification: Preferably in this order:
What prior information do we have about \(\mu\)?
What would make calculations simple?
In this case, a convenient choice is \(\mu \sim \N(\lambda, \tau^2)\), since
\[ \begin{aligned} \log p(\mu \mid \yy) & = \ell(\mu \mid \yy) + \log \pi(\mu) \\ & = -\frac{n(\bar y - \mu)^2}{2} - \frac {(\lambda-\mu)^2}{2\tau^2} = -\frac {(\mu - B\lambda - (1-B)\bar y)^2}{2(1-B)/n}, \end{aligned} \]
where \(B = \tfrac 1 n / (\tfrac 1 n + \tau^2) \in (0,1)\) is called the shrinkage factor. The posterior distribution is then
\[ \mu \mid \yy \sim \N\left(B \lambda + (1-B) \bar y, \frac{1-B} n \right). \]
With the above prior,
\[ \log p(\mu \mid \yy) = -\frac 1 2 [n(\bar{y} - \mu)^2 + \tau^{-2} (\lambda-\mu)^2] = \ell(\mu \mid \yy, \tilde{\yy}), \]
where \(\tilde{\yy}\) consists of \(\tau^{-2}\) additional data points with mean \(\lambda\). In other words, we can think of the prior as adding “fake” data to the data we already have.
As \(\tau \to \infty\), posterior converges to \(\mu \mid \yy \sim \N(\bar y, \frac 1 n)\).
Gives exactly same point and interval estimate as Frequentist inference.
But as \(\tau \to \infty\) we have \(\pi(\mu) \propto 1\) which is not a PDF…
Since any square root of \({\hat \FI}^{-1}\) will do, assume wlog that we are interested in \(\theta_1\) and let \({\hat \FI}^{-1/2}\) denote the lower Cholesky decomposition of \({\hat \FI}^{-1}\). Then the above reduces to \((\hat \theta_1 - \theta_{01})/\sqrt{[\hat \FI^{-1}]_{11}} \to \N(0,1)\). By permuting the entries of \(\tth\) we can get a similar result for any \(\theta_j\), \(j = 1, \ldots, p\).↩︎
It is often used for Bayesian model selection.↩︎