Loading [MathJax]/jax/output/HTML-CSS/jax.js

Introduction

The GaussCop package proposes a simple, unified framework to provide the basic d/p/q/r functions for arbitrary continuous one-dimensional distributions. This can be useful for working with various Copula models (including the Gaussian Copula in the example below).

xDensity Objects

Let X denote an unbounded, continuous random variable with PDF Xf(x). An xDensity object essentially consists of the following elements:

  • PDF values on an evenly-spaced grid.
  • The mean and standard deviation of a Normal distribution to be used outside the grid.

Based on this representation, the d/p/q/r functions are implemented as follows:

  • PDF (d): Step function inside the grid, Normal outside the grid.
  • CDF (p): Piecewise linear inside the grid, Normal outside the grid.
  • Inverse-CDF (q): Exact numerical inversion of the CDF.
  • Sampling (r): Inverse-CDF method. That is, if F(x) is a CDF and UUnif(0,1), then X=F1(U) has CDF F(x).

Basic Usage

xDensity objects can be constructed either from iid samples X1,,Xniidf(x) or from a functional representation of f(x) itself. For example, consider the following xDensity representation based on a random sample from the log-Exponential distribution, log(X)Expo(1)f(x)=exp(ex+x).

require(GaussCop) # load package
## Loading required package: GaussCop
# simulate data
nsamples <- 1e5
X <- log(rexp(nsamples)) # iid samples from the log-Exponential

xDens <- kernelXD(X) # basic xDensity constructor
names(xDens)
## [1] "ndens" "xrng"  "ypdf"  "ylpdf" "ycdf"  "mean"  "sd"

The elements of xDens are:

  • ndens, xrng: a compact representation of the x-value grid, i.e., the number of grid points and the range of the grid.
  • ypdf, ylpdf, ycdf: the PDF, log-PDF, and CDF evaluated at the ndens grid points. While the latter two elements can be reconstructed from the first, they are stored here in order to speed up subsequent calculations.
  • mean, sd: The mean and variance of the Normal distribution used outside the range of the grid. By default, these values are chosen by matching the Normal PDF to the ypdf values at then endpoints of the grid; details of the calculation can be found in the Appendix .

The d/p/q/r functions for xDensity objects are invoked by e.g.:

dXD(x, xDens = xDens, log = TRUE) # log-PDF
pXD(x, xDens = xDens, lower.tail = FALSE) # survival function (1-CDF)
qXD(p = .5, xDens = xDens) # median
rXD(n, xDens = xDens) # random sample

To assess the quality of the xDensity approximation to the log-Exponential distribution log(X)Expo(1), consider the following checks:

  1. That the histogram of rXD() sample matches dXD() and the true PDF (recall that rXD() calls qXD() on stats::runif()).
  2. That pXD() matches the true CDF.

These graphical checks are implemented by the function xdens_test (R code in Appendix) as illustrated below:

# check that xDensity d/p/q/r functions match the true distribution,
# by plotting true PDF/CDF vs xDensity representation.

dlexp <- function(x) exp(-exp(x) + x) # true PDF
plexp <- function(x) pexp(exp(x)) # true CDF

xdens_test(xDens, # xDensity approximation 
           dtrue = dlexp, # true PDF 
           ptrue = plexp, # true CDF
           dname = expression(log(X)%~%"Expo"*(1)), # name of true distribution (for legend)
           dlpos = "topleft", plpos = "topleft") # legend position for pdf/cdf plot
Graphical assessment of `xDensity` approximation to the log-Exponential distribution.

Graphical assessment of xDensity approximation to the log-Exponential distribution.

Constructors for xDensity Objects

The GaussCop package currently implements three methods of constructing xDensity objects. Two take as input a random sample X1,,Xnf(x). The third is based on a functional approximation ˆf(x), e.g., as resulting from a parametric model ˆf(x)=f(xˆθ), where ˆθ is the parameter MLE.

Kernel Density Estimation

The kernelXD() function uses the stats::density() function to fit a Gaussian kernel to a random sample from f(x). It is the most flexible constructor; however, it also requires the most datapoints for stable estimation. Below is a comparison with M iid samples from a χ2(4) distribution, using M=1000 and M=10000 data points.

# true distribution: chi^2_(4)
df_true <- 4
dtrue <- function(x) dchisq(x, df = df_true)
ptrue <- function(x) pchisq(x, df = df_true)
rtrue <- function(n) rchisq(n, df = df_true)
dname <- paste0("X%~%chi[(",df_true,")]^2")

# simulate data
nsamples <- 1e3
X <- rtrue(nsamples)
xDens <- kernelXD(X) # xDensity constructor

par(oma = c(0,2,0,0)) # put sample size in outer margin
xdens_test(xDens, dtrue = dtrue, ptrue = ptrue,
           dname = dname,
           dlpos = "topright", plpos = "bottomright")
mtext(text = paste0("M = ", nsamples),
      side = 2, line = .5, outer = TRUE)

# increase sample size
nsamples <- 1e4
X <- rtrue(nsamples)
xDens <- kernelXD(X)
xdens_test(xDens, dtrue = dtrue, ptrue = ptrue,
           dname = dname,
           dlpos = "topright", plpos = "bottomright")
mtext(text = paste0("M = ", nsamples),
      side = 2, line = .5, outer = TRUE)
par(oma = c(0,0,0,0)) # remove outer margin
Impact of sample size on quality of `kernelXD()` approximation.Impact of sample size on quality of `kernelXD()` approximation.

Impact of sample size on quality of kernelXD() approximation.

Gram-Charlier Expansion

The gc4XD() constructor implements a moment-based density approximation, which works best for unimodal distributions that are not too far from normal. The estimator proceeds in two steps:

  1. The data X1,,Xniidf(x) are first normalized by applying a Box-Cox transformation to each observation Xi, Zi=BoxCox(Xi|ˆα,ˆλ),BoxCox(x|α,λ)={((x+α)λ1)(λCλ1)λ0Clog(x+α)λ=0, where (ˆα,ˆλ) are obtained by maximum likelihood.
  2. The transformed data Z1,,Zn are then fit with a 4th order Gram-Charlier expansion (e.g., Draper and Tierney 1972). That is, let ˆμ, ˆσ, ˆκ3, and ˆκ4 denote the sample-based estimates of the first four cumulants of Z=BoxCox(X|ˆα,ˆλ). The 4th order Gram-Charlier approximation to the true PDF f(z) is ˆf(z)=12πˆσexp[(zˆμ)22ˆσ2]×[1+ˆκ33!ˆσ3H3(zˆμˆσ)+ˆκ44!ˆσ4H4(zˆμˆσ)], where H3(z)=z33z and H4(z)=z46z2+3 are the 3rd and 4th Hermite polynomials. The approximation ˆf(z) is not guaranteed to be positive, but generally is for reasonably unimodal distributions, in which case ˆf(z) is a (normalized) PDF with mean, variance, skewness and kurtosis exactly matching those of the transformed data Z1,,Zn.

The gc4XD() constructor estimates the PDF of the data X1,,Xn on the original scale by inverting steps 1 and 2 above. The example below illustrates the quality of the approximation for M=1000 observations of Xχ2(4).

nsamples <- 1e3
X <- rtrue(nsamples)
xDens <- gc4XD(X) # xDensity constructor
xdens_test(xDens, dtrue = dtrue, ptrue = ptrue,
           dname = dname,
           dlpos = "topright", plpos = "bottomright")
`gc4XD()` approximation with $M = 1000$ samples.

gc4XD() approximation with M=1000 samples.

Known Density Values

The matrixXD() constructor is used to convert functional density representations into the standardized xDensity format. Its input is a 2-column matrix with equally-spaced x-grid values in the first column and corresponding density values in the second. Below is an illustration using the Student-t distributions Xt(2) and Xt(1) (i.e., Cauchy) distributions. We can see that for extremely heavy-tailed distributions such as Cauchy, the normal approximation used by xDensity clearly undercuts tail probabilities.

# true distribution: t(2)
dtrue <- list(t2 = function(x) dt(x, df = 2),
              Cauchy = function(x) dt(x, df = 1))
ptrue <- list(t2 = function(x) pt(x, df = 2),
              Cauchy = function(x) pt(x, df = 1))
dname <- expression(X%~%t[(2)], X%~%"Cauchy")

# matrix density representation
xgrid <- seq(from=-15, to=15, len = 500) # grid
for(ii in 1:2) {
  ypdf <- dtrue[[ii]](xgrid) # true density
  xDens <- matrixXD(cbind(X = xgrid, Y = ypdf)) # xDensity constructor
  xdens_test(xDens, dtrue = dtrue[[ii]], ptrue = ptrue[[ii]],
             dname = dname[ii],
             dlpos = "topright", plpos = "bottomright")
}
`matrixXD()` approximation for $t_{(\nu)}$ distributions with $\nu = 1,2$.`matrixXD()` approximation for $t_{(\nu)}$ distributions with $\nu = 1,2$.

matrixXD() approximation for t(ν) distributions with ν=1,2.

Application: The Gaussian Copula Distribution

Consider a d-dimensional random variable X=(X1,,Xd) with unknown PDF Xf(x). A popular approach to statistical dependence modeling is to do so upon transforming X to having univariate margins. That is, if Fi(xi) is the CDF of Xi, then Ui=F1i(Xi)Unif(0,1), such that the joint PDF of X can be written as f(x)=c(u)×ni=1fi(xi),ui=Fi(xi). Thus, c(u) is the joint PDF of marginally Uniform random variables, UiUnif(0,1), and is known as a Copula distribution. Numerous R packages provide various for fitting and sampling from Copula distributions, e.g., copula (Hofert et al. 2017), VineCopula (Schepsmeier et al. 2017), and many others listed in the Distributions Task View. The xDensity methods for modeling the marginal distributions fi(xi) can easily be combined with the Copula dependence models c(u) provided by these packages. GaussCop provides a simple and specific interface to the so-called Gaussian Copula distribution, which approximates the true c(u) as follows:

  1. Let φ(z) and Φ(z) denote the PDF and CDF of the standard normal N(0,1). Now, consider the change-of-variables Zi=Φ1(Ui), for which the joint PDF of Z=(Z1,,Zn) is g(z)=c(u)×ni=1φ(zi),ui=Φ(zi).
  2. Since marginally ZiN(0,1), the Gaussian Copula approximates the true distirbution g(z) by the best-matching multivariate normal, in the sense of minimizing the Kullback-Leibler divergence, ˆg(z)=argminRlog[g(z)φ(zR)]g(z)dz, where φ(zR) is the joint PDF of a multivariate normal N(0,R) with correlation matrix Rii=1.

To evaluate the accuracy of the Gaussian Copula appromation, consider applying it to the p(p+1)/2-dimensional Cholesky decomposition of a “unit” p×p Wishart distribution, Xp×pWishart(I,p). That is, X=ZZ and Z is a p×p matrix of iid N(0,1).

By construction, the Gaussian Copula perfectly captures the marginal distributions of chol(X). Therefore, consider the error on the marginal distributions of the eigenvalues of X, induced by approximating the copula dependence between the Cholesky variables as multivariate normal. In the example below with p=4, the error appears to be relatively small:

# simulate data from unit Wishart distribution
p <- 4 # size of matrix
nsamples <- 1e4
X <- replicate(nsamples, crossprod(matrix(rnorm(p^2),p,p)))

# Cholesky decomposition
Xchol <- apply(X, 3, function(V) chol(V)[upper.tri(V,diag = TRUE)]) 
Xchol <- t(Xchol)

# fit Gaussian Copula approximation
gCop <- gcopFit(X = Xchol, # iid sample from multivariate distribution
                fitXD = "kernel") # xDensity approximation to each margin

# Compare approximation to true distribution of eigenvalues of X

# extract sorted eigenvalues of a positive definite matrix
eig_val <- function(V) {
  sort(eigen(V, symmetric = TRUE)$values, decreasing = TRUE)
}

Xeig <- t(apply(X, 3, eig_val)) # true distribution
Xeig2 <- rgcop(nsamples, gCop = gCop) # Gaussian Copula approximation
Xeig2 <- t(apply(Xeig2, 1, function(U) {
  # recover full matrix
  V <- matrix(0,p,p)
  V[upper.tri(V,diag=TRUE)] <- U
  V <- crossprod(V)
  eig_val(V)
}))

# plot true and approximate eigenvalue distributions
main <- parse(text = paste0("lambda[", 1:p, "]"))
par(mfrow = c(1,p), mar = c(2.5,2.5,2,.1)+.1, oma = c(0,2,0,0))
for(ii in 1:p) {
  dtrue <- density(Xeig[,ii])
  dapprox <- density(Xeig2[,ii])
  xlim <- range(dtrue$x, dapprox$x)
  ylim <- range(dtrue$y, dapprox$y)
  plot(dtrue$x, dtrue$y, type = "l", xlim = xlim, ylim = ylim,
       main = main[ii], xlab = "", ylab = "", col = "blue")
  lines(dapprox$x, dapprox$y, col = "red")
}
mtext(text = "Density",
      side = 2, line = .5, outer = TRUE)
legend("topright", legend = c("True PDF", "gCop Approx."),
       fill = c("blue", "red"))
par(oma = c(0,0,0,0)) # remove outer margin
Gaussian Copula approximation to the eigenvalues of $\XX \sim \tx{Wishart}(\II, 4)$, as fitted to $\tx{chol}(\XX)$.

Gaussian Copula approximation to the eigenvalues of XWishart(I,4), as fitted to chol(X).

Appendix A: Endpoint Matching

We wish to find the parameters (μ,σ) of a normal distribution whose PDF passes through the points (xi,yi), i=1,2. Thus we wish to find (μ,σ) satisfying logyi=12[(xiμ)2σ2+logσ2+log2π],i=1,2. Subtracting one equation from the other, we obtain log(y1/y2)=(x1μ)22σ2(x2μ)22σ2, such that for any value of σ the required value of μ is μ(σ)=σ2log(y1/y2)x1x2+x1+x22=bσ2+a. Substituting this back into one of the two original equations leads to the root-finding problem g(τ)=A2τ+B2τ+logτC=0, where τ=σ2 and A=xia,B=b,C=2[b(xia)logyi]log2π. A solution to this equation is not guaranteed to exist. For instance, let x1=1, x2=1, and yi=1. Then a Normal distribution passing through these points must have density at least 2. However, in our case yi will typically be very small, in which case a root is likely to exist. Even then, there could be two roots to the equation: one with μ between x1 and x2, and one with μ (way) off to one side. We’ll focus here on finding the former.

The root we seek can be calculated numerically with the (built-in) R function stats::uniroot(), which requires specification of an interval containing the root. To obtain this interval, first notice that g(τ)=A2τ2+B2+1τ, of which the only positive solution is τL=1+1+4(AB)22B2. Since g(τ) for τ0 and τ, we conclude that τL is the minimum value of g(τ) on τ(0,). Let us now find a value τU<τL at which g(τU)>0 (doing this for τU>τL is likely to have μ outside of (x1,x2)).

In order to find a lower bound for the root, note that for 0<τ<A2/2, we have logτ>A2/2τ+D,D=1+log(A2/2). This is because the derivative of the RHS is larger than that of the LHS for τ<A2/2, thus falls to monotonically faster as τ0. The value of D is chosen such that LHS and RHS are equal at τ=A2/2. Substituting the inequality above into the root-finding equation, we obtain g(τ)>A2τ+logτC>A2/2τ+DC, such that g(τ)>0 for τ<A22(CD).

Note that when y1=y2, the endpoint matching method breaks down and the mean and standard deviation of the approximated density is used for the Normally distributed tails.

Appendix B: Code For Test Function xdens_test

# check that xDensity d/p/q/r functions match the true distribution,
# by plotting true PDF/CDF vs xDensity representation.
# xDens: xDensity object.
# dtrue/ptrue: single-variable function of the true pdf/cdf.
# dname: name of true distribution (for legend)
# dlpos, plpos: legend position for pdf/cdf plot.
xdens_test <- function(xDens, dtrue, ptrue, dname, dlpos, plpos) {
  op <- par(no.readonly = TRUE)
  par(mfrow = c(1,2), mar = c(4,4,2,.5)+.1)
  # pdf and sampling check
  nsamples <- 1e5
  Xsim <- rXD(nsamples, xDens = xDens) # xDensity: random sampling
  # extend xDens range to show endpoint matching
  xlim <- xDens$xrng
  xlim <- (xlim - mean(xlim)) * 1.10 + mean(xlim)
  hist(Xsim, breaks = 100, freq = FALSE, # xDensity random sample
       xlab = "x", main = "PDF", xlim = xlim)
  curve(dXD(x, xDens = xDens), add = TRUE, col = "red") # xDensity PDF
  curve(dtrue, add = TRUE, col = "blue") # true PDF
  abline(v = xDens$xrng, lty = 2) # grid endpoints
  legend(dlpos, parse(text = c(dname, "dXD", "rXD", "xrng")),
         pch = c(22,22,22,NA), pt.cex = 1.5,
         pt.bg = c("blue", "red", "white", "black"),
         lty = c(NA, NA, NA, 2))
  # cdf check
  curve(pXD(x, xDens), # xDensity CDF
        from = xlim[1], to = xlim[2], col = "red",
        xlab = "x", main = "CDF", ylab = "Cumulative Probability")
  curve(ptrue, add = TRUE, col = "blue") # true CDF
  abline(v = xDens$xrng, lty = 2) # grid endpoints
  legend(plpos, parse(text = c(dname, "pXD", "xrng")),
         pch = c(22,22,NA), pt.cex = 1.5,
         pt.bg = c("blue", "red", "black"),
         lty = c(NA, NA, 2))
  invisible(par(op))
}

References

Draper, N.R. and Tierney, D.E., 1972. Regions of positive and unimodal series expansion of the Edgeworth and Gram-Charlier approximations. Biometrika, 59 (2), 463–465.

Hofert, M., Kojadinovic, I., Maechler, M., and Yan, J., 2017. Copula: Multivariate dependence with copulas.

Schepsmeier, U., Stoeber, J., Brechmann, E.C., Graeler, B., Nagler, T., and Erhardt, T., 2017. VineCopula: Statistical inference of vine copulas.