Statistics – deep mind

Introduction

Copulas are an important concept in statistics and beyond to describe dependency structures of continuous distributions. However, what can we do if we have multivariate data at hand. And how can we do that?

The following blog post provides a hands-on approach to empirical copulas. It provides insight and illustrations.

Let us assume we have a sample of two random variables $X$ and $Y$ at hand, and we would like to know more about how both variables are interrelated to each other. The elements of the sample of size $n$ of both random variables shall be denoted as a tuple $(x_i, y_i)$ with $1\leq i \leq n$ .

The empirical (bivariate) copula is defined as the discrete function $C_n$ given by

(1) $\begin{equation*} C_n \left(\frac{i}{n}, \frac{j}{n} \right) = \frac{ \# \{(x,y) | \ x\leq x_{(i)} \text{ and } y\leq y_{(j)} \} }{n} \end{equation*}$

where $x_{(i)}$ and $y_{(j)}$ , $1\leq i,j \leq n$ denote the order statistics of the sample and $\#$ provides the cardinality of the subsequent set. For instance, $\#\{a, b, c\} = 3$ .

Notice that the definition is actually a special case of the definition of an empirical distribution function. However, empirical copula functions are only defined on $[0,1]^n$ and they represent a dependency structure and not a distribution.

Hands-on Approach

Uniformly with a sample size of n=5

First, let us assume that $X$ and $Y$ follow an uniform distribution on $[0,1]$ . For both random variables we draw a small sample of size $n=5$ . The sample is listed in the following table and illustrated in the plot below (blue points).

$i/ j$	$x_i$	$y_j$	$x_{(i)}$	$y_{(j)}$
1	0.95	0.24	0.19	0.16
2	0.53	0.16	0.32	0.24
3	0.77	0.56	0.53	0.33
4	0.19	0.33	0.77	0.56
5	0.32	0.80	0.95	0.80

According to the definition of the empirical copula, we also need to know the corresponding ordered samples $x_{(i)}$ and $y_{(j)}$ , which can also be found in the table above. In addition, the tuples of the order statistic $(x_{(i)}, y_{(i)})$ are also sketched in the plot below (orange triangles).

Plot of original sample $(x_i, y_i)$ as well as of the order statistic $(x_{(i)},y_{(i)})$

The domain of the empirical copula is the Cartesian product of the values $\{\frac{1}{5}, \frac{2}{5}, \frac{3}{5}, \frac{4}{5}, 1=\frac{5}{5}\}$ . For instance, $C\left(\frac{1}{5}, \frac{1}{5} \right) = C\left(\frac{2}{5}, \frac{2}{5} \right) = 0$ since no tupel $(x_i, y_i)$ exists such that $x_i \leq x_{(1)} = 0.19$ and $y_i \leq y_{(1)} = 0.16$ for $1\leq i \leq 5$ . However, for $i=2$ and $j=3$ the tuple $(0.19, 0.33)$ meet the conditions $x_i \leq x_{(2)} = 0.32$ and $y_j \leq y_{(3)} = 0.33$ with $1\leq i, j \leq 5$ .

By knowing the set in the numerator of equation (1), we are able to derive the corresponding empirical copula value easily. We just need to count the elements of this set and divide the number by the sample size $n$ . The result is the desired empirical copula value which is a (cumulative) frequency.

Note that the set of equation (1) corresponds to all points $(x_{i}, y_{j})$ that lie within the rectangles intersecting the $x$ and $y$ axes, respectively, passing through $(x_{(i)}, y_{(j)})$ .

The values $C(x,y)$ are listed in the following table.

$C(x,y)$	$\frac{1}{5}$	$\frac{2}{5}$	$\frac{3}{5}$	$\frac{4}{5}$	$\frac{5}{5}$
$\frac{1}{5}$	0.00	0.00	0.20	0.20	0.20
$\frac{2}{5}$	0.00	0.00	0.20	0.20	0.40
$\frac{3}{5}$	0.20	0.20	0.40	0.40	0.60
$\frac{4}{5}$	0.20	0.20	0.40	0.60	0.80
$\frac{5}{5}$	0.20	0.40	0.60	0.80	1.00

Graphically the empirical copula looks as follows:

Empirical Copula of the sample drawn from $(X,Y)$ where both variables are distributed uniformly on $[0,1]$

Normally distributed with a sample size of n=100

Second, let us now assume that $(X, Y)$ follows a bivariate normal distributed with mean $\mu := (0,0)^T$ and

$\Sigma := \left(\begin{array}{ccc} 1 & \rho \\ \rho & 1 \\ \end{array}\right)$

.
Here, $\rho\in [0, \infty)$ is the correlation coefficient between $X$ and $Y$ . We draw a bivariate sample of size $n=100$ with $\rho = 0.3$ using R.

library(mvtnorm) #package to simulate joint normal distribution

n <- 100 #sample size
# parameters of the bivariate normal
mean <- c(0,0) # mean for distribution
sigma <- matrix(c(1,0.3,0.3,1), ncol=2) # correlation matrix

# sample the bivariate normal
Z <- rmvnorm(n=n, mean = mean, sigma = sigma) # simulate bivariate normally distributed sample
plot(Z, main="Bivariate Normal, Correlation 0.3, X vs. Y")

# divide Z to get X and Y
X <- Z[,1]
Y <- Z[,2]

# sort sample
X.ascending <- sort(X)
Y.ascending <- sort(Y)
Z.ascending <- cbind(X.ascending, Y.ascending)

# prepare data structure
C.n <- as.data.frame(matrix(nrow = n, ncol = n), row.names = paste0("X", 1:n))
colnames(C.n) <- paste0("Y", 1:n)
  
# run through the indizes i (of X) and j (of Y)
for( i in 1:n){
  for( j in 1:n){
    C.n[i, j] <- sum(apply(X=Z, MARGIN = 1, FUN = function(Z.row, x.sorted, y.sorted) { sum(Z.row <= c(x.sorted, y.sorted), na.rm = TRUE) >=2 }, X.ascending[i], Y.ascending[j]), na.rm = TRUE)/n
  }
}

# plot the empirical copula
library(plot3D)
x <- (1:n)/n
y <- x 
persp3D(x = x, y = y, z = as.matrix(C.n), phi = 45, theta = 45, xlab = "X", ylab = "Y", main = "Empirical Copula")

We basically apply the same algorithm to calculate the empirical copula values, which are computed within the loops and named by C.n[i, j]. The arguments i and j represent the values of the domain of the empirical copula. The defined function then just counts whether what conditions are met and divides by $n$ .

The plot created by R (for my specific random variate) can be seen in the following.

Empirical Copula of Bivariate Normally Distributed Sample

Transforming the normally distributed sample

Let us further use the random sample $(x_i, y_i)$ with $1 \leq i,j \leq n$ that we have just simulated. However, let us define new random variables $R:= \exp(X)$ and $S:= \exp(Y)$ by transforming the samples $r_i = \exp(x_i)$ , $s_j= \exp(y_j)$ with $1 \leq i, j\leq n$ accordingly.

What do you think how the corresponding empirical copula will look like?

Well, given that $\exp$ is a monotone function nothing will change since the order of the sample will not change. The following R code demonstrates that.

library(mvtnorm) #package to simulate joint normal distribution
library(mvtnorm) #package to simulate joint normal distribution

n <- 100 #sample size

# parameters of the bivariate normal
mean <- c(0,0) # mean for distribution
sigma <- matrix(c(1,0.3,0.3,1), ncol=2) # correlation matrix

# sample the bivariate normal
Z <- rmvnorm(n=n, mean = mean, sigma = sigma) # simulate bivariate normally distributed sample
plot(Z, main="Bivariate Normal, Correlation 0.3, X vs. Y")

# divide Z to get X and Y
X <- Z[,1]
Y <- Z[,2]
cor(X,Y)

# apply a monotone transformation to X and Y
R <- exp(X)
S <- exp(Y)
cor(R, S)
T <- cbind(R, S)

# sort simulated sample
X.ascending <- sort(X)
Y.ascending <- sort(Y)
Z.ascending <- cbind(X.ascending, Y.ascending)

# sort transformed sample
R.ascending <- sort(R)
S.ascending <- sort(S)
T.ascending <- cbind(R.ascending, S.ascending)

# prepare data structures
C.n <- as.data.frame(matrix(nrow = n, ncol = n), row.names = paste0("X", 1:n))
colnames(C.n) <- paste0("Y", 1:n)

C.n.trans <- as.data.frame(matrix(nrow = n, ncol = n), row.names = paste0("R", 1:n))
colnames(C.n.trans) <- paste0("S", 1:n)
  
# run through the indizes i (of X/R) and j (of Y/S)
for( i in 1:n){
  for( j in 1:n){
    C.n[i, j] <- sum(apply(X=Z, MARGIN = 1, FUN = function(Z.row, x.sorted, y.sorted) { sum(Z.row <= c(x.sorted, y.sorted), na.rm = TRUE) >=2 }, X.ascending[i], Y.ascending[j]), na.rm = TRUE)/n
    C.n.trans[i, j] <- sum(apply(X=T, MARGIN = 1, FUN = function(T.row, r.sorted, s.sorted) { sum(T.row <= c(r.sorted, s.sorted), na.rm = TRUE) >=2 }, R.ascending[i], S.ascending[j]), na.rm = TRUE)/n
  }
}

# plot the empirical copula
library(plot3D)
x <- (1:n)/n
y <- x

par(mfrow=c(1, 2)); #prepare canvas
persp3D(x = x, y = y, z = as.matrix(C.n), phi = 45, theta = 45, xlab = "X", ylab = "Y", main = "Empirical Copula of (X,Y)")
persp3D(x = x, y = y, z = as.matrix(C.n.trans), phi = 45, theta = 45, xlab = "Exp(X)", ylab = "Exp(Y)", main = "Empirical Copula of (Exp(X), Exp(Y))")

Comparison of empirical copulas of $(X,Y)$ and transformed $(\exp(X), \exp(Y))$

The comparison of both plots show graphically that the empirical copula does not change by the transformation.

For all who would like to know more about empirical copulas I recommend [¹] by Genest et al.

Genest C, Nešlehová J. A Primer on Copulas for Count Data. ASTIN Bulletin. 2007;37(02):475-515. doi: 10.1017/s0515036100014963

Tag: Statistics

What is an Empirical Copula?

Introduction

Hands-on Approach

Uniformly with a sample size of n=5

Normally distributed with a sample size of n=100

Transforming the normally distributed sample