Contents
Introduction
Copulas are an important concept in statistics and beyond to describe dependency structures of continuous distributions. However, what can we do if we have multivariate data at hand. And how can we do that?
The following blog post provides a hands-on approach to empirical copulas. It provides insight and illustrations.
Let us assume we have a sample of two random variables and at hand, and we would like to know more about how both variables are interrelated to each other. The elements of the sample of size of both random variables shall be denoted as a tuple with .
The empirical (bivariate) copula is defined as the discrete function given by
(1)
where and , denote the order statistics of the sample and provides the cardinality of the subsequent set. For instance, .
Notice that the definition is actually a special case of the definition of an empirical distribution function. However, empirical copula functions are only defined on and they represent a dependency structure and not a distribution.
Hands-on Approach
Uniformly with a sample size of n=5
First, let us assume that and follow an uniform distribution on . For both random variables we draw a small sample of size . The sample is listed in the following table and illustrated in the plot below (blue points).
1 | 0.95 | 0.24 | 0.19 | 0.16 |
2 | 0.53 | 0.16 | 0.32 | 0.24 |
3 | 0.77 | 0.56 | 0.53 | 0.33 |
4 | 0.19 | 0.33 | 0.77 | 0.56 |
5 | 0.32 | 0.80 | 0.95 | 0.80 |
According to the definition of the empirical copula, we also need to know the corresponding ordered samples and , which can also be found in the table above. In addition, the tuples of the order statistic are also sketched in the plot below (orange triangles).
The domain of the empirical copula is the Cartesian product of the values . For instance, since no tupel exists such that and for . However, for and the tuple meet the conditions and with .
By knowing the set in the numerator of equation (1), we are able to derive the corresponding empirical copula value easily. We just need to count the elements of this set and divide the number by the sample size . The result is the desired empirical copula value which is a (cumulative) frequency.
Note that the set of equation (1) corresponds to all points that lie within the rectangles intersecting the and axes, respectively, passing through .
The values are listed in the following table.
0.00 | 0.00 | 0.20 | 0.20 | 0.20 | |
0.00 | 0.00 | 0.20 | 0.20 | 0.40 | |
0.20 | 0.20 | 0.40 | 0.40 | 0.60 | |
0.20 | 0.20 | 0.40 | 0.60 | 0.80 | |
0.20 | 0.40 | 0.60 | 0.80 | 1.00 |
Graphically the empirical copula looks as follows:
Normally distributed with a sample size of n=100
Second, let us now assume that follows a bivariate normal distributed with mean and
.
Here, is the correlation coefficient between and . We draw a bivariate sample of size with using R.
library(mvtnorm) #package to simulate joint normal distribution n <- 100 #sample size # parameters of the bivariate normal mean <- c(0,0) # mean for distribution sigma <- matrix(c(1,0.3,0.3,1), ncol=2) # correlation matrix # sample the bivariate normal Z <- rmvnorm(n=n, mean = mean, sigma = sigma) # simulate bivariate normally distributed sample plot(Z, main="Bivariate Normal, Correlation 0.3, X vs. Y") # divide Z to get X and Y X <- Z[,1] Y <- Z[,2] # sort sample X.ascending <- sort(X) Y.ascending <- sort(Y) Z.ascending <- cbind(X.ascending, Y.ascending) # prepare data structure C.n <- as.data.frame(matrix(nrow = n, ncol = n), row.names = paste0("X", 1:n)) colnames(C.n) <- paste0("Y", 1:n) # run through the indizes i (of X) and j (of Y) for( i in 1:n){ for( j in 1:n){ C.n[i, j] <- sum(apply(X=Z, MARGIN = 1, FUN = function(Z.row, x.sorted, y.sorted) { sum(Z.row <= c(x.sorted, y.sorted), na.rm = TRUE) >=2 }, X.ascending[i], Y.ascending[j]), na.rm = TRUE)/n } } # plot the empirical copula library(plot3D) x <- (1:n)/n y <- x persp3D(x = x, y = y, z = as.matrix(C.n), phi = 45, theta = 45, xlab = "X", ylab = "Y", main = "Empirical Copula")
We basically apply the same algorithm to calculate the empirical copula values, which are computed within the loops and named by C.n[i, j]
. The arguments i
and j
represent the values of the domain of the empirical copula. The defined function then just counts whether what conditions are met and divides by .
The plot created by R (for my specific random variate) can be seen in the following.
Transforming the normally distributed sample
Let us further use the random sample with that we have just simulated. However, let us define new random variables and by transforming the samples , with accordingly.
What do you think how the corresponding empirical copula will look like?
Well, given that is a monotone function nothing will change since the order of the sample will not change. The following R code demonstrates that.
library(mvtnorm) #package to simulate joint normal distribution library(mvtnorm) #package to simulate joint normal distribution n <- 100 #sample size # parameters of the bivariate normal mean <- c(0,0) # mean for distribution sigma <- matrix(c(1,0.3,0.3,1), ncol=2) # correlation matrix # sample the bivariate normal Z <- rmvnorm(n=n, mean = mean, sigma = sigma) # simulate bivariate normally distributed sample plot(Z, main="Bivariate Normal, Correlation 0.3, X vs. Y") # divide Z to get X and Y X <- Z[,1] Y <- Z[,2] cor(X,Y) # apply a monotone transformation to X and Y R <- exp(X) S <- exp(Y) cor(R, S) T <- cbind(R, S) # sort simulated sample X.ascending <- sort(X) Y.ascending <- sort(Y) Z.ascending <- cbind(X.ascending, Y.ascending) # sort transformed sample R.ascending <- sort(R) S.ascending <- sort(S) T.ascending <- cbind(R.ascending, S.ascending) # prepare data structures C.n <- as.data.frame(matrix(nrow = n, ncol = n), row.names = paste0("X", 1:n)) colnames(C.n) <- paste0("Y", 1:n) C.n.trans <- as.data.frame(matrix(nrow = n, ncol = n), row.names = paste0("R", 1:n)) colnames(C.n.trans) <- paste0("S", 1:n) # run through the indizes i (of X/R) and j (of Y/S) for( i in 1:n){ for( j in 1:n){ C.n[i, j] <- sum(apply(X=Z, MARGIN = 1, FUN = function(Z.row, x.sorted, y.sorted) { sum(Z.row <= c(x.sorted, y.sorted), na.rm = TRUE) >=2 }, X.ascending[i], Y.ascending[j]), na.rm = TRUE)/n C.n.trans[i, j] <- sum(apply(X=T, MARGIN = 1, FUN = function(T.row, r.sorted, s.sorted) { sum(T.row <= c(r.sorted, s.sorted), na.rm = TRUE) >=2 }, R.ascending[i], S.ascending[j]), na.rm = TRUE)/n } } # plot the empirical copula library(plot3D) x <- (1:n)/n y <- x par(mfrow=c(1, 2)); #prepare canvas persp3D(x = x, y = y, z = as.matrix(C.n), phi = 45, theta = 45, xlab = "X", ylab = "Y", main = "Empirical Copula of (X,Y)") persp3D(x = x, y = y, z = as.matrix(C.n.trans), phi = 45, theta = 45, xlab = "Exp(X)", ylab = "Exp(Y)", main = "Empirical Copula of (Exp(X), Exp(Y))")
The comparison of both plots show graphically that the empirical copula does not change by the transformation.
For all who would like to know more about empirical copulas I recommend [1] by Genest et al.