R Code – Page 2 – deep mind

Introduction

The assumption of independent and identically distributed random variables, short i.i.d., might be quite handy since it simplifies several statistical methods. The strong law of large numbers as well as the central limit theorem, for instance, assume independent variables. Benefiting from the independence while being able to incorporate dependency structures between random variables would therefore be very useful. And indeed, there is a way how to sneak in correlations in a given setting of i.i.d. random variables. One of those techniques are mixture models such as the binomial- and the Poisson-mixture model that are both studied in this post. The following is mainly based on [1].

Let $(\Omega, \mathcal{A}, \mathbb{P})$ be a suitable probability space and assume that we are interested in a set of $m\in \mathbb{N}$ random variables $L_1, \ldots, L_m$ .

Bernoulli-Mixture Model

A Bernoulli trial (or binomial trial) $X \sim B(p)$ is a random experiment with exactly two possible outcomes which we can call “success” and “failure”. The “success” outcome, often represented by 1, appears with probability $p\in[0,1]$ , while the “failure” state, represented by 0, appears with complement probability $1-p\in [0,1]$ .

One of the simplest stochastic processes, the so-called Bernoulli process, is simply a set of Bernoulli trials.

Example 1:
Let us assume that $L_i$ represents the creditworthiness of counterparty $i\in \{1, \ldots, m\}$ of a credit portfolio $\{ L_1, \ldots, L_m\}$ . It is natural to model the two states “survival” and “default” by using a Bernoulli process with $L_i \sim B(p)$ for all counterparties $i\in \{1, \ldots, m\}$ . That is, $L_i \sim B(p)$ can take on the following two states:

$1$ with probability $p$ , if the counterparty goes bankrupt, and,
$0$ with probability $1-p$ , if the counterparty survives.

Here, $p\in [0,1]$ is the probability of default of counterparty $i$ which is usually derived from a corresponding credit rating and suitable historical default rates.
$\square$

Another popular example of a Bernoulli trial/process is tossing a coin, where one side appears with probability $p$ .

More formally, a Bernoulli process $B(p)$ is a (finite) sequence of independent random variables $L_1, \ldots, L_m,$ with

(1) $\begin{equation*} L_i = \begin{cases} 1 & \text{with probability } p \\ 0 & \text{with probability } 1-p \end{cases} \end{equation*}$

as Bernoulli trial with probability $p$ .

A Bernoulli process is therefore a sequence of independent identically distributed Bernoulli trials. Given the discrete nature of the Bernoulli trial, the expectation $\mathbb{E}(L_i)$ as well as the variance $\mathbb{V}(L_i)$ for counterparty $i$ is given by

(2) $\begin{align*} \mathbb{E}(L_i) &= (1 \cdot p) + (0 \cdot (1-p)) = p \\ \mathbb{E}(L_i^2) &= (\mathbb{P}[L_i=1] \cdot 1^2) + (\mathbb{P}[L_i=0] \cdot 0^2) = p \\ \mathbb{V}(L_i) &= \mathbb{E}(L_i^2)- \mathbb{E}(L_i)^2 = p-p^2 = p(1-p). \end{align*}$

In the next sections we will have a closer look at the Bernoulli process and we will relax the assumption of independence between them.

Sum of i.i.d. Bernoulli Random Variables with constant Parameter

For the sake of illustration let us assume that all random variables are mutually independent and that all random variables do have the same probability $p\in [0,1]$ assigned. That is, $L_1, \ldots, L_m$ is a Bernoulli process with probability $p$ . In this case the distribution of the sum

(3) $\begin{equation*} \mathbf{L}= \sum_{i=1}^m{L_i} \end{equation*}$

is governed by the binomial distribution $B(m, p)$ .

Example 2:
We continue Example 1 and consider the sum $\mathbf{L}=\sum_{i=1}^m{L_i}$ of the corresponding Bernoulli process $L_i \sim B(p)$ that is distributed according to $B(m, p)$ . The probability that $k$ out of $m$ counterparties are going to default (“success”) is $\mathbb{P} [ \mathbf{L} = k ] = \binom{m}{k} p^k(1-p)^{m-k}$ .

For instance, if $m=100$ and $p=0.01$ then two defaults will occur with probability $\binom{100}{2} 0.01^2 (1-0.01)^{100-2} \approx~ 18.49\%$ .
$\square$

As illustrated, it is easy to calculate any feature of the distribution of $\mathbf{L} \sim B(m,p)$ . For instance, the expectation, the variance as well as the probability of a specific value $k\in \mathbb{N}$ is given by:

(4) $\begin{align*} \mathbb{E}(\mathbf{L}) &=m\cdot p \\ \mathbb{V}(\mathbf{L}) &=m\cdot p \cdot (1-p) \\ \mathbb{P}[\mathbf{L}=k] &= \binom{m}{k} p^k(1-p)^{m-k} \end{align*}$

Beside that, it would also be quite simple to determine $\mathbb{E}(L)$ as well as the variance $\mathbb{V}(L)$ by just knowing (2) and applying the additivity of the expectation and variance.

In the next step we are going to incorporate correlation between the variables $L_1, \ldots, L_m$ by using the distribution parameters $p_1, \ldots, p_m$ . Notice that $p_i \in [0,1]$ is a probability.

Sum of Independent Bernoulli Random Variables with Random Parameter

By treating the distribution parameters $\mathbf{p}=(p_1, \ldots, p_m)$ of the outlined Bernoulli process as random variables $\mathbf{P}=(P_1, \ldots, P_m)$ , it is possible to incorporate a correlation. That is, we assume that $\mathbf{P}=(P_1, \ldots, P_m)$ follows a joint distribution represented by the distribution function $F$ with support in $[0,1]^m$ . Correspondingly, we assume $\mathbf{L} \sim B(m, \mathbf{P})$ , where $\mathbf{P}$ itself is considered to be a random variable.

This kills two birds with one stone. First, we drop the assumption that all Bernoulli trials do have the same probability $p$ applied. Second, we can incorporate a correlation between the $m$ random variables since the correlation $Corr(L_i, L_j)$ only depends on $P_i$ and $P_j$ .

In more mathematical terms, let

(5) $\begin{align*} L_i |_{P_i=p_i} \sim B(p_i), \hspace{0.25cm} \text{ and }\hspace{0.25cm} (L_i |_{\mathbf{P}=\mathbf{p}})_{i=1, \ldots, m} \text{ independent.} \end{align*}$

Conditional on the realization of the parameters $p_1, \ldots, p_m$ , the random variables $L_1, \ldots, L_m$ are still independent. One could, for instance, assume that $\mathbf{P}$ is uniformly distributed on $[0,1]^m$ as in the R code example below.

The joint distribution of $\mathbf{L}$ is given by

(6) $\begin{align*} \mathbb{P}[L_1=l_1, \ldots, L_m=l_m] = \int_{[0,1]^m}{\prod_{i=1}^m{p_i^{l_i}(1-p_i)^{1-l_i}} \ dF(p_1, \ldots, p_m)} \end{align*}$

where $l_i \in \{0,1\} =$ {“failure”, “success”}. The representation (6) is possible since we can leverage on the independence.

The covariance between the marginal random variables equals

(7) $\begin{align*} Cov(L_i, L_j) = \mathbb{E}[L_iL_j] - \mathbb{E}[L_i]\mathbb{E}[L_j] = Cov[P_i, P_j] \end{align*}$

and therefore the correlation can be determined by

(8) $\begin{align*} Cor(L_i, L_j) = \frac{Cov(P_i, P_j)}{ \sqrt{\mathbb{E}(P_i)(1-\mathbb{E}(P_i)} \sqrt{\mathbb{E}(P_j)(1-\mathbb{E}(P_j)}}. \end{align*}$

That is, the correlation between the two variables $L_i$ and $L_j$ is fully determined by the variables $P_i$ and $P_j$ . We can therefore use this backdoor to sneak in correlation using conditional independence.

The binomial-mixture model is analytically. Nonetheless, we can simulate both sides of the equations to double-check it by example. The following R code is one way to do so.

# Binomial Mixture Model:
# ---------

# Produce dependent uniform distributed variables by starting with multi-variate 
# normal variables. We therefore need the following package
library(mvtnorm)

m <- 20 #nbr of variables
simulation.nbr <- 1000000 #nbr of simulations

# get correlated normal random variables
sigma <- diag(1, nrow = m, ncol = m)
# variable 2 and 3 shall be correlated with 0.5
sigma[2,3] <- 0.5
sigma[3,2] <- 0.5
# variable 12 and 20 shall be correlated with 0.8
sigma[12,20] <- 0.8
sigma[20,12] <- 0.8

# simulate the m-dimensional joint normal distribution
X <- rmvnorm(n=simulation.nbr, mean = rep(0,m), sigma = sigma)
# transform data into correlated prob on [0,1]^m
P <- pnorm(X)

# prepare data structure for variables
L <- matrix(nrow = simulation.nbr, ncol=m)

# simulate the Bernoulli distribution for each variable i
for(i in 1:m){
  L[,i] <- ifelse(runif(simulation.nbr) < P[,i], 0, 1)
}

# By construction the following covariances should be in the same order
cov(P[,2],P[,3])
cov(L[,2],L[,3])

cov(P[,12],P[,20])
cov(L[,12],L[,20])

# By construction the following correlations should be in the same order
cov(P[,2],P[,3])/( sqrt(mean(P[,2])*(1-mean(P[,2]))) * sqrt(mean(P[,3])*(1-mean(P[,3]))))
cor(L[,2],L[,3])

cov(P[,12],P[,20])/( sqrt(mean(P[,12])*(1-mean(P[,12]))) * sqrt(mean(P[,20])*(1-mean(P[,20]))))
cor(L[,12],L[,20])

Please notice that we simulate a suitable Bernoulli process using the code ifelse(runif(simulation.nbr) < P[,i], 0, 1), where P[,i] carries the correlation information of the parameter $p_i$ already in it.

Poisson-Mixture Model

A similar pattern is also used for Poisson-distributed random variables. Let us therefore assume that our random variables $L_1, \ldots, L_m$ are distributed according to the Poisson distribution with intensity parameter $\lambda_i$ , denoted by $L_i \sim Pois(\lambda_i)$ $\forall$ $i\in \{1, \ldots, m\}$ . The expectation as well as the variance of a Poisson-distributed random variable $L_i$ is given by its intensity parameter, i.e.,

(9) $\begin{align*} \mathbb{E}(L_i) = \mathbb{V}(L_i) = \lambda_i. \end{align*}$

The distribution of the sum of independent Poisson-distributed variables $\mathbf{L} = \sum_{i=1}^m{L_i}$ is determined by

(10) $\begin{equation*} \mathbf{L} \sim Pois \left( \sum_{i=1}^m{\lambda_i}\right). \end{equation*}$

Given that we know the distribution of $\mathbf{L}$ , it is easy to derive any feature such as expectation and variance:

(11) $\begin{align*} \mathbb{E}(\mathbf{L}) = \mathbb{V}(\mathbf{L}) = \sum_{i=1}^m{\lambda_i} . \end{align*}$

Just as in the binomial case we introduce correlation between the variables $L_i$ and $L_j$ with $i\neq j$ by assuming that $\mathbf{\Lambda} = (\Lambda_1, \ldots, \Lambda_m)$ is a random vector with distribution function $F$ and support in $[0, \infty[^m$ .

Further we assume that conditionally on a realization $\mathbf{\lambda} = (\lambda_1, \ldots, \lambda_m)$ of $\mathbf{\Lambda}$ the variables $L_i|_{\Lambda_i=\lambda_i}$ with $i\in \{1, \ldots, m\}$ are independent:

(12) $\begin{equation*} L_i|_{\Lambda_i=\lambda_i} \sim Pois(\lambda_i), \hspace{0.5cm} (L_i|_{\mathbf{\Lambda}=\mathbf{\lambda}})_{i=1, \ldots, m} \text{ independent}. \end{equation*}$

The joint distribution of $\mathbf{L}$ is given by

(13) $\begin{align*} \mathbb{P}[L_1=l_1, \ldots, L_m=l_m] = \int_{[0,\infty]^m}{\prod_{i=1}^m{ \frac{\lambda_i^{l_i}}{l_i!} } \ dF(\lambda_1, \ldots, \lambda_m)} \end{align*}$

where $l_i \in \{0,1,2, \ldots\}$ .

The covariance between the marginal random variables equals

(14) $\begin{align*} Cov(L_i, L_j) = Cov[\Lambda_i, \Lambda_j] \end{align*}$

and therefore the correlation can be determined by

(15) $\begin{align*} Cor(L_i, L_j) = \frac{Cov(\Lambda_i, \Lambda_j)}{ \sqrt{\mathbb{V}(\Lambda_i)+ \mathbb{E}(\Lambda_i)} \sqrt{\mathbb{V}(\Lambda_j)+\mathbb{E}(\Lambda_j)}}. \end{align*}$

That is, the correlation between the two variables $L_i$ and $L_j$ is fully determined by the variables $\Lambda_i$ and $\Lambda_j$ . We can therefore use this backdoor to sneak the correlation in using conditional independence.

The Poisson-mixture model is analytically. Nonetheless, we simulate both sides of the equations to double-check it by example. The following R code is one way to do so.

# Poisson Mixture Model:
# ---------

# Produce dependent variables on positive reals using multivariate discrete 
# random variables and a Gaussian copula. To this end we use the GenOrd package:
library(GenOrd)

m <- 20 #nbr of variables
simulation.nbr <- 1000000

# since we want to have correlated random variables we need a corr matrix
sigma <- diag(1, nrow = m, ncol = m)
# variable 2 and 3 shall be correlated with 0.2
sigma[2,3] <- 0.2
sigma[3,2] <- 0.2
# variable 12 and 20 shall be correlated with 0.7
sigma[12,20] <- 0.7
sigma[20,12] <- 0.7

# to make it reproduceable 
set.seed(1)

# list of m vectors representing the  cumulative  probabilities  defining
# the  marginal distribution of the i-th marginal distribution via the CDF
# that are all the same in our simple example
marginal.CDF <- rep(list(bquote()), m)
for(i in 1:m){
  marginal.CDF[[i]] <- c(0.1, 0.3, 0.5, 0.6, 0.75) 
}

# checks the lower and upper bounds of the correlation coefficients.
# in our case the two correlations lie within the thresholds
corrcheck(marginal.CDF)

# create correlated sample with given marginals
Lambda <- ordsample(n = simulation.nbr, marginal = marginal.CDF, Sigma = sigma)

# prepare data structure for the Poisson-distributed variables
# using the realized lambdas from variable Lambda
L <- array(0,dim=c(simulation.nbr*10, m)) 

# simulate the Poisson distribution for each variable i
for(i in 1:m){ # do for all variables L.i
    L[,i] <- rpois(n = simulation.nbr*10, lambda = Lambda[,i])
}

# By construction the following covariances should be in the same order
cov(Lambda[,2],Lambda[,3])
cov(L[,2],L[,3])

cov(Lambda[,12],Lambda[,20])
cov(L[,12],L[,20])

# By construction the following correlations should be in the same order
cov(Lambda[,2],Lambda[,3])/( sqrt(var(Lambda[,2])+ mean(Lambda[,2])) * sqrt(var(Lambda[,3])+ mean(Lambda[,3])) )
cor(L[,2],L[,3])

cov(Lambda[,12],Lambda[,20])/( sqrt(var(Lambda[,12])+ mean(Lambda[,12])) * sqrt(var(Lambda[,20])+ mean(Lambda[,20])) )
cor(L[,12],L[,20])

Please notice that rpois(n=20, lambda=c(1,500)), for instance, generates a random variate using the two different intensity parameter $\lambda_1=1$ and $\lambda_2=500$ sharing the sample space of size $n=10$ . We therefore multiply the variable simulation.nbr by 10 to increase the quality of the approximation.

Literature:
[1]

Category: R Code

Binomial- and Poisson-Mixture Model

Introduction

Bernoulli-Mixture Model

Sum of i.i.d. Bernoulli Random Variables with constant Parameter

Sum of Independent Bernoulli Random Variables with Random Parameter

Poisson-Mixture Model