Probability Integral Transform & Quantile Function Theorem

Introduction

We present simple proofs for two very important theorems,

  • the probability integral transformation, and,
  • the quantile function theorem.

Both theorems are important in statistics, computational math, machine learning and beyond.

In particular, we shed light on the background of these two well-known theorems by investigating the “why-it-works” question. To this end, we take a closer look into the difference between continuous and discrete distributions with respect to both theorems. For instance, we demonstrate why the Probability Integral Transformation Theorem in general does not work for discrete distribution. In addition, we also outline under what circumstances a discrete distribution at least “converges towards the assertion” of the Probability Integral Transform Theorem. It will become clear what is meant by that in the course of the post.

Let us first state the actual theorems in the next section.
Afterwards, we are going to illustrate both theorems by example and finally we prove them by using Angus [1.] ideas. Note that we will also show why many proofs provided in textbooks or internet resources are not complete and rather misleading. Then, we will also apply the theory by generating normally distributed samples in Microsoft Excel. At last, we provide an overview on the connection to the concepts of copulas for the multi-dimensional case.

Theorems

Let X be a real-valued random variable defined on a probability space (\Omega, \mathcal{A}, \mathbb{P}). Let F: \mathbb{R} \rightarrow [0,1] be defined by x \mapsto F(x):= \mathbb{P}(\{\omega: \ X(\omega)\leq x\}) with x\in \mathbb{R} be the distribution function of X.

Theorem I (Probability Integral Transformation):
If X has distribution function F which is continuous, then the r.v. Y:=F(X) has the distribution function of \text{Uniform}(0,1).

\square

The second theorem that we are going to prove is as follows.

Theorem II (Quantile Function):
Let F be a distribution function F. If F^{-1}: \ ]0,1[ \rightarrow \mathbb{R} is defined by F^{-1}(y)=\inf\{ x| F(x)\geq y \}, 0<y<1 and U has the distribution of \text{Uniform}(0,1), then X=F^{-1}(U) has distribution function F.

\square

But why are both theorems considered to be so important?

Whenever simulations are required, the Quantile Function Theorem is of importance as it is the basis to generate samples of probability distributions. The Probability Integral Transformation Theorem is the basis for many statistical tests (this is an important field of Frequentist’s statistics) and the definition and/or interpretation of a copula.

Both theorem are fundamental results that are applied in many important areas and it is definitely worth understanding both of them. Let us now try to understand what both theorems are state by applying a hands-on approach.

Illustrations

Probability Integral Transformation

Let us draw a large sample X \sim \text{Normal}(0,1) distributed by the standard normal distribution. We can use a computer and R, for instance, to perform that kind of task.

X <- rnorm(n=10^6, mean=0, sd=1)
hist(X)

A simulated sample of a standard normally distributed random variable X is shown in the next Fig. 1.

Histogram of Standard Normal Distribution
Fig. 1: Histogram of Standard Normal Distribution

If we now apply the corresponding distribution function F:\mathbb{R}\rightarrow [0,1] of the standard normal distribution to the output of the random variable X, we get Y=F(X) which is clearly governed by the uniform distribution on [0,1] as we can see in the next histogram in Fig. 2.

Again, we use a computer and R to perform also the second step.

Y <- pnorm(X, mean=0, sd=1)
hist(Y)

Realize that the range of any distribution function F is [0,1], which equals the domain of any random variable governed by \text{Uniform}(0,1).

Histogram of Uniform Distribution on [0,1]
Fig 2: Histogram of Uniform Distribution on [0,1] generated by X\sim \text{Normal}(0,1) and its distribution function F, i.e. Y=F(X)

The demonstrated effect is not a coincidence but based on the Probability Integral Transform Theorem. That is, we could set X to be governed by any continuous random variable (i.e. one with continuous distribution function) and the result would always be the same.

For instance, let us do this again with the exponential distribution and with the Cauchy distribution. Both do have a continuous distribution function:

X <- rexp(n=10^6, rate = 1)
hist(X)
Y <- pexp(X, rate=1)
hist(Y)
X <- rcauchy(n=10^6, location=0, scale=1)
hist(X)
Y <- pcauchy(X, rate=1)
hist(Y)

You will notice that similar outputs as in Fig. 2 are generated, which indicates that Y=F(X)\sim \text{Uniform}(0,1).

If we do the same exercise with a discrete random (i.e. not continuous) variable such as the Binomial distribution (n = sequence of n independent Bernoulli experiments, p = relative frequency of successes = probability), X\sim \text{Binomial}(n, p) the resulting histogram is quite different. Note that we denote the discrete distribution function of X by G_X.

Example 1 (Binomial Variable)

Let us consider the case with X \sim \text{Binomial}(n=2, p=0.5). A binomial distribution with parameters n and p comprises n independent trials with a success rate p of each trial. Hence, in our case, we have the situation outlined in Tab. 1.

1. Bernoulli Trial2. Bernoulli Trial
SuccessSuccess
FailureFailure
SuccessFailure
FailureSuccess
Tab. 1: Possible outcomes of two independent Bernoulli trials with p=0.5

That is, the basic set \Omega consists of a sequence of n Bernoulli outcomes represented by corresponding n-tuples. Thus, \Omega=\{ (failure,failure), (success,failure), (failure,success), (success,success)\} is the basic set of this example.

Let us simulate the Binomial variable X as defined above by using a computer and R.

X <- rbinom(n=10^6, size=2, prob=0.5)
hist(X)
Y <- pbinom(X, size=2, prob=0.5)
hist(Y)

The corresponding histogram of binomial variable X can be seen in Fig. 3.

Fig. 3: Histogram of Y=G_X(X) with X\sim \text{Binomial}(n, p) with n=2 independent Bernoulli experiments and p=0.5

Without continuity the controlled correspondence between an element of the domain x\in \mathbb{R} and the image F(x)=:u\in [0,1] gets lost. The corresponding discrete distribution function G_X is defined as depicted in Fig. 4.

Fig. 4: Plot of discrete distribution function G_X of X

Failure for both trials occur in one out of four possible cases, i.e., with probability 0.25. There are two ways to get one success and one failure, i.e., with probability 0.5. Two successes in a row occur with probability 0.25. Putting these pieces together one ends with the above discrete distribution function of X shown in Fig. 4.

Let us consider how this is reflected in the R simulation. To this end, we use the table command in R:

X <- rbinom(n=10^6, size=2, prob = 0.5)
table(X)
Y <- pbinom(X, size=2, prob = 0.5)
table(Y)

We can see that the R variable X contains about 25% of 0s, about 50% of 1s and about 25% of 2s. The R variable X (containing these figures) is then entered into the distribution function G_X, which translates that input into accumulated probabilities p\in \{0.25, 0.75, 1\}. Hence, we get about 25% of p=0.25, about 50% of p=0.75 and about 25% of p=1. This equals exactly the output of the histogram in Fig. 3.

\square

Ultimately, the Probability Integral Transformation cannot be applied to Example 1 (Binomial Variable) since the increase of the discrete distribution function values are not even (i.e. ‘uneven jumps’). In this specific case, we have an increase of 0.25 two times and 0.5 once.

In continuous distribution functions these increases behave in a controlled manner and the distribution function balances the likelihood of events with the corresponding pace of the accumulation. For instance, the simulated standard normal distribution as shown in Fig. 1 contains only few samples in the tail left of -3. This, however, is compensated by the almost flat distribution function curve of the standard normal distribution around the area of -3. This has the effect that a wider range of the domain of X is needed to fill up the corresponding probability bucket. Note that in Fig. 2, for instance, this probability bucket comprises \frac{1}{20} since 20 buckets have been used to cluster the entire probability mass of 1.

Example 2 (Binomial Variable)

The binomial distribution converges towards a normal distribution if we increase the number of trials (in R it is called the size). This implies that the ‘uneven jumps’ of a discrete Binomial distribution are getting less and less meaningful. Indeed, if we repeat a simulation as follows the result looks quite promising:

size <- 10^5
X <- rbinom(n=10^6, size=size, prob = 0.5)
hist(X)
Y <- pbinom(X, size=size, prob = 0.5)
hist(Y)

Ultimately, all continuous distributions need to be discretized due to the limitations of a computer. So, if we change the size from 10^3 to 10^5, we get the following histograms shown in Fig. 5 – Fig. 7. Apparently, the approximation of X, that is distributed according to the binomial distribution, to the normal distribution ultimately implies the approximation of Y toward the standard uniform distribution:

Fig. 5: Histogram of Y = G_X(X) with size set to 10^3
Fig. 6: Histogram of Y = G_X(X) with size set to 10^4
Fig. 7: Histogram of Y = G_X(X) with size set to 10^5

\square

That is, if a discrete distribution either converges close enough to a continuous distribution or if the distribution has even jumps (i.e. discrete uniform) then the corresponding histogram of Y:=G_X(X) will show a similar pattern as described in Theorem I.

Quantile Function Theorem

This time we do not generate an uniform sample but we start with one. Hence, let us now draw a large uniformly distributed sample U \sim \text{Uniform}(0,1). We use R to perform the simulation.

U <- runif(n=10^6, min=0, max=1)
hist(U)

Starting off with a uniformly distributed sample U, we need to “invert” the steps performed to illustrate the Probability Integral Transform Theorem. That is, we now have to apply the generalized inverse distribution function of the intended distribution and apply U \sim \text{Uniform}(0,1) to it.

If we would like to receive the standard normal distribution, for instance, U needs to be fed into the following (generalized) inverse F^{-1} –shown in Fig. 8– of the Standard Normal distribution:

Fig. 8: Inverse F^{-1} of distribution function F of Standard Normal distribution

Take a while and think about what the function F^{-1} actually does – it takes a value in (0,1) (i.e. a probability) and assigns it to a real value. For instance, the probability 0.5 is assigned to the real value 0. Values in the left tail of the distribution get a rather small absolute number while values in the right tail get a number very close to zero. That is, the increase in probability is quite small in the tails, which makes a lot of sense given the nature of a normal distribution.

In R we need to take the following steps:

X <- qnorm(U, mean=0, sd=1)
hist(X)

The histogram of X=F^{-1}(U), as shown in Fig. 9, is governed by the standard normal distribution. Just as desired.

Fig. 9: Histogram of a sample X = F^{-1}(U) \sim \text{Normal}(0,1)

Please note, however, that the inverse distribution function F^{-1} does not necessarily mean the inverse function. Distribution functions are in general not strictly increasing but non-decreasing. For further details we highly recommend P. Embrecht’s & M. Hofert’s paper ‘A note on generalized inverses’. The difference between inverse and generalized inverse functions will also play a key role in the proof of the Probability Integral Transformation Theorem.

In addition, Theorem II (Quantile Function) does NOT require the distribution function to be continuous. At first sight, this might seem strange since the Quantile Function Theorem is just a kind of inversion of the other.

Example 3 (Binomial Variable)

Let us continue the Binomial example (see above) with respect to the Quantile Function Theorem: We draw an uniformly distributed large sample U \sim \text{Uniform}(0,1) and put it into the generalized inverse distribution function G_X^{-1}:[0,1] \rightarrow \mathbb{R} of \text{Binomial}(n, p) with n=2 and p=0.5. The graph of the corresponding quantile function is illustrated in Fig. 10.

Fig. 10: Plot of quantile function of G_X with X \sim \text{Binomial}(2,0.5)

Again, let us take the time to think about the meaning of this generalized inverse (i.e. quantile function) of the Binomial (cumulative) distribution function. It assigns a probability to each relevant integer. In our example, it assigns 0.25 probability to the integer 0, 0.5 probability to the integer 1 and 0.25 to the integer 2.

Let us do this in R:

U <- runif(n=10^6, min=0, max=1)
hist(U)
X <- qbinom(U, size=2, prob = 0.5)
hist(X)

The result depicted in Fig. 11 is as expected.

Fig. 11: Histogram of X=G_X^{-1}(U) that is distributed according to \text{Binomial}(2,0.5)

Apparently, we have received the desired Binomial distribution by first generating a standard uniform sample and then applying the quantile function to it.

\square

Proofs

This section is based on the paper [1.] by John E. Angus. The problem with the usual attempts to prove the Probability Integral Transformation Theorem is, that it is often not made clear how an inverse of a distribution function can be achieved. Please also refer to this post on StackExchange. For instance, the proof provided on Wikipedia is as follows:

<< Given any random continuous variable X, define Y=F_X(X). Then:

    \begin{align*}F_Y(y) &=P(Y \leq y) \\&= P(F_X(X) \leq y) \\&= P(X\leq F_X^{-1}(y)) \\&= F_X(F_X^{-1}(y)) \\&= y\end{align*}

F_Y is just the CDF of a Uniform(0,1) random variable. Thus, Y has a uniform distribution on the interval [0,1]. >>

Wikipedia retrieved in Nov. 2020.

The cited “proof” above seems to be quite straight-forward. However, it is not clarified what is meant by applying the “inverse” F_X^{-1}(y). In the next subsection we are going to learn more about that step in Lemma 1.

Probability Integral Transformation

The following lemma is the key to the proof of Theorem I. The principle idea is that the basic set can be decomposed into \Omega = \{\omega: X(\omega) \leq x\} \sqcup \{\omega: X(\omega) > x\}, where \sqcup stands for the disjoint union.

In addition, recall that a distribution function is monotone: if x<y, we have X(\omega) \leq x \leq y \Rightarrow F(x) \leq F(y).

Lemma 1:
Let X have a distribution function F. Then for all real x\in \mathbb{R} the following holds:

\ \  \mathbb{P} \left(\{ \omega: \ F(X(\omega)) \leq F(x)\} \right) \ = \ F(x).

Proof of Lemma 1:
Decompose the event \{\omega : F(X(\omega)) \leq F(x)\} as follows:

    \begin{align*}&\{\omega: F(X(\omega)) \leq F(x) \}  \\&= \left[ \{\omega: F(X(\omega)) \leq F(x) \} \cap \{\omega: X(\omega) \leq x \} \right]  \sqcup \\&\left[ \{\omega: F(X(\omega)) \leq F(x) \} \cap \{ X(\omega) > x \} \right]\end{align*}

Note that \{\omega: X(\omega) \leq x\} \subset \{\omega: F(X(\omega)) \leq F(x)\} since F is monotone. In addition, \{\omega: X(\omega) > x\} \cap \{\omega: F(X(\omega)) < F(x)\} = \emptyset. Again, due to the monotonicity of F we see that \{\omega: F(X(\omega)) < F(x)\} = \{\omega: X(\omega)<x\}. Hence, the intersection \{\omega: F(X(\omega)) < F(x)\} \cap \{\omega: X(\omega) > x\} is empty. Considering these facts, we can derive

(1)   \begin{align*} & \{\omega: F(X(\omega)) \leq F(x) \}  \\&= \{\omega: X(\omega) \leq x \} \sqcup \\& \ \left[  \{ \omega: X(\omega) > x \}  \cap \{ \omega: F(X(\omega)) = F(x) \} \right].\end{align*}

Taking probabilities in (1), the assertion follows since the event { \omega: X(\omega) > x } \cap { \omega: F(X(\omega)) = F(x) } has probability zero.

\square

Example 4 (Binomial Variable)

Let us re-consider the discrete variable X \sim \text{Binom}(n=2, p=0.5) as treated in Example 1 to 3 . Recall that we denote the distribution function of the discrete random variable X by G_X. Let us further denote the basic set, that represents the outcome of the n Bernoulli trials, by \Omega. X is defined by \omega \mapsto X(\omega) := \# number of successes. We therefore have

    \begin{align*}  & \{ \omega: G_X(X(\omega)) \leq G_X(k)  \}  \\  &= \{ \omega: X(\omega) \leq k \}  \sqcup \\  &  \  \left[  \{  \omega: X(\omega) > k \}  \cap  \{ \omega: G_X(X(\omega)) = G_X(k) \} \right] \\&= \ \{ \omega: X(\omega) > k \} \end{align*}

with k \in \{0, 1,  \ldots, n\}. In order to further illustrate that, let us consider the set \{ \omega: G_X(X(\omega)) = G_X(k) \} for k=0. The resulting set \{(failure, failure)\} has empty intersection with \{ \omega: X(\omega) > k \} = \{(failure,success),(success,failure),(success,success)\}.

\square

Let us ultimately prove the Probability Integral Transformation Theorem.

Proof Theorem I (Probability Integral Transformation):
Let u\in (0,1). Since F is continuous, there exist a real x such that F(x)=u. Then by Lemma 1, \mathbb{P} ( \{ \omega: Y(\omega) \leq u \} ) = \mathbb{P}( \{ \omega: F( X(\omega) )  \leq  F(x) \} ) = u. This implies that Y is distributed as \text{Uniform}(0,1).

\square

Quantile Function Transformation

Proof Theorem II (Quantile Function Theorem):
First, suppose that F^{-1}(u)= \inf{x: F(x) \geq u}. Note that for any x such that 0<F(x)<1 and any u\in (0,1), F(x)\geq u if, and only if, x\geq F^{-1}(u).

Suppose that x \geq F^{-1}(u), then \{ x: F(x) \geq u \} = F^{-1}(u) is an interval that contains its left-hand endpoint since F is non-decreasing and right-continuous.

Conversely, suppose that u \leq F(x), then x \geq \inf{ x: F(x) \geq u } = F^{-1}(u). It follows now that \mathbb{P}( \{ F^{-1}(U) \leq x \} ) = \mathbb{P}( \{ U \leq F(x) \} ) = F(x), completing the proof.

\square

Generating Normally Distributed Samples in Excel

Given the explanations above it is quite obvious how to generate normally distributed samples in Microsoft Excel. What we need are the following ingredients in Excel:

  • Uniformly distributed samples provided in Excel by RAND();
  • Generalized inverse of normal distribution provided in Excel by NORM.S.INV(probability);

According to the Quantile Function Theorem, the outcome will be a standard normally distributed sample.

In the first step, we need to generate an uniformly distributed sample by including the formula RAND() in as many cells as we would like to have samples. Then, by filling the formula NORM.S.INV(probability) in corresponding cells the uniformly distributed value is transformed to a standard normally distributed value as illustrated in Fig. 13. Note that the formula NORM.S.INV(probability) is the generalized inverse function of the standard normal distribution.

Excel Example showing how to generate a normally distributed sample
Fig. 13: Excel Example showing how to generate a standard normally distributed sample using NORM.S.INV(probability)

We could also use the formula NORM.INV(probability, mean, standard deviation) in order to generate any other normal distribution. Refer to Fig. 14 for an additional illustration.

Fig. 14: Excel Example showing how to generate a (standard) normally distributed sample using NORM.INV(probability, mean, standard deviation)

According to Microsoft the following list of statistical functions is available. All inverse distribution functions can be used to generate a corresponding sample.

Given that continuity is not needed to apply the Quantile Function Theorem, we can also generate discretely distributed samples by applying the same simple algorithm as outlined in the continuous case.

Multivariate Case & Copulas

This section serves as an outlook on the application of both treated theorems to copulas. It is therefore a perfunctory treatment of the topic. Nonetheless, it will become clear how the Probability Integral Transform is used to define copulas.

First, we need some basic notation for the multivariate case: Let d\in \mathbb{N} denote the dimension of the considered Euclidean vector space \mathbb{R}^d. We define a multivariate random variable or random vector X:\Omega \rightarrow \mathbb{R}^d as multivariate measurable function. A d-interval is the Cartesian product of d real intervals and a d-box is a closed d-interval. The unit d-cube [0,1]^d is the d-box denoted by [0,1]^d = [(0, \ldots, 0), (1, \ldots, 1)].

For continuous multivariate distributions, the univariate marginals along with a dependency structure called copula, can be separated. If we apply the Probability Integral Transformation Theorem to a given random vector X=(X_1, \ldots, X_d) to each component, we get a multivariate distribution

(2)   \begin{align*}   U = (U_1, \ldots, U_d) = \left(F_{X_1}(X_1), \ldots, F_{X_d}(X_d) \right)\end{align*}

with uniform marginals U_i, i\in \{1, 2, \ldots, d\} reflecting the exposed dependency structure. This motivates the following definition of a copula.

Let (\Omega, \mathcal{A}, \mathbb{P}) be a probability space and F its distribution function. For every d\geq 2, a d-dimensional copula of X is the joint d-dimensional distribution function C:[0,1]^d \rightarrow [0,1] defined by

(3)   \begin{align*}C(u_1, \ldots, u_d)  &:= \mathbb{P}(U_1 \leq u_1, \ldots, U_d \leq u_d)\\                                        &= \mathbb{P} \left( \bigcap_{j=1}^{d}{ U_j \leq u_j} \right).\end{align*}

Copulas embody the way in which multidimensional distribution functions are coupled to their 1-dimensional margins. By applying the Probability Integral Transform to each component of a random vector as denoted in (2), its multivariate dependency structure can be exposed. Please also note that the Quantile Function Theorem can also be used to simulate copulas.

Literature

1.
Angus, J. E. The Probability Integral Transform and Related Results. SIAM Review 36, 652–654 (1994).