Probability Integral Transform & Quantile Function Theorem

Introduction

We present simple illustrations, explanations and proofs for two very important theorems,

  • the probability integral transformation, and,
  • the quantile function theorem.

Both theorems are important in statistics, computational math, machine learning and beyond.

We shed light on the background of these two well-known theorems by investigating the “why-it-works” question. To this end, we take a closer look into the difference between continuous and discrete distributions with respect to both theorems. For instance, we demonstrate why the Probability Integral Transformation Theorem in general does not work for discrete distribution.

In addition, we also outline under what circumstances a discrete distribution at least “converges towards the assertion” of the Probability Integral Transform Theorem. It will become clear what is meant by that in the course of the post.

Let us first state the actual theorems in the next section.
Afterwards, we are going to illustrate both theorems by example and finally we prove them by using Angus [1.] ideas. Note that we will also show why many ‘proofs’ provided in textbooks or internet resources are not complete and rather misleading.

We will also apply the derived theory by generating normally distributed samples in Microsoft Excel. At last, we provide an overview on the connection to the concepts of copulas for the multi-dimensional case.

Theorems

Let X be a real-valued random variable defined on a probability space (\Omega, \mathcal{A}, \mathbb{P}). Let F: \mathbb{R} \rightarrow [0,1] be defined by x \mapsto F(x):= \mathbb{P}(\{\omega: \ X(\omega)\leq x\}) with x\in \mathbb{R} be the distribution function of X.

Theorem I (Probability Integral Transformation):
If X has distribution function F which is continuous, then the random variable Y:=F(X) has the distribution function of \text{Uniform}(0,1).

\square

The second theorem that we are going to prove is as follows.

Theorem II (Quantile Function):
Let F be a distribution function. If F^{-1}: \ ]0,1[ \rightarrow \mathbb{R} is defined by F^{-1}(y)=\inf\{ x| F(x)\geq y \}, 0<y<1 and U has the distribution of \text{Uniform}(0,1), then X=F^{-1}(U) has distribution function F.

\square

Note that the distribution function F in Theorem I needs to be continuous while the distribution function in Theorem II can also be discrete.

Why are both theorems considered to be so important?

Whenever simulations are required, the Quantile Function Theorem is of importance as it is the basis to generate samples of probability distributions. The Probability Integral Transformation Theorem is the basis for many statistical tests (this is an important field of Frequentist’s statistics) and the definition and/or interpretation of a copula. Moreover, both theorems are interrelated to each other as we will see in the proofs.

Both theorems are fundamental results that are applied in many important areas and it is definitely worth understanding them. Let us now try to understand what the theorems actually state by considering examples.

Illustrations

Probability Integral Transformation

Let us draw a large sample X \sim \text{Normal}(0,1) distributed by the standard normal distribution. We can use a computer and R, for instance, to perform that kind of task.

X <- rnorm(n=10^6, mean=0, sd=1)
hist(X)

A simulated sample of a standard normally distributed random variable X is shown in the next Fig. 1.

Histogram of Standard Normal Distribution
Fig. 1: Histogram of Standard Normal Distribution

If we now apply the corresponding distribution function F:\mathbb{R}\rightarrow [0,1] of the standard normal distribution to the output of the random variable X, we get Y=F(X) which is clearly governed by the uniform distribution on [0,1] as we can see in the next histogram in Fig. 2.

Again, we use a computer and R to perform also the second step.

Y <- pnorm(X, mean=0, sd=1)
hist(Y)

Realize that the range of any distribution function F is [0,1], which equals the domain of any random variable governed by \text{Uniform}(0,1).

Histogram of Uniform Distribution on [0,1]
Fig 2: Histogram of Uniform Distribution on [0,1] generated by X\sim \text{Normal}(0,1) and its distribution function F, i.e. Y=F(X)

The demonstrated effect is not a coincidence but based on the Probability Integral Transform Theorem. That is, we could set X to be governed by any continuous random variable and the result would always be the same. For continuous random variables, the distribution function is a monotonically non-decreasing continuous function. In general, the distribution function of a continuous random variable does not need to be strictly increasing. For instance, the continuous uniform distribution function and the trapezoidal distribution function are not strictly increasing.

Let us do the simulation again with the exponential distribution and with the Cauchy distribution. Both do have a continuous distribution function.

X <- rexp(n=10^6, rate = 1)
hist(X)
Y <- pexp(X, rate=1)
hist(Y)
X <- rcauchy(n=10^6, location=0, scale=1)
hist(X)
Y <- pcauchy(X, rate=1)
hist(Y)

You will notice that similar outputs as in Fig. 2 are generated, which indicates that Y=F(X)\sim \text{Uniform}(0,1).

If you even want to test more distributions you can pick one of the list of continuous distributions provided that you can find corresponding packages in R.

If we do the same exercise with a discrete random (i.e. not continuous) variable such as the Binomial distribution (n = sequence of n independent Bernoulli experiments, p = relative frequency of successes = probability), X\sim \text{Binomial}(n, p) the resulting histogram is quite different. Note that we denote the discrete distribution function of X by G_X.

Example 1 (Binomial Variable)

Let us consider the case with X \sim \text{Binomial}(n=2, p=0.5). A binomial distribution \text{Binomial}(n, p) consists of n independent trials with a success rate p for each trial. Hence, in our specific example we have the situation as outlined in Tab. 1.

1. Bernoulli Trial2. Bernoulli Trial
SuccessSuccess
FailureFailure
SuccessFailure
FailureSuccess
Tab. 1: Possible outcomes of two independent Bernoulli trials with p=0.5

That is, the basic set \Omega consists of a sequence of n Bernoulli outcomes represented by corresponding n-tuples. Thus, \Omega=\{ (failure,failure), (success,failure), (failure,success), (success,success)\} is the basic set of this example.

Let us simulate the Binomial variable X as defined above by using a computer and R.

X <- rbinom(n=10^6, size=2, prob=0.5)
hist(X)
Y <- pbinom(X, size=2, prob=0.5)
hist(Y)

The corresponding histogram of binomial variable X can be seen in Fig. 3.

Fig. 3: Histogram of Y=G_X(X) with X\sim \text{Binomial}(n, p). The parameters are set to n=2 independent Bernoulli experiments with success rate p=0.5

Without continuity the controlled correspondence between an element of the domain x\in \mathbb{R} and the image F(x)=:u\in [0,1] gets lost. The corresponding discrete distribution function G_X is defined as depicted in Fig. 4.

Fig. 4: Plot of discrete distribution function G_X of X

Failure for both trials occur in one out of four possible cases, i.e., with probability 0.25. There are two ways to get one success and one failure, i.e., with probability 0.5. Two successes in a row occur with probability 0.25. Putting these pieces together one ends with the above discrete distribution function of X shown in Fig. 4.

Let us consider how this is reflected in the R simulation. To this end, we use the table command in R:

X <- rbinom(n=10^6, size=2, prob = 0.5)
table(X)
Y <- pbinom(X, size=2, prob = 0.5)
table(Y)

We can see that the R variable X contains about 25% of 0s, about 50% of 1s and about 25% of 2s. The R variable X (containing these figures) is then entered into the distribution function G_X, which translates that input into accumulated probabilities p\in \{0.25, 0.75, 1\}. Hence, we get about 25% of p=0.25, about 50% of p=0.75 and about 25% of p=1. This equals exactly the output of the histogram in Fig. 3.

\square

Ultimately, the Probability Integral Transformation cannot work for Example 1 (Binomial Variable) since the increase of the discrete distribution function values are not even (i.e. ‘uneven jumps’). In this specific case, we have an increase of two times 0.25 and one time 0.5.

In continuous distribution functions these increases behave in a controlled manner and the distribution function balances the likelihood of events with the corresponding pace of the accumulation. For instance, the simulated standard normal distribution as shown in Fig. 1 contains only few samples in the tail left of -3. This, however, is compensated by the almost flat distribution function curve of the standard normal distribution around the area of -3. This has the effect that a wider range of the domain of X is needed to fill up the corresponding probability bucket of the uniform distribution. Note that in Fig. 2, for instance, this probability bucket comprises \frac{1}{20} since 20 buckets have been used to cluster the entire probability mass of 1.

Example 2 (Binomial Variable)

The binomial distribution converges towards a normal distribution if we increase the number of trials (in R it is called the size). This implies that the ‘uneven jumps’ of a discrete Binomial distribution are getting less and less meaningful. Indeed, if we repeat a simulation as follows the result looks quite promising:

size <- 10^5
X <- rbinom(n=10^6, size=size, prob = 0.5)
hist(X)
Y <- pbinom(X, size=size, prob = 0.5)
hist(Y)

So, if we change the size from 10^3 to 10^5, we get the following histograms shown in Fig. 5 to Fig. 7. Apparently, the approximation of X, that is governed by the binomial distribution, to the normal distribution ultimately implies the approximation of Y toward the standard uniform distribution:

Fig. 5: Histogram of Y = G_X(X) with size set to 10^3
Fig. 6: Histogram of Y = G_X(X) with size set to 10^4
Fig. 7: Histogram of Y = G_X(X) with size set to 10^5

\square

That is, if a discrete distribution either converges close enough to a continuous distribution or if the discrete distribution has even jumps (i.e. discrete uniform) then the corresponding histogram of Y:=G_X(X) will show a similar pattern as described in Theorem I.

In addition, note that all continuous distributions need to be discretized due to the limitations of a computer.

Quantile Function Theorem

A relative of the Probability Integral Transform is the Quantile Function Theorem. However, this time we do not get an uniform sample / distribution as a result but we start with one. Hence, let us now draw a large uniformly distributed sample U \sim \text{Uniform}(0,1). We use R to perform the simulation.

U <- runif(n=10^6, min=0, max=1)
hist(U)

Starting off with an uniformly distributed sample U, we need to “invert” the steps performed to illustrate the Probability Integral Transform Theorem. That is, we now have to apply the generalized inverse distribution function of the intended distribution and apply U \sim \text{Uniform}(0,1) to it.

If we would like to receive the standard normal distribution, for instance, U needs to be fed into the following generalized inverse F^{-1} –shown in Fig. 8– of the Standard Normal distribution.

Be aware that a generalized inverse does in general not equal an inverse function since a (discrete) distribution function might contain flat parts of its graph. The distribution function of the Standard Normal distribution is continuous and its (generalized) inverse is depicted in Fig. 8.

Fig. 8: Inverse F^{-1} of distribution function F of Standard Normal distribution

Take a while and think about what the function F^{-1} actually does — it takes a probability in (0,1) and assigns it to a real value. For instance, the probability 0.5 is assigned to the real value 0. Values in the left tail of the distribution get a rather small absolute numbers (i.e. a value close to 0) while values in the right tail get a number very close to 1. That is, the increase in probability is quite small in the tails, which is in line with the nature of a normal distribution.

In R we need to take the following steps:

X <- qnorm(U, mean=0, sd=1)
hist(X)

The histogram of X=F^{-1}(U), as shown in Fig. 9, is governed by the standard normal distribution. Just as desired.

Fig. 9: Histogram of a sample X = F^{-1}(U) \sim \text{Normal}(0,1)

Again, please note that the inverse distribution function F^{-1} does not necessarily mean the inverse function. Distribution functions are in general not strictly increasing but non-decreasing. For further details we highly recommend P. Embrecht’s & M. Hofert’s paper ‘A note on generalized inverses’. The difference between inverse and generalized inverse functions will also play a key role in the proof of the Probability Integral Transformation Theorem.

In addition, Theorem II (Quantile Function) does NOT require the distribution function to be continuous. At first sight, this might seem strange since the Quantile Function Theorem is just a kind of inversion of the other.

Example 3 (Binomial Variable)

Let us continue the Binomial example (see above) with respect to the Quantile Function Theorem. We draw an uniformly distributed large sample U \sim \text{Uniform}(0,1) and put it into the generalized inverse distribution function G_X^{-1}:[0,1] \rightarrow \mathbb{R} of \text{Binomial}(n, p) with n=2 and p=0.5. The graph of the corresponding quantile function is illustrated in Fig. 10.

Fig. 10: Plot of quantile function of G_X with X \sim \text{Binomial}(2,0.5)

Again, let us take the time to think about the meaning of this generalized inverse (i.e. quantile function) of the Binomial (cumulative) distribution function. It assigns an integer to each probability in [0,1]. In our example, it assigns the integer 0 to the probabilities [0,0.25), the integer 1 to the probabilities [0.25,0.75) and 2 to the probabilities [0.75,1].

Let us do this in R:

U <- runif(n=10^6, min=0, max=1)
hist(U)
X <- qbinom(U, size=2, prob = 0.5)
hist(X)

The result –depicted in Fig. 11– is as expected.

Fig. 11: Histogram of X=G_X^{-1}(U) that is distributed according to \text{Binomial}(2,0.5)

Apparently, we have received the desired Binomial distribution by first generating a standard uniform sample and then applying the quantile function to it.

\square

Proofs

This section is based on the paper [1.] by John E. Angus. The problem with the usual attempts to prove the Probability Integral Transformation Theorem is, that it is often not made clear how an inverse of a distribution function can be achieved. Please also refer to this post on StackExchange. For instance, the proof provided on Wikipedia is as follows:

<< Given any random continuous variable X, define Y=F_X(X). Then:

    \begin{align*}F_Y(y) &=P(Y \leq y) \\&= P(F_X(X) \leq y) \\&= P(X\leq F_X^{-1}(y)) \\&= F_X(F_X^{-1}(y)) \\&= y\end{align*}

F_Y is just the CDF of a Uniform(0,1) random variable. Thus, Y has a uniform distribution on the interval [0,1]. >>

Wikipedia retrieved in Nov. 2020.

The cited “proof” above seems to be quite straight-forward. However, it is not clarified what is meant by applying the “inverse” F_X^{-1}(y). In the next subsection we are going to learn more about that step in Lemma 1.

Some textbooks require that a ‘strictly increasing distribution function/random variable’ is needed to apply the Probability Integral Transform Theorem even though continuity is sufficient. Again, please refer to this post on StackExchange.

Probability Integral Transformation

The following lemma is the key to the proof of Theorem I. The principle idea is that the basic set can be decomposed into \Omega = \{\omega: X(\omega) \leq x\} \sqcup \{\omega: X(\omega) > x\}, where \sqcup stands for the disjoint union.

In addition, recall that a distribution function is monotone: if x<y, we have X(\omega) \leq x \leq y \Rightarrow F(x) \leq F(y).

Lemma 1:
Let X have a distribution function F. Then for all real x\in \mathbb{R} the following holds:

\ \  \mathbb{P} \left(\{ \omega: \ F(X(\omega)) \leq F(x)\} \right) \ = \ F(x).

Proof of Lemma 1:
Decompose the event \{\omega : F(X(\omega)) \leq F(x)\} as follows:

    \begin{align*}&\{\omega: F(X(\omega)) \leq F(x) \}  \\&= \left[ \{\omega: F(X(\omega)) \leq F(x) \} \cap \{\omega: X(\omega) \leq x \} \right]  \sqcup \\&\left[ \{\omega: F(X(\omega)) \leq F(x) \} \cap \{ X(\omega) > x \} \right]\end{align*}

Note that \{\omega: X(\omega) \leq x\} \subset \{\omega: F(X(\omega)) \leq F(x)\} since F is monotone. In addition, \{\omega: X(\omega) > x\} \cap \{\omega: F(X(\omega)) < F(x)\} = \emptyset. Again, due to the monotonicity of F we see that \{\omega: F(X(\omega)) < F(x)\} = \{\omega: X(\omega)<x\}. Hence, the intersection \{\omega: F(X(\omega)) < F(x)\} \cap \{\omega: X(\omega) > x\} is empty. Considering these facts, we can derive

(1)   \begin{align*} & \{\omega: F(X(\omega)) \leq F(x) \}  \\&= \{\omega: X(\omega) \leq x \} \sqcup \\& \ \left[  \{ \omega: X(\omega) > x \}  \cap \{ \omega: F(X(\omega)) = F(x) \} \right].\end{align*}

Taking probabilities in (1), the assertion follows since the event \{ \omega: X(\omega) > x \} \cap \{ \omega: F(X(\omega)) = F(x) \} has zero probability.

\square

Example 4 (Binomial Variable)

Let us re-consider the discrete variable X \sim \text{Binom}(n=2, p=0.5) as treated in Example 1 to 3. Recall that we denote the distribution function of the discrete random variable X by G_X. Let us further denote the basic set, that represents the outcome of the n Bernoulli trials, by \Omega. X is defined by \omega \mapsto X(\omega) := \# number of successes. We therefore have

    \begin{align*}  & \{ \omega: G_X(X(\omega)) \leq G_X(k)  \}  \\  &= \{ \omega: X(\omega) \leq k \}  \sqcup \\  &  \  \left[  \{  \omega: X(\omega) > k \}  \cap  \{ \omega: G_X(X(\omega)) = G_X(k) \} \right] \\&= \ \{ \omega: X(\omega) > k \} \end{align*}

with k \in \{0, 1,  \ldots, n\}. In order to further illustrate that, let us consider the set \{ \omega: G_X(X(\omega)) = G_X(k) \} for k=0. The resulting set \{(failure, failure)\} has empty intersection with \{ \omega: X(\omega) > k \} = \{(failure,success),(success,failure),(success,success)\}.

\square

Let us ultimately prove the Probability Integral Transformation Theorem.

Proof Theorem I (Probability Integral Transformation):
Let u\in (0,1). Since F is continuous, there exist a real x such that F(x)=u. Then by Lemma 1, \mathbb{P} ( \{ \omega: Y(\omega) \leq u \} ) = \mathbb{P}( \{ \omega: F( X(\omega) )  \leq  F(x) \} ) = u, which implies that Y is governed by \text{Uniform}(0,1).

\square

Quantile Function Transformation

Proof Theorem II (Quantile Function Theorem):
First, suppose that F^{-1}(u)= \inf \{x: F(x) \geq u \}. Note that for any x such that 0<F(x)<1 and any u\in (0,1), F(x)\geq u if, and only if, x\geq F^{-1}(u).

Suppose that x \geq F^{-1}(u), then \{ x: F(x) \geq u \} = F^{-1}(u) is an infinite interval that contains its left-hand endpoint since F is non-decreasing and right-continuous.

Conversely, suppose that u \leq F(x), then x \geq \inf \{ x: F(x) \geq u \} = F^{-1}(u). It follows now that \mathbb{P}( \{ F^{-1}(U) \leq x \} ) = \mathbb{P}( \{ U \leq F(x) \} ) = F(x), completing the proof.

\square

Generating Normally Distributed Samples in Excel

Given the explanations above, it is quite obvious how to generate normally distributed samples in Microsoft Excel. What we need are the following ingredients in Excel:

  • Uniformly distributed samples provided in Excel by RAND();
  • Generalized inverse of the standard normal distribution provided in Excel by NORM.S.INV(probability);

According to the Quantile Function Theorem, the outcome will be a standard normally distributed sample.

In the first step, we need to generate an uniformly distributed sample by including the formula RAND() in as many cells as we would like to have samples. Then, by filling the formula NORM.S.INV(probability) in corresponding cells the uniformly distributed value is transformed to a standard normally distributed value as illustrated in Fig. 13. Note that the formula NORM.S.INV(probability) is the generalized inverse function of the standard normal distribution.

Excel Example showing how to generate a normally distributed sample
Fig. 13: Excel Example showing how to generate a standard normally distributed sample using NORM.S.INV(probability)

We could also use the formula NORM.INV(probability, mean, standard deviation) in order to generate any other normal distribution. Refer to Fig. 14 for an additional illustration.

Fig. 14: Excel Example showing how to generate a (standard) normally distributed sample using NORM.INV(probability, mean, standard deviation)

According to Microsoft the following list of statistical functions is available. All inverse distribution functions can be used to generate a corresponding sample.

Given that continuity is not needed to apply the Quantile Function Theorem, we can also generate discretely distributed samples by applying the same simple algorithm as outlined in the continuous case.

Multivariate Case & Copulas

This section serves as an outlook on the application of both treated theorems to copulas. It is therefore a perfunctory treatment of the topic. Nonetheless, it will become clear how the Probability Integral Transform is used to define copulas.

First, we need some basic notation for the multivariate case: Let d\in \mathbb{N} denote the dimension of the considered Euclidean vector space \mathbb{R}^d. We define a multivariate random variable or random vector X:\Omega \rightarrow \mathbb{R}^d as multivariate measurable function. A d-interval is the Cartesian product of d real intervals and a d-box is a closed d-interval. The unit d-cube [0,1]^d is the d-box denoted by [0,1]^d = [(0, \ldots, 0), (1, \ldots, 1)].

For continuous multivariate distributions, the univariate marginals along with a dependency structure called copula, can be separated. If we apply the Probability Integral Transformation Theorem to a given random vector X=(X_1, \ldots, X_d) to each component, we get a multivariate distribution

(2)   \begin{align*}   U = (U_1, \ldots, U_d) = \left(F_{X_1}(X_1), \ldots, F_{X_d}(X_d) \right)\end{align*}

with uniform marginals U_i, i\in \{1, 2, \ldots, d\} reflecting the exposed dependency structure. This motivates the following definition of a copula.

Let (\Omega, \mathcal{A}, \mathbb{P}) be a probability space and F its distribution function. For every d\geq 2, a d-dimensional copula of X is the joint d-dimensional distribution function C:[0,1]^d \rightarrow [0,1] defined by

(3)   \begin{align*}C(u_1, \ldots, u_d)  &:= \mathbb{P}(U_1 \leq u_1, \ldots, U_d \leq u_d)\\                                        &= \mathbb{P} \left( \bigcap_{j=1}^{d}{ U_j \leq u_j} \right).\end{align*}

Copulas embody the way in which multidimensional distribution functions are coupled to their 1-dimensional margins. By applying the Probability Integral Transform to each component of a random vector as denoted in (2), its multivariate dependency structure can be exposed. Please also note that the Quantile Function Theorem can also be used to simulate copulas.

Literature

1.
Angus, J. E. The Probability Integral Transform and Related Results. SIAM Review 36, 652–654 (1994).