Conditional Probability & Bayes Rule

This article is about conditional probabilities and Bayes Rule / Theorem. In a second part, we are going to delve into conditional expectations.


Conditional Probabilities and play a crucial role in various fields, including finance, economics, engineering, and data analysis. Conditional probabilities are a fundamental concept in probability theory that allows us to quantify the likelihood of an event occurring given certain conditions or information. They serve as a crucial tool for understanding and modeling events, providing valuable insights into complex systems across various domains, including statistics, machine learning, and decision analysis.

Conditional probability answers the question of ‘how does the probability of an event change if we have extra information’. It is therefore the foundation of Bayesian statistics.

Let us consider a probability measure P: \mathcal{A} \rightarrow \mathbb{R} of a measurable space (\Omega,  \mathcal{A}). Further, let A, B\in \mathcal{A}, valid for the entire post.

Conditional Probability


Let us directly start with the formal definition of a conditional probability. Illustrations and explanations follow immediately afterwards.

Definition (Conditional Probability)
Let (\Omega, \mathcal{A}, P) be a probability space and P(B)>0. The real value

(1)   \begin{align*} P(A|B) := \frac{P(A \cap B)}{P(B)} \end{align*}

is the probability of A given that B has occurred or can be assumed.
P(A \cap B) is the probability that both events A and B occur and P(B) is the new basic set since P(\Omega \setminus B)=0.


A conditional probability, denoted by P(A|B), is a probability measure of an event A occurring, given that another event B has already occurred or can be assumed. That is, P(A|B) reflects the probability that both events A and B occur relative to the new basic set B as illustrated in Fig. 1.

Rendered by

Figure 1: Venn diagram of a possible constellation of the sets A and B

The objective of P(A|B) is two-fold:

  1. Determine the probability of A \in \mathcal{A} while
  2. Considering that B\in \mathcal{A} has already occurred or can be assumed.

Latter item actually means P(B)=1 since we know (by assumption, presumption, assertion or evidence) that B has been occurred or can be assumed. Graphically, the conditional probability P(A|B) is simply the relation between the intersection A \cap B and the new basic set B.

Thereby we can clearly see that formula (1) is a generalization of the way how probabilities P(A)=\frac{P(A)}{P(\Omega)}=\frac{P(A)}{1} are calculated in general on \Omega.

B cannot be a null set since P(B)>0. Due to the additivity of any probability space, we get P(\Omega \setminus B)=0 as (\Omega\setminus B)+B=\Omega. The knowledge about B might be interpreted as an additional piece of information that we have received over time.

The following examples are going to illustrate this very basic concept.

Example 1.1 (Dice)
A fair dice is thrown once but it is not known what the outcome was. Let’s denote the event of rolling a 1 or a 6 as A. Furthermore, assume that it is known that the resulting number is an even number.

How does the probability of getting \{1, 6\} change given the additional information?

Without the information B, the proability of rolling an element of \{1, 6\} is \frac{2}{6}=\frac{1}{3} since the dice is (assumed to be) fair. Using the additional information, however, changes the probability drastically as shown in the following.

    \begin{align*} P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{ \frac{1}{3} }{ \frac{1}{2} } = \frac{2}{3}. \end{align*}

The set of the event A \cap B, where both conditions need to be fulfilled, is

    \begin{align*} A \cap B = \{1,6\} \cap \{2,4,6\} = \{6\}. \end{align*}

The corresponding probability on the basic set \Omega = \{1,2, \ldots, 6\} is \frac{1}{6}. The probability of B = \{2,4,6\} on \Omega equals \frac{1}{2}, such that we get P(A|B) = \frac{ \frac{1}{6} }{ \frac{1}{2} } = \frac{1}{3}.

Rendered by

The corresponding situation is illustrated in the table above, where the set B is highlighted in blue and A in red. Hence, the intersection A \cap B is highlighted in purple.


A heuristic that is sometimes applied to calculate P(A|B) is as follows:

Take the number of favorable outcomes and divide it by the total number of possible outcomes.

This heuristic is derived from the interpretation where \Omega is countable. If you consider the relative frequency (empirical probability) h_n:\mathcal{A} = \mathcal{P}(\mathbb{N}) \rightarrow [0,1] of the events A\cap B and B, then you derive the afore-mentioned heuristic. Note that \mathcal{P}(\cdot) denotes the power set of a given input set. We get

    \begin{align*} \frac{h_n(A\cap B)}{h_n(B)}. \end{align*}

Applied to Example 1.1, we can state that the total number of possible outcomes can be narrowed down to the set \{2, 4, 6\}. Out of these only one is favorable and corresponds to rolling a 6. Notice that rolling a 1 is not favorable since it is not part of the set \{2, 4, 6\}. Hence, the result is \frac{h_n(\{6\}) }{h_n(\{2, 4, 6\})} =\frac{2}{3}.

Probability Measure

Let us consider the probability measure derived from the conditional probability in more detail.

Theorem 1.1:
Let (\Omega, \mathcal{A}, P) be a probability space, A, B\in \mathcal{A} and P(B)>0. The map

A \mapsto P(A|B) =  \frac{P(A \cap B)}{P(B)}

defines a probability measure on \mathcal{A}.

Apparently, P(A|B)>0 since P(A\cap B)>0 and P(B)>0 for all A, B\in \mathcal{A}. Further, P(\Omega|B)= \frac{P(\Omega\cap B)}{P(B)}=1. The \sigma-additivity follows by

    \begin{align*} P \left(\sum_{i=1}^{\infty}{A_i} | B \right)  &=  \frac{ P(\sum_{i=1}^{\infty}{A_i \cap B}) }{ P(B) } \\  &=  \frac{  \sum_{i=1}^{\infty} { P(A_i \cap B) } }{ P(B) } \\  &=   \sum_{i=1}^{\infty}{ P(A_i | B)}.  \end{align*}


Multiplication Rule

The following formula is called the multiplication rule and is simply a rewriting of formula (1) of the conditional probability.

(2)   \begin{align*} P(A \cap B) = P(A|B) \cdot P(B) \end{align*}

Note that this formula also works if the events A and B are not independent, i.e. the multiplication rule is just a souped-up version of the rule of product, where two independent events are required.

The following example will illustrate this relationship.

Example 1.2 (Putting balls back to the urn)
An urn contains 5 white and 5 black balls. Two balls will be drawn successively without putting the balls back to the urn. We are interested in the event

A:= \{ “red ball in the second draw” \}

The probability of A depends obviously on the result of the first draw. To find the probability of event A (a red ball being drawn in the second draw), given that a red or a black ball was drawn in the first draw, we can use the concept of conditional probability. We distinguish the following two cases:

  • First draw results in a black ball, which is reflected in the event
    B_b := \{ “First draw results in a black ball” \}
  • First draw results in a red ball, which is reflected in the event
    B_r := \{ “First draw results in a black ball” \}

Note that P(B_b)=P(B_r)=\frac{5}{10}=\frac{1}{2} and that the events of drawing in the first and second round are dependent since the balls are not put back to the urn. Hence, we cannot use independence to calculate the denominator of formula (1). However, it is actually quite straight-forward to figure out the conditional probabilities P(A | B_r) as well as P(A | B_b) by simply using the relative frequencies heuristic.

If the first draw resulted in a black ball, P(A|B_b) must be \frac{5}{9} since 4 black and 5 red balls are left. If the first draw resulted in a red ball instead, P(A|B_r)=\frac{4}{9} since 5 black and 4 red balls are left.

By knowing the conditional expectations, we can also conclude the probabilities of P(A\cap B_b) and P(A\cap B_r) via the multiplication rule even though the events are not mutually independent.

    \begin{align*} P(A \cap B_b) &=  P(A | B_b) \cdot P(B_b) = \frac{5}{9} \cdot \frac{1}{2} = \frac{5}{18}. \\ P(A \cap B_r) &=  P(A | B_r) \cdot P(B_b) = \frac{4}{9} \cdot \frac{1}{2} = \frac{4}{18}. \end{align*}


Law of Total Probability

The Law of Total Probability relates the probability of an event A to its conditional probabilities B_i based on different “partitions” or “cases” of the sample space \Omega. The “partitions” or “cases” are indexed by an index set I.

The theorem is often used to compute probabilities by considering categories that are mutually exclusive or disjoint.

Theorem 1.2 (Law of Total Probability):
Let (\Omega, \mathcal{A}, P) be a probability space, A, B_i\in \mathcal{A} \setminus \emptyset for all indizes i\in I with I=\mathbb{N} or I=\{1, 2, \ldots, n\} and \Omega = \Sigma_{i\in I}{B_i} a partition of the basic set. Then

    \begin{align*} P(A) = \sum_{i=1}^{\infty}{ P(A \cap B_i) } = \sum_{i\in I}{ P(B_i) P(A|B_i) }. \end{align*}

Proof: The equality of A=\Sigma_{i\in I}{A \cap B_i} holds true for every set A \in \mathcal{A}. Due to the \sigma-additivity of P, we can deduce

    \begin{align*} P(A) &= P\left(\sum_{i=1}^{\infty}{ A \cap B_i } \right) \\      &= \sum_{i=1}^{\infty}{ P(A \cap B_i) } \\      &= \sum_{i=1}^{\infty}{P(B_i) \cdot \frac{P(A \cap B_i)}{P(B_i)} } \\      &= \sum_{i=1}^{\infty}{P(B_i) \cdot P(A | B_i) }. \end{align*}

The situation with a finite index set I is illustrated below.

Rendered by


The following lecture wraps it up and provides an example as well.

Let us consider also a quite simple and totally fictional example.

Example 1.3 (Math Enthusiast of a School Class)
A class consists of 55% boys and 45% girls. 40% of the boys state that they are math enthusiast while 35% of the girls are excited by math. How likely is it to pick a math enthusiast of this particular school class?

Let us define the event A:={pick a math enthusiast of this school class}. Furthermore, we set B_b :={set of kids that are boys} and B_g :={set of kids that are girls}. By applying the Law of Total Probability, we get

    \begin{align*} P(A) &= P\left( (A \cap B_b) \cup (A \cap B_g) \right) \\      &=  P(A \cap B_b) +  P(A \cap B_g) \\      &= P(B_b) P(A|B_b) + P(B_g) P(A|B_g) \\      &= 0.55 \cdot 0,4 + 0.45 \cdot 0.35 \\      &= 0.3775. \end{align*}


Bayes Rule

The conditional probability P(A|B) is the probability that both events A and B occur relative to the new basic set B. Let us rearrange the conditional probability formula (1) as follows:

    \begin{align*} P(A|B) &= \frac{P(A \cap B)}{P(B)} \\ \Leftrightarrow P(A|B) P(B) &= P(A\cap B). \end{align*}

If we swap the roles of the sets A and B, formula (2) can furthermore be transformed into

    \begin{align*} P(B|A) &= \frac{P(B \cap A)}{P(A)} \\ \Leftrightarrow P(B|A) P(A) &= P(A\cap B). \end{align*}

Let us pause here for a moment. Why does it make sense to actually swap the roles of the sets A and B?

For two sets A and B, we can therefore conclude that

(3)   \begin{align*} P(B|A) P(A) &= P(A|B) P(B)\\ \Leftrightarrow P(A|B) &= \frac{P(A) P(B|A)}{P(B)}. \end{align*}

Now we are able to answer the question why it made sense to swap the roles of sets A and B. We were able to derive a formula that contains P(A|B) and P(A). Hence, we have derived a formula which conects the probability of the original P(A) with the situation after we have received additional information B and that the probability P(A|B) encodes this.

Formula (3) is a special case of Bayes’ Rule or Bayes’ Theorem. Let us formulate the more general theorem.

Theorem 1.3 (Bayes’ Theorem):
Let (\Omega, \mathcal{A}, P) be a probability space, A, B_i\in \mathcal{A} \setminus \emptyset for all indizes i\in I with I=\mathbb{N} or I=\{1, 2, \ldots, n\} and \Omega = \Sigma_{i\in I}{B_i} a partition of the basic set. Then

    \begin{align*} P(B_k| A) = \sum_{i\in I}{ \frac{P(A|B_k) \cdot P(B_k)}{ \sum_{i\in I}{P(A|B_i) \cdot P(B_i)} } } \quad (k\in I). \end{align*}

Proof: Use the before mentioned argumentation for the more general case.


Example 2.1 (Medical Test for Rare Desease)
Suppose there is a rare disease that affects only 1 in 1000 people, i.e. there is a probability of 0,01% to catch the disease. A medical test has been developed to diagnose this disease. The test is highly accurate as it correctly identifies the disease 99% of the time (sensitivity), and it correctly also identifies the absence of the disease (specification) 99% of the time.

Let us define the folllowing events/sets:

  • Event D: a person has the disease
  • Event D^C: a person does not have the disease.
    This event is the complementary event of D
  • Event T_p: the medical test is positive

We want to find the probability that a person who tests positive actually has the disease, P(D|T_p).

Bayes’ Theorem helps us figure out the probability that someone actaully has a disease D given that they have tested positive T_p. This is important because even though a test might be accurate, it is possible for false positives to happen. That is,

    \begin{align*} P(D| T_P) &= \frac{P(T_p | D) P(D) }{ P(T_p) } \end{align*}


  • P(T_p|D) is the probability of testing positive given that a person has the disease. This is the sensitivity of the test, which is 0.99 in this case.
  • P(D) is the prior probability of having the disease, which is 0.001 (1 in 1000 people)
  • P(T_p) is the probability of testing positive. Given that a test can be positive, when a person doesn’t have the disease. We therefore need to consider both scenarios and apply the Law of Total Probability when calculating

    \begin{align*} P(T_p) &= P(T_p|D) \cdot P(D) + P(T_p| D^C) \cdot P(D^C)\\        &= 0.99\cdot 0.001 + 0.01*0.999 \\        &= 0.01098. \end{align*}

Note that P(T_p| D^C) is the probability of testing positive given that a person doesn’t have the disease. This is 1 minus the specificity, so P(T_p|D^C) = 1 – 0.99 = 0.01. In addition, P(D^C) is the probability of not having the disease, which is 1-P(D)=0.999.

Putting it all together we get

    \begin{align*} P(D| T_P) &= \frac{P(T_p | D) P(D) }{ P(T_p) } \\ &= \frac{ 0.99 \cdot 0.001 }{ 0.01098 }\\ &= 0.0901... \\ &\approx 9\%. \end{align*}

Image you have a sample of 1000 people as illustrated by the dots in Fig. 1. One out of 1000 actually has the disease and is correctly identified by the test. Another 10 are falsely identified and actually do not have the disease. In total 11 people have been tested positive, but only one actually has the disease, i.e., \frac{1}{11}\approx 9\%.

Figure 1: Sample of 1000 people where one is true positive and 10 are false positive.


Formula (3) is also called Bayes’ Rule or Bayes’ Theorem.