# Causality, Fairness, and Statistical Parity

## Introduction

This post explores fairness, specifically the notion of statistical parity, from a causality viewpoint. Its goal is to gently introduce the reader to the causality frameworks, presenting the basics about Bayesian networks, potential outcomes, randomized control trials, and causal graphs. The post was inspired by an example in the AAAI 2018 paper Fair Inference on Outcomes, that shows how statistical parity may be unable to reveal an underlying causal effect that is discriminatory.

Consider the following hypothetical scenario. A company employs an algorithm for its hiring decisions. The algorithm bases its decisions on the education level of the candidate, and on whether the candidate has a prior conviction. Let’s further assume that according to law a hiring decision should not discriminate on the grounds of prior conviction.

Our goal is to investigate whether the algorithm is non-discriminatory, or fair.

## Bayesian Networks

Let’s denote the hiring decision with the boolean random variable $H$, where $H=1$ means the candidate is hired. Similarly, let $E$, $C$ denote the boolean random variables for education level (high school education or not) and prior conviction, respectively.

Let’s assume we have observed the hiring decisions for several candidates over time. For each candidate we observe their $H, E, C$ values, and collect their joint distribution, shown in the following table.

Any joint probability distribution, such as $p(H,E,C)$, can be represented, or factorized, as a product of marginal and conditional probability distributions by applying the chain rule. Note that there are multiple factorizations for a given joint probability. One such factorization of our joint distribution is

$$p(H,E,C) = p(H|E,C) \cdot p(C|E) \cdot p(E). \label{eq:factorization}$$

Factorizations are visually represented by Bayesian networks. These are directed acyclic graphs, where nodes represent random variables, and edges encode conditional dependences. Each node is associated with a function that specifies its probability distribution conditional on its parents’ values; if a node has no parents, the function is the node’s marginal distribution. Below is Bayesian network corresponding to the aforementioned factorization. The depicted probabilities can be computed directly from the joint probability distribution by appropriately marginalizing and conditioning. (In fact, to illustrate this example, the opposite direction was taken: the conditional and marginal probabilities were set, and the joint distribution was then derived — which explains the strange values in the latter.) The direction of edges indicate forward probabilities. For example, the edge from $E$ to $C$, means we have the conditional $p(C|E)$ in our factorization. To compute reverse probabilities, e.g., $p(E|C)$, one has to apply Bayes rule, i.e., $p(E|C) = p(C|E)\cdot p(E)/p(C)$ — hence the name Bayesian networks!

So if there are multiple factorizations (and thus Bayesian networks) for a given joint probability distribution, why might we care for a specific one? Well for one, certain forward probabilities might make more sense to explicitly encode than others. For example, $p(C|E)$ might be observable or could be easier to model compared to $p(E|C)$. As we later discuss, it is often the case that we encode conditional probabilities that reflect cause-effect relationships, e.g., $E$ causes $C$ and not the other way around.

Bayesian networks visualize (some of) the probabilistic dependencies in the joint distribution. If such dependencies exist, then the factorization is a compact representation of the joint distribution. This is not the case in our example, where 7 probability values are required in the Bayesian network to fully specify the probability of the $2^3-1$ possible configurations of $H, E, C$. One can see that there are no conditional independencies in this Bayesian network, because there is at least one non-blocked undirected path between each pair of nodes — the edge connecting them. In general, the notion of d-separation defines what blocking means and reveals the conditional independence relationships that are encoded in a Bayesian network.

Using this Bayesian network, let’s examine what the observations tell us. Finishing high school is very common with a probability of $p(E=1)=0.8$. Moreover, education level and prior conviction are strongly correlated. There is a $85\%$ chance that a candidate will have a prior conviction if they have not finished high school, i.e., $p(C=1|E=0)=0.85$. Conversely, there is a $90\%$ chance that a candidate will not have a prior conviction if they have finished high school, i.e., $p(C=0|E=1)=0.9$. Lastly, observe that the hiring decisions are summarized by the conditional probability table of $p(H|E,C)$. One thing to notice is that, among the non-high school graduates ($E=0$), the hiring decision appears to favor prior convicts ($C=1$). The situation is reversed for high school graduates ($E=1$), where non-convicts are favored ($C=0$). So, what can we tell about the hiring decision in terms of discrimination based on prior conviction?

## Fairness in Terms of Statistical Parity

Let’s return to our main goal and investigate if the algorithm responsible for the hiring decision is fair. First, we need to formalize what fairness, or no discrimination, means. As mentioned, the algorithm should not discriminate on prior conviction. This can be conceptualized as the requirement for statistical parity, which says that the conditional probabilities of hiring given a prior conviction and given no prior conviction should be equal i.e.,

$$p(H=1|C=0) = p(H=1|C=1). \label{eq:stat_parity}$$

Hide boring computations.

To compute these conditional probabilities, we need to first compute $p(H,C)$ and $p(C)$.

We can compute $p(H,C)$ using the conditional probabilities available in the Bayesian network: $p(H,C) = \sum_{E} p(H|E,C) \cdot p(C|E) \cdot p(E)$. Therefore, we get:

$$p(H=0,C=0) = 0.9 \cdot 0.15 \cdot 0.2 + 0.7 \cdot 0.9 \cdot 0.8 = 0.531$$ $$p(H=1,C=0) = 0.1 \cdot 0.15 \cdot 0.2 + 0.3 \cdot 0.9 \cdot 0.8 = 0.219$$ $$p(H=0,C=1) = 0.6 \cdot 0.85 \cdot 0.2 + 0.95 \cdot 0.1 \cdot 0.8 = 0.178$$ $$p(H=1,C=1) = 0.4 \cdot 0.85 \cdot 0.2 + 0.05 \cdot 0.1 \cdot 0.8 = 0.072$$

From $p(H,C)$, we can marginalize $H$ out to compute $p(C) = \sum_{H} p(H,C)$. Therefore, we get:

$$p(C=0) = 0.531 + 0.219 = 0.75$$ $$p(C=1) = 0.178 + 0.072 = 0.25$$

Plugging in these probabilities, we can straightforwardly compute $p(H=1|C=0) = p(H=1, C=0) / p(C=0)$ and $p(H=1|C=1) = p(H=1, C=1) / p(C=1)$.

It turns out that $p(H=1|C=0) = 0.292$ and $p(H=1|C=1) = 0.288$. Seeing they are roughly equal, we conclude that there is no discrimination according to the statistical parity definition.

## Causality

Let’s re-examine what statistical parity tells us. The probability of hiring a convict is almost the same as that of hiring a non-convict — this is what the conditional probability $p(H|C)$ reveals. But to convincingly argue there is no discrimination, we need to show that prior conviction plays no role in the hiring decision, or, in other words, that prior conviction does not cause the hiring outcome. We are thus looking for the causal effect of $C$ on $H$.

We can conceptualize and calculate the causal effect in two different but equivalent ways, one based on the potential outcomes framework of J. Neyman and D. Rubin, and one based on the causal graphs of S. Wright and J. Pearl.

### Potential Outcomes Framework

Consider some candidate. The potential outcome $H_{C=0}$ for this candidate corresponds to the value the $H$ would have taken, had variable $C$ been assigned the value $0$. Since, the variable $C$ is binary, there is just one other potential outcome, $H_{C=1}$, for this candidate. Note that the observed outcome, which is denoted as $H$, can only be one of these two potential outcomes. That potential outcome which coincides with the observed is called the factual outcome, while the other is a counterfactual outcome that corresponds to an imaginary world where the candidate’s prior conviction is reversed.

So let’s assume we can create worlds where we can assign to candidates their prior conviction status. We’re going to create world 0 where every candidate has no prior conviction, and world 1 where every candidate has a prior conviction. In world 0, we observe the potential outcome $H_{C=0}$ for all candidates, and $p(H_{C=0}=1)$ denotes the probability that a candidate is hired in this world. Similarly, $p(H_{C=1}=1)$ denotes the same probability for world 1. We can say that there is no discrimination on prior conviction $C$ from the hiring decision process if getting hired is equally likely in both worlds, that is:

$$p(H_{C=0}=1) = p(H_{C=1}=1). \label{eq:causal_fairness}$$

This is equivalent to saying there is no average causal effect of $C$ to $H$, i.e., the expected values of the potential outcomes are equal $\mathbb{E}[H_{C=1}] - \mathbb{E}[H_{C=0}] = 0$, which holds since $\mathbb{E}[X] = p(X=1)$ for binary variable $X$.

Assuming we cannot create worlds at will, the gold standard for measuring causal effects is to perform a randomized control trial (RCT), that helps quantify the causal effect of a treatment to an outcome. An RCT assigns individuals randomly to either the control group, where they will not be treated, or to the experimental group, where they will be treated. Observe that the control and experimental groups can be seen as a random sample of hypothetical worlds 0 and 1, respectively. Thus we can estimate the expected value of the potential outcomes by computing the means of the outcomes in the groups.

In our case, to perform an RCT means we should treat prior conviction $C$ as the treatment, and the hiring decision $H$ as the outcome. Clearly, this is not feasible — we cannot force people to become convicts. Whenever it’s impossible or unethical to control the treatment, e.g., the race, smoking habits, we cannot use RCTs to measure causal effects, e.g., in discrimination, lung cancer.

But not all is lost. Causal inference frameworks allow us to identify causal effects from observations (i.e., not experiments such as RCTs) under some causality assumptions. In the potential outcomes framework, the most important assumption we need to make is conditional ignorability. This specifies a set of variables, called confounders, given which the potential outcome becomes independent of the assigned treatment. To make this more concrete, let’s assume in our case that the potential outcome $H_C$ becomes independent of the assigned value of $C$ given the value of the confounder $E$. Ignorability can be expressed as $p(H_{C=0}|E) = p(H_{C=1}|E)$. Essentially, we can ignore the treatment assignment if we control for the confounders.

Ignorability implies that we can express the, conditional on $E$, probability of a potential outcome in terms of the conditional probability on $E$ and $C$ of the observed outcome as: $p(H_C|E) = p(H|E,C)$. In other words, computing the hiring probability among the candidates with some fixed education level $E$ in a hypothetical world, say 0 where candidates are assigned $C$ to be 0, is the same as computing the hiring probability among the candidates who have $C=0$ and that fixed education level $E$. So instead of performing an RCT to randomize the assignment of $C$, we can just condition on $C$ and measure probabilities per stratum of $E$. We can then aggregate these per-stratum probabilities to compute the probability of the potential outcome. This gives us the famous adjustment formula, which in our setting, where $H$ is the outcome, $H_C$ is the potential outcome, $C$ is the treatment, and $E$ is the confounder, is expressed as:

$$p(H_C) = \sum_E p(H_C|E) p(E) = \sum_E p(H|E,C) p(E). \label{eq:adjustment}$$

The causal effect can be now computed. But we’ll come back to it.

### Causal Graphs

As the previous section illustrated, determining causal effects from observations is only possible if we make some causality assumptions. Specifically, we previously made the ignorability assumption that $E$ acts as a confounder in the relationship of $C$ and $H$, which led us to use the adjustment formula as the means to compute the causal effect of $C$ on $H$.

In the causal graphs framework, the causality assumptions are encoded directly in a graph. A causal graph (more accurately a causal Bayesian network) is a Bayesian network where the direction of the edges implies a causal relationship between the two variables. The edge $A \to B$ means that the value of $B$ is derived from that of $A$, or as J. Pearl puts it, $B$ listens to $A$. Essentially, the set of edges in a causal graph expresses the data generating process. While there can be many Bayesian networks that give rise to the same observed data (by representing different factorizations of one joint distribution), there is a single Bayesian network, the causal graph, that explains how the data is generated. This ability is endowed to a Bayesian network by the causality assumptions encoded in its edges. One could write down such a mnemonic:

$$\textit{causal graph} = \textit{Bayesian network} + \textit{causality assumptions}.$$

As a side note, there are some causality questions, e.g., computing counterfactuals about individuals, that require stronger causality assumptions. Briefly, in structural causal models, the value of a variable $X$ is determined by some function $f_X$ of the values of $X$’s parents in the causal graph, i.e., $X = f_X(pa(X))$, where $pa(X)$ denotes the set of $X$’s parents. These structural equations, or response functions, replace the conditional probability functions, with stronger causality assumptions.

Let’s return to our example, and let’s assume that the Bayesian network we have seen so far is a causal one. That is, education level affects both prior conviction and the hiring outcome, and the hiring outcome is affected by prior conviction and education level alone. This is what the direction of edges suggests, and this is what the causality assumptions we have made imply. So given this causal graph, let’s see how we can determine if there is discrimination of the hiring decision based on prior conviction. We will built our answer based on the observations we have collected and which are summarized in the conditional probability tables.

To reach there, we first have to talk about interventions and distinguish them from observations. Consider our causal graph that encodes all causality assumptions regarding the hiring decision, prior conviction, and education level. To answer causality questions, we have to argue about hypothetical situations, and we need a language to express these situations. Suppose we were to change the value of prior conviction for all candidates and set it to 0. This would be an intervention, and we use the do operator to express it, here $do(C=0)$. Intervening on some variable causes changes in those variables that causally depend on it, but will leave the rest of the variables unaffected.

To argue about hypothetical situations, we have to express the effect of an intervention on some variable. Let $p(H|do(C=0))$ denote the probability of $H$ given the intervention of assigning to $C$ the value 0. To better understand what this means, let’s contrast it to the conditional probability $p(H|C=0)$. The latter tells us what is the probability of $H$ given that we have observed $C$ to be 0. It’s the answer to this question: among those who happen to have $C=0$, what is their $H$ value? The former is the probability of $H$ if we intervene and set $C$ to 0. It’s the answer to a different, hypothetical question: if everyone were to have $C=0$, what would their $H$ value be? It should be clear that $p(H|do(C=0))$ and $p(H|C=0)$ are different probabilities. They are only equal if there is no confounding, i.e., common cause of $C$ and $H$. This is not the case in our example, and in fact this is the reason why statistical parity, \eqref{eq:stat_parity}, and causal non-discrimination, \eqref{eq:causal_fairness} and \eqref{eq:causal_fairness_do}, are distinct notions of non-discrimination.

Let’s now express non-discrimination in our example with the causal graphs language. We argue there is no discrimination on prior conviction $C$ from the hiring decision process if getting hired $H$ remains equally likely whichever way we intervene and change the value of $C$, that is:

$$p(H=1|do(C=0)) = p(H=1|do(C=1)). \label{eq:causal_fairness_do}$$

Contrast this with \eqref{eq:causal_fairness}. Observe that the hypothetical situation where we intervene with $do(C=0)$ corresponds to the world 0 in the potential outcomes framework. So the probability $p(H|do(C=0))$ is exactly the probability $p(H_{C=0})$ of the potential outcome $H_{C=0}$ in world 0. Although \eqref{eq:causal_fairness} and \eqref{eq:causal_fairness_do} use different notation, they convey the same non-discrimination concept. The benefit of expressing this concept using the causal graphs framework, is that we can employ a powerful toolset, the do calculus, to compute probabilities given interventions.

Recall, that an intervention, like $do(C=1)$ means that we explicitly set the value of some variable, in our case $C$ to $1$. In the causal graphs framework, this translates to the intervened variable no longer listening to any other variables. As a result, all edges leading to the intervened variable should be removed. Equivalently, its conditional probability table is reduced to $p(C=1)=1$; in structural causal models, the structural equation for $C$ is replaced by $C=1$. The causal graph below illustrates the effect of the $do(C=1)$ intervention.

We will now compute $p(H|do(C=1))$ from this causal graph, using the three rules of do calculus. The goal is to rewrite this interventional probability in a way that contains only observational probabilities, i.e., there is no conditioning on the intervention $do(C=1)$. In our case, this is simple and we only need to apply a rule we have briefly mentioned. If there is no common cause of $C$ and $H$, then $p(H|do(C=1)) = p(H|C=1)$. Examining the causal graph after the intervention, it should be clear that this property now holds. $E$ used to be a common cause for $C$ and $H$. Now that we have intervened and assigned its value, $C$ is no longer affected by any other variable.

So it turns out we need to compute $p(H|C)$ in the intervened causal graph. Using the probabilities encoded in this new Bayesian network, we proceed by computing $p(H,C)$ and $p(C)$. The latter is trivial because for the intervention $do(C=1)$ it holds that $p(C=1)=1$. To compute the former, we will marginalize $E$ out of the joint distribution. As we are working on this new Bayesian network, note that the joint distribution is now factorized differently as $p(H,E,C) = p(H|E,C) \cdot p(E) \cdot p(C)$. Therefore, for either intervention $do(C=1)$ or $do(C=0)$, we derive that:

$$p(H|do(C)) = p(H|C) = p(H,C) = \sum_E p(H|E,C) \cdot p(E). \label{eq:adjustment_do}$$

This is again the adjustment formula of \eqref{eq:adjustment}, which should come as no surprise. The only difference is the way we have derived it; using the language of causal graphs and the do calculus might appear to be more intuitive.

## Fairness in Terms of Causality

We have now formulated two identical causality-based fairness notions, \eqref{eq:causal_fairness} and \eqref{eq:causal_fairness_do}, and two identical ways to compute them, \eqref{eq:adjustment} and \eqref{eq:adjustment_do}. Let’s pick the do notation and continue to test whether the hiring decision is non-discriminatory from our causality viewpoint.

We compute $p(H=1|do(C=0))=0.260$ and $p(H=1|do(C=1))=0.120$, showing that hiring a candidate without prior conviction is more than twice more likely than hiring a prior convict. There is a strong causal effect of prior conviction $C$ on the hiring decision $H$, and therefore we find there is discrimination against prior convicts.

### Observation vs Intervention

Based on the notion of statistical parity, we find no discrimination, while with a causality viewpoint, we find discrimination. So why do we see this discrepancy? Consider a dataset of 1,000 candidates, and let’s compare two situations.

The first, termed Observation, is when the $E$, $C$, and $H$ values are distributed as in their observed joint probability distribution $p(H,E,C)$, also captured by the Bayesian network. This situation corresponds to the left half of the table. The number of candidates in the $E$, $C$ groups is governed by the joint distribution $p(E,C)$ shown in the table below. The number of hired candidates depends on the number of candidates in the $E$, $C$ groups and the hiring probability $p(H|E,C)$, also depicted; e.g., there are 30 candidates with $E=0, C=0$ and, because $p(H=1|E=0, C=0)=0.1$, only 3 among them are hired. Moreover, observe that a total 219 (resp. 72) out of 750 non-convicted (resp. 250 convicted) candidates are hired resulting in the two $p(H=1|C)$ probabilities being roughly equal. Note that the $p(H=1,E,C)$ probabilities match the observed ones shown in the beginning of this post.

In the second situation, termed Intervention, we’re making a $do(C)$ intervention. Specifically, in a random half of the candidates, we’re assigning a non-prior conviction $do(C=0)$, while in the other half we’re assigning a prior conviction $do(C=1)$. This situation represents an RCT, where the $do(C=0)$ half corresponds to World 0, while the $do(C=1)$ to World 1. This situation is depicted in the right half of the table, where the top (resp. bottom) is World 0 (resp. 1). The number of candidates in the $E$, $C$ groups has now changed, because we have intervened and manually set $C$ independent of $E$. The joint distribution now becomes $p(E,C) = p(E)\cdot p(C) = 0.5 \cdot p(E)$, since being assigned a prior conviction or not are equally likely. The following table shows the $p(E,C)$ under the intervention, and the hiring probability $p(H|E,C)$, which is unchanged.

The most significant change in Intervention concerns the prior convicted candidates. There are now twice as many previously convicted candidates. More importantly, there are five times as many prior convicts with high school education (the $E=1,C=1$ group). Because the algorithm hires candidates in this population group with a low probability $p(H=1|E=1,C=1)=0.05$, it turns out that the total number of prior convicts hired is less than before, even thought there are twice as many. Specifically, only 60 out of 500 prior convicts are hired, which results in the low interventional probability $p(H=1|do(C=1))=0.12$.

Note how the $p(H=1,E,C)$ probabilities in Intervention differ from the corresponding in Observation. This results in the discrepancy between the $p(H|C)$ and $p(H|do(C))$. Because we have made some causality assumptions, captured in the causal graph, we are able to derive $p(H|do(C))$ in Intervention from the probabilities in Observation.

### Alternate Causality Assumptions

Let’s now explore what would happen if we had made an alternate set of causality assumptions. Let’s now assume that the education level $E$ does not cause the prior conviction $C$, and rather it’s $C$ that causes $E$. This is our alternate assumption, possible not as plausible, but it serves to understand the implications of causality assumptions.

The alternate causal graph is shown below in the left part, where only the direction of the edge $E$,$C$ is reversed. This corresponds to a different factorization of the same joint distribution. Note that we do not show in the graph the marginal and conditional probability distributions.

Since just the causal graph has changed and all observations are still valid, the conditional hiring probability $p(H|C)$ given prior conviction is unchanged. Therefore, the hiring decision is still fair in terms of statistical parity.

Let’s now investigate fairness in terms of causality. To determine the effect of $C$ on $H$, we have to intervene on $C$, and compute $p(H|do(C))$. Graphically, this intervention means we have to sever all incoming edges to $C$. As there is none, we get the identical causal graph, shown in the right part of the figure. Because the causal graph has not changed, we now have that the interventional probability $p(H|do(C))$ is equal to the observational probability $p(H|C)$. Therefore, the hiring decision is also fair in terms of causality. No discrepancy this time.

So, we see that causality-based fairness crucially depends on the causality assumptions we introduce. In contrast, statistical parity is agnostic to any causality assumptions and only depends on the observed data.

As a general note, the initial causal assumptions suggest that $E$ is a common cause, or a confounder, of $C$ and $H$. Therefore, we have to control for $E$ in order to discover the causal effect of $C$ on $H$. In contrast, the alternate causal assumptions suggest that $E$ is just a mediator of the effect of $C$ on $H$. There are no common causes of $C$, so we don’t need to control for them. The observed relationship between $C$ and $H$ corresponds to their causal relationship.

In general, if fairness or discrimination is with respect to a variable (such as $C$ in our case) that is not caused by any other variable (as in the alternate causality assumptions), we will find no distinction between statistical parity and causality-based fairness. This is typically the case when we study discrimination in terms of attributes such as race and gender, whose assignment is random by nature.