Probabilistic Oddities - Making Sense of Conditional Probability

Consider the following problem:

One in 1000 people has a particular disease. There is a test for this disease, which is 95% accurate. A doctor offers a patient this test, and the patient tests positive for this disease. What is the probability that the patient has the disease?

If you haven't seen problems like this before, you are advised to spend a couple of minutes thinking about it before you read on.

Gut feeling

When a similar question to this was posed to a group of Harvard medical students, the majority of them said 95%. Perhaps this is what you said too?

However, the actual probability is much lower. Out of, say 100,000 people, 100 will have the disease and 99900 will not. If we gave all of them the test, 95/100 people will test positive and have the disease, and 94905/99900 will test negative and not have the disease. This means 4995 people will test positive and not have the disease! Out of all the 5090 people who test positive, only 95 have the disease. This means the probability of the patient having the disease is actually 95/5095, or about 2% - much less than the 95% that intuition suggests. In fact, this type of question, phrased in terms of cancer screening, has been replicated in many surveys, and as often as many as 85% of doctors get it wrong!

If any of your friends are still in any doubt, perhaps a tree diagram would convince them:

The reason so many people get it wrong is because they overlook the fact that although the test is very accurate, the low occurrence of the initial disease skews the results towards false positives. In short, when they were told about the results of the test, they promptly forgot what they already knew about the disease - that it was very rare indeed.

We can re-phrase the question in more mathematical terms. We will write P(A) as the probability of event A occurring, and P(A|B) (read A|B as "A given B") as the probability of event A occurring if we know event B occurs. In our example, we can call "the patient having the disease" event A, and "the patient testing positive" event B. P(A) is 0.001, P(B|A) is 0.95, and P(B|not A) is 0.05. The probability we are looking for is P(A|B) - the probability of having the disease, if we know that the test was positive.

It's a common fallacy to confused about the difference between P(A|B) and P(B|A), but a simple example should make matters clear: let A be the event that you meet David Cameron tomorrow, and B be the event that you meet any British people at all tomorrow. Clearly, P(B|A) is 1 - if you meet David Cameron tomorrow then you will have definitely met at least one British person. But P(A|B) is still minuscule - of all the people who meet a British person tomorrow, few will be meeting David Cameron.

Bayes' Theorem

The mathematical tool that allows us to calculate P(A|B) from P(B|A) is called Bayes' Theorem. The usual statement of Bayes' says:

$$P(A|B) = \frac{P(A) P(B|A)}{P(B)} $$

That is, to convert from P(A|B) to P(B|A), we need to know two more things: P(A) and P(B). In the above situation, P(A) was given to us, and we (essentially) calculated P(B) by P(A)P(B|A)+P(not A)P(B|not A). You can take a brief moment to think about why this works.

Using this equation then, we arrive at

$$P(A|B) = \frac{0.001 \times 0.95}{0.001 \times 0.95+0.999 \times 0.05} = 0.186$$

To translate this back to common sense: we can't simply look at the likelihood of the evidence, we have to use it to in context with what we believed beforehand - we use the evidence from event B (that is, P(B|A) and P(B) in general), to change what we already knew about event A (i.e. work out how to get from P(A) to P(A|B)). We call P(A) and P(not A) our prior distribution, and P(A|B) and P(not A|B) our posterior distribution, for obvious reasons. In light of evidence from B, we revised our estimate of the likelihood of the patient having the disease up, from 0.001 to 0.02 (this is quite a big increase, which shows that although our patient still has a large probability of being healthy, the test was not useless).

If we find out another piece of evidence from event C, for example, if our patient tests negative in another test for the same disease, we can apply the same formula again to take us from P(A|B) to P(A|B and C) - the probability might be revised downwards, and how much it changes depends on how accurate the new test is: P(C|A) and P(C) generally.

Courtroom Drama

The fallacy of P(A|B) = P(B|A), as we saw above, can be dangerous in the medical profession - getting such probabilities wrong can mean lots of unnecessary tests, unwarranted stress, and even overtreatment. However, there is another situation where the mistake can be even more costly.

Suppose you were on the jury of a criminal trial. The prosecution has just informed you that the DNA sample from the crime scene matches the defendant's DNA. The chance of this happening to a random person is 1 in 10 million. How likely is the defendant to be innocent?

Hopefully, you will have learned enough from the last exercise to not say "1 in 10 million" (this is the so-called "Prosecutor's Fallacy"). If you've been paying attention, you will realise that we need to know how likely the defendant is to have committed the crime without this piece of information.

If the suspect was recognised by witnesses, had clear motives or a shaky alibi, then the DNA evidence is going to be pretty conclusive (try calculating how this might change a prior probability of guilt of 0.6). If, however, there was nothing else tying the defendant to the crime - if without this evidence, they are no less likely to be innocent than any other person in a large population - then the prior probability is so tiny that they can't be convicted on this evidence alone. In a population of 60 million, all of which equally likely to be innocent, there might be 6 matches to the DNA. Without anything to single out the suspect, he has a 5/6 chance of being innocent!

Notice, however, that it would be equally fallacious to argue that because there are probably several other DNA matches out there, the evidence does not demonstrate the defendant's guilt and should be disregarded - it could well be that those people have a much lower probability of guilt, due to other evidence pointing towards the defendant. This is the "Defendant's Fallacy".

Unfortunately, juries are not often presented with statistics in a clear and non-misleading way. There have been many cases where a person who was likely to be innocent was convicted because of a lack of understanding about the subtleties of conditional probability. For more information on trials whose outcomes the prosecutor's fallacy may have affected, you can look up the Sally Clark case, the Lucia de Berk case and the Denis John Adams case.

Scientific Investigation

The scientific method itself is another example of Bayesian analysis, in some sense - you have a set of assumptions about the universe, which is your prior, you do experiments and the results are your evidence, and based on them, you decide how to modify your assumptions, to reach your posterior. As you expect though, often, the probabilities are not as easy to unravel as the first example.

In a scientific hypothesis test, one usually assumes a null hypothesis H₀. Then an experiment and calculations are done to produce a p-value, which usually represents the probability that the results at least as extreme as those obtained could have been obtained if H₀ were true. For example, one might have a null hypothesis of "this drug is no more effective than a placebo", and a p-value of 0.01 would mean "if this drug is no more effective than a placebo, then there is only a 0.01% chance that that we could have got our results or better" - convincing evidence that the drug is probably more effective than a placebo.

It is important to bear in mind, however, that this only gives P(results obtained|H₀). In reality, P(H₀|results obtained) is a more useful number, as it actually gives the probability that H₀ is true. In most statistical tests, instead of calculating P(H₀|results obtained), we agree beforehand on a "significance level", for example 0.05, which is the maximum p-value for which we reject the null hypothesis. If the p-value is above the significance level, we conclude that there is not enough evidence to reject the null hypothesis.

It is important to note that the significance test is only a statistical tool - without using something like Bayes' theorem, we still cannot find out the actual probability that the null hypothesis is false! It should be very predictable by now that to find the probability we want, we'll want an estimate of how likely H₀ is to have been true in the first place.

The Prior Matters

It is a common mistake to believe that the p-value is the probability of H₀ being true. Most of the time this mistake is harmless - if we take the "H₀ is equally likely to be right or wrong" as our prior (called a uniform prior: one where all the possibilities are equally likely), and remember Bayes' formula

$$P(H_0|results) = \frac{P(H_0)}{P(results)} P(results|H_0)$$

Then we find that the conversion factor between our two probabilities is $ \frac{P(H₀)}{P(results obtained)}$. If we take the results as being very unlikely if H₀ is true and very unlikely if H₀ is not true, then, with a little careful estimation, we have the 0.5 divided by a number that is a little different from 0.5 - which should give you a ratio that is not too different from 1. This means that in the case where the uniform prior is a good approximation, the fallacy of taking the p-value as the probability of H₀ being true is not too damaging.

But what if the prior is, as is most likely, very different from the uniform? With some more careful estimating (or by trying out some sensible numbers), we realise that the conversion ratio is now likely to be quite different from one.

Consider the position of a researcher studying psychic ability. He carefully sets up an experiment to investigate if two people might be able to communicate at a distance. He obtains a p-value of 0.03. In the researcher's opinion, the null hypothesis, here that the two people cannot psychically communicate, is correct with probability 0.7. What would their revised probability of H₀ be, using Bayes'? Would it be wise to reject H₀ based on this evidence?

Suppose another researcher came by the same data. They presume that psychic ability is almost impossible, and that the null is correct with probability 0.995. What would their revised probability of H₀ be, using Bayes'? Would it be wise to reject H₀ based on this evidence?

This poses a problem for scientists investigating unlikely theories - the prior distribution is subjective. Different researchers will come to different conclusions based on their initial assumptions, even with the same experimental data. To quote an extreme example, if you were certain that your null hypothesis was true, then P(H₀|evidence) stays at one (just put the numbers in the formula and see). Bayesian inference states that no evidence is going to change your view of the world if you were already so convinced of your views that nothing was going to change them. The significant test is only useful if you make sensible hypotheses, and use significance levels that were sensible for the hypotheses and the experiment.

A tool like Bayes provides fairly good probabilistic analysis in most cases. However, when the prior is uncertain, there can be a lot of different interpretations of the same result. And in a black and white situations like a significance test, where you either reject a hypothesis or you don't, it is still, sadly, not enough.