A few days ago, news reports claimed that 16 per cent of cancers around the world were caused by infections. This isn't an especially new or controversial statement, as there's clear evidence that some viruses, bacteria and parasites can cause cancer (think HPV, which we now have a vaccine against). It's not inaccurate either. The paper that triggered the reports did indeed conclude that "of the 12.7 million new cancer cases that occurred in 2008, the population attributable fraction (PAF) for infectious agents was 16·1%".
But for me, the reports aggravated an old itch. I used to work at a cancer charity. We'd get frequent requests for such numbers (e.g. how many cancers are caused by tobacco?). However, whenever such reports actually came out, we got a lot confused questions and comments. The problem is that many (most?) people have no idea what it actually means to say that X% of cancers are caused by something, where those numbers come from, or how they should be used.
Formally, these numbers – the population attributable fractions (PAFs) – represent the proportion of cases of a disease that could be avoided if something linked to the disease (a risk factor) was avoided. So, in this case, we're saying that if no one caught HPV or any other cancer-causing infection, then 16.1% of cancers would never happen. That's around 2 million cases attributable to these causes.
From answering enquiries and talking to people, I reckon that your average reader believes that we get these numbers because keen scientists examined lots of medical records, and did actual tallies. We used to get questions like "How do you know they didn't get cancer because of something else?" and "What, did they actually count the people who got cancer because of [insert risk factor here]?"
No, they didn't. Those numbers are not counts.
Those 2 million cases don't correspond to actual specific people. I can't tell you their names.
Instead, PAFs are the results of statistical models that mash together a lot of data from previous studies, along with many assumptions.
At a basic level, the models need a handful of ingredients. You need to know how common the risk factor is – so, for example, what proportion of cancer patients carry the relevant infections? You need to know how big the effect is – if someone is infected, their risk of cancer goes up by how many times? If you have these two figures, you can calculate a PAF as a percentage. If you also know the incidence of a cancer in a certain population during a certain year, you can convert that percentage into a number of cases.
There's always a certain degree of subjectivity. Consider the size of the effect – different studies will produce different estimates, and the value you choose to put into the model has a big influence on the numbers that come out. And people who do these analyses will typically draw their data from dozens if not hundreds of sources.
In the infection example, some sources are studies that compare cancer rates among people with or without the infections. Others measure proteins or antibodies in blood samples to see who is infected. Some are international registries of varying quality. The new infection paper alone combines data from over 50 papers and sources, and some of these are themselves analyses of many earlier papers. Bung these all into one statistical pot, simmer gently with assumptions and educated guesses, and voila – you have your numbers.
This is not to say that these methods aren't sound (they are) or that these analyses aren't valuable (they can tell public health workers about the scale of different challenges). But it's important to understand what's actually been done, because it shows us why PAFs can be so easily misconstrued.
The numbers aren't about assigning blame.
For a start, PAFs don't necessarily add up. Many causes of cancer interact with one another. For example, being very fat and being very inactive can both increase the risk of cancer, but they are obviously linked. You can't calculate the PAFs for different causes of cancer, and bung them all into a nice pie chart, because the slices of the pie will overlap.
Cancers are also complex diseases. Individual tumours arise because of a number of different genetic mutations that build up over the years, potentially due to different causes. You can't take a single patient and assign them to a "radiation" or "infection" or "smoking" bucket. Those 16.1% of cancers that are linked to infections may also have other "causes". Cancer is more like poverty (caused by a number of events throughout one's life, some inherited and some not) rather than malaria (caused by a very specific infection delivered via mosquito).
You can't find trends by comparing PAFs across different studies.
The latest paper tells us that 16.1% of cancers are attributable to infections. In 2006, a similar analysisconcluded that 17.8% of cancers are attributable to infections. And in 1997, yet another study put the figure at 15.6%. If you didn't know how the numbers were derived, you might think: Aha! A trend! The number of infection-related cancers was on the rise but then it went down again.
That's wrong. All these studies relied on slightly different methods and different sets of data. The fact that the numbers vary tells us nothing about whether the problem of infection-related cancers has got 'better' or 'worse'. (In this case, the estimates are actually pretty close, which is reassuring. I have seen ones that vary more wildly. Try looking for the number of cancers caused by alcohol or poor diets, if you want some examples).
Unfortunately, we have this tricky habit of seeing narratives even when there aren't any. Journalists do this all the time. A typical interview would go like this: "So, you're saying infections cause 16.1% of cancers, but a few years ago, you said they cause 17.8% of cancers." And then, the best-case scenario would be: "So, why did it go down?" And the worst-case one: "Scientists are always changing their minds. How can we trust you if you can't get a simple thing like this right?"
The numbers are hard to compare, and obscure crucial information.
Executives and policy-makers love PAFs, and they especially love comparing them across different risk factors. They are nice, solid numbers that make for strong bullet points and eye-grabbing Powerpoint slides. They have a nasty habit of becoming influential well beyond their actual scientific value. I have seen them used as the arbitrators of decisions, lined up on a single graphic that supposedly illustrates the magnitude of different problems. But of course, they do no such thing.
For a start, the PAF model relies on a strong assumption of causality. You're implying that the risk factor you're studying clearly causes the disease in question. That's warranted in some cases, including many of the infections discussed in the new paper. In others… well, not so much.
Here's an example. I could do two sets of calculations using exactly the same methods and tell you how many cases of cancer were attributable to radon gas, or not eating enough fruit and vegetables. A casual passer-by might compare the two, look at which number was bigger, and draw conclusions about which risk factor was more important. But this would completely obscure the fact that there is very strong evidence that radon gas causes cancer, but only tenuous evidence that a lack of fruit and vegetables does. Comparing the two numbers makes absolutely no sense.
There are other subtle questions you might also need to ask if you were going to commit money to a campaign, or call for policy changes, or define your strategy. How easily could you actually alter exposure to a risk factor? Does the risk factor cause cancers that have no screening programmes, or that are particularly hard to treat? Is it becoming more of a problem? PAFs obscure all of these issues. That would be fine if they were used appropriately, with due caution and caveats. But from experience, they're not.
What PAFs are good for
They're basically a way of saying that a problem is this big (I hold my hands bout an inch apart), or that it's this big (they're a foot apart now) or THIS big (stretched out to the sides). They're our best guess based on the best available data. In the case of infections, the message is that they cause more cancers than people might expect.
Used carefully, I have no real problem with PAFs, but I think that they're blunt instruments, often wielded clumsily. We could do a much better job at communicating what they actually mean, and how they are derived. I'd be happier if we quoted ranges based on confidence intervals. I'd be even happier if we stopped presenting them to one decimal place – that imbues them with a rigour that I honestly don't think they deserve. And if, whenever we talked about PAFs, we liberally used the suffix "-ish"? Well, I'd be this happy.