Mirages of Bias

A new scientific method called exploratory experimentation is being developed. Rather than attempting to predict results in advance (hypothesising), these experiments are run just to discover what happens, after which relevant conclusions can be drawn. The change came about as a result of growing concern about the inability to investigate environmental causes of cancer — previously, scientists were required to predict the factors they expected to influence cancer rates, an arduous task considering the range of candidates for such influence. Now, a database of cancer rate data can simply be compared alongside a host of possible factors, all simultaneously examined so that we can quickly see which of them cause cancer rates to increase.

Question: Why is the above paragraph nonsense?

One of the less intuitive (and most critical) aspects of the scientific method is the necessity of predicting results in advance. If we compile a nationwide database of cancer rates in different regions, as suggested above, alongside various environmental causes such as overhead power lines, inland water supply, etc., and find that cancer rates do increase significantly in areas with more oak trees, what does it matter if we said so in advance? The effect is still there, right?

Wrong — maybe. Ultimately, this misunderstanding goes deeper, to concepts of probability and randomness. When we wish to investigate differences between two groups of people, or Petri dishes, or buckets of water, etc. that have been exposed to different experimental conditions, there are two types of differences. The first is genuine, systematic, biased difference betweent the two groups — this is precisely what we want to find. The second is the error we get from individual differences within the group — this error is random in the sense that its presence is completely independent of any grouping or experimental manipulation, and we hate this type of difference. This error is the reason we need large sample sizes: the more individual points of data there are, the more likely their individual differences will cancel one another out, making the systematic effects much easier to detect in comparison.

Think of the experiment as an effort to understand a low-quality, noisy radio transmission – say, a ten-second repeating message. We want to retrieve the meaningful, systematic sounds from beneath the annoying static. Firstly, to have any hope of understanding it, we definitely need to listen to it multiple times — the more, the better. In fact, if we record a large number of the ten-second loops, and then play them all back at once (turning the volume knob way down), we would find the message surprisingly clear. This is because the static averages out at roughly the same level, since it is random in each ten-second loop, whereas the sounds of the message are biased to play at the same moment each loop. This is a beautiful illustration of randomness versus bias, and how large samples can help bias overcome randomness (the problem of unintended bias in science is a colossal one that will not be addressed here).

How does this relate to the need for a hypothesis? Because error never goes away completely. We can only analyse the probability that a result represents a real effect (i.e. that the result is from bias, not randomness). Say you want to compare two types of arrows. Two archers matched for strength fire them off at the same angle, at the same time. One arrow goes 103 metres, the other goes 104 metres. “Our experiment shows the second arrow to be superior,” you might confidently conclude in your report. But is this really justified? We can’t perfectly match the archers on their strength and ability – surely some inaccuracy in our effort to match them could account for this less-than-one-percent difference in distance?

So maybe we accept that a one-metre gain isn’t good enough. But where would we have drawn the line? At what point are we justified in concluding that the second arrow is any different to the first? Statistics, as well as some basic research methodology, are dedicated to answering this question, and in the end they can only give us a probability that the result was due to chance (we want this to be as small as possible). The “cut-off” point for this probability is completely arbitrary — the widespread convention is a probability of p = 0.05, i.e. we accept that the second arrow is better than the first only if the probability of its better performance occurring due to chance is less than 5% (we call this “statistically significant”).

Of course, that’s still a 5% chance. That means that 5% of the time, an effect that we accept as genuine — say the superiority of an arrow — will actually just be random chance, merely the result of individual errors.

And now we can understand the importance of prediction, because probabilities are only meaningful in advance. Imagine we are comparing twenty arrows to find one that is superior. We shoot them out of twenty identical bow machines, indoors, at the same angle, and sure enough we find one arrow that beat all the others with statistical significance. If that was the exact arrow we had predicted would do better, then that would mean something. If we didn’t predict any arrow, but were performing an “exploratory experiment” to find the best one, then it’s less impressive. The probability of that arrow winning due to chance may have been 5% or less, but we just fired twenty of them. The chance of an arrow doing significantly better than the others is much higher than 5%.

As with any methodology, there are other answers. If we fired the twenty arrows over and over, randomising which machines they were fired out of and randomising the line-up, then a few genuinely superior arrows would tend to win more while the chance-winners would tend to cancel out their own successes with just as many losses. Averaging a large number of trials could reveal a more reliable effect, just like overlaying the bad radio signal from above.

But in psychology, where individual differences are massive and the scope for repeated experiments is limited, this is not always viable, and so we need to predict things — ironically, it increases our relative chances of finding a genuine significant effect, because without it we’d be finding illusionary effects all over the place. The bad logic is the same as hitting a golf ball, and claiming the ball was biased to land on the blade of grass that it did. Unless you were able to claim that beforehand, no one’s impressed — they know there were a vast number of other possible ball-hitting results (in this case, basically all of them) you would have deemed “significant”. The same goes for numerology, and various religious/supernatural claims.

Finally, to tie this back to the hypothetical Great Cancer Database. There are just too many possible environmental factors, and not enough control over cancer itself, to justify any conclusions found from such a wide search. Sure, your analysis of the effects of oak trees may have revealed a very low probability of chance occurrence — maybe 0.1% — but your search looked at thousands of possibilities!

We expect these mirages of bias. In fact, a complete lack of them would suggest that cancer is spread incredibly uniformly, like with a butter knife, with no random clusters — our bias-focused minds often believe that this is how randomness should work. This is why science exists, and exploratory science does not.

Advertisement

~ by Grimrukh on August 3, 2011.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.