The True Value of P-value

Last year, in the end of October, we attended the school Analysis of patterns in Sicily. There were several interesting new things that I personally took home with me from this event, but the thing that probably provided most food for thought for me was the notion of "multiple hypothesis testing". I've been long willing to write something down on this topic, but it's only now that I've found the time and desire to do it. Note that I haven't yet taken the trouble of reading some more in-depth literature about this, so all that goes next is not necessarily flawless, but here we go.

Suppose that you are given some string of length 100 consisting of characters {A, T, C, G} and you are asked, whether it was produced randomly or it is a magic string. According to recent magic research, magic strings are known to contain the substring GATGAG 3 times in them. The problem is: how do you test whether the string is magic or not?

If the string does not contain the given substring given number of times that's simple: you just state that it's not magic. But what if it does? Then you don't really know. Most probably the string is magic, but it's also possible that a random string has GATGAG repeated 5 times. So a consistent answer in this case can only be probabilistic: you say that you believe the string is magic and indicate the chance that you are mistaken. The measure of this chance (which is nearly the same as the probability of making a mistake) is known as the "p-value". In your case the p-value will represent the probability for a random string of length 100 to have GATGAG repeated 5 times. Once you calculate it and see that p is less than, say, 2-20 you may state that the string is most probably magic after all. Done.

The next day you are involved in a search for extraterrestrial activity and once again you are analyzing strings. This time you are interested in whether a given string has come from an intelligent martian or it's just random noise. It is well known that a true martian string must always contain a substring of length at least 5 repeated at least 3 times. You have analyzed your data and found that the substring ATATA is repeated 3 times in it. The question is - how sure can you be that this is a martian string? Remembering your previous work, you calculate the p-value as the probability for a random string to have ATATA repeated 3 times. You see that it is somewhere around 2-10 and conclude that, unless we believe in miracles, the string must be extraterrestrial.

There is one terribly unintuitive mistake in the previous example. The fact that you were searching for any repeating substring of length at least 5 means that the correct p-value, the one you could use to test your results for significance, must represent the probability that a random string would contain some repeating substring of length 5. And this is way greater than the probability of having exactly the ATATA string repeating! It might very well be over 10-20% (sorry, I don't really know the exact value).

But alright, suppose that you have corrected the glitch, and even managed to find a martian civilization. The martians were so happy to find out that they are not alone in the universe, that they started sending enormous numbers of messages and now you are standing at the following problem: you have 1000 messages and you need to find the martian messages among them. Remembering your previous experience, you correctly calculate that the probability for a random message to be martian is somewhere about 1%, you are happy with it, and you select the 10 messages that had properly repeating substrings.

Hopefully, the reader can already guess the mistake in the previous paragraph. Although the probability for a single random string to look martian is only 1%, the probability to find ten such strings among a 1000 is large enough to be insignificant. In fact, there is no way left to assess "significance" properly because the probability of finding some martian-looking strings in such amounts of random data is too large.

And what if you knew that martian strings could have substrings repeating 3, 4, or 5 times? A string having 5 repetitions is certainly more special than the one having 3, right? Does it mean that the probability of finding one of the interesting patterns in random data is less if the pattern you've found is very complex? I'd say no. The concept of p-value, as it is used in standard hypothesis testing, simply does not apply to pattern search problems.

When you are searching for "something interesting", you are more-or-less bound to find this "something". Don't be misled by thinking that you are "deciding upon the significance of a pattern depending on how probable it is for you to find it in random data". That's simply not true. Don't imagine that you work with "probabilities". It's better to think that you just have some "goodness measure" assigned to each potential pattern, and you are interested in finding the patterns with maximal goodness. Yes, it is reasonable to derive the goodness measure using probabilistic thinking, but once you've found the pattern don't interpret its measure as probability.

To mess things up completely I'll conclude with yet another example. There is this popular general pattern-search technique, that goes like that:

  • Mine the data for all patterns present there. For each pattern assign a "p-value", which measures the probability to find exactly this pattern in random data
  • Decide upon the "threshold" t — events "less probable" than t will be considered insignificant.
  • Select all patterns whose p-value is greater than the threshold.

If you've read the whole text attentively enough, you'll see the same problem in this example: both the probabilities and the threshold represent the case of searching a single data item for a single pattern, whereas in fact we are searching for any good patterns in the whole dataset. The probabilities and threshold in this case are completely different. In order to assess them you should generate the same amount of random data, find the best patterns in it, and only then compare the results.

That's it. Hope I've given you some food for thought. As I've already noted, I myself am a bit confused with this topic, and I might be wrong with the whole idea, so your comments are welcome.

PS: There are some nice ideas related to the problem such as the "Bonferroni correction" or the "Sidak correction", which basically state that you should use much lower thresholds. However, my main point here is the problem of interpretation, and I don't see how the whatever correction of thresholds could solve that.