## The True Value of P-value

Last year, in the end of October, we attended the school Analysis of patterns in Sicily. There were several interesting new things that I personally took home with me from this event, but the thing that probably provided most food for thought for me was the notion of "multiple hypothesis testing". I've been long willing to write something down on this topic, but it's only now that I've found the time and desire to do it. Note that I haven't yet taken the trouble of reading some more in-depth literature about this, so all that goes next is not necessarily flawless, but here we go.

Suppose that you are given some string of length 100 consisting of characters `{A, T, C, G}`

and you are asked, whether it was produced randomly or it is a *magic string*. According to recent magic research, *magic strings* are known to contain the substring `GATGAG`

3 times in them. The problem is: how do you test whether the string is magic or not?

If the string does not contain the given substring given number of times that's simple: you just state that it's *not magic*. But what if it does? Then you don't really know. Most probably the string *is* magic, but it's also possible that a *random* string has `GATGAG`

repeated 5 times. So a consistent answer in this case can only be probabilistic: you say that you believe the string is magic and indicate the chance that you are mistaken. The measure of this chance (which is *nearly* the same as the probability of making a mistake) is known as the "*p*-value". In your case the *p*-value will represent the probability for a random string of length 100 to have `GATGAG`

repeated 5 times. Once you calculate it and see that *p* is less than, say, 2^{-20} you may state that the string is most probably *magic* after all. Done.

The next day you are involved in a search for extraterrestrial activity and once again you are analyzing strings. This time you are interested in whether a given string has come from an intelligent martian or it's just random noise. It is well known that a true martian string must always contain a substring of length at least 5 repeated at least 3 times. You have analyzed your data and found that the substring `ATATA`

is repeated 3 times in it. The question is - how sure can you be that this is a martian string? Remembering your previous work, you calculate the p-value as the probability for a random string to have `ATATA`

repeated 3 times. You see that it is somewhere around 2^{-10} and conclude that, unless we believe in miracles, the string must be extraterrestrial.

There is one terribly unintuitive mistake in the previous example. The fact that you were searching for *any* repeating substring of length at least 5 means that the correct *p*-value, the one you could use to test your results for significance, must represent the probability that a random string would contain *some* repeating substring of length 5. And this is way greater than the probability of having exactly the `ATATA`

string repeating! It might very well be over 10-20% (sorry, I don't really know the exact value).

But alright, suppose that you have corrected the glitch, and even managed to find a martian civilization. The martians were so happy to find out that they are not alone in the universe, that they started sending enormous numbers of messages and now you are standing at the following problem: you have 1000 messages and you need to find the martian messages among them. Remembering your previous experience, you correctly calculate that the probability for a random message to be martian is somewhere about 1%, you are happy with it, and you select the 10 messages that had properly repeating substrings.

Hopefully, the reader can already guess the mistake in the previous paragraph. Although the probability for a *single* random string to look martian is only 1%, the probability to find ten such strings among a 1000 is large enough to be insignificant. In fact, there is *no way* left to assess "significance" properly because the probability of finding *some* martian-looking strings in such amounts of random data is too large.

And what if you knew that martian strings could have substrings repeating 3, 4, or 5 times? A string having 5 repetitions is certainly more special than the one having 3, right? Does it mean that the probability of finding *one of the interesting patterns* in random data is less if the pattern you've found is very complex? I'd say **no**. The concept of *p*-value, as it is used in standard hypothesis testing, simply does *not* apply to pattern search problems.

When you are searching for "something interesting", you are more-or-less *bound* to find this "something". Don't be misled by thinking that you are "deciding upon the significance of a pattern depending on how probable it is for you to find it in random data". That's simply not true. Don't imagine that you work with "probabilities". It's better to think that you just have some "goodness measure" assigned to each potential pattern, and you are interested in finding the patterns with maximal goodness. Yes, it is reasonable to derive the goodness measure using probabilistic thinking, but *once you've found the pattern* don't interpret its measure as probability.

To mess things up completely I'll conclude with yet another example. There is this popular general pattern-search technique, that goes like that:

- Mine the data for all patterns present there. For each pattern assign a "p-value", which measures the probability to find
*exactly this*pattern in random data - Decide upon the "threshold"
*t*— events "less probable" than*t*will be considered insignificant. - Select all patterns whose
*p*-value is greater than the threshold.

If you've read the whole text attentively enough, you'll see the same problem in this example: both the probabilities and the threshold represent the case of searching a single data item for a single pattern, whereas in fact we are searching for *any good* patterns in the *whole* dataset. The probabilities and threshold in this case are completely different. In order to assess them you should generate the *same* amount of random data, find the best patterns in it, and only then compare the results.

That's it. Hope I've given you some food for thought. As I've already noted, I myself am a bit confused with this topic, and I might be wrong with the whole idea, so your comments are welcome.

PS: There are some nice ideas related to the problem such as the "Bonferroni correction" or the "Sidak correction", which basically state that you should use much lower thresholds. However, my main point here is the problem of interpretation, and I don't see how the whatever correction of thresholds could solve that.