### Statistical significance

**Statistical significance** is the notion used in hypothesis testing to reject or accept hypotheses. Generally, a null hypothesis

*H0* may be rejected and an alternative hypothesis

*H1* accepted, if there is sufficient statistical evidence given the observed events that favour

*H1* over

*H0*. Statistical significance is calculated as a

**p-value**, the probability of observing the given event or any other more extreme event.

In graphical output, significant results are printed in black and insignificant results in gray. In textual output, significant p-values are preceded with exclamation symbol `!'.

In g:GOSt, the analysis tests

**the following pair of hypotheses**:

**Null hypothesis H0**: Given a set of input of genes *G* as query, and genes associated to GO (KEGG, TRANSFAC..) term *T*, the number of common elements to *G* and *Q*, i.e. intersection *G&Q*, has probably appeared by random chance. **Alternative hypothesis H1**: The intersection *G&Q* is proportionally large enough so it has probably not appeared by random chance, and the event may therefore reveal relevant biological information, i.e. overrepresented GO terms, KEGG pathways, regulatory motifs, etc., of the given gene group.

g:GOSt uses

**Fisher's one-tailed test**, also known as

**cumulative hypergeometric probability**, as the p-value measuring the randomness of the occurred intersection

*G&Q*. Generally, the smaller is the p-value, the higher are the odds that the given match with the term and the input query is important and has not appeared by mere chance. The p-value represents the probability of the observed intersection plus probabilities of all larger, more extreme intersections.

Null hypothesis may be rejected, if the p-value is sufficiently small. The borderline between "normal" and "significantly small" is called

**significance threshold**. Traditional significance thresholds include 0.05, 0.01, 0.001.

The problem of

**multiple testing** occurs when an analysis involves several rounds of testing the same pairs of hypotheses. It is rather intuitive that in a long series of tests, sooner or later one may observe quite a good p-value that has actually occurred by random chance. Therefore it is reasonable to lower significance thresholds as the number of performed tests grows.

Every analysis of a gene list in g:GOSt involves a series of comparisons, as the intersection

*G&Q* and corresponding p-value is calculated for a large number of terms from GO, KEGG, TRANSFAC, and other data sources. Since all of these p-values are compared against a threshold, the GOSt analysis involves multiple testing, rendering traditional significance thresholds useless.

g:GOSt uses

**multiple testing correction algorithms** for distinguishing significant results from random matches. These include

**Bonferroni correction**,

**Benjamini-Hochberg False Discovery rate**, and g:GOSt native method

**g:SCS**. The latter method is used by default.

In multiple testing correction, all p-values in a series of tests are transformed to more conservative values based on the number and distribution of initial p-values. In g:GOSt, the cutoff value after correction is

*0.05*, denoting the fraction of

**false positives** in a normal g:GOSt analysis.