g:Profiler – a web server for functional enrichment analysis and conversions of gene lists

Welcome to g:Profiler

g:Profiler is a public web server for characterising and manipulating gene lists. g:Profiler has a simple user-friendly web interface with powerful visualisations and is currently available for 400+ species, including mammals, plants, fungi, insects from Ensembl and Ensembl Genomes. g:Profiler is updated approximately in every three months and follows quarterly releases of Ensembl databases. g:Profiler tool set consists of the following tools:

g:GOSt , the core of the g:Profiler, performs statistical enrichment analysis to provide interpretation to user-provided gene lists. The gene lists can be either flat or ordered gene lists. We accept majority of the identifier types, chromosomal regions and term IDs as input. We provide data from multiple sources of functional evidence, including Gene Ontology terms, biological pathways, regulatory motifs of transcription factors and microRNAs, human disease annotations and protein-protein interactions.
g:Convert is a gene identifier conversion tool. It uses information in Ensembl databases to handle hundreds of types of IDs for genes, proteins, transcripts, microarray probesets, etc, for many species, experimental platforms and biological databases. g:Convert is flexible: it accepts a mixed list of IDs and recognises their types automatically. It can also serve as a service to get all genes belonging to a particular functional category.
g:Orth is a tool for mapping homologous genes across related organisms based on Ensembl data. Given a selected target organism, g:Orth retrieves the genes of the target organism that are similar in sequence to the initial genes in the input.
g:SNPense is a tool for mapping human single nucleotide polymorphisms (SNP) to gene names, chromosomal locations and variant consequence terms from Sequence Ontology.

g:Profiler has an R package and other programmatic interfaces for integration into your codebase.

About g:Profiler

g:Profiler is developed and maintained in Estonia, at the University of Tartu, Institute of Computer Science, Bioinformatics, Algorithmics and Data Mining Group BIIT. Currently g:Profiler is developed and maintained by a team of professional software developers, statistician and researchers - Uku Raudvere, Ivan Kuzmin, Liis Kolberg, Priit Adler, Hedi Peterson and Jaak Vilo. Previously, major contributors have been Jüri Reimand and Tambet Arak. Over the time, g:Profiler has also received valuable contributions from BIIT members, notably Jaanus Hansen, Raivo Kolde, Meelis Kull and Sulev Reisberg. The first version of g:Profiler was known as GOSt (Gene Ontology Statistics) and became available in early 2005. g:Profiler tool is freely available through web application and various programmatic access points . g:Profiler is an ELIXIR project and its development is supported through European Union European Regional Developmental Funds project "Estonian Life Science Infrastructure for Biological Information".

Publications and theses

Uku Raudvere, Liis Kolberg, Ivan Kuzmin, Tambet Arak, Priit Adler, Hedi Peterson, Jaak Vilo: g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update) Nucleic Acids Research 2019; doi: 10.1093/nar/gkw199 [PDF].
Jüri Reimand, Tambet Arak, Priit Adler, Liis Kolberg, Sulev Reisberg, Hedi Peterson, Jaak Vilo: g:Profiler—a web server for functional interpretation of gene lists (2016 update) Nucleic Acids Research 2016; doi: 10.1093/nar/gkw199 [PDF].
Jüri Reimand, Tambet Arak, Jaak Vilo: g:Profiler—a web server for functional interpretation of gene lists (2011 update) Nucleic Acids Research 2011; doi: 10.1093/nar/gkr378 [PDF].
Jüri Reimand, Meelis Kull, Hedi Peterson, Jaanus Hansen, Jaak Vilo: g:Profiler—a web-based toolset for functional profiling of gene lists from large-scale experiments (2007) Nucleic Acids Research 35 Web Server issue [PDF].
Jüri Reimand: Functional analysis of gene lists, networks and regulatory systems (2010) Doctor of Philosophy (PhD) thesis, University of Tartu, Estonia. [PDF].
Jüri Reimand: Gene Ontology mining tool GOSt (2006) Master of Science thesis, University of Tartu [PDF].

Funding

We acknowledge financial support from Estonia’s Integration to the European Bioinformatics Infrastructure (ELIXIR); European Union through the Structural Fund (Project No 2014-2020.4.01.16-0271, ELIXIR); Estonian Research Council (IUT34-4); European Regional Development Fund for CoE of Estonian ICT research EXCITE projects. Estonian Scientific Computing Infrastructure has provided valuable computing resources.

Tech notes

Check out g:Profiler Beta for most recent developments and data updates.
Previous versions of g:Profiler software and data are available in the archives.
g:Profiler has an R package and other programmatic interfaces for integration into your codebase.

Support

Please use our online contact form or {biit.support}@ut.ee for technical questions and software support.

g:GOSt

g:GOSt performs functional profiling of gene lists using various kinds of biological evidence. The tool performs statistical enrichment analysis to find over-representation of information from Gene Ontology terms, biological pathways, regulatory DNA elements, human disease gene annotations, and protein-protein interaction networks.

g:GOSt uses Fisher's one-tailed test, also known as cumulative hypergeometric probability, as the p-value measuring the randomness of the intersection between the query and the ontology term. The p-value represents the probability of the observed intersection plus probabilities of all larger, more extreme intersections.

Using g:GOSt

Visual reference for g:GOSt [Results] -- Manhattan-like-plot:

The first visible output in g:GOSt is an interactive Manhattan plot that illustrates the enrichment analysis results (see the figure above). This example figure represents the result of a multiquery with two input gene lists. In case of a single query, only one of such Manhattan plot is generated. All the features remain the same.
The x-axis represents functional terms that are grouped and colour-coded by data sources (e.g. Molecular Function from GO is red; the sources that were not included in the analysis are shown in grey). The y-axis shows the adjusted enrichment p-values in negative log10 scale. The light circles represent insignificant terms (if available).
For example, the circle number 1 in the figure illustrates the enrichment of the term GO:0009888 (tissue development; includes 1980 annotated genes) and shows that the p-value in the corresponding query is 2.584x10^-10. This information is shown if hovering over the circle. The circle sizes are in accordance with the corresponding term size, i.e. larger terms have larger circles.
The term location on the x-axis is fixed and terms from the same GO subtree are located closer to each other (circles number 4 and 5). This is helpful for faster and more intuitive interpretation of the results. The number behind the source name in the x-axis labels shows how many significantly enriched terms there were from this source.
Clicking on a circle will pin the circle and create a result table below the figure. The circle will be highlighted with an identifier which is then also referred to in the table showing the detailed information such as data source, id and name of the term together with corresponding p-value. The selection can be removed by clicking again on the circle.
There are two views for Manhattan plot, capped and uncapped, which can be selected from the dedicated selection box in the upper left corner. By default, we show the capped version which collects the term circles with p-values less than 10^-16. This fixes the scale of y-axis to keep Manhattan plots from different queries comparable and is also intuitive as, statistically, p-values smaller than that can all be summarised as highly significant. The same threshold is also used in the statistical tests in R. This selection can also be switched off to show the p-values in a wider scale range.
In case of multiquery, a separate Manhattan plot for every input query is shown with corresponding query names on top of the image (>polysomal and >total in the figure). These images are interlinked. This allows to highlight the same term across multiple queries by hovering over the term circle in any of the individual plot. Clicking will pin the corresponding circle in every plot. This allows to easily compare the results from multiple queries, both in the image and in the results table.

Visual reference for [Detailed results] output:

Query

The default input of g:GOSt is a list of genes/proteins. g:GOSt accepts a simple whitespace-separated gene lists that can consist of mixed types of gene IDs (proteins, transcripts, microarray IDs, etc), SNP IDs, chromosomal intervals or term IDs.

In case of chormosomal regions ENSG genes from these regions are retrieved automatically. Genes need not fit the region fully, and hence one may even study single nucleotides (SNPs). g:Profiler uses a chromosome:start:end format for chromosomal regions (e.g. X:1:2000000).

In case of term ID, g:GOSt retrieves all genes of a given organism associated to the given term, and analyses this set of genes as a query. For example, when queried for GO:0007507 (heart development) with organism H. sapiens, g:GOSt retrieves about a hundred human genes associated to heart development, and performs an analysis for this gene set. One may then observe statistically significant related pathways from KEGG and REACTOME, or putative regulatory elements from TRANSFAC.

g:Profiler tools accept mixed types of gene IDs as input. These are automatically converted via g:Convert to an internal format based on Ensembl genes (ENSG). Note that chromosomal regions and termIDs can also be mixed in with regular gene list query.

All g:Profiler tools assume space-separated queries, i.e. spaces, tabs, and newlines are used to distinguish gene IDs. All queries are case insensitive.

Fully numeric IDs need to prefixed prior to gene list submission. All encountered numeric IDs will be prefixed by g:Profiler automatically, using the prefix determined by the Numeric IDs treated as dropdown menu. See advanced options to change the default type of prefixes.

A random query may also be submitted (button just below query box). This is constructed by randomly picking 50% of the query symbols over all GO terms and 2*25% from two mid-sized GO terms, hence usually generating a statistically significant result.

Multi-queries consist of several gene lists that can be submitted to g:GOSt for comparative enrichment analysis. Each individual gene list needs a FASTA style header line (>QueryTitle) that can include a query title.

Organism

Organism is one of the most important input parameters to g:Profiler tools. This parameter defines the organism where to the input genes, proteins and probes belong to.

g:Profiler supports organism specific queries. Any identifiers not recognised within a given organism are marked as unknown ('?' or 'N/A').

Default organism in g:Profiler is human (Homo sapiens). The organism drop-down list has the 10 most frequently queried organisms at the top and all the other supported organisms grouped by sources (ensembl, fungi, metazoa, parasites and plants). The organism list has flexible search functionality for easier species selection. Search functionality matches across latin and common names (e.g. both human and Homo sapiens can be searched for).

A rule of thumb: If your query gives you no results, you have most probably chosen an incorrect organism.

A full list of available organisms supported by g:Profiler can be found here .

Highlight driver terms in GO

This feature uses a two-stage algorithm for filtering GO enrichment results, providing a more efficient and reliable approach compared to traditional clustering methods. The first stage involves grouping significant terms into sub-ontologies based on their relations, while the second stage focuses on identifying the leading gene sets that give rise to other significant functions in the neighborhood. This is done using a simple greedy search strategy that recalculates hypergeometric p-values with new parameters, resulting in at least one function being presented from every connected component. This approach ensures that multiple leading terms in a component are considered, rather than simply selecting the term with the highest significance level. For more information, see the full feature description

Ordered query

g:Profiler gene lists can be interpreted as ordered lists, where elements (genes, proteins, probesets) are arranged in decreasing order of importance. The "ordered query" option in g:Profiler is useful when genes can be ordered based on some biologically meaningful criteria, such as differential expression in an RNA-seq experiment, the number of protein-protein interactions, or absolute expression values, but there is no clear way to determine how many genes to include in the enrichment analysis. The incremental enrichment analysis in g:Profiler for ordered queries involves testing increasingly larger numbers of genes starting from the top of the list. This option allows users to determine if functional terms are evenly distributed across the gene list or enriched primarily at the top. By running an ordered query, specific functional terms associated with the most significant changes in the experimental setup can be identified, as well as broader terms that characterize the gene set as a whole.

It's important to note that the results of ordered queries in g:Profiler should not be treated as p-values. Instead, users should only infer whether genes belonging to a term are evenly distributed across the query or primarily located at the top.

Run as multiquery

Queries are separated by lines that start with a > symbol and optionally contain a title for particular gene list. Any number of following lines until the next > belong to the query and should contain gene names, chromosomal regions, etc. Note that the query header is not needed in case of a single list of genes.

All results

By default g:GOSt shows out only statistically significant results after multiple correction is applied. However, when a user wants to obtain the full list of gene-term relationships or check particular terms then "All results" option can be applied.

Measure underrepresentation

Check this option if you wish to determine significantly under-represented functional terms. By default over-representation is measured, i.e. the probability that the intersection of query and a functional category has arisen by chance. In contrast, measuring under-representation tests whether a particular term has less genes overlaping with the query genes than expected by chance.

Statistical domain scope

Statistical domain size N describes the total number of genes used for random selection and is one of the four parameters for the hypergeometric probability function of statistical significance used in g:GOSt.

Only annotated genes -- The default behaviour of g:GOSt considers only genes with at least one annotation to be part of the domain. Less-studied mammalian genomes involve a large fraction of genes that lack even "unknown" tags and the most generic annotations (biological process, molecular function, cell component). These unverified genes are often spurious and in most cases could be omitted from statistics. This results in a smaller amount of significant results everywhere; very large terms should almost never appear statistically highlighted.

All known genes -- g:GOSt also allows to calculate statistical significance considering all genes of the given organism in the Ensembl database. Since the number of annotated genes may be much smaller than the number of all genes, g:GOSt always delivers more results that appear as significant. In larger queries, topmost GO terms such as Biological Process, Molecular Function, Cell Component tend to come out as highly significant using all known genes. This is misleading in queries with few unannotated genes. Statistical domain size of all known genes makes sense in situations where the input includes a lot of unknown and unannotated genes.

Custom -- In order to compute functional enrichments of gene lists, g:GOSt uses the background set of all organism-specific genes annotated in the Ensembl database. In several occasions, it is advisable to limit the background set for more accurate statistics. For instance, one may use a custom background when the number of genes and corresponding probesets of a microarray platform is considerably smaller than the number of known genes, or only genes of a specific chromosome are considered. g:GOSt provides means to define the custom background as a mixed list of gene, probeset and protein IDs in the corresponding form field. It is also possible to select a predefined custom background from a list of popular microarray platforms.

Custom over annotated genes -- Use the set of all genes that are annotated in the data source and also included in the custom background. This mimics the default setting ("Only annotated genes") while limiting the background genes to the ones provided.

Custom over all known genes -- Use the set of all known genes in the custom background as the statistical domain.

Significance threshold

Multiple testing problem is a statistical concept that relates to the increased chance of getting significant-looking false positive results, when evaluating a large number of alternative hypotheses simultaneously. It is an important issue in functional enrichment analysis, since each input query is compared against thousands of Gene Ontology terms, pathways, regulatory motifs, etc. Multiple testing correction systematically reduces the significance of detected p-values to discard false positives.

g:GOSt uses multiple testing correction by default and applies our tailor-made algorithm g:SCS for reducing significance scores. Alternatively, one may select Bonferroni correction (BC) or Benjamini-Hochberg FDR (False Discovery Rate) -- these two standard solutions to the multiple testing problem are available under Advanced options. In comparison to the latter two, our algorithm takes into account the unevenly distributed structure of functionally annotated gene sets. Our simulations in (Nucleic Acids Research, 2007) show that g:SCS provides a better threshold between significant and non-significant results than FDR or BC.

g:SCS algorithm -- g:SCS method is the default method for computing multiple testing correction for p-values gained from GO and pathway enrichment analysis. It corresponds to an experiment-wide threshold of a=0.05, i.e. at least 95% of matches above threshold are statistically significant.

This approach is based on the idea that standard multiple testing corrections such as Bonferroni correction and Benjamini-Hochberg FDR are designed for multiple tests that are independent of each other. This is not correct for the analysis in g:GOSt, since GO consists of hierarchically related general and specific terms. The True Path Rule of GO states that genes associated to a given GO term are implicitly associated to all more general parents of this term.

g:SCS threshold is a value pre-calculated for query list sizes up to 1000 genes. Given a fixed input query size, g:SCS analytically approximates a threshold t corresponding to the 5% upper quantile of randomly generated queries of that size. All actual p-values resulting from the query are transformed to corrected p-values by multiplying these to the ratio of the approximate threshold t and the initial experiment-wide threshold a=0.05.

The algorithm considers the set structure underlying gene sets annotated to terms of each organism, and should therefore give a tighter threshold to significant results. g:SCS thresholds perfectly agreed in simulations with randomly generated gene sets of fixed input query sizes.

Bonferroni correction -- Bonferroni correction [1] is a simple and well-known Family Wise Error Rate p-value correction for multiple testing. Family Wise Error Rate measures the probability of at least one random result considered significant (Type I error) within experiment. Given an experiment-wide significance threshold a, Bonferroni correction takes into account only number n of performed independent or dependent tests n in given experiment, and defines individual significance level as a* = a/n . Every match with p-value below a* is considered insignificant.

In g:GOSt analysis, the expected experiment-wide significance level is a=0.05. The variable n denotes the number of independent tests, i.e. the number of GO terms, KEGG pathways, etc, compared against the given input query. Two different approaches for n are mentioned in literature. A more common approach considers only terms that have some genes common with input query. Other approach suggests that n should be considered equal to the number of all annotated terms in the given genome. The first case would involve correction for a few hundred tests, while the second case observes tests with several thousand terms. g:GOSt follows the first approach for calculating the correction.

Bonferroni correction is considered rather conservative in the sense that it increases the amount of discarded significant results. Matches with smaller annotation sets never appear significant, as even a 100% overlap of input query and given term will not result in a sufficiently low p-value.

[1] C. E. Bonferroni. Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 8:3 62, 1936.

False Discovery Rate -- False Discovery Rate (FDR)[1] is a simple multiple testing correction, that measures expected proportion of false significant matches (Type I errors) within results. Benjamini-Hochberg FDR approach takes into account p-values observed in the experiment. The latter is often considered more applicable for Gene Ontology analysis than Bonferroni correction.

Let a be defined significance level for the whole experiment of n statistical tests. FDR method sorts n p-values from tests in increasing order, and picks a* to be the largest p-value, p_j, that is smaller than its proportional significance level a_j. Proportional significance level a_i for ith result, is calculated as a_i = i*a/n. Every match with p-value above Benjamini-Hochberg corrected significance a* level is discarded as insignificant.

In g:GOSt analysis, the expected experiment-wide significance level is a=0.05. The variable n denotes the number of independent tests, i.e. the number of GO terms, KEGG pathways, etc, compared against the given input query. Two different approaches for n are mentioned in literature. A more common approach considers only terms that have some genes common with input query. Other approach suggests that n should be considered equal to the number of all annotated terms the given genome. The first case would involve correction for a few hundred tests, while the second case observes tests with several thousand terms. g:GOSt follows the first approach for calculating the correction.

FDR method is argued improper for analysis of gene queries, as GO terms are hierarchically related, gene annotation sets are highly intersecting because of the True Path Rule, and therefore, statistical tests with many matching terms should not be considered entirely independent of each other. It is not yet clear whether Gene Ontology hierarchy complies with variants of FDR designed for dependent testing.

[1] Yoav Benjamini and Yosef Hochberg. Controlling the False Discovery Rate: Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society B, 57(1):289-300, 1995.

User threshold

User-defined p-value threshold provides a possibility to additionally filter results. The threshold defaults to p=0.05, meaning that all significant results are shown. If the threshold is set to less than 0.05, matches with p-values above threshold are not shown.

User-defined threshold differs from significance thresholds, since it does not attempt to decide over significance, i.e. change the colour of results in output.

Numeric IDs treated as

Fully numeric gene IDs need to get an appropriate prefix prior to gene list submission. All encountered numeric IDs are automatically detected and a prefix is added by g:Profiler based on the Numeric IDs defined in the corresponding drop-down menu. For all namespaces supported by g:Profiler, see the list of namespaces .

Data sources

Gene Ontology
Gene Ontology(GO) is a directed acyclic graph structure that defines biological functions and their relationships to one another. The GO annotations, accompanied by evidence-based statements describe specific gene product and specific ontology term (biological function) relationship. The GO has three major subontologies - Molecular Functions (MF) that describe gene product's functions; Biological Process (BP) that describe in which biological process the gene product participates; and Cellular Component (CC) that describe in which part of the cell the particular gene product is physically located.

Electronic GO annotations [IEA] -- A large proportion of functional annotations of Gene Ontology are assigned to genes using in silico curation methods and have the IEA evidence code (Inferred from Electronic Annotation). While IEA annotations are an invaluable resource in mapping gene functions, manually curated annotations of experimental and computational studies are generally of higher confidence. Therefore it might be advisable to exclude electronically inferred annotations from enrichment analysis and focus on annotations with stronger evidence when higher confidence is wanted for the enrichment results. Excluding IEA annotations may also help to reduce bias towards abundant and ubiquitous housekeeping genes like the ribosomes. These annotations can be excluded under the Data sources selection.

Biological Pathway databases
In g:GOSt we support Reactome, KEGG and WikiPathways. Reactome pathways are centrally curated by selected domain experts. KEGG pathways are a mixture of species specific curation and orthologous knowlege mapping between species. WikiPathways is a community driven platform where pathways are curated by anyone interested.

Regulatory motifs in DNA
Transfac Putative transcription factor binding sites (TFBSs) from TRANSFAC database are retrieved into g:GOSt through a special prediction pipeline. First, TFBSs are found by matching TRANSFAC position specific matrices using the program Match on range +/-1kb from TSS as provided by APPRIS (Annotating principal splice isoforms) via Ensembl biomart. For genes with multiple primary TSS annotations we selected one with most TF matches. The matching range for C. elegans, D. melanogaster and S. cerevisiae is 1kb upstream from ATG (translation start site). A cut-off value to minimize the number of false positive matches (provided by TRANSFAC) is then applied to remove spurious motifs. Remaining matches are split into two inclusive groups based on the amount of matches -- a base class that contains all genes that do match given motif and TFBSs that have at least 2 matches per gene are classified as match class 1.
mirTarBase is a database that holds experimentally validated information about genes that are targetted by miRNAs. We include all the organisms that are covered by mirTarBase.

Protein databases
CORUM is a database of manually annotated protein complexes from mammalian organisms. g:Profiler includes gene annotations for all CORUM protein complexes in human, mouse and rat.

Human Protein Atlas Protein expression measured in normal tissues. The data is based on The Human Protein Atlas version 25.0 and Ensembl version 109. We count more highly expressed genes among less expressed genes, and all expressed genes belong to tissue-specific terms. Reliability assessment is displayed as evidence codes on detailed results view. 'not detected' and 'uncertain' genes have been omitted from our ontology as their interpretation might be ambiguous. We attempt to preserve term IDs between versions. Unfortunately, tissue or cell type renaming will change term ID as well.

Human Phenotype Ontology
We include gene annotations from the Human Phenotype Ontology (HP), a standardized vocabulary of phenotypic abnormalities encountered in human disease. Due to ethical constraints, a significant portion of research on human disease is conducted in model organisms like mouse and rat. However HP only provides annotations to human genes at the time. To complement HP annotations of human genes, we have extended their gene database to genes of other organisms in g:Profiler using gene orthology information from Ensembl. As a result, researchers can directly see enrichments of human disease associations in their gene lists of model organisms.

Bring your data (Custom GMT)

Users can upload their own annotation data using files in Gene Matrix Transposed file format (GMT). The GMT file format is a tab-delimited file format that describes the gene sets/functions. Each row represents a gene set that is described by a unique identifier, a name, and the genes annotated with the function. For more information, please see the format description here. Users can either compose GMT files themselves or use pre-compiled gene sets from our server Example GMT of PharmGKB gene-drug relationships, or available at dedicated websites, for example at Molecular Signatures Database (MSigDB) or at Bader lab. Note that in the case of using a GMT file, only the genes present in the GMT file are known. g:Profiler doesn't include any information from its default database in case of custom annotations.

It is possible to generate a custom annotation set for any organism if you have a set of annotations for it. We support uploading annotations as GMT files. There are step-by-step instructions and a helper tool for all steps of the process available at https://biit.cs.ut.ee/gmt-helper/. More information is available at our FAQ page

Highlighting

We have implemented a novel two-stage hybrid term list filtering algorithm for GO enrichment results that takes into account the underlying topology of annotations, but without introducing additional hyperparameters to tune. The main objective for developing this approach was to systematically reduce the term lists retrieved by g:Profiler.

The first step, once the list of significantly enriched GO functions is detected, is to reorganise the significant terms based on their relations. That is, terms that share GO defined relation, i.e have connecting edges between the term nodes, are grouped together. Therefore, instead of detecting ambiguous clusters of terms by using some, i.e. semantic, similarity measure and a clustering algorithm, we use the reliable knowledge from the manually curated resource of GO. Hereinafter, these term groups are referred to as connected components, and one can think of them as sub-ontologies of GO. The components alone are already helpful for summarizing the results as the terms in the same connected component describe similar biological contexts and most likely share a large part of the genes. These components are shown in the ‘GO Context’ tab.

All significantly enriched terms are displayed with a coloured border that corresponds to their enrichment p-values. Terms with grey-bordered are connecting the enriched terms to the root of the domain. Additionally, highlighted terms are shown with a yellow background. For each term, g:Profiler provides the term ID, description, and enrichment p-value. The initial structure of the Gene Ontology (GO) graph is simplified, but the main idea of more detailed terms being located at the bottom of the tree is retained.

The next step is to detect the non-redundant terms from each of these components. We rely on the idea that every component has its leading gene sets that give rise to other significant functions in the neighbourhood. That is, if a child term is significantly enriched, the parent term might appear significant due to the fact that it also includes all the genes from the child term. To detect these leading sets we developed a simple algorithm that greedily starts from the term that has the smallest adjusted p-value and keeps it as a leading term. At the same time, the overlapping genes between the term and the input gene list are remembered as marked genes.

As the GO annotations are propagated up the graph, we now exclude all the child and ancestor terms of the selected term from further searches. We find these by recursively traversing the graph structure of GO. Then the search continues with the next smallest p-value from the remaining functions in the component, but now we implement the idea from the elim method (Alexa et al, 2006) and measure the significance of the term by excluding the marked genes from the term and recalculating the cumulative hypergeometric p-value with the new parameters. With this, we identify whether there are other relevant genes in the query that give rise to the terms representing different biological functions. If this p-value remains significant, that is, it is below the originally set significance threshold, then we keep the term as another leading term and include the overlapping genes to the list of marked genes. The recalculated p-value is used as an indicator in the algorithm and is not reported to the user. Again, we exclude the child and ancestor terms and continue these steps until are no terms left to check. Since the components are independent and do not share any terms, the greedy search can be applied to the connected components in parallel. As a result, at least one function from every connected component is presented for further interpretation. Though it may seem appealing to simply select only the term with the highest significance level for every component, there might be several leading terms in a component.

Before implementing it for the g:Profiler users, we conducted a straightforward evaluation to assess the performance of our Gene Ontology (GO) filtering algorithm, primarily comparing our in-house developed methods with various parameter settings of the SUMER R package (A. Alexa, J. Rahnenführer and T. Lengauer, Improved scoring of functional groups from gene expression data by decorrelating GO graph structure, Bioinformatics 22, 1600 (2006)”). The primary criteria under examination were: a) the reduction of reported GO terms and b) the preservation of biological relevance within the query. While criterion a) is straightforward to evaluate and quantify, criterion b) is considerably more elusive.

For this investigation, we selected the GO: Biological Process (BP) ontology and designed queries for the experiment. Each query comprised a proportion of genes associated with a random GO:BP term and a proportion of unrelated, random genes. We randomized both the proportions and query size within a reasonable range, allowing us to gauge the impact of "noise" on filtering and evaluate the retention of "signal" in the query both with and without filtering.

Comparison of different filtering strategies on retaining "signal" genes and reducing reported terms in GO queries. The background dot plot indicates all performed queries, while the foreground bigger dots represent the combined average per filtering schema. The x-axis represents the fraction of terms retained after filtering, while the y-axis represents the fraction of "signal" genes with and without filtering. The legend denotes the different filtering methods, including in-house filtering schemes (prefixed with 'filt_'), the currently implemented method in g:Profiler ('passed_filter'), and SUMER methods ('sumer_N', where N denotes the maximum number of terms to retain). The SUMER 'gradient' intuitively visualises the trade-off between signal retention and term reduction. Our 'passed_filter' method, indicated by an arrow, demonstrates the best performance in retaining relevant signal while significantly reducing the number of reported terms.

Our study analyzed the performance of five in-house developed filtering strategies and the SUMER R package with different parameter settings. The SUMER strategies require parameters specifying the maximum number of terms to retain, whereas our in-house methods are parameter-free. Ultimately, the strategy currently implemented in the g:Profiler tool outperformed all other contenders, including the SUMER strategies with various parameters. Although the SUMER strategies exhibited commendable performance, their parameters lacked a suitable default and did not generalize well across different queries.

Furthermore, we conducted the same evaluation using a more complex query featuring signals from two independent GO:BP terms. In this scenario, our implemented filtering strategy demonstrated superior performance, while the limitations of SUMER's parameter-based approach became evident.

Examples

Query 1 : nine core cell cycle transcription factors (TF) in yeast (plain, unordered query).
Query 2 : human INHBA, a member of the TGF-beta pathway, and 29 microarray probesets with highest correlation in gene expression (ordered query).
Query 3 : Same as above, but electronic annotations [IEA] excluded (no IEA electronic GO annotations).
Query 4 : Recurrent mutations in the PI3K/AKT signalling pathway from the TCGA pancancer dataset (5+ SNVs).
Query 5 : Fully numeric EntrezGene IDs and appropriate prefixes added automatically (query with numeric IDs).
Query 6 : Multiquery example with comparative analysis of KEGG pathways - Parkinson deases and Insulin resistance pathways.

g:Convert

g:Convert is a gene ID mapping tool that allows conversion of genes, proteins, microarray probes, common names, various database identifiers, etc. A mixture of IDs of different types may be inserted to g:Convert. The user needs to select a target database to which all input IDs will be converted to. Default target database is Ensembl. g:Convert also supports major public databases and naming conventions like Uniprot, EMBL, RefSeq, Entrez, HUGO, IPI, and organism-specific ID schemas, corresponding mappings are retrieved from Ensembl. In addition to genomic databases, a large variety of microarray platforms are made available for conversion. Many common Affymetrix mappings for different organisms are present; few others such as Celera, Agilent, and Illumina are available for s selection of organisms. Microarray mappings are retrieved from Ensembl.

Input IDs that have no corresponding entry in target database will be displayed as N/A.

g:Convert is based on Ensembl database. All input IDs will be automatically translated through internal mapping based on ENSG genes. g:Convert is well integrated with other modules in g:Profiler and serves as name mapping service for these.

Using g:Convert

Query

Syntactically identical to g:GOSt query field

The default input of g:Convert is a list of genes/proteins. g:Convert accepts a simple whitespace-separated gene lists that can consist of mixed types of gene IDs (proteins, transcripts, microarray IDs, etc), SNP IDs, chromosomal intervals or term IDs.

In case of chormosomal regions ENSG genes from these regions are retrieved automatically. Genes need not fit the region fully, and hence one may even study single nucleotides (SNPs). g:Profiler uses a chromosome:start:end format for chromosomal regions (e.g. X:1:2000000).

In case of term ID, g:Convert retrieves all genes of a given organism associated to the given term. For example, when queried for GO:0007507 (heart development) with organism H. sapiens, g:Convert retrieves about a hundred human genes associated to heart development.

Fully numeric IDs need to prefixed prior to gene list submission. All encountered numeric IDs will be prefixed by g:Profiler automatically, using the prefix determined by the Numeric IDs treated as dropdown menu.

Technically g:GOSt, g:Convert and g:Orth tools can handle identical queries, except for the multi-queries. g:Orth and g:Convert ignore the query separators, treating entire query as single gene list.

Examples

Query 1 : Affymetrix probesets (HG U133 2.0 PLUS) for the three core stem cell transcription factors in human (POU5F1, SOX2, NANOG)
Query 2 : Genes located in human mitochondrial genome shown as HGNC standard nomenclature (query with chromosomal regions).

For most species, g:Convert uses conversion tables from Ensembl Biomart. Some conversions cause ambiguity, and we filter them out. For example - the label ENSG00000015568 is annotated through HPA and Uniprot to genes ENSG00000015568 and ENSG00000183054. To avoid confusion and hard-to-find bugs, we remove the ENSG00000015568-ENSG00000183054 annotation from the g:Profiler database and it doesn't appear in g:Convert results.

g:Orth

g:Orth is a tool for mapping orthologous genes between related organisms based on data collected into the Ensembl database. Orthologous genes are similar in sequence and are likely conserved through evolution since a common ancestor. Orhologous genes may also carry out similar function and are therefore relevant in functional analysis.

The input of g:Orth can be a mixed list of IDs for genes or other biomolecules of the organism of interest that the user also needs to select as a target organism. All genes from the input are then mapped to the orthologous genes of the target organism. Input IDs that have no corresponding entry in either the given organism or the target organism will be displayed as N/A. g:Orth ortholog mappings are based on Ensembl alignments.

Using g:Orth

Query

Syntactically identical to g:GOSt query field

The default input of g:Orth is a list of genes/proteins. g:Orth accepts a simple whitespace-separated gene lists that can consist of mixed types of gene IDs (proteins, transcripts, microarray IDs, etc), SNP IDs, chromosomal intervals or term IDs.

In case of chormosomal regions ENSG genes from these regions are retrieved automatically. Genes need not fit the region fully, and hence one may even study single nucleotides (SNPs). g:Profiler uses a chromosome:start:end format for chromosomal regions (e.g. X:1:2000000).

In case of term ID, g:Orth retrieves all genes of a given organism associated to the given term. For example, when queried for GO:0007507 (heart development) with organism H. sapiens, g:Orth retrieves about a hundred human genes associated to heart development.

Fully numeric IDs need to prefixed prior to gene list submission. All encountered numeric IDs will be prefixed by g:Profiler automatically, using the prefix determined by the Numeric IDs treated as dropdown menu.

Technically g:GOSt, g:Convert and g:Orth tools can handle identical queries, except for the multi-queries. g:Orth and g:Convert ignore the query separators, treating entire query as single gene list.

Examples

Query 1 : List of budding yeast transcription factors and corresponding genes in human.
Query 2 : Rapidly developing regions from the 1st chromosome of the human genome (according to Neanderthal genome analysis) and corresponding orthologs in G.gorilla (query with chromosomal regions).

g:SNPense

g:SNPense allows the user to easily map a list of human SNP rs-codes (e.g. rs7961894) to gene names and receive chromosomal coordinates and predicted variant effects. Mapping is enabled to variants that overlap with at least one protein coding Ensembl transcript. The genome variant information is retrieved from the Ensembl Variation Data. The variant effects are described with colour-coded set of variant consequences terms, defined by the Sequence Ontology. These terms convey the information about the effects that each allele of the variant may have on each transcript.

Welcome to g:Profiler

About g:Profiler

Publications and theses

Funding

Tech notes

Support

g:GOSt

Using g:GOSt

Highlighting

Examples

g:Convert

Using g:Convert

Examples

g:Orth

Using g:Orth

Examples

g:SNPense

Contents