J. Reimand, M. Kull, H. Peterson, J. Hansen, J. Vilo: g:Profiler -- a web-based toolset for functional profiling of gene lists from large-scale experiments (2007) NAR 35 W193-W200 [PDF
J. Reimand, T. Arak, J. Vilo: g:Profiler -- a web server for functional interpretation of gene lists (2011 update) Nucleic Acids Research 2011; doi: 10.1093/nar/gkr378 [PDF
2016-02-02 -- g:Profiler has novel annotations and a new service. This release welcomes Human Protein Atlas data for human protein expression in 44 tissue slices; Online Mendelian Inheritance in Man that covers human genes and their relationships with Mendelian disorders and other genetic phenotypes; and updated TRANSFAC transcription factor binding site predictions. We also introduce a new service g:SNPense that converts SNP identifiers (rs codes) to chromosomal locations and gene IDs. Finally, in-house Python API is now available. Note that after deliberation, we have reversed the more stringent FDR multiple testing correction, introduced in the 2015-11-20. Details on the algorithm as implemented in g:Profiler here.
2016-01-28 -- A bugfix release of g:Profiler. Occasionally, large ordered queries caused memory corruption in the analysis backend and no result being output. This has been corrected.
2015-12-28 -- g:Profiler was updated to Ensembl 83 and Ensembl Genomes 30. This release includes two important updates on ordered gene lists. First, a bug in the previous release caused overly restrictive FDR and Bonferroni correction on ordered gene lists (non-ordered gene lists and default g:SCS analyses were not affected). We recommend repeating analyses of ordered gene lists of the past month. Second, the ordered lists analysis was updated to better detect optimal sublists with strongest p-values of pathway enrichments. This update will result more discovered pathways in cases of ordered analyses of longer gene lists.
All news >>
Welcome to g:GOSt!
First time? See our welcome note.
g:GOSt performs functional profiling of gene lists using various kinds of biological evidence. The tool performs statistical enrichment analysis to find over-representation of information like Gene Ontology terms, biological pathways, regulatory DNA elements, human disease gene annotations, and protein-protein interaction networks. Its output is a tabular graphic where genes are shown in columns, functions in rows, and coloured table cells show functional associations. The basic input of g:GOSt is a list of genes.
- Query 1: nine core cell cycle transcription factors (TF) in yeast (plain, unordered query).
- Query 2: same as above, but considering only the set of all yeast TFs as background (custom statistical background).
- Query 3: human INHBA, a member of the TGF-beta pathway, and 29 microarray probesets with highest correlation in gene expression (ordered query).
- Query 4: Same as above, but electronic annotations [IEA] excuded (no IEA).
- Query 5: Recurrent mutations in the PI3K/AKT signalling pathway from the TCGA pancancer dataset (5+ SNVs).
- Query 6: Same as above, but output stored as Excel file (alternative output type).
- Query 7: Textual list of all known annotations of human gene PAX6 (query with single gene, alternative output type, all significant and insignificant annotations).
- Query 8: Fully numeric EntrezGene IDs and appropriate prefixes added automatically (query with numeric IDs).
Default output of g:GOSt is a PNG graphic. See its detailed description that uses INHBA and co-expressed genes as example.
Three alternative options are available in the output type dropdown menu: textual, spreadsheet (XLS) and minimal. The minimal option shows output without HTML header and is useful for programmatic access. All these options present enrichment information in a common format (see help item [?] at the dropdown).
The basic form of g:Gost input is a list of genes. g:GOSt accepts a simple whitespace-separated gene lists that consist of mixed types of gene IDs (proteins, transcripts, microarray IDs, etc). Single genes, ordered gene lists, GO, pathway or any other IDs of functional information and chromosomal regions may be presented as input. See help items [?] for further information.
Ordered gene lists
g:Profiler gene lists may be interpreted as ordered lists where elements (genes, proteins, probesets) are in order of decreasing importance. The ordered query option is useful when the genes are placed in some biologically meaningful order, for instance according to differential expression in a given microarray experiment. g:Profiler then performs incremental enrichment analysis with increasingly larger numbers of genes from the top of the list. This optimisation procedure identifies specific functional terms that associate to most dramatic changes in gene expression, as well as broader terms that characterise the gene set as a whole.
g:Sorter is a convenient method for producing examples of sorted lists from microarray co-expression searches.
Electronic annotations [IEA]
A significant proportion of functional annotations of Gene Ontology are assigned using in silico curation methods and have the IEA evidence code (Inferred from Electronic Annotation). While IEA annotations are an invaluable resource in mapping gene functions, manually curated annotations of experimental and computational studies are generally of higher confidence. Therefore it is sometimes advisable to exclude electronically inferred annotations from enrichment analysis and focus on annotations with stronger evidence. Excluding IEA annotations may also help reduce bias towards abundant and ubiquitous housekeeping genes like the ribosomes.
Our IEA filter is enabled via a single checkbox no electronic GO annotations and corresponding enrichment analyses account for altered structure of GO annotations.
Besides various gene names, symbols and accessions, queries may be constructed from collections of chromosomal regions. ENSG genes from these regions are retrieved automatically. Genes need not fit the region fully, and hence one may even study single nucleotides (SNPs). g:Profiler uses a chromosome:start:end format for chromosomal regions, e.g. X:1:2000000.
To activate chromosomal queries, check the checkbox chromosome ranges. Note that genes and chromosome regions cannot be mixed in a single query.
Multiple testing correction
Multiple testing problem is a statistical concept that relates to the increased chance of getting significant-looking false positive results, when evaluating a large number of alternative hypotheses simultaneously. It is an important issue in functional enrichment analysis, since each input query is compared against hundreds of Gene Ontology terms, pathways, regulatory motifs, etc. Multiple testing correction systematically reduces the significance of detected p-values to discard false positives.
g:Profiler uses multiple testing correction by default and applies our tailor-made algorithm g:SCS for reducing significance scores. Alternatively, one may select Bonferroni correction (BC) or Benjamini-Hochberg FDR (False Discovery Rate) -- these two standard solutions to the multiple testing problem are available under Advanced options. In comparison to the latter two, our algorithm takes into account the unevenly distributed structure of functionally annotated gene sets. Our simulations in (Nucleic Acids Research, 2007) show that g:SCS provides a better threshold between significant and non-significant results than FDR or BC.
- Check out gProfiler Beta for most recent developments and data updates.
- Previous versions of g:Profiler software and data are available in the archives.
gProfileR R package is available on CRAN.
gprofiler-official Python module is available on PyPI.
- g:Profiler can be inserted into a Galaxy workflow.
- g:Profiler web service API documentation is available here.