funcExplorer [1] is a web based tool for performing hierarchical clustering of gene/protein activity data from RNA-seq, microarray, ProtoArray or any experiments that can be brought into the form of a data matrix, followed by automatic functional enrichment analysis of clusters derived. Important is that the identifiers in the rows are recognizable in g:Profiler. The unique feature of funcExplorer is the global enrichment analysis of every possible cluster for shared biological function and a compact global visualization that highlights major gene clusters that are co-expressed and statistically significantly enriched in biological terms. funcExplorer utilises Gene Ontology, well curated pathway databases (KEGG, Reactome), regulatory motifs of Transfac and microRNA target sites of miRBase, CORUM protein complexes and Human Protein Atlas (HPA) and Human Phenotype Ontology (HPO) to provide information on the shared regulative mechanisms for given genes. All this information is given in an easily comprehendable birds-eye view that in color-coded form outlines interesting groups of genes. The idea of enrichment-driven pruning of hierarchical dendrogram is based on VisHiC web tool published in 2009 [2]. funcExplorer is a complete rewrite of this tool.
Datasets pass several stages of analysis:
Key feature of funcExplorer is search and visualization of clusters that are significantly enriched with biological terms. See section About input for more details.
funcExplorer supports gene/protein activity datasets from all major organisms that come in standardized tab-separated form. Input datasets can be uploaded by the user using simple web upload form and the results will be linked with given e-mail address or just the dataset itself. We also provide a variety of public datasets. funcExplorer supports hundreds of types of gene identifiers to allow user to input the data with favourite gene names or database IDs. The identifiers coincide with the identifiers available in public web server g:Profiler.
The source code of the funcExplorer web tool is freely available at a GitLab repository and Zenodo under the GNU GPLv3 licence.
[1] funcExplorer was published in 2018 by Kolberg et al. (PDF)
[2] VisHiC was published in 2009 by Krushevskaya et al. (PDF)
In this section we cover performing the very first analysis and visualization with funcExplorer. The easiest way to learn how to use funcExplorer is to analyse one of the publicly available datasets. This procedure requires only two easy stages of input:
In order to make things even easier, right on the welcome page of the application we provide a sample query. Just click on it and push Start the analysis button down below. You will be redirected to the analysis page.
Take your time and walk around the result page, mouse over different parts of images and you see interactive information and links. To learn more about the results and options, you can take a look at About output section, get familiar with About input. If you have any problems or questions we will be more than happy to assist you.
To analyse a dataset using funcExplorer user must provide an input dataset and indicate the correct organism during upload process. The rest of the care will be taken by funcExplorer application.
The work of funcExplorer starts with preprocessing the dataset. During this process the dataset is hierarchically clustered using Hybrid Hierarchical Clustering. Pearson similarity measure is used to measure the similarity between elements. As a next step, each cluster from the resulted hierarchy is annotated using g:Profiler. Annotations are performed so that user can later select a multiple testing correction to reduce the amount of false positives resulting from numerous enrichment tests. The special correction that takes into account the hierarchical structure of GO is selected by default (defined as g:SCS method in g:Profiler), but it is also possible to apply standard methods like Bonferroni correction and FDR Benjamini-Hochberg correction.
The second stage of analysis is performed when user chooses to analyse and visualize some particular dataset.
User has 3 options to cut the hierarchical tree.
The list of interesting clusters is defined based on the best annotations of the clusters. In other words, each cluster is characterized by one annotation according to the p-value. The resulting clusters are medium-sized.
The best annotation cutting strategy is performed at two stages:
The F1 cutting strategy is defined based on the F1 measure. This strategy results in smallest clusters representing more specific functional features in the data. Intuitively this looks for clusters that are ``complete'' - have most genes from a functional category in a subtree and large proportion of a subtree belong to that category.
The first annotations cutting strategy is defined based on the clustering hierarchy. This strategy results in largest clusters representing a broad overview of functional features in the data.
In addition to the cutting strategy, it is also possible to set:
All these pages contain links that enable to download the resulting images as PNG files and tables as PNG or CSV files.
This page contains 3 main parts [1]: interactive view that represents the dataset, list of interesting clusters with list of cluster best annotations regarding the domain and p-value of annotation (summary page), list of unique statistically significant annotations (unique annotations page). Unique annotations are annotations that are present in only one cluster of the given dataset. In addition, we provide the list of genes in the dataset with descriptions.
There are two parts in the interactive view: hierarchical clustering tree (dendrogram) [2] and expression profiles (heatmap) [3].
Expression profiles are visualized using blue-white-red color gradient [4]. It is eye-friendly and intuitively understandable: blue color denotes genes with low expression value and red color shows highly-expressed genes. Rows of the heatmap represent genes and columns stand for states of biological conditions (samples). If present, the sample annotations are shown above the heatmap columns [5].
The tree depicts hierarchical clustering of gene expression data with individual elements at one end and single cluster containing all elements at the other. Each node of the tree represents a cluster, the distance represents the similarity of elements in the cluster: the smaller the distance is the more similar elements are. The scale of the distance is explained on x-axis [6]. The similarity is measured using Pearson correlation distance and is scaled to range [0,1].
However, the output is slightly different from the usual one. While cutting the tree, funcExplorer searches interesting clusters: clusters that meet the requirements of cutting strategy and contain statistically significant annotation. These clusters are denoted by colored rectangles. The size of the rectangle communicates the size of underlying cluster, the number of genes in the cluster. The colors of the cluster rectangle code the annotations found for the cluster. The sizes of inner rectangles reflect the proportional distribution of cluster annotations by domain.[7]
Next to the cluster you can see a grey bar that denotes the proportion of genes that are annotated from any of the statistically significant annotations of that cluster [8].
The user can also choose to omit the sparse clusters from the final output. In this case the location of sparse clusters is presented with an empty branch.
Search form enables to search for the location of a gene or annotation term from the interesting clusters if corresponding filter is selected. Searching for multiple genes/terms is allowed by separating the queries by semicolon and space ("; "). The form also helps the user by suggesting keywords while typing, but feel free to search for your own gene/term IDs. The resulting clusters are highlighted in the dendrogram and reported next to the search field.
Short link generation [10] enables to easily share the clustering result with colleagues without distributing excessively long link.
Summary table shows a complete report of the clustering result.
The table contains:
Selection [15] allows to keep only few selected clusters in the output and download their report as PNG. This allows to report only interesting clusters. The data shown in the table can also be downloaded as a CSV file.
Cluster view page contains the information about one particular cluster selected. The result is also interactive.
The main parts of the page:
In this section we describe how to analyse and visualize your own dataset.
You can upload your data in the Data upload section. Fill in the fields of the form and follow the instructions. Note that the name of your uploaded file is used as the name of the dataset in further analysis. It is very important to explicitly define the organism of the dataset.
If e-mail is given, then the results will be linked to that e-mail and later you are able to see all your preprocessed datasets from one page and also you will get a notification when your data is available for browsing.
In case of custom delimited files (not SOFT file) we accept sample annotation tracks that will be shown on top of the heatmap. Format of the input file is shown on the image below. The first row of the data is considered to be the header of the file, i.e the sample/condition identifiers. The sample annotations should lie under the column labels and above the gene expression numeric values. We try to automatically detect the border between sample annotations and numeric data (shown with green line in the below figure). Nevertheless, after data validation, you have the possibility to correct the selection. Annotations are optional, data sets without annotations can be uploaded as well.
To make a dataset available for analysis we need to preprocess it. This step is required due to the size of experimental datasets - clustering and annotation can take a while (for example dataset of 15 conditions and 32 000 genes takes approximately 1 hour). During the preprocessing stage, funcExplorer calculates hierarchical clustering and annotates all the clusters (there can be as many as 2 to the power of n-1 clusters, where n is the number of elements in the dataset) for a dataset. We will take care of the preprocessing of your data ourselves and when this is finished you will get an e-mail with an access link to the results. Preprocessing is performed in a way that later you can apply the same parameters to your dataset as in case of our public datasets.
When the uploading and preprocessing of the dataset is done, you will get an e-mail with a link to your results to the contact e-mail noted in the upload form. In this link you will also see all your previously uploaded and successfully preprocessed datasets.
The speed of the analysis is highly dependent on the size of the dataset. By applying the additional threshold or selecting less annotation types you will reduce the number of computations and speed up the process. You can also increase the minimum size of the potential interesting clusters and decrease the maximum.
You can apply additional threshold for annotations, that will guarantee that all the annotations considered for cutting of the tree using best annotation strategy are highly statistically significant. Allow smaller clusters (from 5 to 100) to be found.
Additional threshold for the annotations found will most probably reduce the size of the picture. Similarly effect is on selecting less annotation types.
In order to share your results with colleagues you can generate a compact link on the view page and share this. This will redirect to the page it was generated on. However, sharing a link of a dataset uploaded to your account (identified by e-mail) gives link holder access also to your other datasets. To prevent this, you can upload dataset without e-mail specification.
However, if you still have several datasets in your project that you would like to upload under single account, without mixing them up with your other datasets, you could use your gmail as the e-mail in the uploading process of every dataset in the following way: firstname.lastname+my_project_name@gmail.com. This will create you a funcExplorer link to all the datasets in your project without revealing your other datasets.
Annotations | |||||||
---|---|---|---|---|---|---|---|
Domain | # | Name | Description | P-value | |||
BP | |||||||
CC | |||||||
MF | |||||||
keg | |||||||
rea | |||||||
tf | |||||||
mi | |||||||
cor | |||||||
hp | |||||||
hpa |