Welcome to funcExplorer!

Introduction


funcExplorer [1] is a web based tool for performing hierarchical clustering of gene/protein activity data from RNA-seq, microarray, ProtoArray or any experiments that can be brought into the form of a data matrix, followed by automatic functional enrichment analysis of clusters derived. Important is that the identifiers in the rows are recognizable in g:Profiler. The unique feature of funcExplorer is the global enrichment analysis of every possible cluster for shared biological function and a compact global visualization that highlights major gene clusters that are co-expressed and statistically significantly enriched in biological terms. funcExplorer utilises Gene Ontology, well curated pathway databases (KEGG, Reactome), regulatory motifs of Transfac and microRNA target sites of miRBase, CORUM protein complexes and Human Protein Atlas (HPA) and Human Phenotype Ontology (HPO) to provide information on the shared regulative mechanisms for given genes. All this information is given in an easily comprehendable birds-eye view that in color-coded form outlines interesting groups of genes. The idea of enrichment-driven pruning of hierarchical dendrogram is based on VisHiC web tool published in 2009 [2]. funcExplorer is a complete rewrite of this tool.

Datasets pass several stages of analysis:

  • Hierarchical clustering
  • Cluster annotation
  • Versatile result visualization

Key feature of funcExplorer is search and visualization of clusters that are significantly enriched with biological terms. See section About input for more details.

funcExplorer supports gene/protein activity datasets from all major organisms that come in standardized tab-separated form. Input datasets can be uploaded by the user using simple web upload form and the results will be linked with given e-mail address or just the dataset itself. We also provide a variety of public datasets. funcExplorer supports hundreds of types of gene identifiers to allow user to input the data with favourite gene names or database IDs. The identifiers coincide with the identifiers available in public web server g:Profiler.


The source code of the funcExplorer web tool is freely available at a GitLab repository and Zenodo under the GNU GPLv3 licence.


[1] funcExplorer was published in 2018 by Kolberg et al. (PDF)

[2] VisHiC was published in 2009 by Krushevskaya et al. (PDF)


Back to top

Quick start


Getting started

In this section we cover performing the very first analysis and visualization with funcExplorer. The easiest way to learn how to use funcExplorer is to analyse one of the publicly available datasets. This procedure requires only two easy stages of input:

  • First step - dataset selection
    On the home page of funcExplorer in the section Select dataset just pick one of the available datasets.
  • Second step - selection of analysis and visualization parameters
    Here the user also has several options. For a quick start we propose Best annotation option: select it from select strategy drop down, leave cluster size limits (5 for minimum and 1000 for maximum) and term type selections as they are. The meaning of cutting strategies and influence of additional parameters is discussed in the About input section.
  • Press Start the analysis button to proceed with analysis.

In order to make things even easier, right on the welcome page of the application we provide a sample query. Just click on it and push Start the analysis button down below. You will be redirected to the analysis page.

Take your time and walk around the result page, mouse over different parts of images and you see interactive information and links. To learn more about the results and options, you can take a look at About output section, get familiar with About input. If you have any problems or questions we will be more than happy to assist you.


Back to top

About input


To analyse a dataset using funcExplorer user must provide an input dataset and indicate the correct organism during upload process. The rest of the care will be taken by funcExplorer application.

The work of funcExplorer starts with preprocessing the dataset. During this process the dataset is hierarchically clustered using Hybrid Hierarchical Clustering. Pearson similarity measure is used to measure the similarity between elements. As a next step, each cluster from the resulted hierarchy is annotated using g:Profiler. Annotations are performed so that user can later select a multiple testing correction to reduce the amount of false positives resulting from numerous enrichment tests. The special correction that takes into account the hierarchical structure of GO is selected by default (defined as g:SCS method in g:Profiler), but it is also possible to apply standard methods like Bonferroni correction and FDR Benjamini-Hochberg correction.

The second stage of analysis is performed when user chooses to analyse and visualize some particular dataset.

User has 3 options to cut the hierarchical tree.

  • Best annotation

    The list of interesting clusters is defined based on the best annotations of the clusters. In other words, each cluster is characterized by one annotation according to the p-value. The resulting clusters are medium-sized.

    The best annotation cutting strategy is performed at two stages:

    • Search of clusters: non overlapping clusters with significant annotation present are searched starting from the smallest p-value. The resulting clusters are highlighted using color-coded rectangles.
    • Clusters that don't contain any dense or interesting clusters are collapsed and marked as the one that don't have any significant annotation that satisfy the input requirements.
  • F1 annotation

    The F1 cutting strategy is defined based on the F1 measure. This strategy results in smallest clusters representing more specific functional features in the data. Intuitively this looks for clusters that are ``complete'' - have most genes from a functional category in a subtree and large proportion of a subtree belong to that category.

    • Search of clusters: non overlapping clusters with significant annotation present are searched starting from the largest F1 score. The resulting clusters are highlighted using color-coded rectangles.
    • Clusters that don't contain any annotations are collapsed and marked as the one that don't have any significant annotation that satisfy the input requirements.
  • First annotation

    The first annotations cutting strategy is defined based on the clustering hierarchy. This strategy results in largest clusters representing a broad overview of functional features in the data.

    • Search of clusters: starting from the root, the first cluster in each branch that has any significant annotation is presented as dense cluster. The resulting clusters are highlighted using color-coded rectangles.
    • Clusters that don't contain any annotations are collapsed and marked as the one that don't have any significant annotation that satisfy the input requirements.

In addition to the cutting strategy, it is also possible to set:

  • Minimum cluster size
    - the minimum size of the cluster that can be detected. Try to keep it larger. If you set it to, say, 5 or 10 you will end up with a large number of tiny clusters, this will result in huge picture one cannot grasp at once.
  • Maximum cluster size
    - the maximum size of the cluster that can be detected.
  • Additional threshold
    - additional threshold to cluster annotations, the annotations whose p-value is above the threshold are not used during tree cutting process.
  • Additional threshold
    - additional threshold to cluster annotations, the annotations whose p-value is above the threshold are not used during tree cutting process.
  • Minimum term size
    - the minimum size for functional terms that can be used to calculate strategy scores.
  • Maximum term size
    - the maximum size for functional terms that can be used to calculate strategy scores. For example, if set to 700, then terms that are known to be related to more than 700 genes are excluded. In case of GO this means that less than 5% of GO terms are excluded.
  • Select term types
    - select term types for cluster annotations, the annotations of types that are not selected are not used during tree cutting process.

Back to top

About output


funcExplorer output consists of several pages:
  • Main hierarchical clustering visualization of dataset
    • Explore the tree: shows dataset dendrogram and heatmap, is interactive and contains links to cluster view
    • Summary: shows tabular summary of clusters
    • Unique annotations: shows functions that come out only in one cluster and therefore are unique in this dataset
    • Genes: lists all the genes together with names, descriptions and corresponding cluster ID
  • Cluster view
    • Explore the cluster: dendrogram and heatmap of cluster, is interactive and contains links to inner clusters that enable to zoom-in. Expression profiles and functional topic wordclouds are also shown.
    • Annotations: shows all the significant functions in the cluster
    • Genes: shows the list of genes in the cluster

All these pages contain links that enable to download the resulting images as PNG files and tables as PNG or CSV files.


Main hierarchical clustering visualization of dataset

This page contains 3 main parts [1]: interactive view that represents the dataset, list of interesting clusters with list of cluster best annotations regarding the domain and p-value of annotation (summary page), list of unique statistically significant annotations (unique annotations page). Unique annotations are annotations that are present in only one cluster of the given dataset. In addition, we provide the list of genes in the dataset with descriptions.

There are two parts in the interactive view: hierarchical clustering tree (dendrogram) [2] and expression profiles (heatmap) [3].

Expression profiles or heatmap

Expression profiles are visualized using blue-white-red color gradient [4]. It is eye-friendly and intuitively understandable: blue color denotes genes with low expression value and red color shows highly-expressed genes. Rows of the heatmap represent genes and columns stand for states of biological conditions (samples). If present, the sample annotations are shown above the heatmap columns [5].

Hierarchical clustering tree or dendrogram

The tree depicts hierarchical clustering of gene expression data with individual elements at one end and single cluster containing all elements at the other. Each node of the tree represents a cluster, the distance represents the similarity of elements in the cluster: the smaller the distance is the more similar elements are. The scale of the distance is explained on x-axis [6]. The similarity is measured using Pearson correlation distance and is scaled to range [0,1].

However, the output is slightly different from the usual one. While cutting the tree, funcExplorer searches interesting clusters: clusters that meet the requirements of cutting strategy and contain statistically significant annotation. These clusters are denoted by colored rectangles. The size of the rectangle communicates the size of underlying cluster, the number of genes in the cluster. The colors of the cluster rectangle code the annotations found for the cluster. The sizes of inner rectangles reflect the proportional distribution of cluster annotations by domain.[7]

Next to the cluster you can see a grey bar that denotes the proportion of genes that are annotated from any of the statistically significant annotations of that cluster [8].

Color codes:
  • GO: Biological Process also abreviated as BP
  • GO: Cellular Component also abreviated as CC
  • GO: Molecular Function also abreviated as MF
  • KEGG also abreviated as keg
  • Reactome also abreviated as rea
  • Transfac also abreviated as tf
  • miRNA also abreviated as mi
  • CORUM also abreviated as cor
  • Human Phenotype Ontology also abreviated as hp
  • Human Protein Atlas also abreviated as hpa

  • The grey color stands for the clusters with no significant annotation found (sparse clusters) [11]

The user can also choose to omit the sparse clusters from the final output. In this case the location of sparse clusters is presented with an empty branch.

The output is interactive:
  • Mouse over any of the rectangles and you will see additional information concerning the cluster and annotations found for the cluster [9]
  • Click on the cluster itself and you will be taken to the cluster view page
  • Mouse over any of the grey bars and you will see how many genes are annotated in that cluster
  • Mouse over the heatmap and you will see the names of corresponding gene and sample with the expression value
  • Mouse over the sample annotations above the heatmap and you will see their values









Searching

Search form enables to search for the location of a gene or annotation term from the interesting clusters if corresponding filter is selected. Searching for multiple genes/terms is allowed by separating the queries by semicolon and space ("; "). The form also helps the user by suggesting keywords while typing, but feel free to search for your own gene/term IDs. The resulting clusters are highlighted in the dendrogram and reported next to the search field.


Compact link generation

Short link generation [10] enables to easily share the clustering result with colleagues without distributing excessively long link.


Summary

Summary table shows a complete report of the clustering result.

The table contains:

  • Domain Best Annotations table [12] shows the top functional terms from each of the analysed domains together with p-value, identificator, description, size of overlap between the given cluster and function, size of function.
  • Topic [13] column shows the top functions enriched in the corresponding cluster and thereby represents the functional topic of the cluster. The words are scaled according to -log10(p-value) across all the clustering results. Color denotes the domain of the function. Hovering over the words shows the full description, p-value and term id.
  • Eigengene profile [14] shows the expression profile that represents the cluster. This is achieved by calculating the first principal component for every cluster. Hovering over the bullets shows the corresponding sample name. The eigengene profiles can be downloaded in a CSV file.

Selection [15] allows to keep only few selected clusters in the output and download their report as PNG. This allows to report only interesting clusters. The data shown in the table can also be downloaded as a CSV file.


Cluster view

Cluster view page contains the information about one particular cluster selected. The result is also interactive.

The main parts of the page:

  • Gene expression profiles [16] show the variability inside the gene cluster. Mouse over the line and see the gene it corresponds to.
  • Functional topic [17] of the cluster is shown in the wordcloud. The word are scaled according to the -log10(p-value) and the colors represent corresponding domain. Hovering over the words shows the full name and ID of the functional term together with corresponding p-value in this cluster.
  • Interactive heatmap and dendrogram for the cluster. Significant annotations for all subclusters are shown using previously introduced color-coded rectangles. Mouse over the rectangle and see the breadth of the cluster shown by highlighting it in the tree. Clicking on the rectangle redirects to the cluster view of the selected subcluster (inner node).[18]
  • List of cluster best annotations regarding p-value. The list is downloadable as CSV (comma separated) or PDF file.
  • Complete list of genes from the cluster. The list is downloadable as CSV or PDF file.



Back to top

Using private dataset


In this section we describe how to analyse and visualize your own dataset.

Upload your dataset

You can upload your data in the Data upload section. Fill in the fields of the form and follow the instructions. Note that the name of your uploaded file is used as the name of the dataset in further analysis. It is very important to explicitly define the organism of the dataset.

If e-mail is given, then the results will be linked to that e-mail and later you are able to see all your preprocessed datasets from one page and also you will get a notification when your data is available for browsing.

Please note that we keep the results by default for 365 days, but on request we are able to keep the data for extended period of time. For example, if the dataset link is accompanying a scientific publication, we are ready to keep the dataset permanently.


Sample annotations

In case of custom delimited files (not SOFT file) we accept sample annotation tracks that will be shown on top of the heatmap. Format of the input file is shown on the image below. The first row of the data is considered to be the header of the file, i.e the sample/condition identifiers. The sample annotations should lie under the column labels and above the gene expression numeric values. We try to automatically detect the border between sample annotations and numeric data (shown with green line in the below figure). Nevertheless, after data validation, you have the possibility to correct the selection. Annotations are optional, data sets without annotations can be uploaded as well.


Preprocessing

To make a dataset available for analysis we need to preprocess it. This step is required due to the size of experimental datasets - clustering and annotation can take a while (for example dataset of 15 conditions and 32 000 genes takes approximately 1 hour). During the preprocessing stage, funcExplorer calculates hierarchical clustering and annotates all the clusters (there can be as many as 2 to the power of n-1 clusters, where n is the number of elements in the dataset) for a dataset. We will take care of the preprocessing of your data ourselves and when this is finished you will get an e-mail with an access link to the results. Preprocessing is performed in a way that later you can apply the same parameters to your dataset as in case of our public datasets.


Getting the results

When the uploading and preprocessing of the dataset is done, you will get an e-mail with a link to your results to the contact e-mail noted in the upload form. In this link you will also see all your previously uploaded and successfully preprocessed datasets.


Now you can go to the page and visualize your results.


Back to top

Tips&Tricks


To speed up the analysis

The speed of the analysis is highly dependent on the size of the dataset. By applying the additional threshold or selecting less annotation types you will reduce the number of computations and speed up the process. You can also increase the minimum size of the potential interesting clusters and decrease the maximum.


To get more precise results

You can apply additional threshold for annotations, that will guarantee that all the annotations considered for cutting of the tree using best annotation strategy are highly statistically significant. Allow smaller clusters (from 5 to 100) to be found.


Playing with the size of the output

Additional threshold for the annotations found will most probably reduce the size of the picture. Similarly effect is on selecting less annotation types.


Sharing the results

In order to share your results with colleagues you can generate a compact link on the view page and share this. This will redirect to the page it was generated on. However, sharing a link of a dataset uploaded to your account (identified by e-mail) gives link holder access also to your other datasets. To prevent this, you can upload dataset without e-mail specification.

However, if you still have several datasets in your project that you would like to upload under single account, without mixing them up with your other datasets, you could use your gmail as the e-mail in the uploading process of every dataset in the following way: firstname.lastname+my_project_name@gmail.com. This will create you a funcExplorer link to all the datasets in your project without revealing your other datasets.


Back to top