Welcome to VisHiC 2.0!

Introduction


VisHiC [1] is a web based tool for performing the hierarchical clustering of gene expression data followed by automatic functional enrichment analysis of clusters derived. The unique feature of VisHiC is the global enrichment analysis of every possible cluster for shared biological function and a compact global visualization that highlights major gene clusters that are co-expressed and statistically significantly enriched in biological terms. VisHiC utilises Gene Ontology, well curated pathway databases (KEGG, Reactome), regulatory motifs of Transfac and microRNA target sites of miRBase, CORUM protein complexes and Human Protein Atlas (HPA), Human Phenotype Ontology (HPO) and Online Mendelian Inheritance in Man (OMIM) to provide information on the shared regulative mechanisms for given genes. All this information is given in an easily comprehendable birds-eye view that in color-coded form outlines interesting groups of genes.

Datasets pass several stages of analysis:

  • Hierarchical clustering
  • Cluster annotation
  • Versatile result visualization

Key feature of VisHiC is search and visualization of clusters that are significantly enriched with biological terms. See section About input for more details.

VisHiC supports gene expression datasets from all major organisms that come in standardized tab-separated form. Gene expression datasets can be uploaded by the user using simple web upload form and the results will be linked with given e-mail address. We also provide a variety of public datasets. VisHiC supports hundreds of types of gene identifiers to allow user to input the data with favourite gene names or database IDs. The identifiers coincide with the identifiers available in public web server g:Profiler.


[1] VisHiC was first published in 2009 by Krushevskaya et al. (PDF)


Back to top

Quick start


Getting started

In this section we cover performing the very first analysis and visualization with VisHiC. The easiest way to learn how to use VisHiC is to analyse one of the publicly available datasets. This procedure requires only two easy stages of input:

  • First step - dataset selection
    On the home page of VisHiC in the section Select dataset just pick one of the available datasets.
  • Second step - selection of analysis and visualization parameters
    Here the user also has several options. For a quick start we propose Best annotation option: select it from select strategy drop down, leave cluster size limits (5 for minimum and 1000 for maximum) and term type selections as they are. The meaning of cutting strategies and influence of additional parameters is discussed in the About input section.
  • Press Start the analysis button to proceed with analysis.

In order to make things even easier, right on the welcome page of the application we provide a sample query. Just click on it and push Start the analysis button down below. You will be redirected to the analysis page.

Take your time and walk around the result page, mouse over different parts of images and you see interactive information and links. To learn more about the results and options, you can take a look at About output section, get familiar with About input. If you have any problems or questions we will be more than happy to assist you.


Back to top

About input


To analyse a dataset using VisHiC user must provide gene expression dataset and indicate the correct organism during upload process. The rest of the care will be taken by VisHiC application.

The work of VisHiC starts with preprocessing the dataset. During this process the dataset is hierarchically clustered using Hybrid Hierarchical Clustering. Pearson similarity measure and Euclidean distance are used to measure the similarity between elements. This allows the user to select the measure of their preference. As a next step, each cluster from the resulted hierarchy is annotated using g:Profiler. Annotations are performed so that user can later select a multiple testing correction to reduce the amount of false positives resulting from numerous enrichment tests. The special correction that takes into account the hierarchical structure of GO is selected by default (defined as g:SCS method in g:Profiler), but it is also possible to apply standard methods like Bonferroni correction and FDR Benjamini-Hochberg correction.

The second stage of analysis is performed when user chooses to analyse and visualize some particular dataset.

User has 3 options to cut the hierarchical tree.

  • Best annotation
  • The list of interesting clusters is defined based on the best annotations of the clusters. In other words, each cluster is characterized by one (best) annotation according to the p-value.

    The best annotation cutting strategy is performed at two stages:

    • Search of dense clusters: non overlapping clusters with significant annotation present are searched starting from the smallest p-value. The resulting clusters are highlighted using color-coded rectangles.
    • Clusters that don't contain any dense or interesting clusters are collapsed and marked as the one that don't have any significant annotation that satisfy the input requirements.

  • Annotation score
  • The list of interesting clusters is based on the accumulative scores of the clusters. A characteristic, that represents the average goodness of annotations, is computed for each cluster.

    The annotation score cutting strategy is also performed at two stages:

    • Search of dense clusters: non overlapping clusters with list of significant annotations present are searched starting from the biggest annotation score. The resulting clusters are highlighted using color-coded rectangles.
    • Clusters that don't contain any dense or interesting clusters are collapsed and marked as the one that don't have any significant annotation that satisfy the input requirements.
  • First annotation
  • The first annotations cutting strategy is defined based on the clustering hierarchy.

    • Search of dense clusters: starting from the root, the first cluster in each branch that has any significant annotation is presented as dense cluster. The resulting clusters are highlighted using color-coded rectangles.
    • Clusters that don't contain any annotations are collapsed and marked as the one that don't have any significant annotation that satisfy the input requirements.

In addition to the cutting strategy, it is also possible to set:

  • Minimum size
    - the minimum size of the cluster that can be marked as dense. Try to keep it larger. If you set it to, say, 5 or 10 you will end up with a large number of tiny clusters, this will result in huge picture one cannot grasp at once.
  • Maximum size
    - the maximum size of the cluster that can be marked as dense.
  • Additional threshold
    - additional threshold to cluster annotations, the annotations whose p-value is above the threshold are not used during tree cutting process.
  • Select term types
    - select term types for cluster annotations, the annotations of types that are not selected are not used during tree cutting process.

Back to top

About output


VisHiC output consists of several pages:

  • Main hierarchical clustering visualization of dataset
    • Explore the tree: shows dataset dendrogram and heatmap, is interactive and contains links to cluster view
    • Summary
    • Unique annotations
    • Genes
  • Cluster view
    • Explore the cluster: dendrogram and heatmap of cluster, is interactive and contains links to inner clusters
    • Annotations
    • Genes

Main hierarchical clustering visualization of dataset

This page contains 3 main parts: interactive view that represents the dataset, list of interesting clusters with list of cluster best annotations regarding the domain and p-value of annotation (summary page), list of unique statistically significant annotations. Unique annotations are annotations that are present in only one dense cluster of the given dataset. In addition, we provide the list of genes in the dataset with descriptions.

There are two parts in the interactive view - expression profiles (heatmap) [1] and a hierarchical clustering tree (dendrogram) [2].

Expression profiles or heatmap

Expression profiles are visualized using blue-white-red color gradient [9]. It is eye-friendly and intuitively understandable: blue color denotes genes with low expression value and red color shows highly-expressed genes. Rows of the heatmap represent genes and columns stand for states of biological conditions (samples). If present, the sample annotations are shown above the heatmap columns [3].

Hierarchical clustering tree or dendrogram

The tree depicts hierarchical clustering of gene expression data with individual elements at one end and single cluster containing all elements at the other. Each node of the tree represents a cluster, the distance represents the similarity of elements in the cluster: the smaller the distance is the more similar elements are. The scale of the distance is explained on x-axis [4]. The similarity is measured using either Pearson correlation or Euclidean distance and is scaled to range [0,1].

However, the picture is slightly different from the usual one. While cutting the tree, VisHiC searches interesting clusters: clusters that meet the requirements of cutting strategy and contain statistically significant annotation. These clusters are denoted by colored rectangles. The size of the rectangle communicates the size of underlying cluster, the number of genes in the cluster. The colors of the cluster rectangle code the annotations found for the cluster. The sizes of inner rectangles reflect the proportional distribution of cluster annotations by domain.[10]

Next to the cluster rectangle you can see a grey bar that denotes the proportion of genes that are annotated from any of the statistically significant annotations of that cluster [5].

Color codes:
  • GO: Biological Process also abreviated as BP
  • GO: Cellular Component also abreviated as CC
  • GO: Molecular Function also abreviated as MF
  • KEGG also abreviated as keg
  • Reactome also abreviated as rea
  • Transfac also abreviated as tf
  • miRNA also abreviated as mi
  • CORUM also abreviated as cor
  • OMIM also abreviated as omi
  • Human Phenotype Ontology also abreviated as hp
  • Human Protein Atlas also abreviated as hpa

  • The grey color stands for the clusters with no significant annotation found (sparse clusters) [6]

The user can also choose to omit the sparse clusters from the final output. In this case the location of sparse clusters is presented with an empty branch [7].

The picture is interactive:
  • Mouse over any of the rectangles and you will see additional information concerning the cluster and annotations found for the cluster [8]
  • Click on the cluster itself and you will be taken to the cluster view page
  • Mouse over any of the grey bars and you will see how many genes are annotated in that cluster [5]
  • Mouse over the heatmap and you will see the names of corresponding gene and sample with the expression value
  • Mouse over the sample annotations above the heatmap and you will see their values





Searching

Search form enables to search for the location of a gene or annotation term from the interesting clusters if corresponding filter is selected. Searching for multiple genes/terms is allowed by separating the queries by semicolon and space ("; "). The form also helps the user by suggesting keywords while typing, but feel free to search for your own gene/term IDs. The resulting clusters are highlighted in the dendrogram.


Cluster view

Cluster view page contains the information about one particular cluster selected. The result is also interactive.

The main parts of the page:

  • Gene expression profiles show the variability inside the gene cluster. Mouse over the line and see the gene it corresponds to.[11]
  • Interactive heatmap and dendrogram for the cluster. Significant annotations for all subclusters are shown using previously introduced color-coded rectangles. Mouse over the rectangle and see the breadth of the cluster shown by highlighting it in the tree. Clicking on the rectangle redirects to the cluster view of the selected subcluster (inner node).[12]
  • List of cluster best annotations regarding p-value. The list is downloadable as CSV (comma separated) or PDF file.
  • Complete list of genes from the cluster. The list is downloadable as CSV or PDF file.



Back to top

Using private dataset


In this section we describe how to analyse and visualize your own dataset.

Upload your dataset

You can upload your data in the Data upload section. Fill in the fields of the form and follow the instructions. Note that the name of your uploaded file is used as the name of the dataset in further analysis. It is very important to explicitly define the organism of the dataset.

If e-mail is given, then the results will be linked to that e-mail and later you are able to see all your preprocessed datasets from one page and also you will get a notification if your data is available for browsing.


Preprocessing

To make a dataset available for analysis we need to preprocess it. This step is required due to the size of experimental datasets - clustering and annotation can take a while (for example dataset of 15 conditions and 32 000 genes takes approximately 1 hour). During the preprocessing stage, VisHiC calculates hierarchical clustering and annotates all the clusters (there can be as many as 2 to the power of n-1 clusters, where n is the number of elements in the dataset) for a dataset. We will take care of the preprocessing of your data ourselves and if this is finished you will get an e-mail with an access link to the results. Preprocessing is performed in a way that later you can apply the same parameters to your dataset as in case of our public datasets.


Getting the results

When the uploading and preprocessing of the dataset is done, you will get an e-mail with a link to your results to the contact e-mail noted in the upload form. In this link you will also see all your previously uploaded and successfully preprocessed datasets.


Now you can go to the page and visualize your results.


Back to top

Tips&Tricks


To speed up the analysis

The speed of the analysis is highly dependent on the size of the dataset. By applying the additional threshold or selecting less annotation types you will reduce the number of computations and speed up the process. You can also increase the minimum size of the potential interesting clusters and decrease the maximum.


To get more precise results

You can apply additional threshold for annotations, that will guarantee that all the annotations considered for cutting of the tree using best annotation strategy are highly statistically significant. Allow smaller clusters (from 5 to 100) to be found.


Playing with the size of the output

Additional threshold for the annotations found will most probably reduce the size of the picture. Similarly effect is on selecting less annotation types.


Back to top