| Home | logoMEM - Multi Experiment Matrix help
P. Adler, R. Kolde, M. Kull, A. Tkatšenko, H. Peterson, J. Reimand and J. Vilo: Mining for coexpression across hundreds of datasets using novel rank aggregation and visualisation methods (2009) Genome Biology [abstract]
R. Kolde, S. Laur, P. Adler and J. Vilo: Robust rank aggregation for gene list integration and meta-analysis (2011) Bioinformatics [abstract]
MEM input explained

Gene ID input
Gene IDs are converted using g:Convert (part of g:Profiler tool) which is built on Ensembl BioMart and thus supports all major gene identifiers, starting from Ensembl gene ID and converting it into platform specific probeset ID(s). In the case of multiple IDs, the user is required to select one. If the initial query is a valid probeset ID, it would be automatically selected.

Database selection
All experiments have been downloaded from the ArrayExpress database and the database of MEM is updated regularly to include all new datasets from ArrayExpress. For the sake of scientific consistency, all past updates are preserved as dated versions in the MEM database (Current always being the latest version; Publication denotes the state of the database at the time when the MEM interface article was published)

Collection selection
Experiments are divided into collections based on their microchip platform. The platforms appear in the list based on the number of experiments in the collection. Some of the platforms are marked with '*', this means we can not provide a nice way to map Affymetrix ID's on those platforms, since they are not yet present in Ensembl BioMart. The '*' will be removed once the platform is mapped trough BioMart!
Platforms can also be filtered by organism ID for more convenient usage.

Options

Similarity

Output

Gene filters

Dataset filters

MEM output explained
tutorial image
  • 1 The list of resulting genes. One gene can occur several times since there can be more than one probe per gene on an array.
    • By hovering over the gene name, one can see the full description of the gene.
  • 2 Heatmap representation of the rank matrix. Each column represents the similarity rankings of the resulting gene list in the particular dataset. The color scale is from dark red to dark blue and red meaning smaller ranks and blue larger (smaller is better).
    • By hovering over the squares one can see the rank of given gene in given dataset.
  • 3 P-values indicating the strength of co-expression. The p-values are obtained by comparing the amount of small ranks to the number of small ranks expected by random draw.
  • 4 The resulting list in terms of probes on the array.
  • 5 The experiment names. By clicking on the experiment name user can select/deselect the dataset for the next query.
    • By hovering over the experiment name one can see the description of the dataset.
  • 6 Hierarchical clustering of the shown rank matrix.
    • By clicking on the nodes of the tree user can select/deselect the corresponding datasets for the next query.
  • 7 Indicator for manual dataset selection. Tick means the dataset will be definitely selected into the next query, cross means it will be excluded. Gray dot is the default state - filters can apply. Ticks are also used to select datasets for the ExpressView

NetCDF output format explained
MEM allows to download results in NetCDF format. To successfully work with NetCDF files netcdf-bin, libnetcdf-dev and their dependencies should be present in the OS. The latter is required to install ncdf package in R. Example of mem NetCDF structure:

netcdf memcpp_res_8 .. 0 {
dimensions:
        focus = 100 ;                                     #number of datasets used in query
        nonfocus = 545 ;                                  #number of datasets left out due to filter (i.e. st-dev, etc)
        strlen_datasets = 64 ;                            #can be ignored / nr of characters reserved for dataset names /
        strlen_genes = 28 ;                               #can be ignored / nr of characters reserved for gene names /
        genes = 22282 ;                                   #number of features/probes/genes on platform
variables:
        char reference(strlen_genes) ;                    #query feature/probe/gene
        char focus_name(focus, strlen_datasets) ;         #dataset names for focus
        double focus_stdev(focus) ;                       #query feature/probe/gene-s st-dev for focus
        char nonfocus_name(nonfocus, strlen_datasets) ;   #not so interesting dataset names
        double nonfocus_stdev(nonfocus) ;                 #not so interesting st-dev values
        char gene_name(genes, strlen_genes) ;             #feature/probe/gene names for rest of the platform (i.e. not reference)
        double score(genes) ;                             #MEM similarity score for all genes (except reference)
        int support(genes) ;                              #number of datasets that contributed in score (boxed in matrix view)
        int score_rank(genes) ;                           #rank value at score
        int focus_rank(genes, focus) ;                    #rank matrix. Ranks for all features/probes/genes in every dataset
        int nonfocus_rank(genes, nonfocus) ;              #rank matrix for non-focus datasets
}

Depending of view options used, some NetCDF variables can be missing from above list, this does not mean it is broken. Use "Display all datasets .." from Output tab and try again.

Intro to R
Following code will open .nc file and read in two most important variables into R: named mem scores (w/o reference) and focus_ranks as rank matrix used to get mem scores.

library(ncdf)                                                   #load R NetCDF library
nc <- open.ncdf("memcpp_res_8 .. 0.nc")                         #open *.nc file / open.ncdf("<path/to/nc/file.nc>") /
  #read some variables from the file into R
reference <- get.var.ncdf(nc, "reference")                      #reference/query feature/probe/gene name
gene_names <- get.var.ncdf(nc,"gene_name")                      #feature/probe/gene names (except reference)
focus_ranks <- get.var.ncdf(nc,"focus_rank")                    #rank matrix (except reference)
scores <- get.var.ncdf(nc, "score")                             #mem scores (except reference)
names(scores) <- gene_names                                     #add names to scores array (w/o reference)

str(scores)       #mem scores with feature/probe/gene names (w/o reference)

ds_names <- get.var.ncdf(nc,"focus_name")                       #dataset names
ds_names <- sub(".*\\/","",sub(".nc","",ds_names), perl=T)      #make dataset names more comfortable
focus_ranks <- t(focus_ranks)                                   #transpose so that datasets would be in columns (not needed if only one dataset!)
focus_ranks <- rbind(rep(1,length(ds_names)), focus_ranks)      #add reference also to the rank matrix (has always rank 1, i.e. most similar to it self)
rownames(focus_ranks) <- c(reference,gene_names)                #define row names for the rank matrix
colnames(focus_ranks) <- ds_names                               #define column names for the rank matrix

str(focus_ranks) #focus rank matrix

If R is not the preferred analysis environment, then it is possible to convert NetCDF files to flat files with TabCDF

Word clouds

About the annotations: to every term in the datasets descriptions MetaMap tool was applied in order to annotate them. After that additionally to the basic word cloud out of the initial descriptions, the annotation word cloud was constructed. The significant terms that appear to have significant annotations in the annotation word cloud were replaced with their annotations. Additional terms from annotation word cloud that have not appeared in the basic word cloud were also added to the final word cloud.

One-way view
The word cloud consists of terms that are significantly most overrepresented across the datasets descriptions that contributed to the coexpression of the query gene with the selected gene. Since there is an ordering for the contributed datasets, the terms were tested to be overrepresented for ds_1; ds_1 and ds_2; etc. and the minimal out all these p-values was chosen as the final one for the term. Bonferroni correction was applied to these p-values due to the multiple comparisons issue. The word clouds were made more thorough by applying annotations to them using MetaMap tool.

Two-way view
The word cloud consists of terms that are significantly most overrepresented across the datasets descriptions that contributed to both coexpression of the query gene with the selected gene and vice versa. Since there is no ordering for the contributed datasets in this case, the terms were tested to be overrepresented only for the whole set of datasets simultaneously. The word clouds were made more thorough by applying annotations to them using MetaMap tool.

Multi-Experiment-Matrix © 2008-2009 | Priit Adler & Jaak Vilo @ Biit Group, Institute of Computer Science, University of Tartu