MEM help

MEM input explained

Gene ID input
Gene IDs are converted using g:Convert (part of g:Profiler tool) which is built on Ensembl BioMart and thus supports all major gene identifiers, starting from Ensembl gene ID and converting it into platform specific probeset ID(s). In the case of multiple IDs, the user is required to select one. If the initial query is a valid probeset ID, it would be automatically selected.

Database selection
All experiments have been downloaded from the ArrayExpress database and the database of MEM is updated regularly to include all new datasets from ArrayExpress. For the sake of scientific consistency, all past updates are preserved as dated versions in the MEM database (Current always being the latest version; Publication denotes the state of the database at the time when the MEM interface article was published)

Collection selection
Experiments are divided into collections based on their microchip platform. The platforms appear in the list based on the number of experiments in the collection. Some of the platforms are marked with '*', this means we can not provide a nice way to map Affymetrix ID's on those platforms, since they are not yet present in Ensembl BioMart. The '*' will be removed once the platform is mapped trough BioMart!
Platforms can also be filtered by organism ID for more convenient usage.

Options

Similarity

Distance measures - distance is measured between expression profiles of genes. Expression profile can be viewed as a vector of numbers over different samples. Distance equals 1 - correlation.
- Pearson correlation - the Pearson correlation coefficient (ρ) measures the strength and direction of a linear relationship between the X and Y variables (vector a and b).
- absolute Pearson - absolute value of Preason correlation coefficient
- Eucidean distance - is the "ordinary" distance between two points that one would measure with a ruler, which can be proven by repeated application of the Pythagorean theorem.
- Manhattan distance - is a metric in which the distance between two points is the sum of the (absolute) differences of their coordinates.
- chord - Euclidean distance measured after the normalization of the two vectors.
- centchord - Euclidean distance measured after centering and normalizing the vectors.
- Spearman - the Spearman's rank correlation coefficient (ρ) is a special case of the Pearson product-moment coefficient in which two sets of data X_i and Y_i are converted to rankings x_i and y_i before calculating the coefficient.
- absolute Spearman - absolute value of Spearman's rank correlation coefficient
- information - Information distance (number of bins is calculated using formula sqrt(number of cols in matrix) but not less than two) - One minus [mutual information divided by entropy]
- Minkowski - Minkowski distance (with parameter p=<positive-real>) - P-th root of the sum of P-th powers of vector element-wise differences

Rank aggregation method - three strategies are provided for rank aggregation. By default BetaMEM is used.
- BetaMEM - a statistical strategy for rank aggregation that compares the distribution of the gene ranks in different datasets with the expected distribution when the rankings are random. The output of the algorithm is a p-value that can be used for deciding the significance of the result if multiplied by number of genes and experiments in the query.
- mean of selected ranks - a simple rank aggregation method where the score is average of n smallest ranks for given gene. n is by default 20, but can be changed by user (field will become visible upon selecting method).
- geometric mean of selected ranks - same as above, but the geometric mean is used.

Score threshold - This is a fairly advanced option, since its semantics changes depending on the selected Rank aggregation method. If the threshold is set only genes with smaller score / p-value are returned. This option works in parallel with the output limit, the one that is satisfied first has the priority. By default the threshold is set to the highest possible value; i.e. not used.
Invert the order of results - by default MEM displays the most similar genes to query, this action may be inverted by checking the corresponding checkbox in the similarity submenu. For example; if Pearson correlation is used then without checking we see similar genes in top (ρ ~ 1), by checking we see genes with ρ ~ -1 in the top instead.

Output

Output limit - It defines how many most similar / distant genes to output. This option works parallel with score threshold limit, the one that is satisfied first has the priority.
Display option - User can select whether to display the graphical output and the text output or only the latter. Graphical output is explained here. Additionally it is possible to cluster rows and columns by their similarity. By default data sets (columns) are clustered; genes (rows) are not and are ordered by score.
Cell parameters - User can manipulate with image drawing parameters. Cell width, height and spacing can be changed to achieve most suitable overview level. do not show contributing datasets indicates whether black rectangles around contributing datasets are shown or not.

Gene filters

Remove unknown and ambiguous genes - pretty much as it says. All probe sets which don't have any known annotation or have simultaneously multiple annotations will be discarded from the output.
Select genes to output - User can specify a set of genes which will be used instead most similar / distant genes in the output. For example the list of genes might comprise a biological pathway.

Dataset filters

StDev threshold for query - hypothetically genes for which the standard deviation (StDev) over samples in the data set is high, give biologically more reliable results when expression profiles are compared. We have implemented StDev filter, which allows user to query only data sets for which the query has higher StDev than the threshold set by the user.
StDev as dataset weight - It is possible to alter the contribution of dataset in proportion of the variance of the query gene in given dataset.
Number of most variant data sets in respect to query - Instead of setting a fixed threshold to the standard deviation, user can specify how many data sets to include (the top datasets where the query has the highest standard deviation). Alternatively set the value "all".
Use beta scaling - depending of the number of datasets in the query the power (or scale) of similarity score may be different. It is also affected by the relative similarity of datasets within the query. Beta scaling can signifcantly reduce the effect of variable collection size.
Search datasets with keywords. - All metadata is searched against entered keywords (dataset ID, title, description and sample annotations). PERL regular expressions are used for matching.

NetCDF output format explained
MEM allows to download results in NetCDF format. To successfully work with NetCDF files netcdf-bin, libnetcdf-dev and their dependencies should be present in the OS. The latter is required to install ncdf package in R. Example of mem NetCDF structure:

netcdf memcpp_res_8 .. 0 {
dimensions:
        focus = 100 ;                                     #number of datasets used in query
        nonfocus = 545 ;                                  #number of datasets left out due to filter (i.e. st-dev, etc)
        strlen_datasets = 64 ;                            #can be ignored / nr of characters reserved for dataset names /
        strlen_genes = 28 ;                               #can be ignored / nr of characters reserved for gene names /
        genes = 22282 ;                                   #number of features/probes/genes on platform
variables:
        char reference(strlen_genes) ;                    #query feature/probe/gene
        char focus_name(focus, strlen_datasets) ;         #dataset names for focus
        double focus_stdev(focus) ;                       #query feature/probe/gene-s st-dev for focus
        char nonfocus_name(nonfocus, strlen_datasets) ;   #not so interesting dataset names
        double nonfocus_stdev(nonfocus) ;                 #not so interesting st-dev values
        char gene_name(genes, strlen_genes) ;             #feature/probe/gene names for rest of the platform (i.e. not reference)
        double score(genes) ;                             #MEM similarity score for all genes (except reference)
        int support(genes) ;                              #number of datasets that contributed in score (boxed in matrix view)
        int score_rank(genes) ;                           #rank value at score
        int focus_rank(genes, focus) ;                    #rank matrix. Ranks for all features/probes/genes in every dataset
        int nonfocus_rank(genes, nonfocus) ;              #rank matrix for non-focus datasets
}

Depending of view options used, some NetCDF variables can be missing from above list, this does not mean it is broken. Use "Display all datasets .." from Output tab and try again.

Intro to R
Following code will open .nc file and read in two most important variables into R: named mem scores (w/o reference) and focus_ranks as rank matrix used to get mem scores.

library(ncdf)                                                   #load R NetCDF library
nc <- open.ncdf("memcpp_res_8 .. 0.nc")                         #open *.nc file / open.ncdf("<path/to/nc/file.nc>") /
  #read some variables from the file into R
reference <- get.var.ncdf(nc, "reference")                      #reference/query feature/probe/gene name
gene_names <- get.var.ncdf(nc,"gene_name")                      #feature/probe/gene names (except reference)
focus_ranks <- get.var.ncdf(nc,"focus_rank")                    #rank matrix (except reference)
scores <- get.var.ncdf(nc, "score")                             #mem scores (except reference)
names(scores) <- gene_names                                     #add names to scores array (w/o reference)

str(scores)       #mem scores with feature/probe/gene names (w/o reference)

ds_names <- get.var.ncdf(nc,"focus_name")                       #dataset names
ds_names <- sub(".*\\/","",sub(".nc","",ds_names), perl=T)      #make dataset names more comfortable
focus_ranks <- t(focus_ranks)                                   #transpose so that datasets would be in columns (not needed if only one dataset!)
focus_ranks <- rbind(rep(1,length(ds_names)), focus_ranks)      #add reference also to the rank matrix (has always rank 1, i.e. most similar to it self)
rownames(focus_ranks) <- c(reference,gene_names)                #define row names for the rank matrix
colnames(focus_ranks) <- ds_names                               #define column names for the rank matrix

str(focus_ranks) #focus rank matrix

If R is not the preferred analysis environment, then it is possible to convert NetCDF files to flat files with TabCDF

Word clouds

About the annotations: to every term in the datasets descriptions MetaMap tool was applied in order to annotate them. After that additionally to the basic word cloud out of the initial descriptions, the annotation word cloud was constructed. The significant terms that appear to have significant annotations in the annotation word cloud were replaced with their annotations. Additional terms from annotation word cloud that have not appeared in the basic word cloud were also added to the final word cloud.

One-way view
The word cloud consists of terms that are significantly most overrepresented across the datasets descriptions that contributed to the coexpression of the query gene with the selected gene. Since there is an ordering for the contributed datasets, the terms were tested to be overrepresented for ds_1; ds_1 and ds_2; etc. and the minimal out all these p-values was chosen as the final one for the term. Bonferroni correction was applied to these p-values due to the multiple comparisons issue. The word clouds were made more thorough by applying annotations to them using MetaMap tool.

Two-way view
The word cloud consists of terms that are significantly most overrepresented across the datasets descriptions that contributed to both coexpression of the query gene with the selected gene and vice versa. Since there is no ordering for the contributed datasets in this case, the terms were tested to be overrepresented only for the whole set of datasets simultaneously. The word clouds were made more thorough by applying annotations to them using MetaMap tool.