## Blogs

## TiddlyWiki - personal web-based notebook

I am using an electronic notebook solution TiddlyWiki and I suggest you to try it out. This is very easy, but to make your life even easier, I have prepared versions with a tutorial included and with several useful plugins installed already (search and tag and latex). All you have to do is to follow these 5 steps:

## K-means & correlation distance

Here's a useful observation related to the use of K-means together with the Pearson correlation distance (© Alex).

The standard K-means update step, where you update the cluster centers by taking the means of the corresponding points is technically not very appropriate in the case of the correlation distance *d(x, y) = 1 - corr(x, y)*. The proper step would be to take the *sum of the *normalized* points* as the new cluster center:

*c = sum _{i}(x_{i}/|x_{i}|)*

## Float precision

''Float precision'' is a very subtle issue. It creeps up so rarely that many people (me included) would get it out of their heads completely before it would show itself somewhere once again.

Indeed, most of the time it's not a problem at all, that floating point computations are not ideally precise, and noone cares about the small additive noise that it produces, as long as you remember to avoid exact comparisons between floats.

## A Brief Introduction to Matrix Algebra

I made up a short guide on the basics of matrix algebra for the recent pattern analysis course to kinda help the catch-uppers.

Despite the fact that it did not seem to help a lot (it's probably too compressed for a beginner and too obvious for the one who knows), I still decided to make it into a more-or-less complete document, which maybe someday end up being useful.

The good point about it is that it summarizes about pretty much **all** of linear algebra that I, personally, know and find useful.

The PDF is available here.

## Scilab + PVM

I've recently found out that you can use PVM within Scilab to parallelize programs easily. I liked the experience so much that I couldn't resist sharing it with you.

## Latent Process Decomposition

Latent Process Decomposition might be one interesting unsupervised analysis approach to try on the FunGenES data. The thing is something like clustering with a model which is slightly more sophisticated than the traditional "mixture". The authors kindly provide the code and some impressive examples of successful application of the method in their paper, so although the conceptual part of the algorithm is heavily mathematical, it might be possible to just try running it on the data with a reasonably small effort.

## Apriori Revisited

A paper by T. de Bie et al. describes one very stylish application of the Apriori algorithm for detection of transcription regulatory modules. The idea is in the smart statement of the problem, which is the following:

Find the

maximalsets of genes that all shareat least rcommon regulators,at least mcommon motifs, and have pairwise correlation ofat least c.

It turns out that the sets of genes of interest naturally satisfy the same properties as the frequent sets in the Apriori algorithm, so it's rather easy to adapt the algorithm for this context.

## An Improved Map of Conserved Regulatory Sites

You might still remember this paper by Harbison et al. that reported some high-quality S.cerevisiae TF binding sites. Well, now there's a followup by Maclsaac et al.. This time the authors used phylogenetic conservation based algorithms (PhyloCon and Converge) to search for binding sites, and reportedly got even better results than before.

Moreover, the authors provide a nice Python package TAMO for performing basic PWM-matching tasks.

## ROC Area-Under-Curve Explained

Some things may take years to have them figured out. It is when someone shows you a definition of some "basic" mathematical object, but does not say why is this defined this way and how should it be interpreted. Moreover, you won't find the answer to your "why" and "how" questions so easily either because they are "so simple" that noone cares to tell, or simply because noone cares. Some time passes and you forget your desire to find out the meaning and just get used to the definition.

For example, it took me some months after I first heard the definition of matrix multiplication to understand why was it defined precisely like that. Same with the notion of a "determinant". Same with pretty much any other university's first-year mathematical object. The problem is probably in the fact that many of our math courses are "definition-based", not "intuition-based", but anyway, that's not the subject of this post.

## The True Value of P-value

Last year, in the end of October, we attended the school Analysis of patterns in Sicily. There were several interesting new things that I personally took home with me from this event, but the thing that probably provided most food for thought for me was the notion of "multiple hypothesis testing". I've been long willing to write something down on this topic, but it's only now that I've found the time and desire to do it. Note that I haven't yet taken the trouble of reading some more in-depth literature about this, so all that goes next is not necessarily flawless, but here we go.