kt's blog

Oct 09 02:03

K-means & correlation distance

Here's a useful observation related to the use of K-means together with the Pearson correlation distance (© Alex).

The standard K-means update step, where you update the cluster centers by taking the means of the corresponding points is technically not very appropriate in the case of the correlation distance d(x, y) = 1 - corr(x, y). The proper step would be to take the sum of the normalized points as the new cluster center:

c = sumi(xi/|xi|)

Oct 08 21:37

Float precision

''Float precision'' is a very subtle issue. It creeps up so rarely that many people (me included) would get it out of their heads completely before it would show itself somewhere once again.
Indeed, most of the time it's not a problem at all, that floating point computations are not ideally precise, and noone cares about the small additive noise that it produces, as long as you remember to avoid exact comparisons between floats.

Nov 23 01:12

A Brief Introduction to Matrix Algebra

I made up a short guide on the basics of matrix algebra for the recent pattern analysis course to kinda help the catch-uppers.
Despite the fact that it did not seem to help a lot (it's probably too compressed for a beginner and too obvious for the one who knows), I still decided to make it into a more-or-less complete document, which maybe someday end up being useful.

The good point about it is that it summarizes about pretty much all of linear algebra that I, personally, know and find useful.

The PDF is available here.

Oct 10 22:52

Scilab + PVM

I've recently found out that you can use PVM within Scilab to parallelize programs easily. I liked the experience so much that I couldn't resist sharing it with you.

Enjoy.

Jul 04 20:11

Latent Process Decomposition

Latent Process Decomposition might be one interesting unsupervised analysis approach to try on the FunGenES data. The thing is something like clustering with a model which is slightly more sophisticated than the traditional "mixture". The authors kindly provide the code and some impressive examples of successful application of the method in their paper, so although the conceptual part of the algorithm is heavily mathematical, it might be possible to just try running it on the data with a reasonably small effort.

Jun 20 18:16

Apriori Revisited

A paper by T. de Bie et al. describes one very stylish application of the Apriori algorithm for detection of transcription regulatory modules. The idea is in the smart statement of the problem, which is the following:

Find the maximal sets of genes that all share at least r common regulators, at least m common motifs, and have pairwise correlation of at least c.

It turns out that the sets of genes of interest naturally satisfy the same properties as the frequent sets in the Apriori algorithm, so it's rather easy to adapt the algorithm for this context.

Jun 18 14:56

An Improved Map of Conserved Regulatory Sites

You might still remember this paper by Harbison et al. that reported some high-quality S.cerevisiae TF binding sites. Well, now there's a followup by Maclsaac et al.. This time the authors used phylogenetic conservation based algorithms (PhyloCon and Converge) to search for binding sites, and reportedly got even better results than before.

Moreover, the authors provide a nice Python package TAMO for performing basic PWM-matching tasks.

May 13 13:11

ROC Area-Under-Curve Explained

Some things may take years to have them figured out. It is when someone shows you a definition of some "basic" mathematical object, but does not say why is this defined this way and how should it be interpreted. Moreover, you won't find the answer to your "why" and "how" questions so easily either because they are "so simple" that noone cares to tell, or simply because noone cares. Some time passes and you forget your desire to find out the meaning and just get used to the definition.

For example, it took me some months after I first heard the definition of matrix multiplication to understand why was it defined precisely like that. Same with the notion of a "determinant". Same with pretty much any other university's first-year mathematical object. The problem is probably in the fact that many of our math courses are "definition-based", not "intuition-based", but anyway, that's not the subject of this post.

Feb 10 16:36

The True Value of P-value

Last year, in the end of October, we attended the school Analysis of patterns in Sicily. There were several interesting new things that I personally took home with me from this event, but the thing that probably provided most food for thought for me was the notion of "multiple hypothesis testing". I've been long willing to write something down on this topic, but it's only now that I've found the time and desire to do it. Note that I haven't yet taken the trouble of reading some more in-depth literature about this, so all that goes next is not necessarily flawless, but here we go.