The Alternate Transcript Diversity (ATD) project aims at characterising the alternative transcripts of mRNA present in the human genome. Alternative transcripts of mRNA come into existance for three different reasons: variation in transcription start site, splicing and polyadenylation. It has been observed that the expression of alternative transcripts is specific to tissue-type as well as developmental stage. Disruptions in expression of ATs corresponding to a particular cell condition can cause serious illnesses (e.g. cancer, multiple sclerosis, heart failure and neurodegenerative disorders). Among other things the ATD project aims at characterising AT variants specific to tissue-types as well as detecting ..............
The ultimate goal of the project is to develop prototypes for novel diagnosis tools and identify new drug targets.

ATD global website

Tasks of the BIIT group

WP5: Identify disease-specific ATs and AT regulatory signals.

WP5 takes raw expression data from WP4 and converts it into usable expression information distributed to all other partners. This expression information is then further exploited for the determination of tissue-specific AT and regulatory sequence/structure motifs. The workpackage is organized into a development phase (up to about month 18) where CO1 and CR7 work together on software tools, and a data acquisition/analysis phase where CO1, CR2 and CR7 collaborate with WP4 participants to exploit the expression data analysis platform and produce value-added information. (See Figure 6.8). In the development phase, unsupervised classification tools will be extended to bring together genes that show concomitant switches in transcript expression, leading to the identification of coregulated AT events. We will also develop methods for identification of differentially expressed genes. CO1 and CR7 will address the need for integration and performance improvement of these methods. CR7 will develop new suitable similarity metrics for combining different data sets using two main approaches – by combining individual metrics into a combined metric, and second, by using time-series specific measures like time-warping. In parallel, CO1 and CR7 will develop tools for motif identification. There are basically 3 types of motifs in mRNA sequences acting as regulatory elements: conserved sequence motifs with possible variations (e.g. a polyA signal or TATA box), secondary structure motifs (e.g. IRE or SECIS) and “fuzzy� motifs such as GU-rich or U-rich regions in UTRs. Programs are already being developed for detecting these three types of patterns by CO1 and CR7. CR7 will further develop the approach for position specific weight matrix construction that is based on rapid exhaustive enumeration followed by refinement step, which enables for more rapid discovery of regulatory signals. They will also develop a novel method for pattern discovery entirely based on the suffix tree approach so that the resulting algorithms are even faster and could accommodate both discrete patterns as well as probabilistic weight matrix profiles. CO1 will adapt their existing RNA structure detection tools to incorporate phylogenetic information. The latter is necessary for the identification of conserved secondary structures, which hasn’t been seriously addressed in previous regulatory motif search algorithms. The risk of us being unable to provide enhanced expression/sequence analysis tools is weak given our past record for developing such programs. In the data acquisition/analysis phase, the above tools are applied to the analysis of expression data from WP2 and WP4. WP2 will provide us with EST counts for each specific form, as a measure of transcript expression levels, as well as feature-specific datasets (sets of forms observed associated to any given condition). CO1 and CR2 have already used EST-counts in this way in prior publications. EST counts and feature-specific datasets will allow us to start benchmarking computational methods early in the project, and obtain a preliminary set of tissue-specific and coregulated ATs.

Once microarray data is available, it will be converted into the existing standard for microarray experiment, MAGE-ML (CO1). Standardized microarray data along with clustering tools will be made available to ATD members through CO1, and expression patterns of each specific isoform will be sent back to CR2/WP2 for input into the ATD database. This processed expression data is not directly publishable and should not be made public as such. Our goal is to valorize it with other ATD members, especially through deliverables 5.3 and 5.4 and possibly seek intellectual protection (see below). At the end of the 3-year period, expression data will finally be made public through the ArrayExpress database at EBI (WP2). The “Analysis� part of WP5 (Deliverables 3 and 4) relies in part on the delivery of raw hybridization data from WP4 (Milestone 2). In case of delays in the microarray design or hybridization stage, we will focus on EST-based expression data as explained above. These EST-based data will be submitted to the same analyses as microarray data in all scenarios. The identification of tissue-specific and coregulated isoforms (Del. 5.3) will be straight forward with the tools developed in Del. 5.1 and can be performed by WP5 or WP4 partners. Given the number of genes on our DNA chip and the variety of samples screened in WP4, it would be surprising that no specific isoform or set of coregulated isoform is identified (smaller scale experiments have identified many tissue-specific genes). However, in the unlikely case where no interesting bias would appear, we can decide to go back to WP4 for changing our gene selection. Finally, CR1 and CR7 will scree coregulated forms for the presence of regulatory motifs using the programs developed in 5.1. These partners will also be interested by the concomitant expression of potential regulatory genes (e.g. with RNA-binding domains) detected on genome-wide chips (see WP4) and appearance or certain isoforms. Such events would suggest new regulatory pathways.

The duration of WP5 is 36 months, with emphasis on development in first half, and data analysis / IP in second half.

D5.1Improved tools for clustering of expression profiles according to AT events and for identification and visualization of common sequences or structures in a set of genes
D5.2MAGE-ML-formatted expression experiments made available to all ATD partners along with existing profiling tools
D5.3AT expression profiles and Tissue-specific ATs
D5.4 List of sequence and secondary structure motifs associated to specific sets of coregulated isoforms