Proposal for cs267 final project

Bonachea, Chapman, Putnam




The combined availability of a rapidly growing database of genomic sequences, and the technology to create oligonucleotide arrays provides novel opportunities for genome-wide experimental measurements of biological systems. One challenge in exploiting this technology is the selection of oligonucleotide probe sequences for making arrays.

Because of the increasing size of genomic sequence databases, determining an optimal set of probe sequences is expensive computationally. In this project we will implement a parallel application for selecting optimal oligonucleotide sequences from a database of genomic sequences.

Oligonucleotide design for microarray construction

Typically, the starting point for designing a microarray is a database of genes sequences, or partial gene sequences. These are the genes whose expression levels one would like to be able to measure with the microarray. For each of these genes, a subsequence is selected for deposition on the microarray.

The choice of these sequences is constrained by three considerations:

    1. Uniqueness of the sequence
    2. The subsequence chosen should hybridize only with a single target gene. Subsequences which (even approximately) match subsequences of other genes should be eliminated. (One possible metric for subsequence similarity can be found at )

    3. Melting temperature
    4. The hybridization reaction happen in parallel, on the same array, and must all occur within a narrow window of temperature. All the probes must therefor have the same melting temperature. (for more information on oligonucleotide thermodynamic calculations, see )

    5. Secondary Structure
    6. Probes which form stable intramolecular helical structures must be eliminated, because such structures interfere with hybridization with target molecules. (see for more info on secondary structure prediction.)


Because of the independence of the choice of each oligo sequence from the choice of the others, a workable 'embarrassingly' parallel implementation in which the database is reproduced on each processing element is possible. However, for large sequence databases (6 Gbp for human), partitioning the sequence database among processing elements may allow higher performance. Our initial goal is to produce a set of optimal oligos for the yeast genome (6 Mbp), with an implementation that can be scaled up to large scale calculations.

Progress as of April 24

documentation for Titanium Classes developed so far