Bonachea, Chapman, Putnam
(MICROARRAY OPTIMAL OLIGO SELECTION
The combined availability of a rapidly growing database of genomic sequences, and the technology to create oligonucleotide arrays provides novel opportunities for genome-wide experimental measurements of biological systems. One challenge in exploiting this technology is the selection of oligonucleotide probe sequences for making arrays.
Because of the increasing size of genomic sequence databases, determining an optimal set of probe sequences is expensive computationally. In this project we will implement a parallel application for selecting optimal oligonucleotide sequences from a database of genomic sequences.
Oligonucleotide design for microarray
Typically, the starting point for designing a microarray is a database of genes sequences, or partial gene sequences. These are the genes whose expression levels one would like to be able to measure with the microarray. For each of these genes, a subsequence is selected for deposition on the microarray.
The choice of these sequences is constrained by three considerations:
The subsequence chosen should hybridize only
with a single target gene. Subsequences which (even approximately) match
subsequences of other genes should be eliminated. (One possible metric
for subsequence similarity can be found at http://www.ncbi.nlm.nih.gov/BLAST/
The hybridization reaction happen in parallel,
on the same array, and must all occur within a narrow window of temperature.
All the probes must therefor have the same melting temperature. (for more
information on oligonucleotide thermodynamic calculations, see http://www.basic.nwu.edu/biotools/oligocalc.html
Probes which form stable intramolecular helical structures must be eliminated, because such structures interfere with hybridization with target molecules. (see http://mfold.wustl.edu/~folder/dna/form1.cgi for more info on secondary structure prediction.)
Because of the independence of the choice of each oligo sequence from the choice of the others, a workable 'embarrassingly' parallel implementation in which the database is reproduced on each processing element is possible. However, for large sequence databases (6 Gbp for human), partitioning the sequence database among processing elements may allow higher performance. Our initial goal is to produce a set of optimal oligos for the yeast genome (6 Mbp), with an implementation that can be scaled up to large scale calculations.