BI508
Algorithms in Computational Molecular Biology
Tu Th 1:30-3:00 in Higgins 425


Course description | Text Book(s) | Tentative Syllabus | Lectures, Class Notes, Data and Homework | Source code of programs | Grading policy


Course description

A good understanding of basic algorithms in the field of computational molecular biology is of paramount importance to bioinformatics researchers, especially those who intend to work at the cutting edge of research. This is because the cutting edge of experimental work in the life sciences is driven in part by the introduction of new experimental procedures -- hence, web servers and bioinformatics software packages may not be capable of analyzing data generated by new experimental techniques. For this reason, bioinformatics researchers who have skills in novel algorithm development will have a competitive edge both in industry and academia. For a case in point, consider the many next generation software packages capable of aligning short (Solexa) reads to the genome -- the authors of such software as Bowtie, SeqAn, Eland, RMAP, SOAP, SHRIMP, Maq, Mosaic, etc. were all intimately familiar with details of the Smith-Waterman and BLAST algorithm when developing their extensions.

In this course, we will cover basic algorithms used in the various areas (computational genomics, structural biology, systems biology) of bioinformatics. Topics may include: pairwise and multiple alignment, wraparound alignment (tandem repeat search), genomic rearrangements, Boltzmannian Monte Carlo (Metropolis-Hastings) and non-Boltzmannian Monte Carlo (Wang-Landau), hidden Markov models (HMMs), phyogenetic tree construction, RNA and protein secondary structure prediction, machine learning applications (neural networks, support vector machines), clustering, microarray data analysis, transcription factor binding site detection, etc.

Many practical bioinformatics courses focus on topics such as, how to use the hidden Markov model software HMMER, GenMark or GenScan, how to use PHYLIP software to construct phylogenetic trees, etc. In contrast, this course focuses on the underlying methods of HMMER, GenMark, etc. -- the reason being that in this fashion you will be able to develop the next generation of bioinformatics software.

Course work will include implementation of the algorithms we cover and in some cases development of applications using publicly available code (e.g. using libSVM support vector machine). Programming projects will often be done with a partner, to build teamwork skills. The programming language is up to you (e.g. python, perl, C/C++, java, MatLab, etc.).

Return to table of contents


Intended audience, prerequisites, and course work

Advanced undergraduate students and graduate students in the physical and life sciences as well as in mathematics and computer science. It is essential to be able to program well in some language.

The course is based on Computational Molecular Biology : An Introduction, by P. Clote and R. Backofen, Wiley & Sons, Inc. (August 2000), as well as material from literature.

The course grade will depend on homework (mostly implementation of algorithms), a midterm and final examination. Computational biology is different than most biology couses, which latter require memorization and becoming familiar with lab techniques. In contrast, computational biology concerns the mathematical design of bioinformatics algorithms as well as implementation details for the development of efficient code. For this reason, it may be more beneficial for the midterm and final to be take-home examinations, consisting of substantial programming projects. This will be decided during the semester.

Return to table of contents


What is Computational Biology?

In the past, biologists generally grouped living organisms into two distinct life forms or domains:

Methanococcus jannaschii is a methanogenic archaebacterium, first collected in 1982 by the Woods Hole submersible Alvin near white smokers from a hot spot of the sea floor of the Pacific Ocean at a depth of 2600 meters. In August 1996, the 1.66 megabase pair genome of M. jannaschii was published by Bult et al. in Science, where it was asserted that more than 56% of its 1738 genes are completely new, unlike any genes in existent databases. A small initial portion of the DNA sequence, consisting of over 1.6 million characters, is given as follows.

TACATTAGTGTTTATTACATTGAGAAACTTTATAATTAAAAAAGATTCATGTAAATTTCT
TATTTGTTTATTTAGAGGTTTTAAATTTAATTTCTAAGGGTTTGCTGGTTTGATTGTTTA
GAATATTTAACTTAATCAAATTATTTGAATTTTTGAAAATTAGGATTAATTAGGTAAGTA
AATAAAATTTCTCTAACAAATAAGTTAAATTTTTAAATTTAAGGAGATAAAAATACTCTG
TTTTATTATGGAAAGAAAGATTTAAATACTAAAGGGTTTATATTATGAAGTAGTTACTTA
CCCTTAGAAAAATATGGTATAGAAAAGCTTAAATATTAAGAGTGATGAAGTATATTATGT

Analysis of the DNA sequence of M. jannaschii provided solid evidence for a startling hypothesis advanced two decades earlier by Carl Woese: there is a third domain of life called Archaea, which is distinct from Prokarya and Eukarya.

How can one determine the (hypothetical) genes of M. jannaschii from its 1.66 megabase pair genome? Obviously this must be done by a computer program, but if the majority of the (hypothetical) genes in this new life form have no homology to known genes, then how does the program work?

The TIGR group of Bult et al. used the commercial software GenMark, which implements a 5-th order Markov model. We'll study Markov chains and important machine learning algorithms for recognizing (inexact) patterns. In particular, we'll study the theory and then implement Hidden Markov Models (HMM), currently used in determining genes, intron/exon splice sites, parts of the genome wrapped around nucleosomes, etc.

Sequence similarity between the new genes of M. jannaschii and those in existent databases were determined by programs. How do these programs work?

Topics to be covered will be among the following, as well as some new results from the literature.

Return to table of contents


Required Texts

Suggested Texts

Do not purchase; this list created for future reading if you are interested.

Additional references if biology background is needed

For your reference, do not purchase.
  1. All you need to know about DNA, Genes and Genetic Engineering, A Concise, Comprehensive Outline, by Gordon R. Carter and Stephen M. Boyle, published by Charles C.Thomas Publisher, Ltd., Springfield, Illinois 1998
  2. Molecular Biology of the Gene, J.D. Watson et al. 3-rd edition, Benjamin/Cummings Publishing Co, 1987.
  3. Introduction to Computational Molecular Biology, J. Setubal and J. Meidanis, PWS Publishing Co, 1997
  4. Introduction to Protein Structure, J. Brandon und C.Tooze, Garland Pub, NY, 1991
  5. Molecular Evolutionary Genetics, M. Nei, Columbia University Press 1987
  6. Genes V, Lewis

Return to table of contents


Grading Policy

Homework, class participation 30%
Midterm 30%
Final Exam 40%

The grading policy is subject to change. If so, then this will be clearly announced with ample time.


Return to table of contents