A course on SEQUENCE ANALYSIS. Roderic Guigó

SEQUENCE SIMILARITY.

Sequence analysis has become such an important field because the sequence of a biomolecule ---DNA or protein---carries a lot of information about the biomolecule's function and history. In particular, we usually assume that similar sequences hold a similar function---structure---, and they have a similar evolutive history. There is very strong evidence supporting such an assumption. It is not surprising, thus, that among the first tools developed in the field of sequence analysis were those aimed to determine the degree of similarity between two sequences. In this lesson, we will address the issue of sequence similarity. In the first part of the lesson, we will start introducing the concept of sequence alignment, on which the concept of sequence similarity (distance) depends. An alignment is simply a correspondence between the sequences, in which each symbol in a sequences is assigned no more than one (maybe none) of the symbols in the other sequence, and in which the order of the symbols in the sequence is maintained. Given a matrix of similarities (distances) between all possible pairs of symbols on which the sequences are built, the similarity (distance) of the alignment is the sum of the similarities (distances) between the aligned symbols. The problem is to find the alignment between two sequences having the maximum similarity (minimum distance). We will see that exhautive exploration of all possible alignments to determine the optimal one is computationally impractical, and we will introduce a dynamic programming algorithm much more efficient---the first version of it developed in the early 1970s by Needelman and Wunsch. With the dynamic programming algorithm only the optimal alignments between all initial segments of the sequences to compare need to be determined in order to find the optimal global alignment between such two sequences. Since the aligment depends strongly on the similarity (distance) matrix used, we will next describe the way in which the two most commonly used sets of matrices---PAM and BLOSUM---are derived. We will finish the first part of this lesson by addressing briefly the issue of the statistical significance of the similarity between aligned sequences. The second part of the lesson is devoted to the problem of finding local alignments between sequences. Often, a global alignment between two sequences has little biological meaning---for instance, when aligning two genomic regions containing a gene. In such cases, it may be more relevant to align those regions in the sequence showing high similarity than attempting to align the whole sequences. We will describe the modification of the global alignment dynamic programming algorithm that allows to find such local alignments ---the modification is due to Smith and Waterman in the early 1980s---. Finally, in the third part of the lesson we will address the problem of building alignments between multiple sequences.

SEQUENCE ALIGNMENT
DYNAMIC PROGRAMMING GLOBAL ALIGNMENT

Needleman, SB, Wunsch, CD, A general method applicable to the search for similarities in the amino acid sequence of of two proteins. J.Mol.Biol. 48:443-453 (1970)

SUBSTITUTION MATRICES

PAM
BLOSUM
Henikoff S, Henikoff JG, Amino acid substitution matrices from protein blocks. PNAS 89:10915-10919 (1992)

STATISTICAL SIGNIFICANCE OF SEQUENCE ALIGNMENTS

Waterman, M. Estimating statistical of sequence alignments, Phil.Trans.R.Soc.Lond B 344:383-390 (1994)

DYNAMIC PROGRAMMING LOCAL ALIGNMENT

Smith, TF, Waterman, M. Identification of common molecular subsequences, J.Mol.Biol. 147:195-197 (1981).

MULTIPLE ALIGNMENT

Last modified: Wed Jun 3 21:42:11 MDT 1998

SEQUENCE SIMILARITY.

HANDOUTS

RECOMENDED INTERNET DOCUMENTS

PRACTICAL