Sequence alignment using Longest Common Subsequence algorithm

January 03, 2018

Introduction

In molecular biology, DNAs and proteins can be represented as a sequence of alphabets. DNA sequences consist of A, T, G, C representing nucleobases adenine, thymine, guanine and cytosine. Proteins consist of 20 different letters indicating 20 different amino acids.

Comparison of two sequences, known as sequence comparison, either from the same organism or from different organism is an important task in molecular biology. It is helpful in providing solutions to many biological questions, for example:

predicting structure and function of proteins
inferring evolutionary history and relatedness of species
locating common subsequences in genes / proteins to identify common motifs,
as a sub-problem in genome assembly for DNA sequencing

In order to perform sequence comparison, we first perform sequence alignment.

What is a Sequence Alignment?

A sequence alignment is a way of placing one sequence above the other in order to identify the correspondence between similar characters or substrings. It can be performed on Deoxyribonucleic acid (DNA), Ribonucleic acid (RNA) or protein sequences.

Sequences from different organisms may be of different sizes. An alignment requires the insertion of spaces in arbitrary locations along the sequence so that both will have the same size. ’Spaces’ or ’gaps’ are inserted either in the beginning or in the end of the sequences.

Let’s consider an example. Say we have the following two DNA sequences GACGGATTAG and GATCGGAATAG. One possible alignment between these two sequence is shown below

GA-CGGATTAG
|| |||| |||
GATCGGAATAG

Note the differences in the two sequences as per this alignment. An extra T in the second sequence, and a substitution from T to A. A ‘space’ is added to the first sequence at a suitable location to align the sequences.

Types of alignment

There are two types of alignment based on what we are looking for:

Global alignment: Aligns a query sequence with the target sequence along the entire length of the sequence. Global alignment is a global optimization process that spans the entire length of two query sequences.
Local alignment: Aligns a substring of the query sequence to a substring of the target sequence, i.e. finding local regions of high similarity between two sequences.

There are also two types of alignment based on the number of sequences being aligned

Pairwise alignment: Involves two sequences, the query and the target with which it is aligned.
Multiple alignment: Involves alignment with more than two sequences.

Software tools for sequence alignment

Locating regions of similarity of the query sequence with the database sequences is a challenging task in bioinformatics. Tools like BLAST and FASTA helps in detecting regions of similarity among organisms. It involves the use of local pairwise alignment, exhaustive heuristic algorithms and dynamic programming approaches like Smith-Waterman algorithm to detect regions of similarity of the query sequence with the database sequence.

Longest common subsequence (LCS) problem

One way of detecting the similarity of two or more sequences is to find their longest common subsequence. Longest common subsequence (LCS) of two sequences is a subsequence, of maximum possible length, which is common to both the sequences.

Illustration

Let’s consider an example. Suppose we have the following two DNA sequences: TAGTCACG and AGACTGTC. The LCS of the two sequences is AGACG, which can be obtained from the following alignment.

TAGTCAC-G--
 ||  || |  
-AG--ACTGTC

There are other possible common sequences of shorter length, such as AGCG or AGTC. Although multiple LCS are possible in general, there is only one LCS for this particular example, i.e. there is no other common subsequence of length 5 for these two sequences.

The LCS also helps in computing how similar the two sequences are: the longer the LCS, the higher the similarity.

LCS algorithm

Let us suppose we have two sequences S1 and S2 of lengths m and n respectively, where S1 = a₁ a₂ … a_m and S2 = b₁ b₂ … b_n.

We will construct a matrix A where A_i,j denotes the length of the longest common subsequence of a₁ a₂ … a_i and b₁ b₂ … b_j.

Example matrix for sequences S1 = TAGTCACG and S2 = AGACTGTC.

Sequence S1 is written vertically, and S2 horizontally. Each letter representing the rows is compared with each letter representing the columns. We’ll proceed row by row, column by column.

If a_i = b_j, then we have found a match. We get a score of 1 for the current match, and A_i-1,j-1 from the rest of the LCS we already obtained from substrings a₁ … a_i-1 and b₁ … b_j-1.
If a_i ≠ b_j, then we have a mismatch. In this case, we need to consider two possibilities. The LCS a₁ … a_i and b₁ … b_j-1 , and the LCS of a₁ … a_i-1 and b₁ … b_j .

Hence, we have

$$ A_{i,j} = \begin{cases} A_{i-1,j-1} + 1 &\text{if } a_i = b_j \\ max(A_{i-1,j}, A_{i,j-1}) &\text{if } a_i \neq b_j \end{cases} $$

and A_0,0 = A_0,j = A_i,0 = 0 for 1 ≤ i ≤ m, 1≤ j ≤ n.

It takes O(nm) time to fill in the m by n matrix A. This approach is called dynamic programming.

Step by step example

The matrix is filled row by row, column by column.

The first row R1 and column C1 are filled with zero.
(mismatch example) The first letter in the row R2 is compared with the column C2. The cells in this case contains ‘T’ and ‘A’. Since it is a mismatch, the score is taken from left or above, i.e. (R1, C2) and (R2, C1). As both values are ‘0’, the total score becomes max(0, 0) = 0 in this case.
We continue filling rest of R2.
(match example) When R3 is compared with C2, both the cell contain ‘A’. Since it is a match, the score is given by 1 + the score at diagonally up and to the left, i.e. (R2, C1), which is ‘0’ in this case. So the total score becomes 1.
We continue filling the rest of R3.
(match example) In a similar way, when R4 is compared with C3, it is again a match. This time, the score is given by 1 + score from (R3, C2) which is 1. Hence, total score becomes 2.
The entire matrix is filled in this way. The final score is given by A_m,n = 5.

Example matrix for sequences S1 = TAGTCACG and S2 = AGACTGTC.

Traceback

The actual subsequence is deduced by performing a traceback, i.e. tracing backwards from the right bottom to the top left. When the length decreases is all directions, the sequences must have had a match. Several paths are possible when two arrows are possible in a cell (given in the box below). In such cases either path can be chosen.

The locations of the diagonal edges contains all the required information for constructing the LCS, as well as knowing the locations in S1 and S2 which the LCS corresponds to.

The LCS for the above example is AGACG of length 5. S1 and S2 with locations of diagonal edges underlines are as follows: TAGTCACG and AGACTGTC.

Scoring matrices

In the above example, we said we get a 0 for a mismatch and a 1 for a match. This can be generalized to different scores for deletion, insertion or match of each of the alphabets, and different scores for each possible pair of mismatched alphabets.

Examples of scoring matrices for DNA alignment. The matrices include scores for matches and mismatches. Identity matrix is equivalent to the LCS algorithm discussed above.

The score for the matches and mismatches are based on evolutionary significance of bases (in case of DNA) and amino acid (in case of proteins). For example, the amino acids are given weightage based on their conserved nature of occurrence observed empirically.

BLOSUM (BLOcks SUbstitution Matrix). Commonly used scoring matrix for sequence alignment of proteins. Each row, column is a amino acid, and a protein is a sequence of amino acids.