Sequence Motifs, Consensus Sequences and The Motif Finding Problem

February 27, 2018

Sequence Motifs and their Biological Significance

Sequence motifs are nucleic acid sequences that are widespread across or within a genomes and have or are speculated to have certain regulatory or structural biological functions.

Motifs that are found in different parts of the genomes like exons, introns and junk, have different functions. Motifs present in the exons ( coding part of the genome) decide the structure of the protein or label proteins to be sent to certain parts of the cell for processes like phosphorylation. Motifs that are present in introns (which makes up the non coding part of genome) are usually the regulatory sequences which determine the amount of gene expression and binding sites of proteins. Satellite DNA, which is the main component of centromeres and heterochromatin, is an example of motif found in junk parts of the genome.

Sequence Motifs and Consensus Sequences

Different occurrences of a sequence motif may differ from each other even if they perform the same function. As such, we define what is known as the consensus sequence for a set of sequences.

Given a set of sequences, a consensus sequence (also called canonical sequence) is the sequence obtained by taking the most frequent residues of nucleic acids / amino acids at each position.

For example, the consensus sequence of AGAT, TGAC and ACAC is AGAC. Writing the sequences one below the other makes it easier to see this - A is the most common residue in position 1, G is most common in position 2, and so on.

AGAT := Sequence 1
TGAC := Sequence 2
ACAC := Sequence 3
AGAC := Consensus sequence

Representation of Motifs and Consensus Sequences

Given that motifs are substring patterns which are found at multiple places each of which may or may not have mutations, there are specific ways of representing these sequences.

Text representation: We will use the following example of a nucleic acid sequence to introduce various aspects of the representation. Example,

T [GA] N Y {T} [CT] R

In this representation, A, T, G and C, denote the four possible nucleotide bases adenine, thymine, cytosine, and guanine. N denotes any nitrogen base, Y and R denote pyrimidine (T or C) and purine (A or G) respectively. [GA] means that, at that position, either G will be present or A. {T} signifies that, at this position any other base will be present except T.

Graphical representation: There is also a graphical method of displaying consensus sequences, i.e consensus logo method. A consensus logo conveys information about the conservation of each position of a sequence motif. The consensus logo depicts the degree of conservation of each position using the height of the consensus character at that position (note that degree of conservation is different from the frequency of each nucleotide at each position).

Positions where there is lesser conservation have shorter total height (position 3, 5, 19, 21, etc). Positions with high conservation have highest total height (usually taken by a single character), for example, positions (6, 10, 14, 18, etc). The exact height of each character comes from the entropy of each position, which is described in detail later in the article.

Motif Finding Problem

Motif finding is described as the problem of discovering motifs without any prior knowledge of what the motifs look like.

Computationally, the motif finding problem can be defined as: Given a set of T sequences each of length N, find the best pattern of length L that appears in each of the T sequences. Different types of scores exist to help us score a given pattern and choose which one is the best.

For the examples below, we’ll consider T = 5, N = 100, and L = 8. That is, we are given 5 sequences of length 100, and we want to find one sequence motif of length 8 from each sequence such that the motifs are similar to each other.

Scoring a set of Motif candidates

First, we’ll see how to score a given set of motifs. That is, we’ll assume that we are given some sequences of length L from each of the T sequences which are our motif candidates. Given this, we’ll see how to score these motif candidates. In the next section, we’ll see how we can find the motifs in the original sequences.

Suppose we have the following set of motif candidates obtained from a set of sequences (5 motif candidates, each of length 8).

5 motifs each of length 8. Red letters mark the deviation from consensus.

From the given motifs, we first construct a profile matrix, which is simply the frequency of each nucleotide base in each position. Hence for our above example, following is how our profile matrix will look like,

The most common letter in each column taken together gives us the consensus string. Hence in this example, our consensus sequence would have A at the first position because it appears maximum times in column 1, similarly we will have C at second position and so on to get the following consensus sequence: ACGTACGT.

One possible score for the set of motif candidates is the total number of mutations (in the above example, that is 2 + 1 + 1 + 0 + 2 + 1 + 2 + 1 = 10). Our objective is to minimize this score.

A more appropriate score is the entropy of the profile matrix. Entropy is a measure of how conserved each position is. Higher entropy implies low conservation, and low entropy implies high conservation. Let prob(R, l) be the probability that residue R appears at location l, i.e. prob(R, l) = count(R, l) / T. For example, for given the profile matrix above, prob(C, 1) = 2/5. We’ll measure the entropy for each location independently. It is defined as follows:

$$ - \sum_{\text{location l}} \ \sum_{\text{residue }R} \ {\text{prob}(R, l)} \ \log{\text{prob}(R, l)} $$

That is, we sum prob(R, l) log(prob(R, l)) over all residues over all locations.

Finding the Motifs (Randomized Algorithm)

We’ll use a randomized iterative algorithm for finding the motifs. We use a randomized iterative algorithm since trying all possible values for the profile matrix is not feasible. There are faster algorithms when the scoring function uses only the consensus sequence instead of the profile matrix, but the entropy function is a much better scoring function than scoring functions based purely on the consensus sequence, which doesn’t take into account how conserved each position of the consensus sequence is.

The algorithm is as follows.

Choose the initial motifs randomly in each of the given sequences. This gives us a set of motif candidates M.
Calculate the profile matrix P for motif candidates M (as described in the previous section).
For each sequence S, given the profile matrix P, find the best motif (described in more detail below). This gives us a new set of motif candidates M.
If the new set of motif candidates is different from the previous one, go back to 2. Else stop.

For step 3, we can score each L-mer given the profile matrix P. The score is given as follows

$$ - \sum_{\text{location }l} \ \ \log{\text{prob}(R, l)} $$

where the sum is over the motif’s residues, and prob is defined by the profile matrix. For example, for the sequence ACAGACAT, the score would be -[log(prob(A, 1)) + log(prob(C, 2)) + log(prob(A, 3)) + … + log(prob(T, 8))].

The randomized algorithm converges really fast. Since our final results depend on the random initialization done in the beginning, we often run the entire process multiple times and choose the motif set with the lowest entropy from all the runs.

Gibb’s Sampling Algorithm

A slight revision to the above randomized algorithm makes the algorithm perform even better. This version is known as Gibb’s Sampler algorithm.

Procedure to perform Gibb’s sampling (major changes from previous algorithm in bold):

Randomly choose starting positions and for the motif L-mers for each of the T given sequences.
Randomly choose one of the T sequences.
Create the profile matrix for remaining T-1 sequences.
For each position in the removed sequence, calculate the score for each L-mer as per the profile matrix.
Sample the new L-mer to use from this sequence. The sampling weight for each L-mer is e^score. (we use e^score since score is the log probability).
Keep repeating from step 2 till score for profile matrix stops improving.

The Gibb’s sampling algorithm is better than the previous one because it much more stable. It takes longer to converge, but we don’t have to run it many many times and it gives the same answer almost all the time.