Conserved Sequences: Their Biological Significance, and the K-mer Finding problem

February 24, 2018

In this article, we’ll introduce the concept of conserved sequences and also describe their biological significance. Then, we’ll see how we can reduce the problem of finding conserved sequences to the problem of finding most common K-mer in a given sequence and further revise the problem to handle mismatches, in-order to make our problem more biologically plausible. Finally, we’ll see a simple algorithm to solve the K-mer problem with mismatches.

Conserved Sequences

In evolutionary biology and genetics, conserved sequences refer to identical or similar sequences of DNA or RNA or amino acids (proteins) that occur in different or same species over generations. These sequences show very minimal changes in their composition or sometimes no changes at all over generations.

The following example shows what conservation of sequences across species actually looks like:

In this image we are looking at the amino acid sequence of mammalian histone proteins and their conserved regions. Those indicated in grey are conserved in all the species while we see the gaps in white which tend to change over different species.

Common examples of conserved sequences include,

translation and transcription related sequences which are found conserved in the genome at multiple places
certain RNA components in ribosomes are found to be highly conserved over various species
tmRNA is found to be conserved in multiple bacteria species
other examples like TATA (repetitive regions) and homeoboxes (involved in regulating embryonic development in a wide range of species).

Conserved Domains database available at NCBI has extensive resources on conserved sequences in different organisms and genomes. It uses protein specific scoring matrices to identify conserved sequences.

Types of Conserved Sequences

Conserved sequences can be categorized into two major categories, orthologous and paralogous. A conserved sequence is called orthologous when identical sequences are found across species and it is called paralogous when identical sequences are found within the same genome over generations.

Example of orthologous conserved sequences: A research on genomes of vertebrate, worm, insect and a yeast genome (PubMed Central PMCID:PMC1182216) found conserved elements across genome wide alignments of 5 species of vertebrates (human, rat, mouse, chicken and Fugu rubripes), 2 species of Caenorhabditis and 7 species of Saccharomyces. The conserved element that was analyzed was found to be 3%–8% of the human genome and substantially higher fractions of the more compact Drosophila melanogaster (37%–53%), Caenorhabditis elegans(18%–37%), and Saccharaomyces cerevisiae (47%–68%) genomes.

Example of paralogous converved sequences: Sequences of DNA in haemoglobin gene in humans is found to be identical at multiple places on the genome and myoglobin gene sequence in chimpanzees.

We often see cases of extreme conservation of nucleic acid or amino acid sequences, these are called ultra conserved sequences. For example, certain sequences in vertebrates have been found in widely different taxas varying drastically. In another case we have universally conserved sequences that comprise of almost all organisms, examples of such sequences are GTP binding elongation factor, ribosomal RNA’s and tRNA’s, etc.

Significance of Conserved Sequences

Biological Significance

Conserved sequences found in different genomes can be either coding sequences or non coding sequences. As coding sequences, amino acids and nucleic acids are often conserved to retain the structure and function of a certain protein. These sequences undergo minimal changes. When changes happen, they usually replace an amino acid or nucleic acid with one which is biochemically similar. Similarly, other mRNA related nucleic acid sequences are often conserved. Non coding sequences, like ribosomes sites, transcriptional factors, binding site, etc, are also conserved sequences.

Computational Significance

Conserved sequences help us find homology (similarity) among different organisms and species. Phylogenetic relationships and trees could be developed and effective ancestry could be found using the data on conserved sequences. A common example is the conserved sequence “16S RNA” which is used to reconstruct phylogenetic relationship among various bacterial phyla.

Conserved sequence can also be used to mark the origination of genetic disorders and mutations. By comparing genomes which have a certain conserved sequence common to them we can easily identify anomalies, any exist.

Finding Conversed Sequences with K-mers

In this section, we will see how given a section of a single DNA, how we can find short conserved sequences. The conserved sequences we are looking for are called regulatory motifs. Regulatory motifs are short DNA segments (say 15-30 nucleic acids) which control the expression of genes, i.e. how many times a gene is transcribed, and hence how much of the corresponding protein is produced.

K-mers are substrings of length k that are found in the input string. In case of computational genomics, the input string represents a sequence of amino acids or nucleic acids. For example 5-mers refer to substrings of length 5, and 7-mers refer to substrings of length 7.

Most Frequent K-mers Problem

We frame the problem of finding short conserved sequences as follows. Given the input sequence of amino acids or nucleic acids, find the K-mer that occurs most frequently. Let’s take an example

We have following data,

INPUT:

Sequence: ACGTTGCATGTCGCATGATGCATGAGAGCT

k = 4

EXPECTED RESULT:

Most frequently occurring 4-mer from the input sequence.

EXAMPLE:

We can use sliding window technique to find all the K-mers. Let us note down all the k-mers,

ACGT : CGTT : GTTG : TTGC : TGCA : GCAT : CATG : ATGT : TGTC : GTCG : TCGC : CGCA : GCAT : CATG : ATGA : TGAT : GATG : ATGC : TGCA : GCAT : CATG : ATGA : TGAG : GAGA : AGAG : GAGC : AGCT

In this example, we see that 4-mers CATG and GCAT are the most frequently occurring 4-mers, as they appear 3 times each.

Allowing Mismatches in K-mers

However, from experiments in biology, we’ve found out that it is possible for conserved sequences to undergo minor changes. As such, we need to expend the above problem to handle mismatches.

For example, ATCCGAT and ATCGGAA have 2 mismatches, one at positions 4 and another at position 7. Let’s see how we can define the problem of finding most frequent k-mers with allowance for mismatches.

INPUT:

Sequence: ACGTTGCATGTCGCATGATGCATGAGAGCT

K = 4, d = 1

EXPECTED RESULT:

Most frequent 4-mers with allowance for 1 mismatch per K-mer.

EXAMPLE:

We have taken the same example sequence as in the previous problem. Hence, the list for all possible 4-mers is unchanged. They were as follows:

ACGT : CGTT : GTTG : TTGC : TGCA : GCAT : CATG : ATGT : TGTC : GTCG : TCGC : CGCA : GCAT : CATG : ATGA : TGAT : GATG : ATGC : TGCA : GCAT : CATG : ATGA : TGAG : GAGA : AGAG : GAGC : AGCT

However, our final result will change as now we have to take into account all possibilities with 1 mismatch allowed.

If we take GATG for example, there are 5 k-mers in the above sequence which match GATG allowing for one mismatch, i.e. GTTG, CATG, CATG, GATG and CATG. Similarly there are 5 matching k-mers for ATGC and ATGT as well. Hence, our result for the most frequent k-mers with allowance for 1 mismatch are GATG, ATGC (matches TTGC, ATGT, ATGA, ATGC, ATGA) and ATGT (matches ACGT, ATGT, ATGA, ATGC, ATGA).

Step-by-Step Algorithm for the K-mers problem

The following is a simple procedure for solving the above problem:-

Create list L of all K-mers in the original string
For every K-mer X in the original string
Consider every K-mer Y in the original string
Count the number of mismatches m between X and Y
If m <= d, then increase score of X by 1
Result = K-mer X with highest score

Computational Efficiency: If the original length of the string is L, then the algorithm does about L²K calculations. Note that L can sometimes be quite large, say 10s of millions or even billions (human DNA has comprises of about 3-4 billion nucleic acids).

Correctness: The above algorithm works only if the K-mer appears correctly (without any mismatches) at-least once in the DNA sequence. Although this is not necessary, in practice this is usually the case. This is the case for many algorithms in bioinformatics, whereby an algorithm is not proved to give optimal results all the time, but in practice, it works quite well.