Sequence Motifs, Consensus Sequences and The Motif Finding Problem
Sequence Motifs and their Biological Significance
Sequence motifs are nucleic acid sequences that are widespread across or within a genomes and have or are speculated to have certain regulatory or structural biological functions.
Motifs that are found in different parts of the genomes like exons, introns and junk, have different functions. Motifs present in the exons ( coding part of the genome) decide the structure of the protein or label proteins to be sent to certain parts of the cell for processes like phosphorylation. Motifs that are present in introns (which makes up the non coding part of genome) are usually the regulatory sequences which determine the amount of gene expression and binding sites of proteins. Satellite DNA, which is the main component of centromeres and heterochromatin, is an example of motif found in junk parts of the genome.
Conserved Sequences: Their Biological Significance, and the K-mer Finding problem
In this article, we'll introduce the concept of conserved sequences and also describe their biological significance. Then, we'll see how we can reduce the problem of finding conserved sequences to the problem of finding most common K-mer in a given sequence and further revise the problem to handle mismatches, in-order to make our problem more biologically plausible. Finally, we'll see a simple algorithm to solve the K-mer problem with mismatches.
In evolutionary biology and genetics, conserved sequences refer to identical or similar sequences of DNA or RNA or amino acids (proteins) that occur in different or same species over generations. These sequences show very minimal changes in their composition or sometimes no changes at all over generations.
The following example shows what conservation of sequences across species actually looks like:
Synteny Blocks, Genetic Rearrangements and Synteny Block Construction
What are Synteny Blocks?
Synteny blocks are conserved regions within two sets of chromosomes. In other words, they are identical stretches of nucleotides on two different chromosomes.
Lets take an example, of mouse and human chromosomes. The genomic similarity between human and mouse chromosomes is unexpectedly high, at about 85%. This high amount of genomic similarity implies that at the molecular level, a lot of the functions that are being performed in human and mouse cells are the same, even though on the outside a human looks very different from a mouse.
If we look specifically at the X chromosome, the similarity is even higher at 95%. There are 11 stretches of nucleotides (synteny blocks) which occur in both humans and mice X chromosomes, though found at drastically different locations on the chromosome.
In this article, we'll talk about a method for constructing evolutionary trees, known as character based evolutionary tree construction. It was initially designed to infer evolutionary relationships based on morphological and physiological characters.
In character based tree construction, we are given a DNA segment for multiple species coming from the same part of the genome (for example, the same gene). Given these DNA sequences, we could like to construct the evolutionary tree, i.e. predict which species are more closely related and have a recent common ancestor, vs species that are not closely related and diverged earlier.
Character based tree construction method is based on Occam’s razor principle which states “when several hypotheses with different degrees of complexity are proposed to explain the same phenomenon, one should choose the simplest hypothesis”. In terms of tree buil...
Evolutionary Tree Construction: Neighbor-Joining Algorithm
Evolutionary Tree Construction
The problem of evolutionary tree construction is inferring the topology and the branch lengths of the evolutionary tree that may have produced the given gene sequence data. The number of leaf nodes in the inferred tree should be equal to the number of gene sequences in the given data.
The Neighbor-Joining algorithm
Neighbor-Joining (NJ) tree inference method was originally written by Saitou and Nei in 1987. It belongs to a class of distance-based methods used to build evolutionary trees. NJ method takes a matrix of pairwise evolutionary distances between the given sequences to build the evolutionary tree.
The pairwise distances are typically obtained from s...
Evidence from morphological, biochemical, and gene sequence data suggests that all organisms on Earth are genetically related, and the genealogical relationships of living things can be represented by a vast evolutionary tree, the Tree of Life, or the evolutionary tree. An evolutionary tree is a graph where the sequences under study are represented as leaf nodes with internal nodes and branches depicting the evolutionary relationships between the sequences. In majority of the cases, the DNA sequences are gene sequences from different organisms and may represent the actual evolution of the organisms.
Consider 4 gene sequences Human1, Chimpanzee1, Mouse1 and Fish1 from Human, Chimpanzee, Mouse and Fish species, respectively. We will also assume that these are homologous or equivalent genes that convert glucose to energy in their respective species. The hypothetical evolutionary tree of the 4 genes can be seen from the following figure.