CommonLounge is a community of learners who learn together. Get started with the featured resources above, ask questions and discuss related stuff with everyone.
Conserved Sequences: Their Biological Significance, and the K-mer Finding problem
In this article, we'll introduce the concept of conserved sequences and also describe their biological significance. Then, we'll see how we can reduce the problem of finding conserved sequences to the problem of finding most common K-mer in a given sequence and further revise the problem to handle mismatches, in-order to make our problem more biologically plausible. Finally, we'll see a simple algorithm to solve the K-mer problem with mismatches.
In evolutionary biology and genetics, conserved sequences refer to identical or similar sequences of DNA or RNA or amino acids (proteins) that occur in different or same species over generations. These sequences show very minimal changes in their composition or sometimes no changes at all over generations.
The following example shows what conservation of sequences across species actually looks like:
Sequence alignment using Longest Common Subsequence algorithm
In molecular biology, DNAs and proteins can be represented as a sequence of alphabets. DNA sequences consist of A, T, G, C representing nucleobases adenine, thymine, guanine and cytosine. Proteins consist of 20 different letters indicating 20 different amino acids.
Comparison of two sequences, known as sequence comparison, either from the same organism or from different organism is an important task in molecular biology. It is helpful in providing solutions to many biological questions, for example:
predicting structure and function of proteins
inferring evolutionary history and relatedness of species
locating common subsequences in genes / proteins to identify common motifs,
as a sub-problem in genome assembly for DNA sequencing
In this article, we'll talk about a method for constructing evolutionary trees, known as character based evolutionary tree construction. It was initially designed to infer evolutionary relationships based on morphological and physiological characters.
In character based tree construction, we are given a DNA segment for multiple species coming from the same part of the genome (for example, the same gene). Given these DNA sequences, we could like to construct the evolutionary tree, i.e. predict which species are more closely related and have a recent common ancestor, vs species that are not closely related and diverged earlier.
Character based tree construction method is based on Occam’s razor principle which states “when several hypotheses with different degrees of complexity are proposed to explain the same phenomenon, one should choose the simplest hypothesis”. In terms of tree buil...
Genes encode and can be used to synthesize proteins, and this process is known as gene expression. In higher organisms like humans, thousands of genes express together by different amounts depending upon various factors such as the type of cell (nerve cell or heart cell), environment and disease conditions. For example, different types of cancers invoke different gene expression patterns in humans. These different gene expression patterns under different conditions can be studied using Microarray technology.
Microarrays and Gene Expression profiling
Data from a Microarray can be imagined as rectangular matrix or a grid with each cell in the matrix corresponding to a gene expression value under a particular condition. As shown in the figur...