Detecting Mutations with Read Mapping and Suffix Trees
What are Reads? What is Read Mapping?
Reads, in a sequencing experiment, are small overlapping sections of genome that are “read” and revealed by the sequencing machine. These are short “ATGC” strings that have unique locations on the genome. Read mapping involves finding the locations of these reads on the genome. In case of human genome, this involves finding the location of millions of short ATGC substrings (reads) on a 3-billion character long string (human genome). Therefore, this is essentially a multiple pattern matching problem, where we compare the reads to all the sub-strings in the genome to the find the location of the reads.
Protein structure prediction using homology modeling
What are proteins?
Proteins are large biomolecules which are responsible for performing most of the functions within an organisms cells, including responding to stimuli, acting as catalysts for other reactions, transporting molecules from one place to another and performing cell signaling. Just like DNA sequences, protein sequences are strings of molecules but unlike DNA sequences, there are 20 different molecules called amino-acids that make up protein sequences.
Every 1D protein sequence string folds into 3D structures. These 3D protein structures are determine how a protein responds to various environments and which other molecules it interacts with, and hence is critical in the ability of the protein to perform its functions. The 3D structure of protein is described by providing the coo...
Genes encode and can be used to synthesize proteins, and this process is known as gene expression. In higher organisms like humans, thousands of genes express together by different amounts depending upon various factors such as the type of cell (nerve cell or heart cell), environment and disease conditions. For example, different types of cancers invoke different gene expression patterns in humans. These different gene expression patterns under different conditions can be studied using Microarray technology.
Microarrays and Gene Expression profiling
Data from a Microarray can be imagined as rectangular matrix or a grid with each cell in the matrix corresponding to a gene expression value under a particular condition. As shown in the figur...
Evolutionary Tree Construction: Neighbor-Joining Algorithm
Evolutionary Tree Construction
The problem of evolutionary tree construction is inferring the topology and the branch lengths of the evolutionary tree that may have produced the given gene sequence data. The number of leaf nodes in the inferred tree should be equal to the number of gene sequences in the given data.
The Neighbor-Joining algorithm
Neighbor-Joining (NJ) tree inference method was originally written by Saitou and Nei in 1987. It belongs to a class of distance-based methods used to build evolutionary trees. NJ method takes a matrix of pairwise evolutionary distances between the given sequences to build the evolutionary tree.
The pairwise distances are typically obtained from s...
Evidence from morphological, biochemical, and gene sequence data suggests that all organisms on Earth are genetically related, and the genealogical relationships of living things can be represented by a vast evolutionary tree, the Tree of Life, or the evolutionary tree. An evolutionary tree is a graph where the sequences under study are represented as leaf nodes with internal nodes and branches depicting the evolutionary relationships between the sequences. In majority of the cases, the DNA sequences are gene sequences from different organisms and may represent the actual evolution of the organisms.
Consider 4 gene sequences Human1, Chimpanzee1, Mouse1 and Fish1 from Human, Chimpanzee, Mouse and Fish species, respectively. We will also assume that these are homologous or equivalent genes that convert glucose to energy in their respective species. The hypothetical evolutionary tree of the 4 genes can be seen from the following figure.
All living organisms contain a long string of molecules called the DNA (Deoxyribose Nucleic Acid). The DNA strings are made up of 4 types building block molecules called bases viz. Adenine, Guanine, Cytosine and Thymine represented by the 4 letters A, G, C and T respectively.
DNA sequencing is the process of inferring the DNA string, i.e. the ordering of the 4 bases for any given organism. Complete DNA sequences called genomes are millions or even billions of bases long, especially for higher organisms like humans. Current sequencing technologies can read sequences that are at most 100s of bases long and cannot be used to infer the entire DNA sequence at once. Therefore, short overlapping stretches of these long DNA sequences are first read (called as reads) using these technologies and then the reads are assembled together to reveal the entire DNA sequence of organisms. This process of assembling the short fragments of genomes to reveal the ent...