- DNA and DNA sequencing
- Genome Assembly
- The Hamiltonian Path approach
- The Eulerian Path Approach and de Bruijn graphs
All living organisms contain a long string of molecules called the DNA (Deoxyribose Nucleic Acid). The DNA strings are made up of 4 types building block molecules called bases viz. Adenine, Guanine, Cytosine and Thymine represented by the 4 letters A, G, C and T respectively.
DNA sequencing is the process of inferring the DNA string, i.e. the ordering of the 4 bases for any given organism. Complete DNA sequences called genomes are millions or even billions of bases long, especially for higher organisms like humans. Current sequencing technologies can read sequences that are at most 100s of bases long and cannot be used to infer the entire DNA sequence at once. Therefore, short overlapping stretches of these long DNA sequences are first read (called as reads) using these technologies and then the reads are assembled together to reveal the entire DNA sequence of organisms. This process of assembling the short fragments of genomes to reveal the entire genome sequence is called the genome assembly.
Total number of reads obtained from a sequencing experiment mainly depends on 3 variables. The length of reads, length of the genome to be assembled and depth of sequencing. The length of reads depends on the sequencing technology. The depth of sequencing is the average number of times each base is sequenced. Therefore, a 30x sequencing depth signifies each base in the genome is sequenced on an average 30 times. Approximate number of reads can be calculated using the equation N = (D)(G)/L, where D is the sequencing depth, G is the genome length and L is the read length. For example, a sequencing experiment to assemble human genome of 3 billion bases performed at 3x depth with sequencing technology that produces reads of length 100 will produce 90 million reads.
Genome assembly can be thought of as solving a 1-dimensional jigsaw puzzle with the reads as individual pieces of the puzzle. This 1-dimensional genome puzzle can be solved by detecting overlaps between adjacent reads to reveal the whole genome sequence. All the popular genome assembly tools work on this same basic principle of detecting overlaps between the reads, to assemble genomes.
In the following sections we will see the two prominent graph theory based approaches used for assembling genomes from large number of reads.
This approach attempts to find the Shortest Common Superstring (SCS) from the set given set of reads. Since the reads are nothing but short and overlapping substrings of the entire genome, the shortest superstring that contains all the substrings is the original genome sequence.
To solve this problem, we first construct a graph with each of the reads as nodes. If two reads overlap, we add a directed edges between the corresponding nodes. The direction of the edge is such that the suffix of the origin node matches the prefix of the destination node. Weights are assigned to the edges that signify the length of the overlap between the reads.
Then a directed path through the graph is found such that visits each node only once and which also maximizes the sum of edge weights. Tracing this path through the graph will reveal the shortest common string that contains all the reads as subsequences. This path through the graph that visits each node exactly once is known as Hamiltonian path.
However, inferring the SCS using the Hamiltonian has some major computational limitations.
The Hamiltonian approach is used by a class of genome assemblers called the Overlap - Layout - Consensus (OLC). CELERA, PHRAP, CAP3 and TIGR are some of the popular OLC assembly methods. See references at the bottom of the article for links for further reading about these specific methods.
Eulerian path through a graph is the one which visits all the edges exactly once. As compared to the Hamiltonian problem, Eulerian paths through a graph can be found in linear time. For application of Eulerian path approach for genome assembly we need to obtain a peculiar type of graph using the reads, called the de Bruijn graph.
A de Bruijn graph is a graph with nodes as k-mers from the reads and directed edges between the nodes. We will take a closer look at de Bruijn graph based assembly with the help of following example.
Consider the following reads (colored sequences) from the original sequence (black).
All the unique k-mers with k = 3 from are extracted from all the reads by sliding an overlapping window of length 3, one base at a time, as shown in the figure below. A k-mer is word or subsequence of length k.
A graph with nodes as k-mers is constructed and the total number of nodes is equal to the total number of unique k-mers. The nodes are connected with directed edges if the k-1 suffix of the origin node is same as the k-1 prefix of the destination node. Finally when the resulting graph is traversed using the Eulerian path, first letter of every node will gives us the assembled sequence.
One of the biggest advantages of this approach over the Hamiltonian approach is that we have completely eliminated the all-against-all read alignments step since we don't have to detect read overlaps. The Eulerian approach is also known to handle repeats better than the Hamiltonian approach.
A more realistic version of the de Bruijn graph used for assembling real genomes can be seen from the figure below.
EULER was the first assembly method to implement de Bruijn graphs for highly repetitive genomes. Currently, Velvet is one of the most popular de Bruijn graph based assembly methods. Again, see references below for links for further reading about these specific methods.