# Introduction to Genome Assembly

December 20, 2017

# DNA and DNA sequencing

All living organisms contain a long string of molecules called the *DNA* (Deoxyribose Nucleic Acid). The DNA strings are made up of 4 types building block molecules called *bases* viz. Adenine, Guanine, Cytosine and Thymine represented by the 4 letters A, G, C and T respectively.

*DNA sequencing* is the process of inferring the DNA string, i.e. the ordering of the 4 bases for any given organism. Complete DNA sequences called *genomes* are millions or even billions of bases long, especially for higher organisms like humans. Current sequencing technologies can read sequences that are at most 100s of bases long and cannot be used to infer the entire DNA sequence at once. Therefore, short overlapping stretches of these long DNA sequences are first read (called as *reads*) using these technologies and then the reads are assembled together to reveal the entire DNA sequence of organisms. This process of assembling the short fragments of genomes to reveal the entire genome sequence is called the **genome assembly**.

# Genome Assembly

Total number of reads obtained from a sequencing experiment mainly depends on 3 variables. The length of reads, length of the genome to be assembled and depth of sequencing. The length of reads depends on the sequencing technology. The depth of sequencing is the average number of times each base is sequenced. Therefore, a 30x sequencing depth signifies each base in the genome is sequenced on an average 30 times. Approximate number of reads can be calculated using the equation N = (D)(G)/L, where D is the sequencing depth, G is the genome length and L is the read length. For example, a sequencing experiment to assemble human genome of 3 billion bases performed at 3x depth with sequencing technology that produces reads of length 100 will produce 90 million reads.

Genome assembly can be thought of as solving a 1-dimensional jigsaw puzzle with the reads as individual pieces of the puzzle. This 1-dimensional genome puzzle can be solved by detecting *overlaps* between adjacent reads to reveal the whole genome sequence. All the popular genome assembly tools work on this same basic principle of detecting overlaps between the reads, to assemble genomes.

In the following sections we will see the two prominent graph theory based approaches used for assembling genomes from large number of reads.

# The Hamiltonian Path approach

This approach attempts to find the *Shortest Common Superstring* (SCS) from the set given set of reads. Since the reads are nothing but short and overlapping substrings of the entire genome, the shortest superstring that contains all the substrings is the original genome sequence.

To solve this problem, we first construct a *graph* with each of the reads as nodes. If two reads overlap, we add a directed edges between the corresponding nodes. The direction of the edge is such that the suffix of the origin node matches the prefix of the destination node. Weights are assigned to the edges that signify the length of the overlap between the reads.

Then a directed path through the graph is found such that visits each node only once and which also maximizes the sum of edge weights. Tracing this path through the graph will reveal the shortest common string that contains all the reads as subsequences. This path through the graph that visits each node exactly once is known as *Hamiltonian path*.

However, inferring the SCS using the Hamiltonian has some major computational limitations.

- Finding a Hamiltonian path comes under a class of problems in Computer Science called
*NP-hard*which means they are difficult to solve. More specifically, the time required to compute the solution is*exponential*in terms of the size of the graph*.* - Moreover to construct the graph, overlaps need to be detected between all pairs of reads which means all-against-all alignments of reads need to be carried out. Therefore, this approach becomes infeasible in case of large genomes where large number of reads are needed to cover the entire genomes (as discussed in the previous section, in practical use cases we often have 100 million reads, which implies 10
^{16}read pairs). - This situation is complicated further by reads from
*repetitive regions*of the genomes and*sequencing errors*in the reads. Reads from repetitive regions are like puzzle pieces that can fit at multiple regions in the genomes. Ideally, every read should come from one single location in the genome. Therefore nodes representing repetitive reads are connected to many other nodes as opposed to being connected to at most 2 nodes.

The Hamiltonian approach is used by a class of genome assemblers called the *Overlap - Layout - Consensus (OLC)*. CELERA, PHRAP, CAP3 and TIGR are some of the popular OLC assembly methods. See references at the bottom of the article for links for further reading about these specific methods.

# The Eulerian Path Approach and de Bruijn graphs

*Eulerian path* through a graph is the one which visits all the edges exactly once. As compared to the Hamiltonian problem, Eulerian paths through a graph can be found in *linear time*. For application of Eulerian path approach for genome assembly we need to obtain a peculiar type of graph using the reads, called the *de Bruijn* graph.

A de Bruijn graph is a graph with nodes as *k-mers* from the reads and directed edges between the nodes. We will take a closer look at de Bruijn graph based assembly with the help of following example.

Consider the following reads (colored sequences) from the original sequence (black).

All the unique k-mers with k = 3 from are extracted from all the reads by sliding an overlapping window of length 3, one base at a time, as shown in the figure below. A k-mer is word or subsequence of length k.

A graph with nodes as k-mers is constructed and the total number of nodes is equal to the total number of unique k-mers. The nodes are connected with directed edges if the k-1 suffix of the origin node is same as the k-1 prefix of the destination node. Finally when the resulting graph is traversed using the Eulerian path, first letter of every node will gives us the assembled sequence.

One of the biggest advantages of this approach over the Hamiltonian approach is that we have completely eliminated the all-against-all read alignments step since we don’t have to detect read overlaps. The Eulerian approach is also known to handle repeats better than the Hamiltonian approach.

A more realistic version of the de Bruijn graph used for assembling real genomes can be seen from the figure below.

EULER was the first assembly method to implement de Bruijn graphs for highly repetitive genomes. Currently, Velvet is one of the most popular de Bruijn graph based assembly methods. Again, see references below for links for further reading about these specific methods.

# References

- Celera - Myers, Eugene W., et al. “A whole-genome assembly of Drosophila.”
*Science*287.5461 (2000): 2196-2204. - TIGR - Sutton, Granger G., et al. “TIGR Assembler: A new tool for assembling large shotgun sequencing projects.”
*Genome Science and Technology*1.1 (1995): 9-19. - Phrap - Green P. PHRAP documentation, 2005
__http://www.phrap.org__ - CAP3 - Huang, Xiaoqiu, and Anup Madan. “CAP3: A DNA sequence assembly program.”
*Genome research*9.9 (1999): 868-877. - EULER - Pevzner, Pavel A., Haixu Tang, and Michael S. Waterman. “An Eulerian path approach to DNA fragment assembly.”
*Proceedings of the National Academy of Sciences*98.17 (2001): 9748-9753. - VELVET - Zerbino, Daniel R., and Ewan Birney. “Velvet: algorithms for de novo short read assembly using de Bruijn graphs.”
*Genome research*18.5 (2008): 821-829.