Reads, in a sequencing experiment, are small overlapping sections of genome that are “read” and revealed by the sequencing machine. These are short “ATGC” strings that have unique locations on the genome. Read mapping involves finding the locations of these reads on the genome. In case of human genome, this involves finding the location of millions of short ATGC substrings (reads) on a 3-billion character long string (human genome). Therefore, this is essentially a multiple pattern matching problem, where we compare the reads to all the sub-strings in the genome to the find the location of the reads.
Read mapping helps in finding the genomic differences between different individuals of the same species. For example, 99% of human genome is identical for all humans and the remaining 1% is responsible for all the diversity among humans on planet Earth. Read mapping allows us to explore this diversity by comparing the genome sequences of different individuals. One of the most important uses for read mapping is locating the differences or mutations in the genomes that could be attributed to certain diseases. Considering the genome sequence of a healthy human as the reference, the reads from a diseased individual can be mapped to the reference and differences in characters between the reference DNA string and the mapped read sub-strings can be attributed to the disease and are called as disease causing mutations.
Computationally, the problem of read mapping is a multiple pattern matching problem where several short patterns (the reads) need to be matched to a very long pattern (the genome) to find the exact locations of reads in the genome. A brute-force approach would simply attempt to match every read to the large genome by sliding down the read one position at time to find the position in the reference genome that matches the read. But looking at the scale of problem, this approach is bound to run into memory and time-complexity issues. We will look at some sophisticated algorithms and data structures used for reading mapping.
The suffix tree data structure, as the name suggest, stores all the suffixes that occur in the reference genome in tree format. We will take a closer look at the construction of suffix tree using a following hypothetical string “panamabananas” as a simple example of reference genome. We will add a special character “$” at the end of the string to signify the end of the string “panamabananas$”
All the suffixes from the given string can be easily found by starting from any position in the string and running till the end. For example, the largest suffix starting from the first position is “panamabananas$” which is the string itself. The next suffix for the string is “anamabananas$” starting from the second position and running till the end. Similarly, the remaining suffixes “namabananas$”, “amabananas$” … and so on, can be easily found by moving the starting positions until the last character is reached (excluding the $)
All the suffixes can be represented in Trie data structure as show below
The trie data structure consists of nodes and directed edges and where each node represents a character from the string. Each path in the trie, from the root node to the leaf node denoted by the $ character, represents a suffix in the original string. The number at the end of each path gives the position at which the suffix occurs in the original string. The suffixes that have unique prefixes have paths directly starting from the root whereas suffixes which share prefixes share paths from the root and branch-out from each other later.
This trie structure is converted to a suffix tree data structure to reduce the number of edges, which in-turn leads to reduction in the memory requirements. To reduce the number of edges, all the non-branching paths are combined into one single node as follows:
The suffix tree can be easily used to map small reads (substrings) to the reference string without using the brute-force approach. (these reads only need to be substrings of the reference string, they need not be suffixes).
Let us see this with an example. Suppose we have a short read “ana” which we want to map to the reference string. We start at the root of the suffix tree, and we go down the path matching the strings we have. The path that "ana" takes us through the tree is marked in read. If we look at all the positions under the node representing "ana", we see the nodes are labelled with positions 1, 7 and 9. Therefore, the short read “ana” occurs at positions 1, 7 and 9 in the reference string.
In this way, large number of reads can be mapped to reference genome by first converting the reference genome (large string of characters ATGC) into a suffix tree data structure and then finding the positions of short reads, obtained from sequencing experiments, in the reference genome using the suffix tree. This way, finding each short read takes a small amount of computation (proportional to the length of the short read), as compared to the brute-force approach (which required time proportional to the length of the entire genome).
The primary objective of mapping reads is to find differences between the reference genome and the reads obtained from the sequencing experiment. Therefore, rather than looking for exact matches between the reads and reference we need to look for approximate matches. One of popular methods for inexact matching is seeding. Consider the following read (pattern) matching to the genome with one mismatch:
We can divide the pattern and the matching portion of genome into 2 parts such that one of the parts (or sub-string) match exactly.
Similarly, if we would like to perform read matching with reference genome with allowance for ‘d’ mismatches then we can divide the read into ‘d+1’ equal parts. At least one of these parts must match exactly for the total number of mismatches to be less than or equal to d. (If none of the parts match exactly, then there are at-least d+1 mismatches, one for each part).
These parts are called seeds. Each seed exact match gives us a possible candidate for read match with less than or equal to d mismatches. Hence, we first find all exact matches for all seeds, which gives us a list of candidate positions. Then, we check whether each candidate position actually matches our short read with less than or equal to d mismatches. If it doesn't, we simply discard the candidate and move on. The process is guaranteed to find all existing matches with less than or equal to d mismatches.