Character Based Evolutionary Tree Construction

February 16, 2018

In this article, we’ll talk about a method for constructing evolutionary trees, known as character based evolutionary tree construction. It was initially designed to infer evolutionary relationships based on morphological and physiological characters.

In character based tree construction, we are given a DNA segment for multiple species coming from the same part of the genome (for example, the same gene). Given these DNA sequences, we could like to construct the evolutionary tree, i.e. predict which species are more closely related and have a recent common ancestor, vs species that are not closely related and diverged earlier.

Parsimony score

Character based tree construction method is based on Occam’s razor principle which states “when several hypotheses with different degrees of complexity are proposed to explain the same phenomenon, one should choose the simplest hypothesis”. In terms of tree building, all other things being equal, the best hypothesis is the one that requires the fewest evolutionary changes.

Parsimony score is defined as total cost of all mutations in a tree that are required to explain the observed differences. The parsimony score depends on the structure of the tree, the locations assigned to each of the species, as well as the inferred DNA sequences for each of the common ancestors.

Example

Let’s say we have we have 4 species and the nucleotides (DNA sequence) from each of the specifies. They are as follows:

As per the table, species A and B have same nucleotide at 1^st position and C and D have same nucleotides at 1^st position.

Based on the above information, we might predict a evolutionary tree that looks as follows (figure 1). Note that the prediction also includes the positions of species A, B, C and D as well as the sequences for the common ancestors.

Figure 1

The dashes on the edges label the number of differences between a node and its ancestor. For example, species A (GGG) differs from its parent (GTG) at one location (G vs T, in location 2).

Let us consider another possible evolutionary tree. As per this tree, species A and C are more closely related. Again, dashes on the edges label the number of differences.

Figure 2

We have drawn two trees considering the nucleotide positions of species. But how we know which tree is better or more parsimonious tree?

We can see that the parsimony score of the first tree is 4 and for the second tree it is 5. Since we are trying to minimize the number of mutations, the first tree is better than the second tree.

Weighted parsimony score

The above definition can be extended to include cases different nucleotide mutations are given different scores. This is described below with some examples:

Unweighted Parsimony Score = 5

Weighted Parsimony Score = 20

If we have multiple characters per sequence, we simply compute the above score for each position, and add them up.

Parsimony Problems

Our broad strategy for finding the evolutionary tree with the minimum parsimony score will be as follows:

Small Parsimony problem: We are given a tree, but need to determine labeling of the tree that provides the lowest parsimony score. That is, we need to find the nucleotide sequences of the ancestors.

Large Parsimony problem: Given the solution to the above problem, we’ll see how to find the best tree structure.

Small Parsimony Problem: Sankoff’s Algorithm

Given a tree, we need to determine a labeling for the tree that provides the lowest parsimony score. That is, we need to find the nucleotide sequences of the ancestors.

Assumption: We assume that the characters in the string are independent. Hence, the Small Parsimony problem can be solved independently for each character. Therefore, to devise an algorithm, we can assume that every leaf is labeled by a single character rather than by a string of m characters.

Algorithm

We will use Sankoff’s algorithm to solve the above problem.

Let S_j(i) be the smallest (weighted) number of steps for the subtree at or below node j, given that node j has nucleotide i. Also, let c_ik be the cost of changing nucleotide i to nucleotide k.

Step 1: We will calculate S_j(i) from the leafs to the root. Initially, at the leafs (say leaf j), we have

$$ S_j(i) = \begin{cases} 0 &\text{if node j has nucleotide i} \\ \infty &\text{if node j has any other nucleotide} \end{cases} $$

Step 2: Then proceeding up the tree for node whose immediate descendants are l and r, we have

$$ S_j(i) = min_k [c_{ik} + S_l(k) ] + min_k [c_{ik} + S_r( k) ] $$

That is, when computing the parsimony score of subtree at node j given it has nucleotide i, we consider all possible nucleotide values for the children of node j.

Step 3: The minimum number of (weighted) steps for the tree is found by computing at the root node (say 0) the S₀(i) and taking the smallest of these.

Example

Let’s take an example and proceed stepwise to understand better.

Step 1: Begin at leaves:

If leaf has the character in question, score is 0
Else score is ∞

Step 2: Values for internal notes (calculated bottom to top). Lets calculate the score for nucleotide A of the first parent node.

Scoring matrix c_ij (given)

The scoring matrix c_ij is given above. S_l(k) and S_r(k) are already calculated (tree above).

We calculate S_l(i), C_iA, and Sum = S_l(i) + C_iA.

We calculate S_r(i), C_iA, and Sum = S_r(i) + C_iA.

Min(left sum) + Min(right sum) = 0 + 1 = 1. Therefore, sum of the scores we get for nucleotide A = 1.

Other scores are calculated using same approach for all the nodes including root node. After calculating all the scores, tree will look like this:

Step 3: Smallest score at root node is minimum parsimony score. Here the smallest score is 6. Hence, root node assigned as “A”

Step 4: After the root scores are calculated, we can travel down the tree and assign each vertex with optimal character.

Score 6 came from 1 + 5 of A nucleotide from both the descendent nodes (refer figure above). Hence, A will be assigned to both the descendent nodes (refer figure below). And the tree is thus labeled as:

Figure 10

Large Parsimony Problem

Given the solution to the small parsimony problem, we’ll see how to find the best tree structure.

In general, finding the optimal tree is a NP complete problem, i.e. there is no fast algorithm for this problem. Hence, we use an algorithm which won’t always find the optimal tree, but works quite well in practice.

Algorithm

Set current tree equal to arbitrary binary tree structure
Go through all edges and perform nearest neighbor interchanges (we’ll see an example of nearest neighbor interchange below).
Solve small parsimony problem on each tree
If any tree has lower parsimony score, then update the current tree and go to step 2, otherwise stop.

This method is called the Greedy Heuristic method.

Figure 11

Example

Let’s take an unrooted binary tree shown in Figure 11.
Nearest neighbor interchange: We will delete the two nodes “c” and “d” connected by an internal edge. Rearranging the subtrees (a, b, e and f) gives us the following possible tree structure are (shown in figure below).

Figure 12

For every possible tree using small parsimony algorithm lowest score is calculated.
Lowest score tree is returned.

Conclusion

In this article, we saw a character based evolutionary tree construction method based on parsimony score. First, we solved the small parsimony problem, which finds the parsimony score given a tree structure and leaves. Then, we looked at the large parsimony problem which finds a good tree structure starting from a tree structure chosen at random by performing nearest neighbor interchanges.