CommonLounge Archive

Evolutionary Tree Construction: Neighbor-Joining Algorithm

December 27, 2017

Evolutionary Tree Construction

The problem of evolutionary tree construction is inferring the topology and the branch lengths of the evolutionary tree that may have produced the given gene sequence data. The number of leaf nodes in the inferred tree should be equal to the number of gene sequences in the given data.

The Neighbor-Joining algorithm

Neighbor-Joining (NJ) tree inference method was originally written by Saitou and Nei in 1987. It belongs to a class of distance-based methods used to build evolutionary trees. NJ method takes a matrix of pairwise evolutionary distances between the given sequences to build the evolutionary tree.

The pairwise distances are typically obtained from sequence alignment algorithms like Smith-Waterman and BLAST that aligns each gene sequence to every other gene sequence. The alignment scores can be used as an estimate of evolutionary distances between the sequences. Programs available to calculate distance include: DNADIST for DNA MSA’s and PROTDIST for Protein MSA’s. These programs are part of PHYLIP package.

The output we get is a tree along with the branch lengths.

Genetic distance map of 18 human groups constructed using the neighbor-joining method based on 23 kinds of genetic information. Made by Saitou Naruya, professor at the National Institute for Genetics, Japan in 2002.

In each stage, the two nearest nodes of the tree are chosen and defined as neighbors in our tree. Neighbors are defined as a pair of OTU’s who have one node connecting them, where OTU = Operational taxonomic units, or in other words – nodes of the tree. The number of operations is proportional to n3, where n is the number of sequences.

Step by step example

In this section, we’ll construct an evolutionary tree using Neighbor Joining Algorithm. We will use hypothetical distance matrix of n = 6 taxas.


Step 1: Calculate the net divergence r for each taxa from all other taxa.

r(i) = total distance of taxa i from all other taxa's 
     = d(i, 1) + d(i, 2) + ... d(i, n)

Example for calculating r(D) = 7 + 10 + 7 + 5 + 9 = 41

r(A) = 5 + 4 + 7 + 6 + 8 = 30
r(B) = 5 + 7 + 10 + 9 + 11 = 42
r(C) = 4 + 7 + 7 + 6 + 8 = 32
r(D) = 7 + 10 + 7 + 5 + 9 = 41 
r(E) = 6 + 9 + 6 + 5 + 8 = 34
r(F) = 8 + 11 + 8 + 9 + 8 = 44 

Step 2: Calculate the new distance matrix (M) using the following formula for each pair of taxa:

M(i,j) = d(i,j) – (r(i) + r(j))/(n - 2)

Example calculation for M(A, B)

M(A,B) = d(A,B) – (r(A) + r(B))/(n - 2)
       = 5 - (30 + 42)/(6-2)
       = -13

Similarly all distances are calculated and shown below:


Step 3: Using this new matrix find the closest pair of taxa i, j. Consider the lowest distance and assign u as the connecting node for that pair. Branch length is then calculated using formula:

S(i,u) = d(i,j)/2 + (r(i)-r(j))/2(n-2)
S(j,u) = d(i,j) - S(i,u)

Example:

As per the matrix M, the closest taxa pair is: AB = -13. The distance from U to A, and U to B is calculated as:

S(A,U) = d(A,B)/2 + (r(A)-r(B))/2(n-2)
       = 5/2 + (30-42)/2(6-2)
       = 1
S(B,U) = d(A,B) - S(A,U)
       = 5 - 1
       = 4
where, d(A,B) = 5, r(A) = 30, r(B) = 42 and n = 6.

Step 4: Calculate the new distances from U to all other taxas. The distance d(u, k) between u and taxa k is given by:

d(u,k) = [d(i,k) + d(j,k) - d(i,j)]/2

Hence, for our continued example, we have

d(U,C) = [d(A,C) + d(B,C) - d(A,B)]/2
       = [4+7-5]/2 = 3
d(U,D) = [d(A,D]+d(B,D) - d(A,B)]/2
       = [4+7-5]/2 = 6
d(U,E) = [d(A, E) + d(B,E) - d(A,B)]/2
       = [6+9-5]/2 = 5
d(U,F) = [(d(A,F) + d(B,F) - d(A,B)]/2
       = [8+11-5]/2 = 7 

Other distances remain as is. Hence, the new matrix distance matrix will be:


Step 5: Repeat steps 1 to 4 using the new matrix of distances in every round. After each step is done recursively we will end with the following resultant tree:

Advantages and disadvantages

Advantages:

  • It is computationally fast, hence can be used for large data sets.
  • It does not assume all lineages evolve at the same rate, i.e. molecular clock assumption.
  • It does not require nor assume that the distance data is ultrametric or additive.

Disadvantages:

  • It often assigns negative length to some of the branches.
  • Lack of a clear optimality criterion although some implementations attempt a ‘minimal evolution’ approximation.

Implementations available

References and resources


© 2016-2022. All rights reserved.