Distance based approach of phylogenetic analysis

The distance between every pairs of sequences are calculated, resulting data matrix is used for tree reconstruction.

Clustering procedure of the distance base analysis is given below.

Calculate distances and produce the matrix between all combination of taxa.
Join the two closest clusters.
Recalculate the distances between all the clusters.
Repeat the above procedure until all species connected into a single cluster.

Distance based analysis approach is also known as Algorithmic or Clustering method. Using distance matrices and starting with the most related sequence pairs, Clustering algorithms create a phylogenetic tree, Used models of nucleotide substitution to calculate the distance (NJ trees).

Distance base analysis method

UPGMA (Unweighted Pair Group Method with Arithmetic mean)
NJ (Neighbor Joining)
Least square method
Minimum Evolution (ME)

UPGMA (Unweighted Pair Group Method with Arithmetic mean)

UPGMA is the simplest method for constructing trees; this method is used cluster algorithm that compute the distance between clusters using average pairwise distance.

Computes the distance using average pairwise distance. This algorithm assume equal rate of evolution therefore branch length are equal.

Assumptions of the UPGMA

Same evolutionary speed on all lineages. (Rate of mutation is constant overtime)
Constant Molecular Clock
Clustered sequences are equidistant from the the common ancestor (Ultra-metric Tree)

Advantages of UPGMA

Fast method
It does not require the assumptions of the evolutionary models.
Easy to interpret – it provide clear representation of the phylogenetic relationship between taxa.

Disadvantages of UPGMA

Constant molecular clock assumption fails when the rate of the evolution is non-linear.
Generation of ultra metric trees
Gives only one possible tree.

Nucleotide substitution

There are many nucleotide substitution occurs along the all evolutionary lineages; single substitution, Multiple substitution, coincidental substitution, parallel substitution, convergent substitution, back substitution.

Nucleotide substitutions are grouped hierarchically

Simple substitutions
General base substitutions
Transitions and transversions
Purine to purine & pyrimidine to pyrimidine transitions
AC/GT & AT/CG transversions.

Models of Nucleotide Substitutions

Dissimilarity between the observed sequence difference and expected sequences differences is minimized by using nucleotide substitution models.

Some Models of nucleotide substitution

Markov chain Model – Predict the probability of the sequence of events occurring based on the most recent event not on it history or context. this method is useful for stimulation of sequence evolution, estimate the rate of evolution, correcting for multiple substitutions.

Jukes Contor Model – Equal probability of changing the any nucleotide.

Kimura two parameter distance – equal base frequency, Different probability for the transition & trans-version, Transitions are more frequent than the transversions.

Hasegawa-Kishino yano model – variable base frequency, one transition rate and the one transversion rate (this model assumes that the probability of a transition happen between any two nucleotides is same & the probability of a transversion happen between any two nucleotides is same) Different probability for the transition & trans-version

General Time Reversible (GTR) –

This model describes how DNA sequences change over time
Relative frequencies of the each nucleotide do not change. (after certain amount of time passed the frequency of each nucleotide will be same as previous)
This model assume that substitution happen in any direction, there is know any assumption about substitution preferential direction.
In order to find the relative frequency of each nucleotides requires six substitution rate parameters because 6 possible types of nucleotide substitution can happens A ↔ C, A ↔ G, A ↔ T, C ↔ G, C ↔ T, G ↔ T

NJ (Neighbor joining)

Most widely used distance base method
Apply cluster algorithm to the distance matrix to construct the fully resolve phylogeny.
It starts with the star tree and then paring the taxa base on the taxon distances; also this join taxa are chosen in order to minimize an estimate of tree length.
NJ allows the unequal rate of evolution therefore the branch length is proportional to the amount of evolutionary changes. (Not produce Ultra-metric Tree)

Advantages of NJ

Fast and suitable for large datasets and bootstrap analysis
Permits lineages with largely different branch length.
Does not assume that taxa are equidistant from the connecting node.(do not generate ultra-metric trees)
Branch length is proportional to the amount of evolutionary changes

Disadvantages of NJ

Gives only one possible tree.
Output tree depends on the evolutionary model

Least square method

Minimized the differences between the calculated distances in the distance matrix and expected distances on the tree. In other words (observed – expected)² should be minimized.
Therefore getting shorter branched tree are much more preferential than getting longer branches.

Minimum Evolution (ME)

In this method find the best branch lengths by least squares. assume that the tree with least branch length are most likely the true tree.
In this method shorter tree length are more accurate than longer branch length tree.

Strengths of the distance base methods

High computational efficiency – specially NJ
Good for large data sets having low levels of sequence divergence
Use realistic substitution model to calculate the pair-wise distances. (NJ, Least square method, Minimum Evolution (ME))

Weaknesses of the distance base methods

Distance based method perform poorly for very divergent sequences because the large distances cause large sampling errors.
No explicit optimization criteria, only single tree is produced.
It is sensitive to gaps in the sequence alignment.

Computational Analysis of Distance based Phylogenetic Tree

Obtain the nucleotide sequences
Align the sequences (using MUSCLE/ ClustalW)
Build the distance data matrix – using suitable substitution model
Construct the phylogenetic tree using (UPGMA, NJ)

Software for distance base phylogenetic analysis – MEGA (Molecular Evolution Genetics Analysis), PHYLIP

Download MEGA

Download PHYLIP

References:

Sequence homology – Wikipedia. (2014, December 8). Sequence Homology – Wikipedia. https://en.wikipedia.org/wiki/Sequence_homology