Distance based approach of phylogenetic analysis

The distance between every pairs of sequences are calculated, resulting data matrix is used for tree reconstruction.

Clustering procedure of the distance base analysis is given below.

  1. Calculate distances and produce the matrix between all combination of taxa.
  2. Join the two closest clusters.
  3. Recalculate the distances between all the clusters.
  4. Repeat the above procedure until all species connected into a single cluster.

Distance based analysis approach is also known as Algorithmic or Clustering method. Using distance matrices and starting with the most related sequence pairs, Clustering algorithms create a phylogenetic tree, Used models of nucleotide substitution to calculate the distance (NJ trees).

Distance base analysis method

  1. UPGMA (Unweighted Pair Group Method with Arithmetic mean)
  2. NJ (Neighbor Joining)
  3. Least square method
  4. Minimum Evolution (ME)

UPGMA (Unweighted Pair Group Method with Arithmetic mean)

UPGMA is the simplest method for constructing trees; this method is used cluster algorithm that compute the distance between clusters using average pairwise distance.

Computes the distance using average pairwise distance. This algorithm assume equal rate of evolution therefore branch length are equal.

Assumptions of the UPGMA

  • Same evolutionary speed on all lineages. (Rate of mutation is constant overtime)
  • Constant Molecular Clock
  • Clustered sequences are equidistant from the the common ancestor (Ultra-metric Tree)

Advantages of UPGMA

  • Fast method
  • It does not require the assumptions of the evolutionary models.
  • Easy to interpret – it provide clear representation of the phylogenetic relationship between taxa.

Disadvantages of UPGMA

  • Constant molecular clock assumption fails when the rate of the evolution is non-linear.
  • Generation of ultra metric trees
  • Gives only one possible tree.
Nucleotide substitution

There are many nucleotide substitution occurs along the all evolutionary lineages; single substitution, Multiple substitution, coincidental substitution, parallel substitution, convergent substitution, back substitution.

Nucleotide substitutions are grouped hierarchically

  • Simple substitutions
  • General base substitutions
  • Transitions and transversions
  • Purine to purine & pyrimidine to pyrimidine transitions
  • AC/GT & AT/CG transversions.

Models of Nucleotide Substitutions

  • Dissimilarity between the observed sequence difference and expected sequences differences is minimized by using nucleotide substitution models.

Some Models of nucleotide substitution

  • Markov chain Model – Predict the probability of the sequence of events occurring based on the most recent event not on it history or context. this method is useful for stimulation of sequence evolution, estimate the rate of evolution, correcting for multiple substitutions.
  • Jukes Contor ModelEqual probability of changing the any nucleotide.
  • Kimura two parameter distanceequal base frequency, Different probability for the transition & trans-version, Transitions are more frequent than the transversions.
  • Hasegawa-Kishino yano modelvariable base frequency, one transition rate and the one transversion rate (this model assumes that the probability of a transition happen between any two nucleotides is same & the probability of a transversion happen between any two nucleotides is same) Different probability for the transition & trans-version
  • General Time Reversible (GTR)
  • This model describes how DNA sequences change over time
  • Relative frequencies of the each nucleotide do not change. (after certain amount of time passed the frequency of each nucleotide will be same as previous)
  • This model assume that substitution happen in any direction, there is know any assumption about substitution preferential direction.
  • In order to find the relative frequency of each nucleotides requires six substitution rate parameters because 6 possible types of nucleotide substitution can happens A ↔ C, A ↔ G, A ↔ T, C ↔ G, C ↔ T, G ↔ T

NJ (Neighbor joining)

  • Most widely used distance base method
  • Apply cluster algorithm to the distance matrix to construct the fully resolve phylogeny.
  • It starts with the star tree and then paring the taxa base on the taxon distances; also this join taxa are chosen in order to minimize an estimate of tree length.
  • NJ allows the unequal rate of evolution therefore the branch length is proportional to the amount of evolutionary changes. (Not produce Ultra-metric Tree)

Advantages of NJ

  • Fast and suitable for large datasets and bootstrap analysis
  • Permits lineages with largely different branch length.
  • Does not assume that taxa are equidistant from the connecting node.(do not generate ultra-metric trees)
  • Branch length is proportional to the amount of evolutionary changes

Disadvantages of NJ

  • Gives only one possible tree.
  • Output tree depends on the evolutionary model

Least square method

  • Minimized the differences between the calculated distances in the distance matrix and expected distances on the tree. In other words (observed – expected)2 should be minimized.
  • Therefore getting shorter branched tree are much more preferential than getting longer branches.

Minimum Evolution (ME)

  • In this method find the best branch lengths by least squares. assume that the tree with least branch length are most likely the true tree.
  • In this method shorter tree length are more accurate than longer branch length tree.
Strengths of the distance base methods
  • High computational efficiency – specially NJ
  • Good for large data sets having low levels of sequence divergence
  • Use realistic substitution model to calculate the pair-wise distances. (NJ, Least square method, Minimum Evolution (ME))
Weaknesses of the distance base methods
  • Distance based method perform poorly for very divergent sequences because the large distances cause large sampling errors.
  • No explicit optimization criteria, only single tree is produced.
  • It is sensitive to gaps in the sequence alignment.
Computational Analysis of Distance based Phylogenetic Tree
  1. Obtain the nucleotide sequences
  2. Align the sequences (using MUSCLE/ ClustalW)
  3. Build the distance data matrix – using suitable substitution model
  4. Construct the phylogenetic tree using (UPGMA, NJ)

Software for distance base phylogenetic analysis – MEGA (Molecular Evolution Genetics Analysis), PHYLIP

References:

  • Sequence homology – Wikipedia. (2014, December 8). Sequence Homology – Wikipedia. https://en.wikipedia.org/wiki/Sequence_homology

Leave a Reply

Your email address will not be published.