Notes on Phyolgenetic Trees, NeedlemanWunsch, And SingleLinkage Clustering
Chris Tralie
These notes go over the background topics for homework 5, which bring together minimum spanning tree algorithms and dynamic programming an cool application to biology and the "tree of life."
 Amino Acids
 The NeedlemanWunsch Algorithm
 NeedlemanWunsch Interactive Applet
 BLOSUM Amino Acid Substitution Costs
 Phylogenetic Trees
 SingleLinkage Clustering
 A 1 penalty for deleting a
 A 2 penalty for deleting b
 A 3 penalty for swapping an a for a b or a b for an a
 A +2 score for matching an a to an a
 A +3 score for matching a b to a b
 Start off with each species in its own cluster, which is represented by a leaf node in the tree.

Merge the two clusters with the highest similarity, according to the NeedlemanWunsch distance. Similarity is defined as the maximum similarity between all pairs of items in one and the other cluster. Create a new internal node in the tree whose children are the root nodes of each cluster that's being merged.
As an example, suppose that Fugu rubipres and Goldfish are paired together, and Eastern newt and Fire salamander are paired together. Then the BLOSUM62based NeedlemanWunsch scores are as follows:
Eastern newt Fire salamander Fugu rubipres 1319 1294 Goldfish 1341 1300  Repeat step 2 as long as there is more than one cluster.
 The final node that's created is the root of the tree.
 The tree that's being built
 A union find data structure to keep track of which leaf nodes (corresponding to species) are part of the same connected component
 An array or dictionary that keeps track of the node of the tree corresponding to a root in the union find data structure
 Create a node for each species, and create a union find structure where each node starts off as its own component. Set the roots of each of these components to be the leaf nodes in the tree in a separate array.
 Compute all pairwise similarities between nodes based on NeedlemanWunsch.
 Sort all pairs of nodes in decreasing order of similarity to ensure that the most similar pairs get grouped together first. This step incurs the dominant cost of O(N^{2} log N) since there are O(N^{2}) pairs.
 Now we need to merge the nodes together from the bottom up according to our sort. In particular, for each pair of nodes in the above order, check to see if they are part of different clusters by comparing their roots in union find. If they have different roots, merge them together in the union find data structure, create a new internal node for them, and set the new root to point to this internal node. Otherwise, if they are already part of the same cluster, they have the same root (adding an edge would form a cycle), and you can simply skip this pair.
 The last node to be added after repeating step 4 for all pairs is the root of the tree.
Table of Contents
Amino Acids
A DNA sequence can be thought of as a string. Roughly, DNA encodes building blocks known as amino acids, which, when decoded and strung together in sequence, form proteins. The sequence has a large impact on the way the protein structure folds and ultimately forms a 3D shape, leading towards a particular biomolecular role.
Click here to see more information about each type of amino acid. In assignment 5, we will be considering an alphabet of 23 amino acids, including all of the standard 20, as well as the B,Z,X special ones highlighted in red in the aforementioned link.
The NeedlemanWunsch Algorithm
Since DNA is a string, we can compare two DNA sequences with string comparison methods, but when comparing the sequences across organisms, the technique needs to account for mutations that have occurred over evolutionary histories, including additions, deletions, and substitutions of individual amino acids. We've discussed dynamic programing techniques for computing the string edit distance, which accounts for such mutations. However, it models a unit cost for an addition, deletion, and substitution alike, and there are biological reasons that we may want to have costs that are more less expensive for certain amino acid edits, as explained below.
To address the need for variable costs, there is a variant of edit distance known as NeedlemanWunsch, in which the costs can change depending on what characters are involved. By convention, we actually switch from a "minimizing cost" mindset to a maximizing score mindset. Given a string of length M and a string of length N, the optimal NeedlemanWunsch score can be computed in O(MN) time using a similar dynamic programming algorithm to edit distance and dynamic time warping. In particular, to match a string of length M to a string of length N, the vanilla version proceeds by filling in an (M+1) x (N+1) table S^{*} storing all solutions to subproblems, using the following recurrence relation, where d_{i} and d_{j} refer to the cost of deleting the i^{th} and j^{th} characters of the first and second strings, respectively, and c_{ij} refers to the cost of substituting the i^{th} character of the first string for the j^{th} character of the second string (and everything is 0indexed):
\[ S_{ij} = \left\{ \begin{array}{cc} S_{i1, 0} + d_{i1} & j = 0 \\ S_{0, j1} + d_{j1} & i = 0 \\ \max \left\{ \begin{array}{c} S_{i1, j1} + c_{i1, j1} \\ S_{i, j1} + d_{j1} \\ S_{i1, j} + d_{i1} \end{array} \right\} & i > 0, j > 0 \end{array} \right\} \]
(*) Note that there is a version that uses only Θ(N) memory, which uses a similar trick to a recent algorithm I developed with an Ursinus student for dynamic time warping
To help you explore examples of NeedlemanWunsch scoring, I have provided a live demo app below. In the default example, we have an alphabet of two letters a and b, and the scores are as follows
{"ab":3}
means that it costs 3 both to swap a in for b and to swap b in for a, and {"a":1}
refers to a penalty of 1 for deleting a.
To generate examples, when you hit "match strings" in the live demo below, this program will run the NeedlemanWunsch algorithm to fill out the dynamic programming table. It will also perform a backtracing to find one of the optimalcost sequence of edit operations, though this is not something you have to do in assignment 5. Play around with different examples and explore the results until you feel comfortable.
NeedlemanWunsch Interactive Applet
String 1  
String 2  
Alphabet  
Pairwise  
Optimal Matching Score 
BLOSUM Amino Acid Substitution Costs
Now that we have an alphabet for DNA strings and an algorithm to compare them, we can define meaningful costs for matchings and substitutions of amino acids. We will be using tables obtained from an experimental, datadriven technique that constructs a "Blocks Substitution Matrix," or "BLOSUM" for short, as described in this paper. In a nutshell, the technique computes statistical likelihoods that substitutions take place by examining many wellaligned sequences.
There are different similarity thresholds at which BLOSUM tables can be constructed, and higher numbers mean that we're more conservative in which symbols we allow to be aligned. In assignment 5, we'll be considering BLOSUM50 and BLOSUM62, as obtained from ftp://ftp.ncbi.nih.gov/repository/blocks/unix/blosum/blosum.tar.Z. Click the "show/hide" buttons to view each of these below. You'll notice that we get a positive score when matching an amino acid to itself, and a negative score when swapping amino acids and when deleting them (matching to a *), so this is compatible with NeedlemanWunsch. You'll also notice that this matrix is symmetric (e.g. A to D and D to A are both 2), so we will only be storing the nonredundant parts in a lookup dictionary.
BLOSUM 62  BLOSUM 50 



NOTE: Interestingly, as shown in this paper, the original authors had a normalization error when computing the BLOSUM matrices according to their procedure, but the matrices somehow yield better search results. So we will be using the original matrices in assignment 5.
Phylogenetic Trees
Now that we have ways of measuring similarity between amino acid sequences, we can use these similarity measures as a proxy for overall similarity between species. Amazingly, if we choose the right genes to analyze, we can use NeedlemanWunsch similarity to build a "tree of life" from the bottom up. In this context, the tree will show branching in evolutionary history, and it is known as a Phylogenetic tree. To build it, we'll merge nodes together first that have the highest similarity score, and they will end up towards the bottom of the tree. Internal nodes can then be thought of as common ancestors, and eventually we will end up at a root which can be thought of as the origin of life.
When inferring the evolution of species, good genes to examine come from mitochondrial DNA (DNA describing the energy factory in cells), as the DNA mutates quickly and is only passed on from the mother. Using https://www.ncbi.nlm.nih.gov/nuccore/, I've provided amino acid sequences data for 71 species in the file organisms.json from the COX3 mitochondrial gene on homework 5. The figure below shows the phylogenetic tree of these species obtained using NeedlemanWunsch with BLOSUM62 costs. We will walk through how to construct this in the next section.
SingleLinkage Clustering
We will now describe an algorithm, known as "single linkage clustering with the min rule," that can be used to construct phylogenetic trees. In this context, a phylogenetic tree is a specific example of a more general construct known as a dendrogram.
Algorithm 1: O(N^{3}) Naive SingleLinkage Clustering
We will first describe an algorithm that will help to explain conceptually how the tree is built, but this algorithm is inefficient, so we will then explain a better algorithm that computes the same information, and you will implement the latter algorithm.
Algorithm 2: Kruskal's Algorithm for O(N^{2}logN) Single Linkage Clustering
The algorithm we've described so far totally works, but the problem is that a straightforward implementation from this description takes O(N^{3}) time for N items. That's OK for only 71 species in our example, but these sorts of trees are often built on images, where every pixel starts off in its own node. Even a 100x100 image has 10,000 pixels, so N^{3} is a trillion at that point. So even though we can get away with a naive algorithm in assignment 5, for the purposes of the class, we will be implementing a more efficient version based on an algorithm known as Kruskal's algorithm, using union find to help us. The algorithm uses union find to keep track of clusters as clusters are merged together from most to least similar until everything is in one tree. The algorithm maintains three data structures in tandem as it goes along
This algorithm is tricky enough to warrant an example showing the dynamics of the three data structures. For simplicity of illustration, we'll show building a dendrogram corresponding to a minimal spanning tree on a set of points in the plane, though the algorithm is very similar for phylogenetic trees on Needleman Wunsch; we basically just switch sorting in ascending order of length to sorting in descending order of similarity. The example is shown below. The left plot shows the tree being built on the original point cloud, as well as the parents pointers of the union find. The right plot shows the dendrogram being constructed as new internal nodes are being added, and it also shows the array that keeps track of which union find roots correspond to which dendrogram roots.
NOTE ALSO: The maximum rule algorithm is not necessarily the best algorithm for constructing phylogenetic trees, because it tends to produce subtrees whose most similar nodes are quite similar, but whose least similar nodes can be quite different. A better scheme is the "average rule," in which nodes are merged together based on average similarity of pairs between them. But this is beyond the scope of assignment 5.