In this lab you will implement an efficient data structure for the disjoint set data structure known as union find. You will implement and test several accelerations to union find that make the "amortized cost" (average cost) of an operation near constant time(!)
Click here to download the starter code for this lab, which provides a reference implementation of the slow disjoint set data structure, as well as code to compare this to your implementations in this lab in the file
- Convert a problem into a format required by an abstract data type
- Implement union find with size-based merging and path compression in python
- Experimentally measure and plot the amortized cost of operations as the input size scales
Review the pseudocode and example picture below of the basic union find algorithm. If needed, click here to rewatch the video I made that explains this
Your Task: Following the pseudocode in the above picture, create a file called
unionfind.py which contains a class called
UnionFind with methods for
union. For simplicity in this lab, you can assume that you start off with individual bubbles for the numbers 0, 1, ..., N-1, and you can pass
N as a parameter to the constructor, just as in the naive implementation you explored in the module.
Create some simple tests to convince yourself that this is working before you proceed to the next section.
One way that we can make union find much faster is if we do something called path compression. At a high level, we want to try to avoid very long branches in our forest, because this will make it slower to find the root. So what we can do is some bookkeeping to make future calls of root are more efficient. It's easiest to see this with an example. Suppose we start with the forest below:
Then, let's suppose we called
root(9). All of the highlighted nodes (9, 6, 1, 7) below will be visited on the path on the way to 9's root:
This is a long path relative to the number of elements here, but we can cut it down for future iterations by backing up and creating a shortcut from all of the highlighted nodes to the root that was found. The image below shows this, with parent pointers bolded to indicate what was changed
Notice how much shorter the tree is now! This is one of the ways we get towards near constant time performance on future queries
Your Task: Create a copy of your file
unionfindopt.py. In this new file, modify your
root method to implement path compression, and make sure your union find still works properly
There is one more optimization we can do to speed up union find. Let's suppose we have the following forest below. In addition to storing the parent references, we'll store how many nodes are under each node in a list called
weights. Actually, we only need to remember the weights of the roots for what we're about to do, so I'll omit all of the other weights
Now let's suppose we call
union(4, 1). This will merge the roots 0 and 3. One possibility would be make 0's parent be 3, as shown below:
The other option is to make 3's parent be 0, as shown below:
This is definitely the better option! The reason is that it only increases the distance of two nodes to the root: 3 and 4. By contrast, making 0 the parent of 3 increases the distance of nodes 0, 1, 2, 5, 6, 7, 8, 9 all by 1.
Your task: Modify your
union method in
unionfindopt.py so that it implements weight-based merging, in addition to path compression. That is, keep track of the weights of root nodes, and make the parent the one with the larger size.
Note that the weights of these roots will be unchanged by path compression, so we can combine this trick with path compression.
Now we will get a sense for the efficiency of the tree-based union find compared to one of our original ideas empirically by testing it out and timing it on inputs of varying size. Path compression may take a long time on some calls to
root, but we're banking on it saving us time in future operations. Hence, we are interested not in the cost of an individual operation, but in the amortized cost, or the average cost of unions and finds over many operations. Furthermore, whenever we're thinking about the performance of algorithms, we want to know how the algorithms scale in performance as the input size scales up.
Your task: Modify both of your union find classes to have two local variables:
_operations: A running count of how many times a parent arrow was followed in
rootand how many times a parent was reassigned in
_calls: The total number of calls made to union and find.
I've done a similar thing in
djsetslow.py. Then, at the end of many operations, the amortized cost can be computed as
Assuming you've done all of this properly, I've created a program in
test.py to call your code and the slow disjoint set code on many random tests of different sizes, and to plot the results. The program will also let you know if your find operations don't agree with those in
djsetslow.py, which you can consider a reference answer.
Question: What does the amortized cost of the slow implementation appear to be in big-O notation, as a function of N? What does the amortized cost of the optimized union find appear to be as a function of N?
For the bored...
Testing only once per input size is not reliable, because the operations chosen are random, and we may have gotten lucky or unlucky. To fix this, you should run multiple different random tests for a single input size and average their results. This will smooth things out a bit
Extra Credit (Up to +2)
Some students in the class have come up with other cool ideas for data structures for disjoint set. Submit a separate file with an implementation of a different idea, and add a plot for it in the experiments to see how it stacks up. We'll share these with the class when they're done.