Ursinus CS 271: Data Structures And Algorithms, Fall 2022

Lab 2: Union Find (8 Points)

Due Monday 9/19/2022

In this lab you will implement an efficient data structure for the disjoint set data structure known as union find. You will implement and test several accelerations to union find that make the "amortized cost" (average cost) of an operation near constant time(!)

Click here to download the starter code for this lab, which provides a reference implementation of the slow disjoint set data structure, as well as code to compare this to your implementations in this lab in the file test.py

Learning Objectives

Convert a problem into a format required by an abstract data type
Implement union find with size-based merging and path compression in python
Experimentally measure and plot the amortized cost of operations as the input size scales

What To Submit

When you are finished, submit your files unionfind.py and unionfindopt.py to canvas, as well as your amortized cost plot and an answer to the complexity question as a comment on canvas.

Background

Review the pseudocode and example picture below of the basic union find algorithm. If needed, click here to rewatch the video I made that explains this

Basic Union Find Implementation (2 Points)

Your Task: Following the pseudocode in the above picture, create a file called unionfind.py which contains a class called UnionFind with methods for root, find, and union. For simplicity in this lab, you can assume that you start off with individual bubbles for the numbers 0, 1, ..., N-1, and you can pass N as a parameter to the constructor, just as in the naive implementation you explored in the module.

Create some simple tests to convince yourself that this is working before you proceed to the next section.

Path Compression (2 Points)

One way that we can make union find much faster is if we do something called path compression. At a high level, we want to try to avoid very long branches in our forest, because this will make it slower to find the root. So what we can do is some bookkeeping to make future calls of root are more efficient. It's easiest to see this with an example. Suppose we start with the forest below:

Item	0	1	2	3	4	5	6	7	8	9
parent	0	7	0	3	4	9	1	0	1	6

Then, let's suppose we called root(9). All of the highlighted nodes (9, 6, 1, 7) below will be visited on the path on the way to 9's root:

This is a long path relative to the number of elements here, but we can cut it down for future iterations by backing up and creating a shortcut from all of the highlighted nodes to the root that was found. The image below shows this, with parent pointers bolded to indicate what was changed

Item	0	1	2	3	4	5	6	7	8	9
parent	0	0	0	3	4	9	0	0	1	0

Notice how much shorter the tree is now! This is one of the ways we get towards near constant time performance on future queries

Your Task: Create a copy of your file unionfind.py called unionfindopt.py. In this new file, modify your root method to implement path compression, and make sure your union find still works properly

Size-Based Merging (2 Points)

There is one more optimization we can do to speed up union find. Let's suppose we have the following forest below. In addition to storing the parent references, we'll store how many nodes are under each node in a list called weights. Actually, we only need to remember the weights of the roots for what we're about to do, so I'll omit all of the other weights

Item	0	1	2	3	4	5	6	7	8	9
parent	0	7	0	3	3	9	1	0	1	6
weights	8			2

Now let's suppose we call union(4, 1). This will merge the roots 0 and 3. One possibility would be make 0's parent be 3, as shown below:

Item	0	1	2	3	4	5	6	7	8	9
parent	0	7	0	3	3	9	1	0	1	6
weights				10

The other option is to make 3's parent be 0, as shown below:

Item	0	1	2	3	4	5	6	7	8	9
parent	0	7	0	3	3	9	1	0	1	6
weights	10

This is definitely the better option! The reason is that it only increases the distance of two nodes to the root: 3 and 4. By contrast, making 0 the parent of 3 increases the distance of nodes 0, 1, 2, 5, 6, 7, 8, 9 all by 1.

Your task: Modify your union method in unionfindopt.py so that it implements weight-based merging, in addition to path compression. That is, keep track of the weights of root nodes, and make the parent the one with the larger size.

Note that the weights of these roots will be unchanged by path compression, so we can combine this trick with path compression.

Amortized Cost Plots (2 Points)

Now we will get a sense for the efficiency of the tree-based union find compared to one of our original ideas empirically by testing it out and timing it on inputs of varying size. Path compression may take a long time on some calls to root, but we're banking on it saving us time in future operations. Hence, we are interested not in the cost of an individual operation, but in the amortized cost, or the average cost of unions and finds over many operations. Furthermore, whenever we're thinking about the performance of algorithms, we want to know how the algorithms scale in performance as the input size scales up.

Your task: Modify both of your union find classes to have two local variables:

_operations: A running count of how many times a parent arrow was followed in root and how many times a parent was reassigned in union
_calls: The total number of calls made to union and find.

I've done a similar thing in djsetslow.py. Then, at the end of many operations, the amortized cost can be computed as _operations/_calls.

Assuming you've done all of this properly, I've created a program in test.py to call your code and the slow disjoint set code on many random tests of different sizes, and to plot the results. The program will also let you know if your find operations don't agree with those in djsetslow.py, which you can consider a reference answer.

Question: What does the amortized cost of the slow implementation appear to be in big-O notation, as a function of N? What does the amortized cost of the optimized union find appear to be as a function of N?

For the bored...

Testing only once per input size is not reliable, because the operations chosen are random, and we may have gotten lucky or unlucky. To fix this, you should run multiple different random tests for a single input size and average their results. This will smooth things out a bit

Extra Credit (Up to +2)

Some students in the class have come up with other cool ideas for data structures for disjoint set. Submit a separate file with an implementation of a different idea, and add a plot for it in the experiments to see how it stacks up. We'll share these with the class when they're done.