Discovering Disjoint Set Data Structures

Chris Tralie

Getting Started

Click here to get the starter code for today.

Background

There is a very important abstract data type in algorithms known as the Disjoint Sets data type. You can think of it as starting off with a whole bunch of object-containing bubbles (sets) that merge together over time and never break apart. After so many bubbles have merged, we can ask the question: are two objects in the same bubble?

To keep things simple, all of the objects will simply be integers numbered from 0 up to some total number of bubbles minus 1. In our abstract data type, there are two operations we can do to integers:

  • union(i, j): Merge the bubbles (sets) containing i and j.
  • find(i, j): At this point in the disjoint set, are i and j part of the same set?

Example

Below is an example of a run of union and find on 10 numbers, which each individually start out in their own bubble/set

Exercise 1: Designing A Data Structure

A data structure is an implementation of an abstract data type. In C++ terms, the abstract data type is like a header file, and the data structure is the cpp file. The end user does not need to know the implementation details in order to use methods in the header file, but the person designing the code should come up with an efficient implementation so that the end user has a good experience.

In this first exercise, you should discuss with your group how to design a data structure to accomplish this task. You can do this just using lists, but you may also want to use a set. You should not write any code at this stage. Questions to ask yourself as you go:

  • What local variables should I create in the object, and what are their roles?
  • What algorithms should I implement that work with the variables I've defined?
  • How do I know my algorithms are correct?
You should write a few examples out by hand to make sure this works like you think it does (and perhaps verify with the above example). I will check in with all of the groups after a few minutes to see what people come up with.

Exercise 2: Implementing and testing the data structure

Now that you have a data structure in mind, it's time to implement it by turning your above ideas into code. So you should take a moment to do that as a group. Even if you think it works, however, you need to test it. I have provided a unit test for the above example in test.py, so if you first run

!python -m pytest test.py

In the spyder terminal, you should see all tests pass, and that's a good sign. However, you should make a different test of your own by filling in test_my(). Any time you want to test to make sure something is True, put a boolean expression in an assert statement. If that expression evaluates to False, then it will crash the code, and that test will fail.

Exercise 3: Corner Case Testing

An important part of data structure implementation is to ensure that your code works not just on "ordinary" inputs, but on all possible inputs. Think about what strange inputs could occur. Then design a corner case test for each one of them. In order for python's unit tester to run it, make sure you create a method that starts with the word test

Exercise 4: Stress Testing

Another way to test the robustness of a data structure implementation is to put a very large number of items in it and to make sure the operations still work as expected. Think about how you could create a very large random test. How many items will you have in your disjoint set, and how will you systematically (but randomly) merge them together and check to see what has been merged? Below is a code snippet that may help. If we fail the test, we want to be able to repeat it, so it's important to actually use "pseudorandomness" instead of randomness. So we use a seed