We discussed in class the problem of finding a needle in a haystack. More specifically, the abstract data type (data structures version of an API), specifies the following three operations we need to do to a collection of objects:

Officially, this is called the Set ADT. To keep things simple, we will assume our data structure has exactly one copy of each element. So if we say add(5) add(5) remove(5), then there is no longer a 5 (adding the second time had no effect). Also, we'll assume that removing an element that isn't there does nothing.

As far as implementations go, we discussed four different options

We came up with the following table of the worst case number of operations one would have to do to perform our operations for a collection that currently has $N$ items:

Find Add Remove
Ordinary Array $\approx N$ $\approx 2N$ $\approx 3N$
Sorted Array $\approx log_2(N)$ $\approx 2N$ $\approx 2N + log_2(N)$
Linked List $\approx N$ $\approx 1$ $\approx N$

Below are some explanations

Ordinary Array

Sorted Array

This is similar to the above, except searching is more efficient because we can continually halve our search range until there's only one element left (recall the definition we gave of log). We'll explore this algorithm more in the first lab. But it speeds up the find step, as well as the remove step (since finding is the first thing we do)

Linked List

Searching/finding is just a matter of following the links, starting at the head. In the worst case, the element is at the end, or it isn't there at all, in which case we have to walk through all $N$ link nodes. Let's break down adding and removing in a bit more detail:

Linked List Adding

We just make the element we want to add the new head

Specifically, the steps are

  1. new.next = head
  2. head = new

Linked List Removing

This is a little bit trickier, but the basics are the same. We have to find the element and then reassign pointers to circumvent the element we're trying to remove (assuming there's a garbag collector around to clean up the orphaned node). Let's assume that it's actually there (otherwise, we just get to the end and don't do anything). Let's also assume that it's somewhere in the middle (you'd have to handle it being the beginning as a special case). Then the pseudocode is as follows

Summary of Basic Implementations

We've made improvements with the sorted array and linked list, but it seems like we're playing a game of whack-a-mole; we made adding more efficient with a linked list, but searching got slow again. How can we improve this?

Intro To Hashing

The solution to a more efficient set ADT is actually incredibly simple. We just have to combine arrays and linked lists in clever ways. What we will do is split up our data into a bunch of buckets. Each bucket will hold one small chunk of our data. When we want to find an element, we will be able to jump directly to a bucket that would contain it if it were in our set (and likewise for removing and adding).

Students had some neat ideas in class about creating buckets. For instance, if we knew the range of our numbers was from $a$ to $b$ and we wanted N buckets, then we could setup the buckets so that they contained numbers in the range

For instance, if we wanted 4 buckets for numbers between 0 and 99, we could do buckets with ranges $[0, 24], [25, 49], [50, 74], [75, 99]$

The code for this would be OK enough, but then things would get instantly messed up as soon as we made a number that went beyond the range.

Actually, there's a solution even simpler than this that works in a more general case where we don't have to assume the range! Given $b$ buckets, we simply map a number $x$ to $x \% b$! For example, if we chose $b=2$, all even numbers would go into one bucket, and all odd numbers go into another bucket.

Let's look at a few different ways to implement this with numbers in python. We'll cheat and use a random access list for now, but technically each bucket contains a linked list, since these are more efficient than unsorted arrays by our above discussion. You'll implement the linked list version in your first homework

Here's an even more compact solution using numpy

One thing we discussed in class was the space/time tradeoff between using more buckets. If we use more buckets, we have fewer elements that "collide" (i.e. occupy the same bucket) in each bucket, so it doesn't take as much time to find what we're looking for or to add/remove something. However, it takes more memory to store the buckets. But actually, if we use a number of buckets on the order of the number of elements we expect to have, this isn't terrible; we're just roughly doubling the amount of memory we need, but we get things more efficient. Let's do this on the example below

By contrast, here's only using 3 buckets

Hash codes

In order to extend what we did above to objects, we need to define a consistent way of turning our objects into numbers. This is known as a hash code. A hash code should be deterministic; that is, it shouldn't change. For example, a hash code for a person could be the month of the year that they were born. Below is an example of what a hash table would look like for Harry Potter characters based on this hash function (using the birthdays of the actors who played them in the movies). Notice how the linked lists are setup in each bucket

You can examine such hash codes live for Harry Potter characters at this link.

We started to discuss some of theoretical properties of hash functions, starting with the pigeonhole principle. In this context, if we have more objects than we do buckets, then we are guaranteed to have a collision.