Hash Functions

The purpose of this class is to familiarize yourself with hash tables, and particularly with hash functions and choosing a good hash function.

Lecture/Live-Code

Why Hash Tables?

So far, we’ve been concerned primarily with storing data that we can access with a few operations:

  • Searching for data
  • Adding data
  • Removing data

What if there was a way to do all of these in constant time?

Introducing: hash tables!

How do Hash Tables Work?

Hash tables take some input and use a hash function to map them to an index in an array.

That’s really all they do. The complexity comes from trying to choose a good hash function; that is, one that performs this mapping in a way that allows us to have constant-time search, addition, and removal.

“Good” vs. “Bad” Hashes

All a hash function must do is consistently map the same input to the same output. But not all hash functions are equal.

A “good” hash function is one that avoids collisions in the data. It should also be fast at doing its hashing; why?

Think about which of these might be “good” hash functions and which might be “bad” hash functions if we are trying to, for example, create hashes for all of the students at Lawrence University?

  • The student’s height in centimeters
  • The student’s age in years
  • The student’s age in seconds
  • The student’s birthday in MMDDYYYY format
  • The number of J’s in the student’s name
  • The number of letters in the student’s name

A Good Hash Function

Come up with a decent hash function for the names of all the students in this class. Remember that your hash function should:

  • Be fast
  • Be reliable
  • Avoid collisions when possible

Hash Collisions

There will always be collisions in hash tables, so we should have some strategies for dealing with them. A few of these strategies are:

  • Chaining: Linked lists or trees that hold data at a particular location
  • Linear probing: If your current spot is taken, find the next spot in line
  • Double-hashing: If your current spot is taken, find the spot plus some randomly generated hashed value

How do each of these affect the worst-case runtime? The average case?