ℹ️ Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 2.2 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html |
| Last Crawled | 2026-02-09 15:06:19 (2 months ago) |
| First Indexed | 2025-01-18 07:09:59 (1 year ago) |
| HTTP Status Code | 200 |
| Meta Title | Hashing and Hash Tables — CS 340: Algorithms and Data Structures 1.0 documentation |
| Meta Description | null |
| Meta Canonical | null |
| Boilerpipe Text | Our discussions on trees centered around a data structure that stored items efficiently, but to get the balanced height trees, things got tough to implement. I nstead of focusing so much on the structure, hashing takes the approach that the structure should be rather simple. The downside is that iterating over elements does not come for free, as in trees, but is possible with a few tricks.
So, for hashing we are looking at simple structures, usually arrays. We will manage the size of the array to be not too much bigger than the amount of data stored (like C++ vectors), to preserve iteration in linear time.
Hash Tables
¶
The basic question is “Why not just use an array as a table?”. Its a good question...
Lets think about a table containing products that a store wants to keep track of. Here is an example. There are serious problems with this basic approach, what are they?
Real World Data
Here are some examples of real world data we may want to store in a table. If we are simply using the numbers as array indexes, then:
How big of an array do we need?
How much of the array will actually get used?
Students: Student ID (9 digits)
People: SSN (9 digits)
ZIP code: 5 digits, 9 digits
ISBN: 10 digits
UPC: 12 digits
Others: Character strings
Basic Hash Tables
¶
A
Hash Table
will consist of 2 parts:
a
table
(an array), and
a
hash function
that will convert key values to array indices. (used for insert/delete/search)
A hash function can really be anything, but there are some recipes for reliably good ones. Here are a couple examples of some that might work out in specific cases:
Use certain digits form a long number.
ex: last 4 digits of student ID. Will this work at our university?
Folding.
Use some function to get a smaller range of values. ex: add the digits of student ID. Will this work at our university?
A Basic Hash Table Example
size: 5
Hash function: Add first and last digits, then mod the result by the table size.
Here is the table:
Insert the following:
349587
98745
84743
Now find the same numbers in the hash table. (just apply the formula and look for them)
Now insert:
24544
.
Collision with 84743!
Collisions are a problem, but there are various ways to handle them:
Open addressing collision handling methods:
Linear probing – look for next open spot
Quadratic probing
Double Hashing
Increase the table size
Multiple-Item Storage collision handling methods
Buckets
Chaining
Open Addressing: Linear Probing
¶
If there is a collision, just look for the next open slot and insert the item there.
Deleting
. Deleting is problematic, since removing an item might break the linear probe. Instead of actually deleting items, mark them as being deleted (lazy delete!)
What happens when the hash table fills up?
One problem with linear probing is that it can lead to a degenerate situation where items that map to the same portion of a hash table overflow into other parts of the hash table, causing a cascading series of probes for lots of items.
Open Addressing: Quadratic Probing
¶
Instead of just looking at the next slot for an opening, follow a quadratic sequence of indices (1,2,4,8,16,...)
Resolves some of the clustering problems of linear probing. Can fail with a non-full hash table (but we can make an odd sized table).
What do we do when insertion fails?
Open Addressing: Double Hashing (or triple, or quadruple for that matter)
¶
If a collision occurs on one hash function, simply use another one.
What do we do if a collision occurs on all hash functions?
The hash functions can be tried in parallel! but hash functions are typically pretty computationally cheap.
Open Addressing: Increase the Table Size
¶
Make a new array with more room.
How much more room?
Insert each item into the new array.
Do we reuse the same hash functions?
Delete the old array.
When should this occur? When insertion fails, when the table is full? when the table is almost full? when a certain percent of probes have occurred?
Multi-item Storage: Buckets
¶
The idea is simple, just keep room for more than 1 item at each table location. Pick a fixed number, and possibly use an additional open addressing strategy if the bucket fills.
Multi-item Storage: Chaining
¶
Use a linked list (or other ADT) at each table location.
Might need to consider increasing the table size if a list (or lists) get too long.
String Hash Functions
¶
String hash functions are little tougher. Here are some examples:
Add the numeric values of first five characters:
burner = 2+21+18+14+5 = 60
scanner = 19+1+14+14+5 = 53
camera = 3+1+13+5+18 = 40
tablet = 20+1+2+12+5 = 40
Values range from 1 to 130
Concatenate positional values of first five char
burner = 2 21 18 14 5 = 22,118,145
scanner = 19 3 1 14 14 5 = 19,314,145
camera = 3 1 13 5 18 = 3,113,518
tablet = 20 1 2 12 5 = 2,012,125
Values range from 1,048,576 to 28,142,426
Instead, look into production hash functions: MD5, SHA, etc.
Good Hash Functions
¶
Good hash functions are easy to compute, and distribute values evenly throughout the table.
Here is a recipe:
Where:
is the size of the table.
is a prime number larger than any number that will be hashed (4,294,967,291 is the largest
unsigned
32 bit prime integer. 2,147,483,647 is the largest
signed
32 bit prime number. 18,446, 744,073,709,551,557 is the larges
unsigned
64 bit number. 9,223,372,036,854,775,783 is the largest
signed
prime integer.
NOTE
you MUST BE CAREFUL with integer overflows, when perfoming these calculations!!!)
are positive constants both less than
. Make sure
.
is some interaction of
and
. A good basic choice is
Multiple hash functions can be easily generated by choosing random values of
and
.
One implementation note, intermediate values need to be 64 bit to prevent overflow!
This type of hash function falls in the family of
2-universal family
hash functions, with a probability of items colliding
. (just make sure
for all
)
Lets try this out in class.
Load Factor of a Hash Table
¶
The load factor for a hash table is:
. This is between 1 and 0. A high load factor indicates the hash table is almost full, and you might want to think about resizing it.
A final note
¶
Make sure the hash table size is ODD. Prime numbers help too
No matter what, a hash table doesn’t store data in sorted order!
Bloom Filters
¶
Bloom filters are one of the coolest data structures around, but they are not used too often.
The idea behind a bloom filter is that we don’t always need to store every item, sometimnes we just need to record the fact that the item exists. This is known as
membership testing
.
Question
What are some example applications that might use membership testing?
Bloom filters don’t store the actual data, they keep track of a bit string. Also, they do not use a single hash function, but use a group of
hash functions. Each hash function will address a single bit in the array.
Example
Hash_function1( int ): mod the sum of the first two digits
Hash_function2( int ): mod sum of 2nd pair of digits
Hash_function3( int ): mod sum of 3rd pair of digits
Now insert: 937789, 932243, 106616
Does this number exist in our set? 134898
Not so fast! The thing about bloom filters is that they are not always right, but they are never wrong. Bloom filters have the property that they may have
False Positives
, a number is recorded in the filter but was never explicitly put into the filter! This occurs by chance, when other numbers simply cause those bits to be 1. It can also happen when the bloom filter begins to fill up!
So what do we do about false positives? CONTROL THEM!!
We just make the probability of false positives sufficiently small.
From
http://en.wikipedia.org/wiki/Bloom_filter
:
.
.
Or just look at the table:
Question
What are some example applications that might use membership testing now that we have to deal with false positives?
Example
Lets say I want to do existence testing for a set of 1,000,000 items.
If an item is an integer, that means we would need to store 4,000,000 bytes ~= 3.8 MB of data
We would need roughly twice that to prevent saturation in a hash table ~= 7.6MB
What we are storing may be much larger than an integer!
In the case of the Bloom Filter...
Lets say we want a false positive rate of 0.001
According to the table, we need m/n = 15 and k = 7
So, we need 15,000,000 bits ~= 1.8MB. That’s a quarter of the storage of the hash table, and, no matter how large the items are, it will always take 1.8MB
We just need to make sure that we can come up with 7 independent hash functions! But thats easy with our hash function recipe above!
Finally, some properties of Bloom Filters:
A BF can represent an entire universe of elements, wheras hash table runs out of space. Also, a hash table must explicitly store all elements
Union and Intersection of bloom filters is possible using bitwise operations on the bit strings (super useful for combining tables in databases without having to rebuild indexes)
If the filter becomes satuated, we have to change the
ratio. There are types of BFs that grow like vectors!
there are tons of variants for all types of applications: Bloomier filters, counting filters, stable bloom filters, scalable filters
Deleteion is a problem! What happens if we try to delete something?
Lets practice a little... |
| Markdown | 
### Navigation
- [index](https://www.cs.siue.edu/~marmcke/docs/cs340/genindex.html "General Index")
- [next](https://www.cs.siue.edu/~marmcke/docs/cs340/heaps.html "Heaps and Priority Queues") \|
- [previous](https://www.cs.siue.edu/~marmcke/docs/cs340/reviewMultiThread.html "Review of Multithreaded Programming") \|
- [CS 340: Algorithms and Data Structures 1.0 documentation](https://www.cs.siue.edu/~marmcke/docs/cs340/index.html) »
### [Table Of Contents](https://www.cs.siue.edu/~marmcke/docs/cs340/index.html)
- [Hashing and Hash Tables](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html)
- [Hash Tables](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#hash-tables)
- [Basic Hash Tables](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#basic-hash-tables)
- [Open Addressing: Linear Probing](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-linear-probing)
- [Open Addressing: Quadratic Probing](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-quadratic-probing)
- [Open Addressing: Double Hashing (or triple, or quadruple for that matter)](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-double-hashing-or-triple-or-quadruple-for-that-matter)
- [Open Addressing: Increase the Table Size](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-increase-the-table-size)
- [Multi-item Storage: Buckets](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#multi-item-storage-buckets)
- [Multi-item Storage: Chaining](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#multi-item-storage-chaining)
- [String Hash Functions](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#string-hash-functions)
- [Good Hash Functions](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#good-hash-functions)
- [Load Factor of a Hash Table](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#load-factor-of-a-hash-table)
- [A final note](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#a-final-note)
- [Bloom Filters](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#bloom-filters)
#### Previous topic
[Review of Multithreaded Programming](https://www.cs.siue.edu/~marmcke/docs/cs340/reviewMultiThread.html "previous chapter")
#### Next topic
[Heaps and Priority Queues](https://www.cs.siue.edu/~marmcke/docs/cs340/heaps.html "next chapter")
### This Page
- [Show Source](https://www.cs.siue.edu/~marmcke/docs/cs340/_sources/hashing.txt)
### Quick search
Enter search terms or a module, class or function name.
# Hashing and Hash Tables[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#hashing-and-hash-tables "Permalink to this headline")
Our discussions on trees centered around a data structure that stored items efficiently, but to get the balanced height trees, things got tough to implement. I nstead of focusing so much on the structure, hashing takes the approach that the structure should be rather simple. The downside is that iterating over elements does not come for free, as in trees, but is possible with a few tricks.
So, for hashing we are looking at simple structures, usually arrays. We will manage the size of the array to be not too much bigger than the amount of data stored (like C++ vectors), to preserve iteration in linear time.
## Hash Tables[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#hash-tables "Permalink to this headline")
The basic question is “Why not just use an array as a table?”. Its a good question...
Lets think about a table containing products that a store wants to keep track of. Here is an example. There are serious problems with this basic approach, what are they?
[](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hSimple.png)
Real World Data
Here are some examples of real world data we may want to store in a table. If we are simply using the numbers as array indexes, then:
1. How big of an array do we need?
2. How much of the array will actually get used?
- Students: Student ID (9 digits)
- People: SSN (9 digits)
- ZIP code: 5 digits, 9 digits
- ISBN: 10 digits
- UPC: 12 digits
- Others: Character strings
### Basic Hash Tables[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#basic-hash-tables "Permalink to this headline")
A **Hash Table** will consist of 2 parts:
1. a **table** (an array), and
2. a **hash function** that will convert key values to array indices. (used for insert/delete/search)
A hash function can really be anything, but there are some recipes for reliably good ones. Here are a couple examples of some that might work out in specific cases:
- **Use certain digits form a long number.** ex: last 4 digits of student ID. Will this work at our university?
- **Folding.** Use some function to get a smaller range of values. ex: add the digits of student ID. Will this work at our university?
A Basic Hash Table Example
size: 5
Hash function: Add first and last digits, then mod the result by the table size.
Here is the table:
[](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hEmptyTable.png)
Insert the following:
1. 349587 
2. 98745 
3. 84743 
Now find the same numbers in the hash table. (just apply the formula and look for them)
Now insert:
- 24544 . **Collision with 84743\!**
Collisions are a problem, but there are various ways to handle them:
Open addressing collision handling methods:
1. Linear probing – look for next open spot
2. Quadratic probing
3. Double Hashing
4. Increase the table size
Multiple-Item Storage collision handling methods
1. Buckets
2. Chaining
### Open Addressing: Linear Probing[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-linear-probing "Permalink to this headline")
If there is a collision, just look for the next open slot and insert the item there.
[](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hLinearProbe.png)
**Deleting**. Deleting is problematic, since removing an item might break the linear probe. Instead of actually deleting items, mark them as being deleted (lazy delete!)
What happens when the hash table fills up?
One problem with linear probing is that it can lead to a degenerate situation where items that map to the same portion of a hash table overflow into other parts of the hash table, causing a cascading series of probes for lots of items.
### Open Addressing: Quadratic Probing[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-quadratic-probing "Permalink to this headline")
Instead of just looking at the next slot for an opening, follow a quadratic sequence of indices (1,2,4,8,16,...)
[](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hQuadProbe.png)
Resolves some of the clustering problems of linear probing. Can fail with a non-full hash table (but we can make an odd sized table).
What do we do when insertion fails?
### Open Addressing: Double Hashing (or triple, or quadruple for that matter)[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-double-hashing-or-triple-or-quadruple-for-that-matter "Permalink to this headline")
If a collision occurs on one hash function, simply use another one.
What do we do if a collision occurs on all hash functions?
The hash functions can be tried in parallel! but hash functions are typically pretty computationally cheap.
### Open Addressing: Increase the Table Size[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-increase-the-table-size "Permalink to this headline")
Make a new array with more room. **How much more room?**
Insert each item into the new array. **Do we reuse the same hash functions?**
Delete the old array.
When should this occur? When insertion fails, when the table is full? when the table is almost full? when a certain percent of probes have occurred?
### Multi-item Storage: Buckets[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#multi-item-storage-buckets "Permalink to this headline")
The idea is simple, just keep room for more than 1 item at each table location. Pick a fixed number, and possibly use an additional open addressing strategy if the bucket fills.
### Multi-item Storage: Chaining[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#multi-item-storage-chaining "Permalink to this headline")
Use a linked list (or other ADT) at each table location.
Might need to consider increasing the table size if a list (or lists) get too long.
### String Hash Functions[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#string-hash-functions "Permalink to this headline")
String hash functions are little tougher. Here are some examples:
- Add the numeric values of first five characters:
1. burner = 2+21+18+14+5 = 60
2. scanner = 19+1+14+14+5 = 53
3. camera = 3+1+13+5+18 = 40
4. tablet = 20+1+2+12+5 = 40
Values range from 1 to 130
- Concatenate positional values of first five char
1. burner = 2 21 18 14 5 = 22,118,145
2. scanner = 19 3 1 14 14 5 = 19,314,145
3. camera = 3 1 13 5 18 = 3,113,518
4. tablet = 20 1 2 12 5 = 2,012,125
Values range from 1,048,576 to 28,142,426
Instead, look into production hash functions: MD5, SHA, etc.
### Good Hash Functions[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#good-hash-functions "Permalink to this headline")
Good hash functions are easy to compute, and distribute values evenly throughout the table.
Here is a recipe:

Where:
-  is the size of the table.
-  is a prime number larger than any number that will be hashed (4,294,967,291 is the largest **unsigned** 32 bit prime integer. 2,147,483,647 is the largest **signed** 32 bit prime number. 18,446, 744,073,709,551,557 is the larges **unsigned** 64 bit number. 9,223,372,036,854,775,783 is the largest **signed** prime integer. **NOTE** you MUST BE CAREFUL with integer overflows, when perfoming these calculations!!!)
-  are positive constants both less than . Make sure .
-  is some interaction of  and . A good basic choice is 
Multiple hash functions can be easily generated by choosing random values of  and .
One implementation note, intermediate values need to be 64 bit to prevent overflow\!
This type of hash function falls in the family of *2-universal family* hash functions, with a probability of items colliding . (just make sure  for all )
Lets try this out in class.
### Load Factor of a Hash Table[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#load-factor-of-a-hash-table "Permalink to this headline")
The load factor for a hash table is: . This is between 1 and 0. A high load factor indicates the hash table is almost full, and you might want to think about resizing it.
### A final note[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#a-final-note "Permalink to this headline")
Make sure the hash table size is ODD. Prime numbers help too
No matter what, a hash table doesn’t store data in sorted order\!
## Bloom Filters[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#bloom-filters "Permalink to this headline")
Bloom filters are one of the coolest data structures around, but they are not used too often.
The idea behind a bloom filter is that we don’t always need to store every item, sometimnes we just need to record the fact that the item exists. This is known as **membership testing**.
Question
What are some example applications that might use membership testing?
Bloom filters don’t store the actual data, they keep track of a bit string. Also, they do not use a single hash function, but use a group of  hash functions. Each hash function will address a single bit in the array.
Example
Hash\_function1( int ): mod the sum of the first two digits
Hash\_function2( int ): mod sum of 2nd pair of digits
Hash\_function3( int ): mod sum of 3rd pair of digits
Now insert: 937789, 932243, 106616
[](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFilt.png)
Does this number exist in our set? 134898
Not so fast! The thing about bloom filters is that they are not always right, but they are never wrong. Bloom filters have the property that they may have **False Positives**, a number is recorded in the filter but was never explicitly put into the filter! This occurs by chance, when other numbers simply cause those bits to be 1. It can also happen when the bloom filter begins to fill up\!
So what do we do about false positives? CONTROL THEM!\!
We just make the probability of false positives sufficiently small.
From <http://en.wikipedia.org/wiki/Bloom_filter> :
[](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFP1.png)
.
[](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFP2.png)
.
[](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFP3.png)
- We can determine , the number of items we need to store
- Then choose an appropriate , compute the optimal , and see if the false positive probability is within acceptable bounds
- If not, try new values for  or 
Or just look at the table:
[](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFP4.png)
Question
What are some example applications that might use membership testing now that we have to deal with false positives?
Example
Lets say I want to do existence testing for a set of 1,000,000 items.
- If an item is an integer, that means we would need to store 4,000,000 bytes ~= 3.8 MB of data
- We would need roughly twice that to prevent saturation in a hash table ~= 7.6MB
- What we are storing may be much larger than an integer\!
In the case of the Bloom Filter...
- Lets say we want a false positive rate of 0.001
- According to the table, we need m/n = 15 and k = 7
- So, we need 15,000,000 bits ~= 1.8MB. That’s a quarter of the storage of the hash table, and, no matter how large the items are, it will always take 1.8MB
- We just need to make sure that we can come up with 7 independent hash functions! But thats easy with our hash function recipe above\!
Finally, some properties of Bloom Filters:
- A BF can represent an entire universe of elements, wheras hash table runs out of space. Also, a hash table must explicitly store all elements
- Union and Intersection of bloom filters is possible using bitwise operations on the bit strings (super useful for combining tables in databases without having to rebuild indexes)
- If the filter becomes satuated, we have to change the  ratio. There are types of BFs that grow like vectors\!
- there are tons of variants for all types of applications: Bloomier filters, counting filters, stable bloom filters, scalable filters
- Deleteion is a problem! What happens if we try to delete something?
Lets practice a little...
### Navigation
- [index](https://www.cs.siue.edu/~marmcke/docs/cs340/genindex.html "General Index")
- [next](https://www.cs.siue.edu/~marmcke/docs/cs340/heaps.html "Heaps and Priority Queues") \|
- [previous](https://www.cs.siue.edu/~marmcke/docs/cs340/reviewMultiThread.html "Review of Multithreaded Programming") \|
- [CS 340: Algorithms and Data Structures 1.0 documentation](https://www.cs.siue.edu/~marmcke/docs/cs340/index.html) »
© Copyright 2014, M McKenney. Created using [Sphinx](http://sphinx-doc.org/) 1.3. |
| Readable Markdown | null |
| Shard | 50 (laksa) |
| Root Hash | 11131842051709627250 |
| Unparsed URL | edu,siue!cs,www,/~marmcke/docs/cs340/hashing.html s443 |