âšď¸ Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 0.2 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://blog.codinghorror.com/hashtables-pigeonholes-and-birthdays/ |
| Last Crawled | 2026-04-13 05:13:08 (5 days ago) |
| First Indexed | 2016-06-04 02:18:02 (9 years ago) |
| HTTP Status Code | 200 |
| Meta Title | Hashtables, Pigeonholes, and Birthdays |
| Meta Description | null |
| Meta Canonical | null |
| Boilerpipe Text | One of the most beloved of all data structures in computer science is theÂ
hash table
.
A hash table is a data structure that associates keys with values. The primary operation it supports efficiently is a lookup: given a key (e.g. a personâs name), find the corresponding value (e.g. that personâs telephone number). It works by transforming the key using a hash function into a hash, a number that is used to index into an array to locate the desired location (âbucketâ) where the values should be.
Key-value pairs are quite common in real world data, and hashtables are both reasonably efficient in storage and quite fast at lookups, offering O(1) performance in most cases. Thatâs why hashtables are the go-to data structure for many programmers. It may not be the optimal choice, but unlike so many things in computer science, itâs rarely aÂ
bad
 choice.
But hash tables do have one crucial weakness:Â
they are only as good as the hash function driving them
. As we add each new item to the hash table, we compute a hash value from the key for that item, and drop the item in the bucket represented by that hash value. So how many buckets do we need? Letâs consider the extremes:
If we hadÂ
one giant bucket
, everything would get piled in together. Weâd have to look at each and every item in our one bucket to find the one we want, which reduces us to worst-case performance: an O(n) linear search.
If we hadÂ
exactly the same number of buckets as items
, each item is placed in its own unique, individual bucket. We know each bucket will contain one, andÂ
only
 one, item. Thatâs a perfect hash function, delivering best-case performance: an O(1) lookup.
Reality, of course, lies somewhere in between these two extremes. The choice of hash function is critical, so you donât end up with a bucket shortage. As you place more and more items in each bucket (i.e., "collisions") you edge closer to the slow O(n) end of the performance spectrum.
Thereâs something magical about these hash functions that drive the hash table. The idea of the hash as aÂ
unique digital fingerprint
 for every chunk of data in the entire world is a fascinating one. Itâs a fingerprint that cleverly fits into a mere 32 bits of storage, yet is somehow able to uniquely identify any set of data ever created.
Of course,Â
this is a lie
, for several reasons. Letâs start with the most obvious one. Consider all possible values of a 32-bit hash function:
2
32
 ~= 4.3 billion
The current population of the earth is about 6.6 billion people. If we were to apply aÂ
perfect
 32-bit hash function to the DNA of every man, woman, and child on the planet, we could not guarantee uniqueness âÂ
we simply donât have enough possible hash values to represent them all!
This is known as theÂ
pigeonhole principle
. Itâs not complicated. If you try to put 6 pigeons in 5 holes, one will inevitably be left out in the cold.
Youâll definitely want toÂ
use a large enough hash value
 so you can avoid the pigeonhole principle. How much you care about this depends on how many things youâre planning to store in your hashtable, naturally.
The other reason hashes can fail as digital fingerprints is becauseÂ
collisions are a lot more likely than most people realize
. TheÂ
birthday paradox
 illustrates how quickly you can run into collision problems for small hash values. I distinctly remember the birthday paradox fromÂ
my college
 calculus class, and Iâll pose you the same question our TA asked us:
In a typical classroom of 30 students, what are the odds that two of the students will have the same birthday?
Donât read any further until youâve taken a guess. Whatâs your answer?
Everyone has completely unique DNA, but shares one of 365* possible birthdays with the rest of us.Â
Birthdays are effectively a tiny 365 value hash function.
 Using such a small hash value, thereâs a 50% chance of two people sharing the same birthday after a mereÂ
23 people
. With the 30 students in our hypothetical classroom, the odds of two students having a shared birthday rise to 70%. The statistics donât lie: when the question was posed in that classroom so many years ago, there were in fact two students who shared the same birthday.
A rule of thumb for estimating the number of values you need to enter in a hashtable before you have a 50 percent chance of an existing collision is to take the square root of 1.4 times the number of possible hash values.
SQRT(1.4 * 365) = 23
SQRT(1.4 * 2
32
) = 77,543
When using a 32-bit hash value, we have a 50% chance that a collision exists after about 77 thousand entries â a pretty far cry from the 4 billion possible values we could store in that 32-bit value. This is not a big deal for a hashtable; so what if a few of our buckets have more than one item? But itâs a huge problem if youâre relying on the hash as a unique digital fingerprint.
The hashing functions behind our precious hashtables may be a lie.Â
But theyâre aÂ
convenient
 lie.
 They work. Just keep the pigeonhole principle and the birthday paradox in mind as youâre using them, and youâll do fine.
*No, letâs forget leap years for now. And other variables like birth patterns. Yes, I know this is how programmers think. Imagine how much it would suck to have one birthday every four years, though. Ouch. |
| Markdown | [](https://blog.codinghorror.com/)
- [Archive](https://blog.codinghorror.com/page/2/)
- [Discourse](https://www.discourse.org/)
- [Stack](https://www.stackexchange.com/)
- [RGMII](https://rgmii.org/)
- [Reading](https://blog.codinghorror.com/recommended-reading-for-developers/)
- [About](https://blog.codinghorror.com/about-me/)
- [Shop](https://blog.codinghorror.com/own-a-coding-horror/)
# Hashtables, Pigeonholes, and Birthdays
[](https://blog.codinghorror.com/author/jeff-atwood/)
#### [Jeff Atwood](https://blog.codinghorror.com/author/jeff-atwood/)
06 Dec 2007
â 4 min read [â Comments](https://blog.codinghorror.com/hashtables-pigeonholes-and-birthdays/#discourse-comments)
One of the most beloved of all data structures in computer science is the [hash table](http://en.wikipedia.org/wiki/Hash_table).
> A hash table is a data structure that associates keys with values. The primary operation it supports efficiently is a lookup: given a key (e.g. a personâs name), find the corresponding value (e.g. that personâs telephone number). It works by transforming the key using a hash function into a hash, a number that is used to index into an array to locate the desired location (âbucketâ) where the values should be.
Key-value pairs are quite common in real world data, and hashtables are both reasonably efficient in storage and quite fast at lookups, offering O(1) performance in most cases. Thatâs why hashtables are the go-to data structure for many programmers. It may not be the optimal choice, but unlike so many things in computer science, itâs rarely a *bad* choice.
But hash tables do have one crucial weakness: **they are only as good as the hash function driving them**. As we add each new item to the hash table, we compute a hash value from the key for that item, and drop the item in the bucket represented by that hash value. So how many buckets do we need? Letâs consider the extremes:
- If we had **one giant bucket**, everything would get piled in together. Weâd have to look at each and every item in our one bucket to find the one we want, which reduces us to worst-case performance: an O(n) linear search.
- If we had **exactly the same number of buckets as items**, each item is placed in its own unique, individual bucket. We know each bucket will contain one, and *only* one, item. Thatâs a perfect hash function, delivering best-case performance: an O(1) lookup.
Reality, of course, lies somewhere in between these two extremes. The choice of hash function is critical, so you donât end up with a bucket shortage. As you place more and more items in each bucket (i.e., "collisions") you edge closer to the slow O(n) end of the performance spectrum.
Thereâs something magical about these hash functions that drive the hash table. The idea of the hash as a [unique digital fingerprint](http://haacked.com/archive/2007/01/22/Identicons_as_Visual_Fingerprints.aspx) for every chunk of data in the entire world is a fascinating one. Itâs a fingerprint that cleverly fits into a mere 32 bits of storage, yet is somehow able to uniquely identify any set of data ever created.
Of course, **this is a lie**, for several reasons. Letâs start with the most obvious one. Consider all possible values of a 32-bit hash function:
232 ~= 4.3 billion
The current population of the earth is about 6.6 billion people. If we were to apply a *perfect* 32-bit hash function to the DNA of every man, woman, and child on the planet, we could not guarantee uniqueness â **we simply donât have enough possible hash values to represent them all\!**
This is known as the [pigeonhole principle](http://en.wikipedia.org/wiki/Pigeonhole_principle). Itâs not complicated. If you try to put 6 pigeons in 5 holes, one will inevitably be left out in the cold.

Youâll definitely want to **use a large enough hash value** so you can avoid the pigeonhole principle. How much you care about this depends on how many things youâre planning to store in your hashtable, naturally.
The other reason hashes can fail as digital fingerprints is because **collisions are a lot more likely than most people realize**. The [birthday paradox](http://en.wikipedia.org/wiki/Birthday_paradox) illustrates how quickly you can run into collision problems for small hash values. I distinctly remember the birthday paradox from [my college](http://www.virginia.edu/) calculus class, and Iâll pose you the same question our TA asked us:
> In a typical classroom of 30 students, what are the odds that two of the students will have the same birthday?
Donât read any further until youâve taken a guess. Whatâs your answer?

Everyone has completely unique DNA, but shares one of 365\* possible birthdays with the rest of us. **Birthdays are effectively a tiny 365 value hash function.** Using such a small hash value, thereâs a 50% chance of two people sharing the same birthday after a mere *23 people*. With the 30 students in our hypothetical classroom, the odds of two students having a shared birthday rise to 70%. The statistics donât lie: when the question was posed in that classroom so many years ago, there were in fact two students who shared the same birthday.
A rule of thumb for estimating the number of values you need to enter in a hashtable before you have a 50 percent chance of an existing collision is to take the square root of 1.4 times the number of possible hash values.
```
SQRT(1.4 * 365) = 23
SQRT(1.4 * 232) = 77,543
```
When using a 32-bit hash value, we have a 50% chance that a collision exists after about 77 thousand entries â a pretty far cry from the 4 billion possible values we could store in that 32-bit value. This is not a big deal for a hashtable; so what if a few of our buckets have more than one item? But itâs a huge problem if youâre relying on the hash as a unique digital fingerprint.
The hashing functions behind our precious hashtables may be a lie. **But theyâre a *convenient* lie.** They work. Just keep the pigeonhole principle and the birthday paradox in mind as youâre using them, and youâll do fine.
\*No, letâs forget leap years for now. And other variables like birth patterns. Yes, I know this is how programmers think. Imagine how much it would suck to have one birthday every four years, though. Ouch.
[hash function](https://blog.codinghorror.com/tag/hash-function/) [efficiency](https://blog.codinghorror.com/tag/efficiency/) [performance](https://blog.codinghorror.com/tag/performance/)
[](https://blog.codinghorror.com/author/jeff-atwood/)
#### Written by Jeff Atwood
Indoor enthusiast. Co-founder of [Stack Overflow](https://stackoverflow.com/), [Discourse](https://www.discourse.org/), and [RGMII](https://rgmii.org/). Disclaimer: I have no idea what I'm talking about. Let's be kind to each other. Find me <https://infosec.exchange/@codinghorror>
[â Previous Post Sharing The Customerâs Pain](https://blog.codinghorror.com/sharing-the-customers-pain/)
[Next Post â The Danger of NaĂŻvetĂŠ](https://blog.codinghorror.com/the-danger-of-naivete/)
## Recent Posts
[](https://blog.codinghorror.com/launching-the-rural-guaranteed-minimum-income-initiative/)
### [Launching The Rural Guaranteed Minimum Income Initiative](https://blog.codinghorror.com/launching-the-rural-guaranteed-minimum-income-initiative/)
It's been a year since I invited Americans to join us in a pledge to Share the American Dream: 1. Support organizations you feel are effectively helping those most in need across America right now. 2. Within the next five years, also contribute public dedications of time or
By Jeff Atwood ¡
03 Feb 2026
[Comments](https://blog.codinghorror.com/launching-the-rural-guaranteed-minimum-income-initiative/#discourse-comments)
[](https://blog.codinghorror.com/the-road-not-taken-is-guaranteed-minimum-income/)
### [The Road Not Taken is Guaranteed Minimum Income](https://blog.codinghorror.com/the-road-not-taken-is-guaranteed-minimum-income/)
The dream is incomplete until we share it with our fellow Americans.
By Jeff Atwood ¡
20 Mar 2025
[Comments](https://blog.codinghorror.com/the-road-not-taken-is-guaranteed-minimum-income/#discourse-comments)
[](https://blog.codinghorror.com/lets-talk-about-the-american-dream/)
### [Let's Talk About The American Dream](https://blog.codinghorror.com/lets-talk-about-the-american-dream/)
A few months ago I wrote about what it means to stay gold â to hold on to the best parts of ourselves, our communities, and the American Dream itself. But staying gold isnât passive. It takes work. It takes action. It takes hard conversations that ask us to confront
By Jeff Atwood ¡
05 Mar 2025
[Comments](https://blog.codinghorror.com/lets-talk-about-the-american-dream/#discourse-comments)
[](https://blog.codinghorror.com/stay-gold-america/)
### [Stay Gold, America](https://blog.codinghorror.com/stay-gold-america/)
We are at an unprecedented point in American history, and I'm concerned we may lose sight of the American Dream.
By Jeff Atwood ¡
06 Jan 2025
[Comments](https://blog.codinghorror.com/stay-gold-america/#discourse-comments)
[Iâm feeling unlucky... đ˛](https://blog.codinghorror.com/random) [See All Posts](https://blog.codinghorror.com/page/2/)
[](https://blog.codinghorror.com/)
- [Archive](https://blog.codinghorror.com/page/2/)
- [Reading](https://blog.codinghorror.com/recommended-reading-for-developers/)
- [About](https://blog.codinghorror.com/about-me/)
- [Shop](https://blog.codinghorror.com/own-a-coding-horror/)
Powered by [Ghost](https://ghost.org/) ¡ Themed by [Obox](https://oboxthemes.com/) |
| Readable Markdown | [](https://blog.codinghorror.com/author/jeff-atwood/)
One of the most beloved of all data structures in computer science is the [hash table](http://en.wikipedia.org/wiki/Hash_table).
> A hash table is a data structure that associates keys with values. The primary operation it supports efficiently is a lookup: given a key (e.g. a personâs name), find the corresponding value (e.g. that personâs telephone number). It works by transforming the key using a hash function into a hash, a number that is used to index into an array to locate the desired location (âbucketâ) where the values should be.
Key-value pairs are quite common in real world data, and hashtables are both reasonably efficient in storage and quite fast at lookups, offering O(1) performance in most cases. Thatâs why hashtables are the go-to data structure for many programmers. It may not be the optimal choice, but unlike so many things in computer science, itâs rarely a *bad* choice.
But hash tables do have one crucial weakness: **they are only as good as the hash function driving them**. As we add each new item to the hash table, we compute a hash value from the key for that item, and drop the item in the bucket represented by that hash value. So how many buckets do we need? Letâs consider the extremes:
- If we had **one giant bucket**, everything would get piled in together. Weâd have to look at each and every item in our one bucket to find the one we want, which reduces us to worst-case performance: an O(n) linear search.
- If we had **exactly the same number of buckets as items**, each item is placed in its own unique, individual bucket. We know each bucket will contain one, and *only* one, item. Thatâs a perfect hash function, delivering best-case performance: an O(1) lookup.
Reality, of course, lies somewhere in between these two extremes. The choice of hash function is critical, so you donât end up with a bucket shortage. As you place more and more items in each bucket (i.e., "collisions") you edge closer to the slow O(n) end of the performance spectrum.
Thereâs something magical about these hash functions that drive the hash table. The idea of the hash as a [unique digital fingerprint](http://haacked.com/archive/2007/01/22/Identicons_as_Visual_Fingerprints.aspx) for every chunk of data in the entire world is a fascinating one. Itâs a fingerprint that cleverly fits into a mere 32 bits of storage, yet is somehow able to uniquely identify any set of data ever created.
Of course, **this is a lie**, for several reasons. Letâs start with the most obvious one. Consider all possible values of a 32-bit hash function:
232 ~= 4.3 billion
The current population of the earth is about 6.6 billion people. If we were to apply a *perfect* 32-bit hash function to the DNA of every man, woman, and child on the planet, we could not guarantee uniqueness â **we simply donât have enough possible hash values to represent them all\!**
This is known as the [pigeonhole principle](http://en.wikipedia.org/wiki/Pigeonhole_principle). Itâs not complicated. If you try to put 6 pigeons in 5 holes, one will inevitably be left out in the cold.

Youâll definitely want to **use a large enough hash value** so you can avoid the pigeonhole principle. How much you care about this depends on how many things youâre planning to store in your hashtable, naturally.
The other reason hashes can fail as digital fingerprints is because **collisions are a lot more likely than most people realize**. The [birthday paradox](http://en.wikipedia.org/wiki/Birthday_paradox) illustrates how quickly you can run into collision problems for small hash values. I distinctly remember the birthday paradox from [my college](http://www.virginia.edu/) calculus class, and Iâll pose you the same question our TA asked us:
> In a typical classroom of 30 students, what are the odds that two of the students will have the same birthday?
Donât read any further until youâve taken a guess. Whatâs your answer?

Everyone has completely unique DNA, but shares one of 365\* possible birthdays with the rest of us. **Birthdays are effectively a tiny 365 value hash function.** Using such a small hash value, thereâs a 50% chance of two people sharing the same birthday after a mere *23 people*. With the 30 students in our hypothetical classroom, the odds of two students having a shared birthday rise to 70%. The statistics donât lie: when the question was posed in that classroom so many years ago, there were in fact two students who shared the same birthday.
A rule of thumb for estimating the number of values you need to enter in a hashtable before you have a 50 percent chance of an existing collision is to take the square root of 1.4 times the number of possible hash values.
```
SQRT(1.4 * 365) = 23
SQRT(1.4 * 232) = 77,543
```
When using a 32-bit hash value, we have a 50% chance that a collision exists after about 77 thousand entries â a pretty far cry from the 4 billion possible values we could store in that 32-bit value. This is not a big deal for a hashtable; so what if a few of our buckets have more than one item? But itâs a huge problem if youâre relying on the hash as a unique digital fingerprint.
The hashing functions behind our precious hashtables may be a lie. **But theyâre a *convenient* lie.** They work. Just keep the pigeonhole principle and the birthday paradox in mind as youâre using them, and youâll do fine.
\*No, letâs forget leap years for now. And other variables like birth patterns. Yes, I know this is how programmers think. Imagine how much it would suck to have one birthday every four years, though. Ouch. |
| Shard | 41 (laksa) |
| Root Hash | 6205400369951510841 |
| Unparsed URL | com,codinghorror!blog,/hashtables-pigeonholes-and-birthdays/ s443 |