🕷️ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:
Response:
Calculated Shard: 50 (from laksa169)

2. Crawled Status Check

Query:
Response:

3. Robots.txt Check

Query:
Response:

4. Spam/Ban Check

Query:
Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

đź“„
INDEXABLE
âś…
CRAWLED
2 months ago
🤖
ROBOTS ALLOWED

Page Info Filters

FilterStatusConditionDetails
HTTP statusPASSdownload_http_code = 200HTTP 200
Age cutoffPASSdownload_stamp > now() - 6 MONTH2.2 months ago
History dropPASSisNull(history_drop_reason)No drop reason
Spam/banPASSfh_dont_index != 1 AND ml_spam_score = 0ml_spam_score=0
CanonicalPASSmeta_canonical IS NULL OR = '' OR = src_unparsedNot set

Page Details

PropertyValue
URLhttps://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html
Last Crawled2026-02-09 15:06:19 (2 months ago)
First Indexed2025-01-18 07:09:59 (1 year ago)
HTTP Status Code200
Meta TitleHashing and Hash Tables — CS 340: Algorithms and Data Structures 1.0 documentation
Meta Descriptionnull
Meta Canonicalnull
Boilerpipe Text
Our discussions on trees centered around a data structure that stored items efficiently, but to get the balanced height trees, things got tough to implement. I nstead of focusing so much on the structure, hashing takes the approach that the structure should be rather simple. The downside is that iterating over elements does not come for free, as in trees, but is possible with a few tricks. So, for hashing we are looking at simple structures, usually arrays. We will manage the size of the array to be not too much bigger than the amount of data stored (like C++ vectors), to preserve iteration in linear time. Hash Tables ¶ The basic question is “Why not just use an array as a table?”. Its a good question... Lets think about a table containing products that a store wants to keep track of. Here is an example. There are serious problems with this basic approach, what are they? Real World Data Here are some examples of real world data we may want to store in a table. If we are simply using the numbers as array indexes, then: How big of an array do we need? How much of the array will actually get used? Students: Student ID (9 digits) People: SSN (9 digits) ZIP code: 5 digits, 9 digits ISBN: 10 digits UPC: 12 digits Others: Character strings Basic Hash Tables ¶ A Hash Table will consist of 2 parts: a table (an array), and a hash function that will convert key values to array indices. (used for insert/delete/search) A hash function can really be anything, but there are some recipes for reliably good ones. Here are a couple examples of some that might work out in specific cases: Use certain digits form a long number. ex: last 4 digits of student ID. Will this work at our university? Folding. Use some function to get a smaller range of values. ex: add the digits of student ID. Will this work at our university? A Basic Hash Table Example size: 5 Hash function: Add first and last digits, then mod the result by the table size. Here is the table: Insert the following: 349587 98745 84743 Now find the same numbers in the hash table. (just apply the formula and look for them) Now insert: 24544 . Collision with 84743! Collisions are a problem, but there are various ways to handle them: Open addressing collision handling methods: Linear probing – look for next open spot Quadratic probing Double Hashing Increase the table size Multiple-Item Storage collision handling methods Buckets Chaining Open Addressing: Linear Probing ¶ If there is a collision, just look for the next open slot and insert the item there. Deleting . Deleting is problematic, since removing an item might break the linear probe. Instead of actually deleting items, mark them as being deleted (lazy delete!) What happens when the hash table fills up? One problem with linear probing is that it can lead to a degenerate situation where items that map to the same portion of a hash table overflow into other parts of the hash table, causing a cascading series of probes for lots of items. Open Addressing: Quadratic Probing ¶ Instead of just looking at the next slot for an opening, follow a quadratic sequence of indices (1,2,4,8,16,...) Resolves some of the clustering problems of linear probing. Can fail with a non-full hash table (but we can make an odd sized table). What do we do when insertion fails? Open Addressing: Double Hashing (or triple, or quadruple for that matter) ¶ If a collision occurs on one hash function, simply use another one. What do we do if a collision occurs on all hash functions? The hash functions can be tried in parallel! but hash functions are typically pretty computationally cheap. Open Addressing: Increase the Table Size ¶ Make a new array with more room. How much more room? Insert each item into the new array. Do we reuse the same hash functions? Delete the old array. When should this occur? When insertion fails, when the table is full? when the table is almost full? when a certain percent of probes have occurred? Multi-item Storage: Buckets ¶ The idea is simple, just keep room for more than 1 item at each table location. Pick a fixed number, and possibly use an additional open addressing strategy if the bucket fills. Multi-item Storage: Chaining ¶ Use a linked list (or other ADT) at each table location. Might need to consider increasing the table size if a list (or lists) get too long. String Hash Functions ¶ String hash functions are little tougher. Here are some examples: Add the numeric values of first five characters: burner = 2+21+18+14+5 = 60 scanner = 19+1+14+14+5 = 53 camera = 3+1+13+5+18 = 40 tablet = 20+1+2+12+5 = 40 Values range from 1 to 130 Concatenate positional values of first five char burner = 2 21 18 14 5 = 22,118,145 scanner = 19 3 1 14 14 5 = 19,314,145 camera = 3 1 13 5 18 = 3,113,518 tablet = 20 1 2 12 5 = 2,012,125 Values range from 1,048,576 to 28,142,426 Instead, look into production hash functions: MD5, SHA, etc. Good Hash Functions ¶ Good hash functions are easy to compute, and distribute values evenly throughout the table. Here is a recipe: Where: is the size of the table. is a prime number larger than any number that will be hashed (4,294,967,291 is the largest unsigned 32 bit prime integer. 2,147,483,647 is the largest signed 32 bit prime number. 18,446, 744,073,709,551,557 is the larges unsigned 64 bit number. 9,223,372,036,854,775,783 is the largest signed prime integer. NOTE you MUST BE CAREFUL with integer overflows, when perfoming these calculations!!!) are positive constants both less than . Make sure . is some interaction of and . A good basic choice is Multiple hash functions can be easily generated by choosing random values of and . One implementation note, intermediate values need to be 64 bit to prevent overflow! This type of hash function falls in the family of 2-universal family hash functions, with a probability of items colliding . (just make sure for all ) Lets try this out in class. Load Factor of a Hash Table ¶ The load factor for a hash table is: . This is between 1 and 0. A high load factor indicates the hash table is almost full, and you might want to think about resizing it. A final note ¶ Make sure the hash table size is ODD. Prime numbers help too No matter what, a hash table doesn’t store data in sorted order! Bloom Filters ¶ Bloom filters are one of the coolest data structures around, but they are not used too often. The idea behind a bloom filter is that we don’t always need to store every item, sometimnes we just need to record the fact that the item exists. This is known as membership testing . Question What are some example applications that might use membership testing? Bloom filters don’t store the actual data, they keep track of a bit string. Also, they do not use a single hash function, but use a group of hash functions. Each hash function will address a single bit in the array. Example Hash_function1( int ): mod the sum of the first two digits Hash_function2( int ): mod sum of 2nd pair of digits Hash_function3( int ): mod sum of 3rd pair of digits Now insert: 937789, 932243, 106616 Does this number exist in our set? 134898 Not so fast! The thing about bloom filters is that they are not always right, but they are never wrong. Bloom filters have the property that they may have False Positives , a number is recorded in the filter but was never explicitly put into the filter! This occurs by chance, when other numbers simply cause those bits to be 1. It can also happen when the bloom filter begins to fill up! So what do we do about false positives? CONTROL THEM!! We just make the probability of false positives sufficiently small. From http://en.wikipedia.org/wiki/Bloom_filter : . . Or just look at the table: Question What are some example applications that might use membership testing now that we have to deal with false positives? Example Lets say I want to do existence testing for a set of 1,000,000 items. If an item is an integer, that means we would need to store 4,000,000 bytes ~= 3.8 MB of data We would need roughly twice that to prevent saturation in a hash table ~= 7.6MB What we are storing may be much larger than an integer! In the case of the Bloom Filter... Lets say we want a false positive rate of 0.001 According to the table, we need m/n = 15 and k = 7 So, we need 15,000,000 bits ~= 1.8MB. That’s a quarter of the storage of the hash table, and, no matter how large the items are, it will always take 1.8MB We just need to make sure that we can come up with 7 independent hash functions! But thats easy with our hash function recipe above! Finally, some properties of Bloom Filters: A BF can represent an entire universe of elements, wheras hash table runs out of space. Also, a hash table must explicitly store all elements Union and Intersection of bloom filters is possible using bitwise operations on the bit strings (super useful for combining tables in databases without having to rebuild indexes) If the filter becomes satuated, we have to change the ratio. There are types of BFs that grow like vectors! there are tons of variants for all types of applications: Bloomier filters, counting filters, stable bloom filters, scalable filters Deleteion is a problem! What happens if we try to delete something? Lets practice a little...
Markdown
![](https://www.cs.siue.edu/~marmcke/docs/cs340/_static/title.png) ### Navigation - [index](https://www.cs.siue.edu/~marmcke/docs/cs340/genindex.html "General Index") - [next](https://www.cs.siue.edu/~marmcke/docs/cs340/heaps.html "Heaps and Priority Queues") \| - [previous](https://www.cs.siue.edu/~marmcke/docs/cs340/reviewMultiThread.html "Review of Multithreaded Programming") \| - [CS 340: Algorithms and Data Structures 1.0 documentation](https://www.cs.siue.edu/~marmcke/docs/cs340/index.html) » ### [Table Of Contents](https://www.cs.siue.edu/~marmcke/docs/cs340/index.html) - [Hashing and Hash Tables](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html) - [Hash Tables](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#hash-tables) - [Basic Hash Tables](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#basic-hash-tables) - [Open Addressing: Linear Probing](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-linear-probing) - [Open Addressing: Quadratic Probing](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-quadratic-probing) - [Open Addressing: Double Hashing (or triple, or quadruple for that matter)](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-double-hashing-or-triple-or-quadruple-for-that-matter) - [Open Addressing: Increase the Table Size](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-increase-the-table-size) - [Multi-item Storage: Buckets](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#multi-item-storage-buckets) - [Multi-item Storage: Chaining](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#multi-item-storage-chaining) - [String Hash Functions](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#string-hash-functions) - [Good Hash Functions](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#good-hash-functions) - [Load Factor of a Hash Table](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#load-factor-of-a-hash-table) - [A final note](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#a-final-note) - [Bloom Filters](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#bloom-filters) #### Previous topic [Review of Multithreaded Programming](https://www.cs.siue.edu/~marmcke/docs/cs340/reviewMultiThread.html "previous chapter") #### Next topic [Heaps and Priority Queues](https://www.cs.siue.edu/~marmcke/docs/cs340/heaps.html "next chapter") ### This Page - [Show Source](https://www.cs.siue.edu/~marmcke/docs/cs340/_sources/hashing.txt) ### Quick search Enter search terms or a module, class or function name. # Hashing and Hash Tables[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#hashing-and-hash-tables "Permalink to this headline") Our discussions on trees centered around a data structure that stored items efficiently, but to get the balanced height trees, things got tough to implement. I nstead of focusing so much on the structure, hashing takes the approach that the structure should be rather simple. The downside is that iterating over elements does not come for free, as in trees, but is possible with a few tricks. So, for hashing we are looking at simple structures, usually arrays. We will manage the size of the array to be not too much bigger than the amount of data stored (like C++ vectors), to preserve iteration in linear time. ## Hash Tables[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#hash-tables "Permalink to this headline") The basic question is “Why not just use an array as a table?”. Its a good question... Lets think about a table containing products that a store wants to keep track of. Here is an example. There are serious problems with this basic approach, what are they? [![\_images/hSimple.png](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hSimple.png)](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hSimple.png) Real World Data Here are some examples of real world data we may want to store in a table. If we are simply using the numbers as array indexes, then: 1. How big of an array do we need? 2. How much of the array will actually get used? - Students: Student ID (9 digits) - People: SSN (9 digits) - ZIP code: 5 digits, 9 digits - ISBN: 10 digits - UPC: 12 digits - Others: Character strings ### Basic Hash Tables[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#basic-hash-tables "Permalink to this headline") A **Hash Table** will consist of 2 parts: 1. a **table** (an array), and 2. a **hash function** that will convert key values to array indices. (used for insert/delete/search) A hash function can really be anything, but there are some recipes for reliably good ones. Here are a couple examples of some that might work out in specific cases: - **Use certain digits form a long number.** ex: last 4 digits of student ID. Will this work at our university? - **Folding.** Use some function to get a smaller range of values. ex: add the digits of student ID. Will this work at our university? A Basic Hash Table Example size: 5 Hash function: Add first and last digits, then mod the result by the table size. Here is the table: [![\_images/hEmptyTable.png](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hEmptyTable.png)](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hEmptyTable.png) Insert the following: 1. 349587 ![\\rightarrow 10 \\% 5 = 0](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/b45e0882b31c0c8766af61c11a604d3feb4b7559.png) 2. 98745 ![\\rightarrow 14 \\% 5 = 4](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/dac6d3978a59b5d6e21e4014e6757f0856e42467.png) 3. 84743 ![\\rightarrow 11 \\% 5 = 1](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/2965539df530f40fa892a11144a8c7a93cd6541b.png) Now find the same numbers in the hash table. (just apply the formula and look for them) Now insert: - 24544 ![\\rightarrow 7 \\% 5 = 1](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/896827ce7d061d8fcfd3d4b6e95633b55e7235d6.png). **Collision with 84743\!** Collisions are a problem, but there are various ways to handle them: Open addressing collision handling methods: 1. Linear probing – look for next open spot 2. Quadratic probing 3. Double Hashing 4. Increase the table size Multiple-Item Storage collision handling methods 1. Buckets 2. Chaining ### Open Addressing: Linear Probing[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-linear-probing "Permalink to this headline") If there is a collision, just look for the next open slot and insert the item there. [![\_images/hLinearProbe.png](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hLinearProbe.png)](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hLinearProbe.png) **Deleting**. Deleting is problematic, since removing an item might break the linear probe. Instead of actually deleting items, mark them as being deleted (lazy delete!) What happens when the hash table fills up? One problem with linear probing is that it can lead to a degenerate situation where items that map to the same portion of a hash table overflow into other parts of the hash table, causing a cascading series of probes for lots of items. ### Open Addressing: Quadratic Probing[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-quadratic-probing "Permalink to this headline") Instead of just looking at the next slot for an opening, follow a quadratic sequence of indices (1,2,4,8,16,...) [![\_images/hQuadProbe.png](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hQuadProbe.png)](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hQuadProbe.png) Resolves some of the clustering problems of linear probing. Can fail with a non-full hash table (but we can make an odd sized table). What do we do when insertion fails? ### Open Addressing: Double Hashing (or triple, or quadruple for that matter)[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-double-hashing-or-triple-or-quadruple-for-that-matter "Permalink to this headline") If a collision occurs on one hash function, simply use another one. What do we do if a collision occurs on all hash functions? The hash functions can be tried in parallel! but hash functions are typically pretty computationally cheap. ### Open Addressing: Increase the Table Size[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-increase-the-table-size "Permalink to this headline") Make a new array with more room. **How much more room?** Insert each item into the new array. **Do we reuse the same hash functions?** Delete the old array. When should this occur? When insertion fails, when the table is full? when the table is almost full? when a certain percent of probes have occurred? ### Multi-item Storage: Buckets[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#multi-item-storage-buckets "Permalink to this headline") The idea is simple, just keep room for more than 1 item at each table location. Pick a fixed number, and possibly use an additional open addressing strategy if the bucket fills. ### Multi-item Storage: Chaining[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#multi-item-storage-chaining "Permalink to this headline") Use a linked list (or other ADT) at each table location. Might need to consider increasing the table size if a list (or lists) get too long. ### String Hash Functions[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#string-hash-functions "Permalink to this headline") String hash functions are little tougher. Here are some examples: - Add the numeric values of first five characters: 1. burner = 2+21+18+14+5 = 60 2. scanner = 19+1+14+14+5 = 53 3. camera = 3+1+13+5+18 = 40 4. tablet = 20+1+2+12+5 = 40 Values range from 1 to 130 - Concatenate positional values of first five char 1. burner = 2 21 18 14 5 = 22,118,145 2. scanner = 19 3 1 14 14 5 = 19,314,145 3. camera = 3 1 13 5 18 = 3,113,518 4. tablet = 20 1 2 12 5 = 2,012,125 Values range from 1,048,576 to 28,142,426 Instead, look into production hash functions: MD5, SHA, etc. ### Good Hash Functions[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#good-hash-functions "Permalink to this headline") Good hash functions are easy to compute, and distribute values evenly throughout the table. Here is a recipe: ![h(k) = ( f(a,k)+b) \\% p \\% S](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/bad12d496a27c4739dee26bca876ede6bbe8da68.png) Where: - ![S](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/11a85f3c69ae6702cb1d99d3de451913b8f84c04.png) is the size of the table. - ![p](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/3eca8557203e86160952e1c0f735f7417f3285b1.png) is a prime number larger than any number that will be hashed (4,294,967,291 is the largest **unsigned** 32 bit prime integer. 2,147,483,647 is the largest **signed** 32 bit prime number. 18,446, 744,073,709,551,557 is the larges **unsigned** 64 bit number. 9,223,372,036,854,775,783 is the largest **signed** prime integer. **NOTE** you MUST BE CAREFUL with integer overflows, when perfoming these calculations!!!) - ![a,b](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/ed3ee7bf0f52d1ad30ec7b003588cab83b4b108f.png) are positive constants both less than ![p](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/3eca8557203e86160952e1c0f735f7417f3285b1.png). Make sure ![a \\neq 0](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/8e5662effc36bbbf17fecdcbcc8d20a6fd1e55be.png). - ![f(a,k)](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/5954345d9914ef5b002630ad1085284ceef1b401.png) is some interaction of ![a](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/7dd2a5ea01fbd72ad2a58dd1f3d6ecbfde6208a1.png) and ![k](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/e9203da50e1059455123460d4e716c9c7f440cc3.png). A good basic choice is ![f(a,k)=a \\times k](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/8f85c0c559c464b6877d162ee5e626047ad2da08.png) Multiple hash functions can be easily generated by choosing random values of ![a](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/7dd2a5ea01fbd72ad2a58dd1f3d6ecbfde6208a1.png) and ![b](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/5e87bf41a96deddf6cb485ff530f153f2590e9cc.png). One implementation note, intermediate values need to be 64 bit to prevent overflow\! This type of hash function falls in the family of *2-universal family* hash functions, with a probability of items colliding ![\\leq \\frac{1}{S}](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/90c56dddc5333614933c27e5311587b58357ba8e.png). (just make sure ![k \< p](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/e874e7a826f8424eaa1821d1a819f5ac56bb7329.png) for all ![k](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/e9203da50e1059455123460d4e716c9c7f440cc3.png)) Lets try this out in class. ### Load Factor of a Hash Table[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#load-factor-of-a-hash-table "Permalink to this headline") The load factor for a hash table is: ![\\frac{numInsertedItems}{numLocations}](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/a5a212447644ac048e04644a3044dc23b71be88d.png). This is between 1 and 0. A high load factor indicates the hash table is almost full, and you might want to think about resizing it. ### A final note[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#a-final-note "Permalink to this headline") Make sure the hash table size is ODD. Prime numbers help too No matter what, a hash table doesn’t store data in sorted order\! ## Bloom Filters[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#bloom-filters "Permalink to this headline") Bloom filters are one of the coolest data structures around, but they are not used too often. The idea behind a bloom filter is that we don’t always need to store every item, sometimnes we just need to record the fact that the item exists. This is known as **membership testing**. Question What are some example applications that might use membership testing? Bloom filters don’t store the actual data, they keep track of a bit string. Also, they do not use a single hash function, but use a group of ![k](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/e9203da50e1059455123460d4e716c9c7f440cc3.png) hash functions. Each hash function will address a single bit in the array. Example Hash\_function1( int ): mod the sum of the first two digits Hash\_function2( int ): mod sum of 2nd pair of digits Hash\_function3( int ): mod sum of 3rd pair of digits Now insert: 937789, 932243, 106616 [![\_images/hBloomFilt.png](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFilt.png)](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFilt.png) Does this number exist in our set? 134898 Not so fast! The thing about bloom filters is that they are not always right, but they are never wrong. Bloom filters have the property that they may have **False Positives**, a number is recorded in the filter but was never explicitly put into the filter! This occurs by chance, when other numbers simply cause those bits to be 1. It can also happen when the bloom filter begins to fill up\! So what do we do about false positives? CONTROL THEM!\! We just make the probability of false positives sufficiently small. From <http://en.wikipedia.org/wiki/Bloom_filter> : [![\_images/hBloomFP1.png](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFP1.png)](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFP1.png) . [![\_images/hBloomFP2.png](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFP2.png)](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFP2.png) . [![\_images/hBloomFP3.png](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFP3.png)](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFP3.png) - We can determine ![n](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/413f8a8e40062a9090d9d50b88bc7b551b314c26.png), the number of items we need to store - Then choose an appropriate ![m](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/c4bb40dd65eae6c11b325989b14e0b8d35e4e3ef.png), compute the optimal ![k](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/e9203da50e1059455123460d4e716c9c7f440cc3.png), and see if the false positive probability is within acceptable bounds - If not, try new values for ![m](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/c4bb40dd65eae6c11b325989b14e0b8d35e4e3ef.png) or ![n](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/413f8a8e40062a9090d9d50b88bc7b551b314c26.png) Or just look at the table: [![\_images/hBloomFP4.png](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFP4.png)](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFP4.png) Question What are some example applications that might use membership testing now that we have to deal with false positives? Example Lets say I want to do existence testing for a set of 1,000,000 items. - If an item is an integer, that means we would need to store 4,000,000 bytes ~= 3.8 MB of data - We would need roughly twice that to prevent saturation in a hash table ~= 7.6MB - What we are storing may be much larger than an integer\! In the case of the Bloom Filter... - Lets say we want a false positive rate of 0.001 - According to the table, we need m/n = 15 and k = 7 - So, we need 15,000,000 bits ~= 1.8MB. That’s a quarter of the storage of the hash table, and, no matter how large the items are, it will always take 1.8MB - We just need to make sure that we can come up with 7 independent hash functions! But thats easy with our hash function recipe above\! Finally, some properties of Bloom Filters: - A BF can represent an entire universe of elements, wheras hash table runs out of space. Also, a hash table must explicitly store all elements - Union and Intersection of bloom filters is possible using bitwise operations on the bit strings (super useful for combining tables in databases without having to rebuild indexes) - If the filter becomes satuated, we have to change the ![m/n](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/fda12c7bb431a2212b50f0a5def335f88be3923a.png) ratio. There are types of BFs that grow like vectors\! - there are tons of variants for all types of applications: Bloomier filters, counting filters, stable bloom filters, scalable filters - Deleteion is a problem! What happens if we try to delete something? Lets practice a little... ### Navigation - [index](https://www.cs.siue.edu/~marmcke/docs/cs340/genindex.html "General Index") - [next](https://www.cs.siue.edu/~marmcke/docs/cs340/heaps.html "Heaps and Priority Queues") \| - [previous](https://www.cs.siue.edu/~marmcke/docs/cs340/reviewMultiThread.html "Review of Multithreaded Programming") \| - [CS 340: Algorithms and Data Structures 1.0 documentation](https://www.cs.siue.edu/~marmcke/docs/cs340/index.html) » © Copyright 2014, M McKenney. Created using [Sphinx](http://sphinx-doc.org/) 1.3.
Readable Markdownnull
Shard50 (laksa)
Root Hash11131842051709627250
Unparsed URLedu,siue!cs,www,/~marmcke/docs/cs340/hashing.html s443