🕷️ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:

Response:

Calculated Shard: 50 (from laksa169)

2. Crawled Status Check

Query:

curl -X POST \
  'http://laksa050.int.ahrefs:8124/' \
  -H 'Content-Type: text/plain' \
  -H 'X-ClickHouse-Database: crawler3' \
  -H 'Authorization: Basic YXBpOg==' \
  -d 'SELECT getAhrefsURLFromUnparsed(src_unparsed) AS found_url, ifNull(toUnixTimestamp(download_stamp), 0) AS crawl_time, ifNull(toUnixTimestamp(props_url_first_seen), 0) AS first_indexed_time, download_http_code AS http_code, src_unparsed AS src_unparsed, src_root_hash AS src_root_hash, history_drop_reason AS history_drop_reason, meta_title AS meta_title, meta_descriptions AS meta_descriptions, attrs_boilerpipe_text AS attrs_boilerpipe_text, attrs_markdown AS attrs_markdown, attrs_readable_markdown AS attrs_readable_markdown, meta_canonical AS meta_canonical FROM crawler3.page_info_local FINAL PREWHERE (src_root_hash, src_unparsed) IN ((getAhrefsRootHashFromUnparsed(getAhrefsUnparsedNoserviceFromURL(\'https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html\')), getAhrefsUnparsedNoserviceFromURL(\'https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html\'))) FORMAT JSONEachRow'

Response:

{"found_url":"https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html","crawl_time":1770649579,"first_indexed_time":1737184199,"http_code":200,"src_unparsed":"edu,siue!cs,www,\/~marmcke\/docs\/cs340\/hashing.html s443","src_root_hash":"11131842051709627250","history_drop_reason":null,"meta_title":"Hashing and Hash Tables — CS 340: Algorithms and Data Structures 1.0 documentation","meta_descriptions":[],"attrs_boilerpipe_text":"Our discussions on trees centered around a data structure that stored items efficiently, but to get the balanced height trees, things got tough to implement.  I nstead of focusing so much on the structure, hashing takes the approach that the structure should be rather simple.  The downside is that iterating over elements does not come for free, as in trees, but is possible with a few tricks.\nSo, for hashing we are looking at simple structures, usually arrays.  We will manage the size of the array to be not too much bigger than the amount of data stored (like C++ vectors), to preserve iteration in linear time.\nHash Tables\n¶\nThe basic question is “Why not just use an array as a table?”.  Its a good question...\nLets think about a table containing products that a store wants to keep track of.  Here is an example.  There are serious problems with this basic approach, what are they?\nReal World Data\nHere are some examples of real world data we may want to store in a table.  If we are simply using the numbers as array indexes, then:\nHow big of an array do we need?\nHow much of the array will actually get used?\nStudents: Student ID (9 digits)\nPeople: SSN (9 digits)\nZIP code: 5 digits, 9 digits\nISBN: 10 digits\nUPC: 12 digits\nOthers: Character strings\nBasic Hash Tables\n¶\nA\nHash Table\nwill consist of 2 parts:\na\ntable\n(an array), and\na\nhash function\nthat will convert key values to array indices. (used for insert\/delete\/search)\nA hash function can really be anything, but there are some recipes for reliably good ones.  Here are a couple examples of some that might work out in specific cases:\nUse certain digits form a long number.\nex: last 4 digits of student ID.  Will this work at our university?\nFolding.\nUse some function to get a smaller range of values.  ex: add the digits of student ID.  Will this work at our university?\nA Basic Hash Table Example\nsize: 5\nHash function: Add first and last digits, then mod the result by the table size.\nHere is the table:\nInsert the following:\n349587\n98745\n84743\nNow find the same numbers in the hash table.  (just apply the formula and look for them)\nNow insert:\n24544\n.\nCollision with 84743!\nCollisions are a problem, but there are various ways to handle them:\nOpen addressing collision handling methods:\nLinear probing – look for next open spot\nQuadratic probing\nDouble Hashing\nIncrease the table size\nMultiple-Item Storage collision handling methods\nBuckets\nChaining\nOpen Addressing: Linear Probing\n¶\nIf there is a collision, just look for the next open slot and insert the item there.\nDeleting\n.  Deleting is problematic, since removing an item might break the linear probe.  Instead of actually deleting items, mark them as being deleted (lazy delete!)\nWhat happens when the hash table fills up?\nOne problem with linear probing is that it can lead to a degenerate situation where items that map to the same portion of a hash table overflow into other parts of the hash table, causing a cascading series of probes for lots of items.\nOpen Addressing: Quadratic Probing\n¶\nInstead of just looking at the next slot for an opening, follow a quadratic sequence of indices (1,2,4,8,16,...)\nResolves some of the clustering problems of linear probing.  Can fail with a non-full hash table (but we can make an odd sized table).\nWhat do we do when insertion fails?\nOpen Addressing: Double Hashing (or triple, or quadruple for that matter)\n¶\nIf a collision occurs on one hash function, simply use another one.\nWhat do we do if a collision occurs on all hash functions?\nThe hash functions can be tried in parallel! but hash functions are typically pretty computationally cheap.\nOpen Addressing: Increase the Table Size\n¶\nMake a new array with more room.\nHow much more room?\nInsert each item into the new array.\nDo we reuse the same hash functions?\nDelete the old array.\nWhen should this occur?  When insertion fails, when the table is full? when the table is almost full? when a certain percent of probes have occurred?\nMulti-item Storage: Buckets\n¶\nThe idea is simple, just keep room for more than 1 item at each table location.  Pick a fixed number, and possibly use an additional open addressing strategy if the bucket fills.\nMulti-item Storage: Chaining\n¶\nUse a linked list (or other ADT) at each table location.\nMight need to consider increasing the table size if a list (or lists) get too long.\nString Hash Functions\n¶\nString hash functions are little tougher.  Here are some examples:\nAdd the numeric values of first five characters:\nburner    =  2+21+18+14+5 = 60\nscanner = 19+1+14+14+5  = 53\ncamera  = 3+1+13+5+18    = 40\ntablet     = 20+1+2+12+5    = 40\nValues range from 1 to 130\nConcatenate positional values of first five char\nburner    =  2 21 18 14 5     = 22,118,145\nscanner = 19 3 1 14 14 5   = 19,314,145\ncamera  = 3 1 13 5 18        =   3,113,518\ntablet     = 20 1 2 12 5        =   2,012,125\nValues range from 1,048,576 to 28,142,426\nInstead, look into production hash functions: MD5, SHA, etc.\nGood Hash Functions\n¶\nGood hash functions are easy to compute, and distribute values evenly throughout the table.\nHere is a recipe:\nWhere:\nis the size of the table.\nis a prime number larger than any number that will be hashed (4,294,967,291 is the largest\nunsigned\n32 bit prime integer.  2,147,483,647 is the largest\nsigned\n32 bit prime number.  18,446,    744,073,709,551,557 is the larges\nunsigned\n64 bit number. 9,223,372,036,854,775,783 is the largest\nsigned\nprime integer.\nNOTE\nyou MUST BE CAREFUL with integer overflows, when perfoming these calculations!!!)\nare positive constants both less than\n. Make sure\n.\nis some interaction of\nand\n.  A good basic choice is\nMultiple hash functions can be easily generated by choosing random values of\nand\n.\nOne implementation note, intermediate values need to be 64 bit to prevent overflow!\nThis type of hash function falls in the family of\n2-universal family\nhash functions, with a probability of items colliding\n. (just make sure\nfor all\n)\nLets try this out in class.\nLoad Factor of a Hash Table\n¶\nThe load factor for a hash table is:\n.  This is between 1 and 0.  A high load factor indicates the hash table is almost full, and you might want to think about resizing it.\nA final note\n¶\nMake sure the hash table size is ODD.  Prime numbers help too\nNo matter what, a hash table doesn’t store data in sorted order!\nBloom Filters\n¶\nBloom filters are one of the coolest data structures around, but they are not used too often.\nThe idea behind a bloom filter is that we don’t always need to store every item, sometimnes we just need to record the fact that the item exists.  This is known as\nmembership testing\n.\nQuestion\nWhat are some example applications that might use membership testing?\nBloom filters don’t store the actual data, they keep track of a bit string. Also, they do not use a single hash function, but use a group of\nhash functions. Each hash function will address a single bit in the array.\nExample\nHash_function1( int ): mod the sum of the first two digits\nHash_function2( int ): mod sum of 2nd pair of digits\nHash_function3( int ): mod sum of 3rd pair of digits\nNow insert: 937789, 932243, 106616\nDoes this number exist in our set?  134898\nNot so fast!  The thing about bloom filters is that they are not always right, but they are never wrong.  Bloom filters have the property that they may have\nFalse Positives\n, a number is recorded in the filter but was never explicitly put into the filter!  This occurs by chance, when other numbers simply cause those bits to be 1.  It can also happen when the bloom filter begins to fill up!\nSo what do we do about false positives?  CONTROL THEM!!\nWe just make the probability of false positives sufficiently small.\nFrom\nhttp:\/\/en.wikipedia.org\/wiki\/Bloom_filter\n:\n.\n.\nOr just look at the table:\nQuestion\nWhat are some example applications that might use membership testing now that we have to deal with false positives?\nExample\nLets say I want to do existence testing for a set of 1,000,000 items.\nIf an item is an integer, that means we would need to store 4,000,000 bytes ~= 3.8 MB of data\nWe would need roughly twice that to prevent saturation in a hash table ~= 7.6MB\nWhat we are storing may be much larger than an integer!\nIn the case of the Bloom Filter...\nLets say we want a false positive rate of 0.001\nAccording to the table, we need m\/n = 15 and k = 7\nSo, we need 15,000,000 bits ~= 1.8MB.  That’s a quarter of the storage of the hash table, and, no matter how large the items are, it will always take 1.8MB\nWe just need to make sure that we can come up with 7 independent hash functions! But thats easy with our hash function recipe above!\nFinally, some properties of Bloom Filters:\nA BF can represent an entire universe of elements, wheras hash table runs out of space.  Also, a hash table must explicitly store all elements\nUnion and Intersection of bloom filters is possible using bitwise operations on the bit strings (super useful for combining tables in databases without having to rebuild indexes)\nIf the filter becomes satuated, we have to change the\nratio.  There are types of BFs that grow like vectors!\nthere are tons of variants for all types of applications: Bloomier filters, counting filters, stable bloom filters, scalable filters\nDeleteion is a problem!  What happens if we try to delete something?\nLets practice a little...","attrs_markdown":"![](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_static\/title.png)\n\n### Navigation\n- [index](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/genindex.html \"General Index\")\n- [next](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/heaps.html \"Heaps and Priority Queues\") \\|\n- [previous](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/reviewMultiThread.html \"Review of Multithreaded Programming\") \\|\n- [CS 340: Algorithms and Data Structures 1.0 documentation](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/index.html) »\n\n### [Table Of Contents](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/index.html)\n- [Hashing and Hash Tables](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html)\n  - [Hash Tables](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#hash-tables)\n    - [Basic Hash Tables](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#basic-hash-tables)\n    - [Open Addressing: Linear Probing](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#open-addressing-linear-probing)\n    - [Open Addressing: Quadratic Probing](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#open-addressing-quadratic-probing)\n    - [Open Addressing: Double Hashing (or triple, or quadruple for that matter)](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#open-addressing-double-hashing-or-triple-or-quadruple-for-that-matter)\n    - [Open Addressing: Increase the Table Size](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#open-addressing-increase-the-table-size)\n    - [Multi-item Storage: Buckets](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#multi-item-storage-buckets)\n    - [Multi-item Storage: Chaining](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#multi-item-storage-chaining)\n    - [String Hash Functions](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#string-hash-functions)\n    - [Good Hash Functions](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#good-hash-functions)\n    - [Load Factor of a Hash Table](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#load-factor-of-a-hash-table)\n    - [A final note](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#a-final-note)\n  - [Bloom Filters](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#bloom-filters)\n#### Previous topic\n[Review of Multithreaded Programming](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/reviewMultiThread.html \"previous chapter\")\n\n#### Next topic\n[Heaps and Priority Queues](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/heaps.html \"next chapter\")\n\n### This Page\n- [Show Source](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_sources\/hashing.txt)\n\n### Quick search\nEnter search terms or a module, class or function name.\n\n# Hashing and Hash Tables[¶](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#hashing-and-hash-tables \"Permalink to this headline\")\nOur discussions on trees centered around a data structure that stored items efficiently, but to get the balanced height trees, things got tough to implement. I nstead of focusing so much on the structure, hashing takes the approach that the structure should be rather simple. The downside is that iterating over elements does not come for free, as in trees, but is possible with a few tricks.\n\nSo, for hashing we are looking at simple structures, usually arrays. We will manage the size of the array to be not too much bigger than the amount of data stored (like C++ vectors), to preserve iteration in linear time.\n\n## Hash Tables[¶](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#hash-tables \"Permalink to this headline\")\nThe basic question is “Why not just use an array as a table?”. Its a good question...\n\nLets think about a table containing products that a store wants to keep track of. Here is an example. There are serious problems with this basic approach, what are they?\n\n[![\\_images\/hSimple.png](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/hSimple.png)](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/hSimple.png)\n\nReal World Data\n\nHere are some examples of real world data we may want to store in a table. If we are simply using the numbers as array indexes, then:\n\n1. How big of an array do we need?\n2. How much of the array will actually get used?\n\n- Students: Student ID (9 digits)\n- People: SSN (9 digits)\n- ZIP code: 5 digits, 9 digits\n- ISBN: 10 digits\n- UPC: 12 digits\n- Others: Character strings\n\n### Basic Hash Tables[¶](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#basic-hash-tables \"Permalink to this headline\")\nA **Hash Table** will consist of 2 parts:\n\n1. a **table** (an array), and\n2. a **hash function** that will convert key values to array indices. (used for insert\/delete\/search)\n\nA hash function can really be anything, but there are some recipes for reliably good ones. Here are a couple examples of some that might work out in specific cases:\n\n- **Use certain digits form a long number.** ex: last 4 digits of student ID. Will this work at our university?\n- **Folding.** Use some function to get a smaller range of values. ex: add the digits of student ID. Will this work at our university?\n\nA Basic Hash Table Example\n\nsize: 5\n\nHash function: Add first and last digits, then mod the result by the table size.\n\nHere is the table:\n\n[![\\_images\/hEmptyTable.png](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/hEmptyTable.png)](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/hEmptyTable.png)\n\nInsert the following:\n\n1. 349587 ![\\\\rightarrow 10 \\\\% 5 = 0](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/b45e0882b31c0c8766af61c11a604d3feb4b7559.png)\n2. 98745 ![\\\\rightarrow 14 \\\\% 5 = 4](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/dac6d3978a59b5d6e21e4014e6757f0856e42467.png)\n3. 84743 ![\\\\rightarrow 11 \\\\% 5 = 1](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/2965539df530f40fa892a11144a8c7a93cd6541b.png)\n\nNow find the same numbers in the hash table. (just apply the formula and look for them)\n\nNow insert:\n\n- 24544 ![\\\\rightarrow 7 \\\\% 5 = 1](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/896827ce7d061d8fcfd3d4b6e95633b55e7235d6.png). **Collision with 84743\\!**\n\nCollisions are a problem, but there are various ways to handle them:\n\nOpen addressing collision handling methods:\n\n1. Linear probing – look for next open spot\n2. Quadratic probing\n3. Double Hashing\n4. Increase the table size\n\nMultiple-Item Storage collision handling methods\n\n1. Buckets\n2. Chaining\n\n### Open Addressing: Linear Probing[¶](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#open-addressing-linear-probing \"Permalink to this headline\")\nIf there is a collision, just look for the next open slot and insert the item there.\n\n[![\\_images\/hLinearProbe.png](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/hLinearProbe.png)](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/hLinearProbe.png)\n\n**Deleting**. Deleting is problematic, since removing an item might break the linear probe. Instead of actually deleting items, mark them as being deleted (lazy delete!)\n\nWhat happens when the hash table fills up?\n\nOne problem with linear probing is that it can lead to a degenerate situation where items that map to the same portion of a hash table overflow into other parts of the hash table, causing a cascading series of probes for lots of items.\n\n### Open Addressing: Quadratic Probing[¶](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#open-addressing-quadratic-probing \"Permalink to this headline\")\nInstead of just looking at the next slot for an opening, follow a quadratic sequence of indices (1,2,4,8,16,...)\n\n[![\\_images\/hQuadProbe.png](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/hQuadProbe.png)](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/hQuadProbe.png)\n\nResolves some of the clustering problems of linear probing. Can fail with a non-full hash table (but we can make an odd sized table).\n\nWhat do we do when insertion fails?\n\n### Open Addressing: Double Hashing (or triple, or quadruple for that matter)[¶](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#open-addressing-double-hashing-or-triple-or-quadruple-for-that-matter \"Permalink to this headline\")\nIf a collision occurs on one hash function, simply use another one.\n\nWhat do we do if a collision occurs on all hash functions?\n\nThe hash functions can be tried in parallel! but hash functions are typically pretty computationally cheap.\n\n### Open Addressing: Increase the Table Size[¶](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#open-addressing-increase-the-table-size \"Permalink to this headline\")\nMake a new array with more room. **How much more room?**\n\nInsert each item into the new array. **Do we reuse the same hash functions?**\n\nDelete the old array.\n\nWhen should this occur? When insertion fails, when the table is full? when the table is almost full? when a certain percent of probes have occurred?\n\n### Multi-item Storage: Buckets[¶](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#multi-item-storage-buckets \"Permalink to this headline\")\nThe idea is simple, just keep room for more than 1 item at each table location. Pick a fixed number, and possibly use an additional open addressing strategy if the bucket fills.\n\n### Multi-item Storage: Chaining[¶](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#multi-item-storage-chaining \"Permalink to this headline\")\nUse a linked list (or other ADT) at each table location.\n\nMight need to consider increasing the table size if a list (or lists) get too long.\n\n### String Hash Functions[¶](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#string-hash-functions \"Permalink to this headline\")\nString hash functions are little tougher. Here are some examples:\n\n- Add the numeric values of first five characters:\n\n1. burner = 2+21+18+14+5 = 60\n2. scanner = 19+1+14+14+5 = 53\n3. camera = 3+1+13+5+18 = 40\n4. tablet = 20+1+2+12+5 = 40\n\nValues range from 1 to 130\n\n- Concatenate positional values of first five char\n\n1. burner = 2 21 18 14 5 = 22,118,145\n2. scanner = 19 3 1 14 14 5 = 19,314,145\n3. camera = 3 1 13 5 18 = 3,113,518\n4. tablet = 20 1 2 12 5 = 2,012,125\n\nValues range from 1,048,576 to 28,142,426\n\nInstead, look into production hash functions: MD5, SHA, etc.\n\n### Good Hash Functions[¶](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#good-hash-functions \"Permalink to this headline\")\nGood hash functions are easy to compute, and distribute values evenly throughout the table.\n\nHere is a recipe:\n\n![h(k) = ( f(a,k)+b) \\\\% p \\\\% S](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/bad12d496a27c4739dee26bca876ede6bbe8da68.png)\n\nWhere:\n\n- ![S](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/11a85f3c69ae6702cb1d99d3de451913b8f84c04.png) is the size of the table.\n- ![p](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/3eca8557203e86160952e1c0f735f7417f3285b1.png) is a prime number larger than any number that will be hashed (4,294,967,291 is the largest **unsigned** 32 bit prime integer. 2,147,483,647 is the largest **signed** 32 bit prime number. 18,446, 744,073,709,551,557 is the larges **unsigned** 64 bit number. 9,223,372,036,854,775,783 is the largest **signed** prime integer. **NOTE** you MUST BE CAREFUL with integer overflows, when perfoming these calculations!!!)\n- ![a,b](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/ed3ee7bf0f52d1ad30ec7b003588cab83b4b108f.png) are positive constants both less than ![p](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/3eca8557203e86160952e1c0f735f7417f3285b1.png). Make sure ![a \\\\neq 0](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/8e5662effc36bbbf17fecdcbcc8d20a6fd1e55be.png).\n- ![f(a,k)](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/5954345d9914ef5b002630ad1085284ceef1b401.png) is some interaction of ![a](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/7dd2a5ea01fbd72ad2a58dd1f3d6ecbfde6208a1.png) and ![k](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/e9203da50e1059455123460d4e716c9c7f440cc3.png). A good basic choice is ![f(a,k)=a \\\\times k](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/8f85c0c559c464b6877d162ee5e626047ad2da08.png)\n\nMultiple hash functions can be easily generated by choosing random values of ![a](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/7dd2a5ea01fbd72ad2a58dd1f3d6ecbfde6208a1.png) and ![b](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/5e87bf41a96deddf6cb485ff530f153f2590e9cc.png).\n\nOne implementation note, intermediate values need to be 64 bit to prevent overflow\\!\n\nThis type of hash function falls in the family of *2-universal family* hash functions, with a probability of items colliding ![\\\\leq \\\\frac{1}{S}](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/90c56dddc5333614933c27e5311587b58357ba8e.png). (just make sure ![k \\< p](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/e874e7a826f8424eaa1821d1a819f5ac56bb7329.png) for all ![k](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/e9203da50e1059455123460d4e716c9c7f440cc3.png))\n\nLets try this out in class.\n\n### Load Factor of a Hash Table[¶](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#load-factor-of-a-hash-table \"Permalink to this headline\")\nThe load factor for a hash table is: ![\\\\frac{numInsertedItems}{numLocations}](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/a5a212447644ac048e04644a3044dc23b71be88d.png). This is between 1 and 0. A high load factor indicates the hash table is almost full, and you might want to think about resizing it.\n\n### A final note[¶](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#a-final-note \"Permalink to this headline\")\nMake sure the hash table size is ODD. Prime numbers help too\n\nNo matter what, a hash table doesn’t store data in sorted order\\!\n\n## Bloom Filters[¶](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/hashing.html#bloom-filters \"Permalink to this headline\")\nBloom filters are one of the coolest data structures around, but they are not used too often.\n\nThe idea behind a bloom filter is that we don’t always need to store every item, sometimnes we just need to record the fact that the item exists. This is known as **membership testing**.\n\nQuestion\n\nWhat are some example applications that might use membership testing?\n\nBloom filters don’t store the actual data, they keep track of a bit string. Also, they do not use a single hash function, but use a group of ![k](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/e9203da50e1059455123460d4e716c9c7f440cc3.png) hash functions. Each hash function will address a single bit in the array.\n\nExample\n\nHash\\_function1( int ): mod the sum of the first two digits\n\nHash\\_function2( int ): mod sum of 2nd pair of digits\n\nHash\\_function3( int ): mod sum of 3rd pair of digits\n\nNow insert: 937789, 932243, 106616\n\n[![\\_images\/hBloomFilt.png](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/hBloomFilt.png)](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/hBloomFilt.png)\n\nDoes this number exist in our set? 134898\n\nNot so fast! The thing about bloom filters is that they are not always right, but they are never wrong. Bloom filters have the property that they may have **False Positives**, a number is recorded in the filter but was never explicitly put into the filter! This occurs by chance, when other numbers simply cause those bits to be 1. It can also happen when the bloom filter begins to fill up\\!\n\nSo what do we do about false positives? CONTROL THEM!\\!\n\nWe just make the probability of false positives sufficiently small.\n\nFrom <http:\/\/en.wikipedia.org\/wiki\/Bloom_filter> :\n\n[![\\_images\/hBloomFP1.png](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/hBloomFP1.png)](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/hBloomFP1.png)\n\n.\n\n[![\\_images\/hBloomFP2.png](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/hBloomFP2.png)](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/hBloomFP2.png)\n\n.\n\n[![\\_images\/hBloomFP3.png](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/hBloomFP3.png)](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/hBloomFP3.png)\n\n- We can determine ![n](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/413f8a8e40062a9090d9d50b88bc7b551b314c26.png), the number of items we need to store\n- Then choose an appropriate ![m](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/c4bb40dd65eae6c11b325989b14e0b8d35e4e3ef.png), compute the optimal ![k](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/e9203da50e1059455123460d4e716c9c7f440cc3.png), and see if the false positive probability is within acceptable bounds\n- If not, try new values for ![m](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/c4bb40dd65eae6c11b325989b14e0b8d35e4e3ef.png) or ![n](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/413f8a8e40062a9090d9d50b88bc7b551b314c26.png)\n\nOr just look at the table:\n\n[![\\_images\/hBloomFP4.png](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/hBloomFP4.png)](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/hBloomFP4.png)\n\nQuestion\n\nWhat are some example applications that might use membership testing now that we have to deal with false positives?\n\nExample\n\nLets say I want to do existence testing for a set of 1,000,000 items.\n\n- If an item is an integer, that means we would need to store 4,000,000 bytes ~= 3.8 MB of data\n- We would need roughly twice that to prevent saturation in a hash table ~= 7.6MB\n- What we are storing may be much larger than an integer\\!\n\nIn the case of the Bloom Filter...\n\n- Lets say we want a false positive rate of 0.001\n- According to the table, we need m\/n = 15 and k = 7\n- So, we need 15,000,000 bits ~= 1.8MB. That’s a quarter of the storage of the hash table, and, no matter how large the items are, it will always take 1.8MB\n- We just need to make sure that we can come up with 7 independent hash functions! But thats easy with our hash function recipe above\\!\n\nFinally, some properties of Bloom Filters:\n\n- A BF can represent an entire universe of elements, wheras hash table runs out of space. Also, a hash table must explicitly store all elements\n- Union and Intersection of bloom filters is possible using bitwise operations on the bit strings (super useful for combining tables in databases without having to rebuild indexes)\n- If the filter becomes satuated, we have to change the ![m\/n](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/_images\/math\/fda12c7bb431a2212b50f0a5def335f88be3923a.png) ratio. There are types of BFs that grow like vectors\\!\n- there are tons of variants for all types of applications: Bloomier filters, counting filters, stable bloom filters, scalable filters\n- Deleteion is a problem! What happens if we try to delete something?\n\nLets practice a little...\n\n### Navigation\n- [index](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/genindex.html \"General Index\")\n- [next](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/heaps.html \"Heaps and Priority Queues\") \\|\n- [previous](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/reviewMultiThread.html \"Review of Multithreaded Programming\") \\|\n- [CS 340: Algorithms and Data Structures 1.0 documentation](https:\/\/www.cs.siue.edu\/~marmcke\/docs\/cs340\/index.html) »\n\n© Copyright 2014, M McKenney. Created using [Sphinx](http:\/\/sphinx-doc.org\/) 1.3.","attrs_readable_markdown":null,"meta_canonical":null}

3. Robots.txt Check

Query:

Response:

4. Spam/Ban Check

Query:

Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

📄

INDEXABLE

✅

CRAWLED

2 months ago

🤖

ROBOTS ALLOWED

Page Info Filters

Filter	Status	Condition	Details
HTTP status	PASS	`download_http_code = 200`	HTTP 200
Age cutoff	PASS	`download_stamp > now() - 6 MONTH`	2.2 months ago
History drop	PASS	`isNull(history_drop_reason)`	No drop reason
Spam/ban	PASS	`fh_dont_index != 1 AND ml_spam_score = 0`	ml_spam_score=0
Canonical	PASS	`meta_canonical IS NULL OR = '' OR = src_unparsed`	Not set

Page Details

Property	Value
URL	https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html
Last Crawled	2026-02-09 15:06:19 (2 months ago)
First Indexed	2025-01-18 07:09:59 (1 year ago)
HTTP Status Code	200
Meta Title	Hashing and Hash Tables — CS 340: Algorithms and Data Structures 1.0 documentation
Meta Description	null
Meta Canonical	null
Boilerpipe Text	Our discussions on trees centered around a data structure that stored items efficiently, but to get the balanced height trees, things got tough to implement. I nstead of focusing so much on the structure, hashing takes the approach that the structure should be rather simple. The downside is that iterating over elements does not come for free, as in trees, but is possible with a few tricks. So, for hashing we are looking at simple structures, usually arrays. We will manage the size of the array to be not too much bigger than the amount of data stored (like C++ vectors), to preserve iteration in linear time. Hash Tables ¶ The basic question is “Why not just use an array as a table?”. Its a good question... Lets think about a table containing products that a store wants to keep track of. Here is an example. There are serious problems with this basic approach, what are they? Real World Data Here are some examples of real world data we may want to store in a table. If we are simply using the numbers as array indexes, then: How big of an array do we need? How much of the array will actually get used? Students: Student ID (9 digits) People: SSN (9 digits) ZIP code: 5 digits, 9 digits ISBN: 10 digits UPC: 12 digits Others: Character strings Basic Hash Tables ¶ A Hash Table will consist of 2 parts: a table (an array), and a hash function that will convert key values to array indices. (used for insert/delete/search) A hash function can really be anything, but there are some recipes for reliably good ones. Here are a couple examples of some that might work out in specific cases: Use certain digits form a long number. ex: last 4 digits of student ID. Will this work at our university? Folding. Use some function to get a smaller range of values. ex: add the digits of student ID. Will this work at our university? A Basic Hash Table Example size: 5 Hash function: Add first and last digits, then mod the result by the table size. Here is the table: Insert the following: 349587 98745 84743 Now find the same numbers in the hash table. (just apply the formula and look for them) Now insert: 24544 . Collision with 84743! Collisions are a problem, but there are various ways to handle them: Open addressing collision handling methods: Linear probing – look for next open spot Quadratic probing Double Hashing Increase the table size Multiple-Item Storage collision handling methods Buckets Chaining Open Addressing: Linear Probing ¶ If there is a collision, just look for the next open slot and insert the item there. Deleting . Deleting is problematic, since removing an item might break the linear probe. Instead of actually deleting items, mark them as being deleted (lazy delete!) What happens when the hash table fills up? One problem with linear probing is that it can lead to a degenerate situation where items that map to the same portion of a hash table overflow into other parts of the hash table, causing a cascading series of probes for lots of items. Open Addressing: Quadratic Probing ¶ Instead of just looking at the next slot for an opening, follow a quadratic sequence of indices (1,2,4,8,16,...) Resolves some of the clustering problems of linear probing. Can fail with a non-full hash table (but we can make an odd sized table). What do we do when insertion fails? Open Addressing: Double Hashing (or triple, or quadruple for that matter) ¶ If a collision occurs on one hash function, simply use another one. What do we do if a collision occurs on all hash functions? The hash functions can be tried in parallel! but hash functions are typically pretty computationally cheap. Open Addressing: Increase the Table Size ¶ Make a new array with more room. How much more room? Insert each item into the new array. Do we reuse the same hash functions? Delete the old array. When should this occur? When insertion fails, when the table is full? when the table is almost full? when a certain percent of probes have occurred? Multi-item Storage: Buckets ¶ The idea is simple, just keep room for more than 1 item at each table location. Pick a fixed number, and possibly use an additional open addressing strategy if the bucket fills. Multi-item Storage: Chaining ¶ Use a linked list (or other ADT) at each table location. Might need to consider increasing the table size if a list (or lists) get too long. String Hash Functions ¶ String hash functions are little tougher. Here are some examples: Add the numeric values of first five characters: burner = 2+21+18+14+5 = 60 scanner = 19+1+14+14+5 = 53 camera = 3+1+13+5+18 = 40 tablet = 20+1+2+12+5 = 40 Values range from 1 to 130 Concatenate positional values of first five char burner = 2 21 18 14 5 = 22,118,145 scanner = 19 3 1 14 14 5 = 19,314,145 camera = 3 1 13 5 18 = 3,113,518 tablet = 20 1 2 12 5 = 2,012,125 Values range from 1,048,576 to 28,142,426 Instead, look into production hash functions: MD5, SHA, etc. Good Hash Functions ¶ Good hash functions are easy to compute, and distribute values evenly throughout the table. Here is a recipe: Where: is the size of the table. is a prime number larger than any number that will be hashed (4,294,967,291 is the largest unsigned 32 bit prime integer. 2,147,483,647 is the largest signed 32 bit prime number. 18,446, 744,073,709,551,557 is the larges unsigned 64 bit number. 9,223,372,036,854,775,783 is the largest signed prime integer. NOTE you MUST BE CAREFUL with integer overflows, when perfoming these calculations!!!) are positive constants both less than . Make sure . is some interaction of and . A good basic choice is Multiple hash functions can be easily generated by choosing random values of and . One implementation note, intermediate values need to be 64 bit to prevent overflow! This type of hash function falls in the family of 2-universal family hash functions, with a probability of items colliding . (just make sure for all ) Lets try this out in class. Load Factor of a Hash Table ¶ The load factor for a hash table is: . This is between 1 and 0. A high load factor indicates the hash table is almost full, and you might want to think about resizing it. A final note ¶ Make sure the hash table size is ODD. Prime numbers help too No matter what, a hash table doesn’t store data in sorted order! Bloom Filters ¶ Bloom filters are one of the coolest data structures around, but they are not used too often. The idea behind a bloom filter is that we don’t always need to store every item, sometimnes we just need to record the fact that the item exists. This is known as membership testing . Question What are some example applications that might use membership testing? Bloom filters don’t store the actual data, they keep track of a bit string. Also, they do not use a single hash function, but use a group of hash functions. Each hash function will address a single bit in the array. Example Hash_function1( int ): mod the sum of the first two digits Hash_function2( int ): mod sum of 2nd pair of digits Hash_function3( int ): mod sum of 3rd pair of digits Now insert: 937789, 932243, 106616 Does this number exist in our set? 134898 Not so fast! The thing about bloom filters is that they are not always right, but they are never wrong. Bloom filters have the property that they may have False Positives , a number is recorded in the filter but was never explicitly put into the filter! This occurs by chance, when other numbers simply cause those bits to be 1. It can also happen when the bloom filter begins to fill up! So what do we do about false positives? CONTROL THEM!! We just make the probability of false positives sufficiently small. From http://en.wikipedia.org/wiki/Bloom_filter : . . Or just look at the table: Question What are some example applications that might use membership testing now that we have to deal with false positives? Example Lets say I want to do existence testing for a set of 1,000,000 items. If an item is an integer, that means we would need to store 4,000,000 bytes ~= 3.8 MB of data We would need roughly twice that to prevent saturation in a hash table ~= 7.6MB What we are storing may be much larger than an integer! In the case of the Bloom Filter... Lets say we want a false positive rate of 0.001 According to the table, we need m/n = 15 and k = 7 So, we need 15,000,000 bits ~= 1.8MB. That’s a quarter of the storage of the hash table, and, no matter how large the items are, it will always take 1.8MB We just need to make sure that we can come up with 7 independent hash functions! But thats easy with our hash function recipe above! Finally, some properties of Bloom Filters: A BF can represent an entire universe of elements, wheras hash table runs out of space. Also, a hash table must explicitly store all elements Union and Intersection of bloom filters is possible using bitwise operations on the bit strings (super useful for combining tables in databases without having to rebuild indexes) If the filter becomes satuated, we have to change the ratio. There are types of BFs that grow like vectors! there are tons of variants for all types of applications: Bloomier filters, counting filters, stable bloom filters, scalable filters Deleteion is a problem! What happens if we try to delete something? Lets practice a little...
Markdown	![](https://www.cs.siue.edu/~marmcke/docs/cs340/_static/title.png) ### Navigation - [index](https://www.cs.siue.edu/~marmcke/docs/cs340/genindex.html "General Index") - [next](https://www.cs.siue.edu/~marmcke/docs/cs340/heaps.html "Heaps and Priority Queues") \\| - [previous](https://www.cs.siue.edu/~marmcke/docs/cs340/reviewMultiThread.html "Review of Multithreaded Programming") \\| - [CS 340: Algorithms and Data Structures 1.0 documentation](https://www.cs.siue.edu/~marmcke/docs/cs340/index.html) » ### [Table Of Contents](https://www.cs.siue.edu/~marmcke/docs/cs340/index.html) - [Hashing and Hash Tables](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html) - [Hash Tables](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#hash-tables) - [Basic Hash Tables](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#basic-hash-tables) - [Open Addressing: Linear Probing](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-linear-probing) - [Open Addressing: Quadratic Probing](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-quadratic-probing) - [Open Addressing: Double Hashing (or triple, or quadruple for that matter)](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-double-hashing-or-triple-or-quadruple-for-that-matter) - [Open Addressing: Increase the Table Size](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-increase-the-table-size) - [Multi-item Storage: Buckets](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#multi-item-storage-buckets) - [Multi-item Storage: Chaining](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#multi-item-storage-chaining) - [String Hash Functions](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#string-hash-functions) - [Good Hash Functions](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#good-hash-functions) - [Load Factor of a Hash Table](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#load-factor-of-a-hash-table) - [A final note](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#a-final-note) - [Bloom Filters](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#bloom-filters) #### Previous topic [Review of Multithreaded Programming](https://www.cs.siue.edu/~marmcke/docs/cs340/reviewMultiThread.html "previous chapter") #### Next topic [Heaps and Priority Queues](https://www.cs.siue.edu/~marmcke/docs/cs340/heaps.html "next chapter") ### This Page - [Show Source](https://www.cs.siue.edu/~marmcke/docs/cs340/_sources/hashing.txt) ### Quick search Enter search terms or a module, class or function name. # Hashing and Hash Tables[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#hashing-and-hash-tables "Permalink to this headline") Our discussions on trees centered around a data structure that stored items efficiently, but to get the balanced height trees, things got tough to implement. I nstead of focusing so much on the structure, hashing takes the approach that the structure should be rather simple. The downside is that iterating over elements does not come for free, as in trees, but is possible with a few tricks. So, for hashing we are looking at simple structures, usually arrays. We will manage the size of the array to be not too much bigger than the amount of data stored (like C++ vectors), to preserve iteration in linear time. ## Hash Tables[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#hash-tables "Permalink to this headline") The basic question is “Why not just use an array as a table?”. Its a good question... Lets think about a table containing products that a store wants to keep track of. Here is an example. There are serious problems with this basic approach, what are they? [![\_images/hSimple.png](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hSimple.png)](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hSimple.png) Real World Data Here are some examples of real world data we may want to store in a table. If we are simply using the numbers as array indexes, then: 1. How big of an array do we need? 2. How much of the array will actually get used? - Students: Student ID (9 digits) - People: SSN (9 digits) - ZIP code: 5 digits, 9 digits - ISBN: 10 digits - UPC: 12 digits - Others: Character strings ### Basic Hash Tables[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#basic-hash-tables "Permalink to this headline") A Hash Table will consist of 2 parts: 1. a table (an array), and 2. a hash function that will convert key values to array indices. (used for insert/delete/search) A hash function can really be anything, but there are some recipes for reliably good ones. Here are a couple examples of some that might work out in specific cases: - Use certain digits form a long number. ex: last 4 digits of student ID. Will this work at our university? - Folding. Use some function to get a smaller range of values. ex: add the digits of student ID. Will this work at our university? A Basic Hash Table Example size: 5 Hash function: Add first and last digits, then mod the result by the table size. Here is the table: [![\_images/hEmptyTable.png](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hEmptyTable.png)](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hEmptyTable.png) Insert the following: 1. 349587 ![\\rightarrow 10 \\% 5 = 0](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/b45e0882b31c0c8766af61c11a604d3feb4b7559.png) 2. 98745 ![\\rightarrow 14 \\% 5 = 4](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/dac6d3978a59b5d6e21e4014e6757f0856e42467.png) 3. 84743 ![\\rightarrow 11 \\% 5 = 1](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/2965539df530f40fa892a11144a8c7a93cd6541b.png) Now find the same numbers in the hash table. (just apply the formula and look for them) Now insert: - 24544 ![\\rightarrow 7 \\% 5 = 1](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/896827ce7d061d8fcfd3d4b6e95633b55e7235d6.png). Collision with 84743\! Collisions are a problem, but there are various ways to handle them: Open addressing collision handling methods: 1. Linear probing – look for next open spot 2. Quadratic probing 3. Double Hashing 4. Increase the table size Multiple-Item Storage collision handling methods 1. Buckets 2. Chaining ### Open Addressing: Linear Probing[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-linear-probing "Permalink to this headline") If there is a collision, just look for the next open slot and insert the item there. [![\_images/hLinearProbe.png](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hLinearProbe.png)](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hLinearProbe.png) Deleting. Deleting is problematic, since removing an item might break the linear probe. Instead of actually deleting items, mark them as being deleted (lazy delete!) What happens when the hash table fills up? One problem with linear probing is that it can lead to a degenerate situation where items that map to the same portion of a hash table overflow into other parts of the hash table, causing a cascading series of probes for lots of items. ### Open Addressing: Quadratic Probing[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-quadratic-probing "Permalink to this headline") Instead of just looking at the next slot for an opening, follow a quadratic sequence of indices (1,2,4,8,16,...) [![\_images/hQuadProbe.png](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hQuadProbe.png)](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hQuadProbe.png) Resolves some of the clustering problems of linear probing. Can fail with a non-full hash table (but we can make an odd sized table). What do we do when insertion fails? ### Open Addressing: Double Hashing (or triple, or quadruple for that matter)[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-double-hashing-or-triple-or-quadruple-for-that-matter "Permalink to this headline") If a collision occurs on one hash function, simply use another one. What do we do if a collision occurs on all hash functions? The hash functions can be tried in parallel! but hash functions are typically pretty computationally cheap. ### Open Addressing: Increase the Table Size[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#open-addressing-increase-the-table-size "Permalink to this headline") Make a new array with more room. How much more room? Insert each item into the new array. Do we reuse the same hash functions? Delete the old array. When should this occur? When insertion fails, when the table is full? when the table is almost full? when a certain percent of probes have occurred? ### Multi-item Storage: Buckets[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#multi-item-storage-buckets "Permalink to this headline") The idea is simple, just keep room for more than 1 item at each table location. Pick a fixed number, and possibly use an additional open addressing strategy if the bucket fills. ### Multi-item Storage: Chaining[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#multi-item-storage-chaining "Permalink to this headline") Use a linked list (or other ADT) at each table location. Might need to consider increasing the table size if a list (or lists) get too long. ### String Hash Functions[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#string-hash-functions "Permalink to this headline") String hash functions are little tougher. Here are some examples: - Add the numeric values of first five characters: 1. burner = 2+21+18+14+5 = 60 2. scanner = 19+1+14+14+5 = 53 3. camera = 3+1+13+5+18 = 40 4. tablet = 20+1+2+12+5 = 40 Values range from 1 to 130 - Concatenate positional values of first five char 1. burner = 2 21 18 14 5 = 22,118,145 2. scanner = 19 3 1 14 14 5 = 19,314,145 3. camera = 3 1 13 5 18 = 3,113,518 4. tablet = 20 1 2 12 5 = 2,012,125 Values range from 1,048,576 to 28,142,426 Instead, look into production hash functions: MD5, SHA, etc. ### Good Hash Functions[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#good-hash-functions "Permalink to this headline") Good hash functions are easy to compute, and distribute values evenly throughout the table. Here is a recipe: ![h(k) = ( f(a,k)+b) \\% p \\% S](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/bad12d496a27c4739dee26bca876ede6bbe8da68.png) Where: - ![S](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/11a85f3c69ae6702cb1d99d3de451913b8f84c04.png) is the size of the table. - ![p](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/3eca8557203e86160952e1c0f735f7417f3285b1.png) is a prime number larger than any number that will be hashed (4,294,967,291 is the largest unsigned 32 bit prime integer. 2,147,483,647 is the largest signed 32 bit prime number. 18,446, 744,073,709,551,557 is the larges unsigned 64 bit number. 9,223,372,036,854,775,783 is the largest signed prime integer. NOTE you MUST BE CAREFUL with integer overflows, when perfoming these calculations!!!) - ![a,b](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/ed3ee7bf0f52d1ad30ec7b003588cab83b4b108f.png) are positive constants both less than ![p](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/3eca8557203e86160952e1c0f735f7417f3285b1.png). Make sure ![a \\neq 0](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/8e5662effc36bbbf17fecdcbcc8d20a6fd1e55be.png). - ![f(a,k)](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/5954345d9914ef5b002630ad1085284ceef1b401.png) is some interaction of ![a](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/7dd2a5ea01fbd72ad2a58dd1f3d6ecbfde6208a1.png) and ![k](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/e9203da50e1059455123460d4e716c9c7f440cc3.png). A good basic choice is ![f(a,k)=a \\times k](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/8f85c0c559c464b6877d162ee5e626047ad2da08.png) Multiple hash functions can be easily generated by choosing random values of ![a](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/7dd2a5ea01fbd72ad2a58dd1f3d6ecbfde6208a1.png) and ![b](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/5e87bf41a96deddf6cb485ff530f153f2590e9cc.png). One implementation note, intermediate values need to be 64 bit to prevent overflow\! This type of hash function falls in the family of 2-universal family hash functions, with a probability of items colliding ![\\leq \\frac{1}{S}](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/90c56dddc5333614933c27e5311587b58357ba8e.png). (just make sure ![k \< p](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/e874e7a826f8424eaa1821d1a819f5ac56bb7329.png) for all ![k](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/e9203da50e1059455123460d4e716c9c7f440cc3.png)) Lets try this out in class. ### Load Factor of a Hash Table[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#load-factor-of-a-hash-table "Permalink to this headline") The load factor for a hash table is: ![\\frac{numInsertedItems}{numLocations}](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/a5a212447644ac048e04644a3044dc23b71be88d.png). This is between 1 and 0. A high load factor indicates the hash table is almost full, and you might want to think about resizing it. ### A final note[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#a-final-note "Permalink to this headline") Make sure the hash table size is ODD. Prime numbers help too No matter what, a hash table doesn’t store data in sorted order\! ## Bloom Filters[¶](https://www.cs.siue.edu/~marmcke/docs/cs340/hashing.html#bloom-filters "Permalink to this headline") Bloom filters are one of the coolest data structures around, but they are not used too often. The idea behind a bloom filter is that we don’t always need to store every item, sometimnes we just need to record the fact that the item exists. This is known as membership testing. Question What are some example applications that might use membership testing? Bloom filters don’t store the actual data, they keep track of a bit string. Also, they do not use a single hash function, but use a group of ![k](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/e9203da50e1059455123460d4e716c9c7f440cc3.png) hash functions. Each hash function will address a single bit in the array. Example Hash\_function1( int ): mod the sum of the first two digits Hash\_function2( int ): mod sum of 2nd pair of digits Hash\_function3( int ): mod sum of 3rd pair of digits Now insert: 937789, 932243, 106616 [![\_images/hBloomFilt.png](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFilt.png)](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFilt.png) Does this number exist in our set? 134898 Not so fast! The thing about bloom filters is that they are not always right, but they are never wrong. Bloom filters have the property that they may have False Positives, a number is recorded in the filter but was never explicitly put into the filter! This occurs by chance, when other numbers simply cause those bits to be 1. It can also happen when the bloom filter begins to fill up\! So what do we do about false positives? CONTROL THEM!\! We just make the probability of false positives sufficiently small. From <http://en.wikipedia.org/wiki/Bloom_filter> : [![\_images/hBloomFP1.png](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFP1.png)](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFP1.png) . [![\_images/hBloomFP2.png](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFP2.png)](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFP2.png) . [![\_images/hBloomFP3.png](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFP3.png)](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFP3.png) - We can determine ![n](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/413f8a8e40062a9090d9d50b88bc7b551b314c26.png), the number of items we need to store - Then choose an appropriate ![m](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/c4bb40dd65eae6c11b325989b14e0b8d35e4e3ef.png), compute the optimal ![k](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/e9203da50e1059455123460d4e716c9c7f440cc3.png), and see if the false positive probability is within acceptable bounds - If not, try new values for ![m](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/c4bb40dd65eae6c11b325989b14e0b8d35e4e3ef.png) or ![n](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/413f8a8e40062a9090d9d50b88bc7b551b314c26.png) Or just look at the table: [![\_images/hBloomFP4.png](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFP4.png)](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/hBloomFP4.png) Question What are some example applications that might use membership testing now that we have to deal with false positives? Example Lets say I want to do existence testing for a set of 1,000,000 items. - If an item is an integer, that means we would need to store 4,000,000 bytes ~= 3.8 MB of data - We would need roughly twice that to prevent saturation in a hash table ~= 7.6MB - What we are storing may be much larger than an integer\! In the case of the Bloom Filter... - Lets say we want a false positive rate of 0.001 - According to the table, we need m/n = 15 and k = 7 - So, we need 15,000,000 bits ~= 1.8MB. That’s a quarter of the storage of the hash table, and, no matter how large the items are, it will always take 1.8MB - We just need to make sure that we can come up with 7 independent hash functions! But thats easy with our hash function recipe above\! Finally, some properties of Bloom Filters: - A BF can represent an entire universe of elements, wheras hash table runs out of space. Also, a hash table must explicitly store all elements - Union and Intersection of bloom filters is possible using bitwise operations on the bit strings (super useful for combining tables in databases without having to rebuild indexes) - If the filter becomes satuated, we have to change the ![m/n](https://www.cs.siue.edu/~marmcke/docs/cs340/_images/math/fda12c7bb431a2212b50f0a5def335f88be3923a.png) ratio. There are types of BFs that grow like vectors\! - there are tons of variants for all types of applications: Bloomier filters, counting filters, stable bloom filters, scalable filters - Deleteion is a problem! What happens if we try to delete something? Lets practice a little... ### Navigation - [index](https://www.cs.siue.edu/~marmcke/docs/cs340/genindex.html "General Index") - [next](https://www.cs.siue.edu/~marmcke/docs/cs340/heaps.html "Heaps and Priority Queues") \\| - [previous](https://www.cs.siue.edu/~marmcke/docs/cs340/reviewMultiThread.html "Review of Multithreaded Programming") \\| - [CS 340: Algorithms and Data Structures 1.0 documentation](https://www.cs.siue.edu/~marmcke/docs/cs340/index.html) » © Copyright 2014, M McKenney. Created using [Sphinx](http://sphinx-doc.org/) 1.3.
Readable Markdown	null
Shard	50 (laksa)
Root Hash	11131842051709627250
Unparsed URL	edu,siue!cs,www,/~marmcke/docs/cs340/hashing.html s443