ℹ️ Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | FAIL | download_stamp > now() - 6 MONTH | 8.5 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://ksvi.mff.cuni.cz/~dingle/2021-2/algs/notes_11.html |
| Last Crawled | 2025-08-06 10:49:45 (8 months ago) |
| First Indexed | 2021-12-11 20:09:11 (4 years ago) |
| HTTP Status Code | 200 |
| Meta Title | null |
| Meta Description | null |
| Meta Canonical | null |
| Boilerpipe Text | Introduction to
Algorithms, 2021-2
Week 1 1 :
Notes Some of this
week's topics are covered in Problem
Solving with Algorithms : And in Introduction
to Algorithms : Here are some additional notes. hash functions A hash function maps values of some type T to integers in a
fixed range. Often we will want a hash function that produces
values in the range 0 .. (N – 1), where N is a power of two.
Hash functions are very useful, and are a common building
block in programming and also in theoretical computer
science. In
general, there may be many more possible values of T than integers in
the output range. This means that hash functions will inevitably map
some distinct input values to the same output value; this is called a
hash collision .
A good hash function will produce relatively few collisions in
practice. In other words, even if two input values are similar to
each other, they should be unlikely to have the same hash value. An
ideal hash function
will produce hash collisions in practice no more often than would be
expected if it were producing random outputs. As a
first example, let's think about how to construct a hash function
that maps an integer of any size ,
i, to a hash
value in the range 0 .. (N – 1), for some fixed value of N such as
N = 2 32 .
There is an obvious choice for the function: we can take i mod N. As a next
example, let's build a hash function that maps a pair
of integers (i,
j) to a hash value in the range 0 .. (N – 1). We might first
consider the hash function (i + j) mod N. However, this is a poor
choice. For example, (0, 2), (1, 1), and (2, 0) will all have the
same hash value, and it's
easy to imagine that we will encounter all of these input values in
practice. A
better choice of hash function would be (Ki + j) mod N for some
constant K. However, some values of K will be better than others. For
example, suppose that N = 2 32 ,
and we choose the constant K = 2 30 .
Then (0, 0), (4, 0), and (8, 0) will all have the same hash value 0,
since (4
· 2 30 )
mod 2 32 =
2 32 mod
2 32 =
0 and (8
· 2 30 )
mod 2 32 =
2 33 mod
2 32 =
0 In
fact, with this hash function any pair (i, 0) will always have a hash
value that is 0, 1, 2, or 3, even though these are only 4 out of 2 32 possible
output values for the hash function
in general! In
general, we will get the best results if K and N are relatively
prime ,
i.e. they have no prime factors in common. Furthermore, we probably
don't want K to be too small, to avoid hash collisions for small
values of i and j. As one example, the prime number 1,000,003 could
be a reasonable choice for K. hashing strings In practice, we will very often want a hash
function that takes strings as
input. Suppose that we want
a hash function that takes strings of characters with ordinal values
from 0 to 255 (i.e. with no fancy Unicode characters) and
produces 32-bit hash values in the range 0 ≤ v < 2 32 .
As one
idea, we could add the ordinal values of all characters in the input
string: def hash(s): return sum([ord(c) for c in s])
This is a poor hash function. If the input strings are short, then
the output values will always be small integers. Furthermore, two
input strings that contain the same set of characters (a very common
occurrence) will hash to the same number. Here is one way to construct a better hash
function, called modular hashing . Given any string s, consider
it as a series of digits forming one large number H .
For example, i f
our
characters have ordinal values in the range from 0 to 255, we
can imagine them to be digits in base 256 .
Then
we can compute H
mod N
for some constant N ,
producing a hash value in the range from 0 to N
– 1. Here is Python code that implements this idea: # Generate a hash code in the range 0 .. 2^32 – 1 # DO NOT USE – this hash function is poor (see text below) def my_hash(s): h = 0 for c in s: d = ord(c) # 0 .. 255 h = 256 * h + d return h % ( 2 ** 32 )
As you can see, this code is using the algorithm for combining
digits in any base that we learned in one of the very first
lectures of this class. Unfortunately, this hash function is still
poor . We can see that it often maps similar strings to the same
hash value: >>> my_hash( 'bright' ) 1768384628 >>> my_hash( 'light' ) 1768384628 >>> my_hash( 'night' ) 1768384628
The problem is that if we have a number H in base 256, then H
mod 2 32 is exactly the last four digits of the
number, because 2 32 = (2 8 )^4 = 256 4 .
If that's not obvious, consider the same phenomenon in base 10:
2276345 mod 10000 = 6345, because 1000 = 10 4 . And so this
hash function only depends on the last four characters in the
string . More generally, if B is the base of our digits
(e.g. B = 256 here), and N is the size of the
output range (e.g. N = 2 32 here), then we will probably
get the best hash behavior if B and N are relatively prime .
So
if we want a better hash function, we must change B or N. Assuming
that we want to produce values in a certain given range, we'd like to
keep N as it is. So let's change B. In fact it will probably be best
if B is not too close to a power of two (for number-theoretic reasons
that we won't go into here). A good choice for B might be the prime
number 1,000,003 that we saw above. Let's modify our hash function to
use it: # Generate a hash code in the range 0 .. 2^32 - 1 def my_hash(s): h = 0 for c in s: d = ord(c) # 0 .. 255 h = 1_000_003 * h + d return h % ( 2 ** 32 )
Now we get distinct values for the strings we saw above: >>> my_hash( 'bright' ) 2969542966 >>> my_hash( 'light' ) 1569733066 >>> my_hash( 'night' ) 326295660
To be clear, we are now considering the input string to be a series
of digits in base 1,000,003! This also means that our
input string can now reasonably contain any Unicode characters that
we like. A
disadvantage of the function above is that it computes an integer h
that will be huge if the input string is large, since it encodes all
of the characters in the string. That may be inefficient, and in fact
many programming languages don't support large integers of this sort. However,
we can make a tiny change to the code so that it will compute the
same output values, but be far more efficient. Rather than taking the
result mod 2 32 at
the end, we can perform it at
every step of the calculation : # Generate a hash code in the range 0 .. 2^32 - 1 def my_hash(s): h = 0 for c in s: d = ord(c) # 0 .. 255 h = ( 1_000_003 * h + d) % ( 2 ** 32 ) return h
This function computes the same hash values that the previous version
did: >>> my_hash( 'bright' ) 2969542966 >>> my_hash( 'light' ) 1569733066 >>> my_hash( 'night' ) 326295660
Why does this trick work? A useful mathematical fact is that if
you're performing a series of additions and multiplications and you
want the result mod N, you can actually perform a (mod N) operation
at any step along the way, and you will still get the same result .
I won't prove this statement here, but u ltimately
it
is true because of this fact, which you might see in a
number theory course: if a ≡ b (mod N) and c ≡ d (mod N), then a
+ c ≡ b + d (mod N) and ac ≡ bd (mod N). In any case, this is
especially useful in lower-level languages such as C that have
fixed-size integers, because arithmetic operations in those
languagues automatically compute
the result mod N for some fixed N (where typically N = 2 32
or N = 2 64 ). hash tables A hash table is a data structure used to
implement either a set of keys ,
or a map from keys to values .
A hash table is often more efficient than a binary
tree (which we can also be used to implement a set or map, as we saw
recently). Hash tables are simple, and do not require complex code to
stay balanced as is the case for binary trees. For this reason, hash
tables are very widely used, probably even more so than binary trees
for storing arbitrary maps from keys to values. The most common
method for implementing a hash table is chaining ,
in which the table contains an array of buckets .
Each bucket contains a hash chain ,
which is a linked list of keys (or key/value pairs in the case of a
map ) .
For example, h ere
is a picture of a small hash table with 4 buckets that stores a set
of string keys: In some hash table implementations, the array of
buckets has a fixed size. In others, it can expand dynamically. For
the moment , we will assume that the number
of buckets is a constant B. A
hash table requires a hash function h(k) that can map each key to a
bucket number. Typically we choose
h(k) = h 1 (k)
mod B, where h 1 is
a preexisting hash function that
maps keys to larger integers (such
as the my_hash function we wrote above) .
If a key k is present in a hash table, it is always
stored
in the hash chain in
bucket h(k).
In other words, the hash function tells us which bucket a key belongs
in. With
this structure, it is straightforward to implement the set operations
contains(), add(), and remove(). contains(x)
walks
down the hash chain in bucket h(x), looking
for a node with value x .
If it finds such a node in the chain, it returns True, otherwise
False. add(x) will first call contains(x) to check whether the value
to be added is already present in the table. If it if not, it will
prepend a
new node with value x
to the hash chain in bucket h(x). remove(x) will
look for a node with value x in the hash chain in bucket hash(x). If
it finds such a node, it will delete it
from
the linked list. Here
is a partial implementation in Python of a hash table representing a
set of objects. The remove() method is left as an exercise for you. class Node: def __ init __(self, key , next): self. key = key self.next = next class Hash Set: def __init__(self, numBuckets): # each array element is the head of a linked list of Nodes self.a = numBuckets * [ None ] def contains(self, x): i = hash(x) % len(self.a) # hash bucket index p = self.a[i] while p != None : if p.val == x: return True p = p.next return False def add(self, x): if not self.contains(x): i = hash(x) % len(self.a) self.a[i] = Node(x, self.a[ i ]) # prepend to hash chain I t
is straightforward to extend this hash set implementation to store a
map from keys to values: all we have to do is store a
key/value pair in each node, just like when we use a binary search
tree to store a map. We'd now like to consider the running time of hash
table operations. Suppose that a hash table has N nodes
in B buckets. Then
its load factor α is defined as α = N / B. This is the
average number of nodes
per bucket, i.e. the average length of each hash chain. Suppose that our hash function distributes keys
evenly among buckets. Then any lookup in a hash table that misses
( e.g.
a contains()
request for a key that is absent) will effectively be choosing a
bucket at random. So it will examine α buckets on average as it
walks the bucket's hash chain to look for the key. This shows that
such lookups run in time O(α) on average, independent of N. The analysis is a bit trickier for lookups that
hit, since these are more likely to search a bucket that has a longer
hash chain. Nevertheless it can also be shown that these run in time
O(α) on average if the hash function distributes keys evenly. In
fact all of our hash set
operations (add, contains, remove) will run in O(α) on
average. So we can make hash table operations
arbitrarily fast (on average) by keeping α small, i.e. by using as
many hash buckets as needed. Of course, this supposes that we know in
advance how many items we will be storing in a hash table, so that we
can preallocate an appropriate number of buckets. However, even if that number is not known, we can
dynamically expand
a hash table whenever its load factor grows above
some fixed limit α 0 .
To grow the table, we allocate a new bucket array, typically twice
the size of the old one. Then we loop over all the nodes in the
buckets in the old array, and insert them into the new array. We must
recompute each key's hash value to find its position in the new
array, which may not be the same as in the previous, smaller array. Suppose that we start with an empty hash table and
insert N values into it, doubling the number of buckets whenever the
load factor exceeds some fixed value α 0 . Then how long
will it take to insert the N values, as a function of N? If we exclude the time for
the doubling operations , then each insertion operation will
run in O(1). That's because each insertion will run in O(α) (since
it must traverse an existing hash chain to check whether the
value being inserted is already present), and α will always be less
than the constant α 0 .) Now let's consider the time spent growing the
table. The time to perform each doubling operation is O(M), where M
is the number of elements in the hash table at the moment we perform
the doubling. That's because we must rehash M elements, and each
rehash operation takes O(1) since we can compute a hash value and
prepend an element to a linked list in constant time. If there are
initially k buckets in the hash table, then the first doubling
operation will double the size to (2k). Considering the end of the
process, the
last doubling
operation will double the size of the hash table to N. The the
second-to-last operation will double it to (N / 2), the operation
before that will double it to (N / 4), and so on. The total doubling
time will be O(k)
+ O(2k) + … + O(N / 4) + O(N / 2) + O(N) ≤ O(1
+ 2 + 4 + … + N / 4 + N / 2 + N) =
O(N) So we can insert N elements in O(N), which means
that insertion takes O(1) on average , even as the hash table
grows arbitrarily large. priority queues In recent lectures we learned about stacks and
queues, which are abstract data types that we can implement in
various ways, such as using an array or a linked list. A priority queue is another abstract data
type. At a minimum, a
priority queue might provide
the following methods : q.add(value)
Add a value t o a
priority queue.
q.is_empty()
Return true if the queue is empty.
q. remove_smallest ()
Remove the largest
value from a priority queue and return it.
A priority queue differs from a stack and an ordinary queue in the
order in which elements are removed. A stack is last in first out:
the pop function removes the element that was added most
recently. An ordinary queue is first in first out: the dequeue
function removes the element that was added least recently. In a
priority queue, the remove_smallest method removes the
element with the smallest value . The interface above describes a min-queue ,
in which we can efficiently remove the smallest
value. Alternatively we can build a m ax -queue ,
which has a remove_la rgest method that removes
the largest
value; this is more convenient for some applications.
There is no fundamental differerence between these: a ny
data structure that implements a m in -queue
can be trivially modified to produce a m ax- queue,
by changing the direction of element comparisons. In
theory we could implement a priority queue using a binary search
tree. If we did so and the tree was balanced, then add and
remove_s mallest would
run in time O(log N), where N is the number of elements in the queue .
But there are more efficient data structures for implementing
priority queues, such as binary
heaps , which
we will discuss in the next lecture. |
| Markdown | null |
| Readable Markdown | null |
| Shard | 188 (laksa) |
| Root Hash | 9742904090598978588 |
| Unparsed URL | cz,cuni!mff,ksvi,/~dingle/2021-2/algs/notes_11.html s443 |