Overview
A hash table stores key → value pairs and supports expected O(1) search, insert, and delete. It computes an array index from the key using a hash function h(k), then resolves collisions (two keys mapping to the same index) using either separate chaining (lists per bucket) or open addressing (probing within the array). Periodic resizing keeps the load factor small so constant-time performance holds.
Structure Definition
A hash table consists of:
-
A bucket array
T[0..m-1](capacitym). -
A hash function
h: Key → {0..m-1}. -
A collision policy:
-
Separate chaining:
T[i]holds a (possibly empty) list/array of entries with hashi. -
Open addressing: One entry per slot; on collision, probe other indices in a deterministic sequence.
-
Load factor. α = n/m, where n is the number of stored keys. For chaining, α is expected average chain length. For open addressing, α must be < 1; performance degrades as α → 1.
Core Operations
Separate chaining (with small vectors/lists per bucket)
function SEARCH(T, k):
i = h(k) mod m
for (key, val) in T[i]:
if key == k: return val
return NOT_FOUND
function INSERT(T, k, v):
if LOAD_FACTOR(T) > MAX_ALPHA: RESIZE(T, growth_factor)
i = h(k) mod m
for (key, _) in T[i]:
if key == k: // update existing
set value; return
append (k, v) to T[i]
function DELETE(T, k):
i = h(k) mod m
remove first entry with key == k in T[i] (if any)Open addressing (linear/quadratic/double hashing)
We maintain slots with states: EMPTY, FILLED, TOMBSTONE (deleted).
function PROBE_INDEX(h1, h2, i, m, scheme):
if scheme == LINEAR: return (h1 + i) mod m
if scheme == QUADRATIC: return (h1 + c1*i + c2*i*i) mod m
if scheme == DOUBLE: return (h1 + i*h2) mod m // h2 ≠ 0, coprime to m
function SEARCH(T, k):
h1 = h(k) mod m
h2 = h2_fn(k, m) // for DOUBLE only; otherwise ignore
for i in 0..m-1:
j = PROBE_INDEX(h1, h2, i, m, scheme)
if T[j] == EMPTY: return NOT_FOUND
if T[j] is FILLED and T[j].key == k: return T[j].val
return NOT_FOUND
function INSERT(T, k, v):
if LOAD_FACTOR(T) > MAX_ALPHA: RESIZE_REHASH(T, growth_factor)
h1 = h(k) mod m; h2 = h2_fn(k, m)
first_tomb = NONE
for i in 0..m-1:
j = PROBE_INDEX(h1, h2, i, m, scheme)
if T[j] is FILLED:
if T[j].key == k: T[j].val = v; return
else if T[j] is TOMBSTONE and first_tomb is NONE:
first_tomb = j
else if T[j] is EMPTY:
place at (first_tomb if set else j); return
function DELETE(T, k):
h1 = h(k) mod m; h2 = h2_fn(k, m)
for i in 0..m-1:
j = PROBE_INDEX(h1, h2, i, m, scheme)
if T[j] == EMPTY: return // not found
if T[j] is FILLED and T[j].key == k:
mark T[j] = TOMBSTONE; returnTip
For open addressing, do not shift elements after a deletion—use a tombstone so later probes don’t stop early. Tombstones are reclaimed during periodic rehashing.
Example (Stepwise)
Chaining. Let m=5, keys {17, 42, 9, 14} with h(k)=k mod 5.
Buckets: B[0]={}, B[1]={}, B[2]={17}, B[3]={}, B[4]={9,14,42} (after inserts).
Searching 42 inspects B[2]? No—42 mod 5 = 2? Actually 42 mod 5 = 2, so it lives in B[2] (correcting the example): insert order yields B[2]={17,42}, B[4]={9,14}. Average chain length α = n/m = 4/5 = 0.8.
Open addressing (linear probing, m=7). Insert keys {10, 24, 31, 18} with h(k)=k mod 7.
-
10 → 3goes to slot 3. -
24 → 3collides; probe to 4. -
31 → 3collides; probe to 5. -
18 → 4collides; probe 5 (full), then 6 (place). Notice a cluster forming at indices3..6.
Complexity and Performance
Assume a good hash function and simple uniform hashing.
-
Expected time:
search,insert,deleteare O(1) expected for both chaining and open addressing when the load factorαis bounded (e.g.,α ≤ 1for open addressing;αmodest for chaining). -
Worst case: O(n) if all keys land in one bucket or if probing scans the entire table (adversarial inputs or poor
h). -
Space:
Θ(m + n)for chaining (extra node overhead),Θ(m)for open addressing (no pointers but must keepα < 1).
Clustering effects (open addressing).
-
Linear probing suffers primary clustering: contiguous runs get longer.
-
Quadratic probing mitigates primary clustering but can suffer secondary clustering (same
h1→ same probe pattern). -
Double hashing reduces both by using a second, key-dependent step size.
Optimizations or Variants
-
Resizing policies. Grow
mwhenα > MAX_ALPHA(e.g., 0.75 for open addressing, 1–2+ for chaining). For chaining, you can also shrink whenαfalls below a threshold to save memory. -
Cache-friendly chains. Use small fixed arrays (“stash”) or a flat vector per bucket instead of linked lists to improve locality.
-
Robin Hood hashing (open addressing). On insert, if the new key’s probe distance exceeds the resident’s, swap; evens out cluster depth and reduces variance of search time.
-
Cuckoo hashing. Maintain two (or more) hash functions and move items between two tables on collisions; gives O(1) lookups and high load but requires relocation logic and a stash for cycles.
-
Hopscotch hashing. Keeps items near their ideal bucket using bitmaps; excellent locality on CPUs.
-
Instrumentation. Track average probe length or chain length; trigger rehash before tail latency spikes.
Applications
-
Dictionaries/maps/sets in language runtimes (Python dicts, Java HashMap, C++ unordered_map).
-
Memoization tables in algorithms (e.g., Dynamic Programming caches).
-
Symbol tables in compilers/interpreters.
-
Deduplication (tracking seen IDs), frequency counts, joins in databases (hash join).
Common Pitfalls or Edge Cases
Warning
Poor hash functions. Simple
h(k)=k mod mfor string-like or patterned keys causes clustering. Prefer mixed hash functions with good avalanche properties (e.g., MurmurHash3, xxHash) or SipHash for adversarial inputs.
Warning
Forgetting equality semantics. If two keys can be equal, their hashes must be equal (consistency with
==). Also ensure hash depends on the same fields as equality.
Warning
Changing keys in place. In open addressing, mutating a key after insertion corrupts probe invariants; never modify keys stored in the table.
Warning
Deletion without tombstones. In open addressing, clearing a slot to
EMPTYbreaks probe chains; useTOMBSTONEand compact on rehash.
Warning
Load factor too high. As
α → 1in open addressing, probe lengths blow up. Resize earlier (e.g., targetα ≤ 0.7–0.8) to keep performance flat.
Implementation Notes or Trade-offs
Choosing m and hash functions
-
Pick
mas a power of two (bitmask) or a prime near a power of two. -
For power-of-two
m, mix high bits into low bits (e.g., final XOR/rotate) so low bits aren’t constant across many keys. -
For double hashing, ensure
h2(k)is odd (coprime tomfor power-of-twom) and nonzero to visit all slots.
Resizing and rehashing
function RESIZE_REHASH(T, growth_factor):
old = T
m2 = choose_capacity(ceil(growth_factor * old.m))
T = new_table(m2)
for slot in old:
if slot is FILLED:
INSERT(T, slot.key, slot.val) // rehash with new m
return TFor chaining, rehash all entries bucket by bucket. For open addressing, avoid copying raw slots; reinsert each FILLED entry to rebuild proper probe sequences and drop tombstones.
Memory and locality
-
Chaining: pointer-based lists hurt locality; prefer small vectors or intrusive nodes stored in arenas.
-
Open addressing: excellent locality (single array), but deletions and high
αare trickier. -
Strings/large keys: store a short fingerprint (e.g., 8–16 bits) in the slot to speed disambiguation before full key compares.
Concurrency
-
Coarse locks (one lock per table) are simple but contend.
-
Striped locks (one per bucket range) enable parallelism in chaining.
-
Lock-free tables exist (specialized), but are complex; start with coarse/striped and measure.
Summary
Hash tables give near-constant-time operations by mapping keys to array indices and controlling collisions. Choose between chaining (simpler deletes, tolerant of high α) and open addressing (better locality, needs careful deletion and lower α). A good hash function, a sensible resize policy, and attention to probe/chain lengths keep performance predictable. For heavy-duty workloads, consider modern variants like Robin Hood or cuckoo hashing to tame tail latencies and improve cache behavior.