Basic Concept of HashMap
Prerequisite Knowledge
Before reading this article, you should first learn:
Firstly, I need to clarify a common misconception among beginners.
Are hash tables and what we commonly refer to as Maps (key-value mappings) the same thing? No, they are not.
This can be clearly explained using Java. Map
is a Java interface that only declares several methods without providing specific implementations:
interface Map<K, V> {
V get(K key);
void put(K key, V value);
V remove(K key);
// ...
}
The Map interface itself only defines a series of operations for key-value mapping. Data structures like HashMap
implement these operations based on their own characteristics. Other data structures like TreeMap
and LinkedHashMap
also implement this interface.
In other words, you can say that the get
, put
, and remove
methods of HashMap
have a complexity of , but you cannot say the same for the Map
interface. If you switch to another implementation class, such as TreeMap
, which uses a binary tree structure, these methods would have a complexity of .
Why am I emphasizing this? Mainly for readers using non-Java languages.
Other programming languages might not have such clear interface definitions as Java, which can easily lead readers to confuse hash tables with key-value maps. Hearing about key-value operations, one might incorrectly assume that their complexity is always . This is incorrect; it depends on how the underlying data structure implements the key-value operations.
In this section, I will guide you through implementing a hash table, exploring why hash tables can achieve complexity for operations like addition, deletion, and lookup, and discussing two methods for resolving hash collisions.
Basic Principles of Hash Tables
Easily Explained in a Few Sentences
A hash table can be understood as an enhanced version of an array.
An array can find the corresponding element through an index in time complexity, where the index is a non-negative integer.
A hash table is similar; it can locate the value
corresponding to a key
in time complexity. The key
can be of various types, such as numbers or strings.
How is this done? It's quite simple. The underlying implementation of a hash table is an array (which we can call a table
). It first converts the key
into an index in the array through a hash function (which we can call hash
), and then operations like addition, deletion, and lookup are similar to those in arrays:
// Pseudocode logic for a hash table
class MyHashMap {
private Object[] table;
// Insert/Update, complexity O(1)
public void put(K key, V value) {
int index = hash(key);
table[index] = value;
}
// Retrieve, complexity O(1)
public V get(K key) {
int index = hash(key);
return table[index];
}
// Delete, complexity O(1)
public void remove(K key) {
int index = hash(key);
table[index] = null;
}
// Hash function, converting key into a valid index in the table
// The time complexity must be O(1) to ensure
// that the above methods all have complexity
private int hash(K key) {
// ...
}
}
// Pseudocode logic for hash table
class MyHashMap {
private:
vector<void*> table;
public:
// Insert/Update, complexity O(1)
void put(auto key, auto value) {
int index = hash(key);
table[index] = value;
}
// Search, complexity O(1)
auto get(auto key) {
int index = hash(key);
return table[index];
}
// Delete, complexity O(1)
void remove(auto key) {
int index = hash(key);
table[index] = nullptr;
}
private:
// Hash function, convert key to valid index in the table
// Time complexity must be O(1) to ensure the above methods have O(1) complexity
int hash(auto key) {
// ...
}
};
# Hash table pseudocode logic
class MyHashMap:
def __init__(self):
self.table = [None] * 1000
# add/update, complexity O(1)
def put(self, key, value):
index = self.hash(key)
self.table[index] = value
# get, complexity O(1)
def get(self, key):
index = self.hash(key)
return self.table[index]
# remove, complexity O(1)
def remove(self, key):
index = self.hash(key)
self.table[index] = None
# hash function, convert key into a valid index in the table
# time complexity must be O(1) to ensure the above methods have O(1) complexity
def hash(self, key):
# ...
pass
// Pseudo-code logic for hash table
type MyHashMap struct {
table []interface{}
}
func Constructor() MyHashMap {
return MyHashMap{
table: make([]interface{}, 10000),
}
}
// Add/Update, complexity O(1)
func (h *MyHashMap) Put(key, value interface{}) {
index := h.hash(key)
h.table[index] = value
}
// Get, complexity O(1)
func (h *MyHashMap) Get(key interface{}) interface{} {
index := h.hash(key)
return h.table[index]
}
// Remove, complexity O(1)
func (h *MyHashMap) Remove(key interface{}) {
index := h.hash(key)
h.table[index] = nil
}
// Hash function, convert key to a valid index in the table
// Time complexity must be O(1) to ensure the above methods are O(1)
func (h *MyHashMap) hash(key interface{}) int {
...
}
// Hash table pseudocode logic
class MyHashMap {
constructor() {
this.table = new Array(1000).fill(null);
}
// Add/Update, complexity O(1)
put(key, value) {
const index = this.hash(key);
this.table[index] = value;
}
// Get, complexity O(1)
get(key) {
const index = this.hash(key);
return this.table[index];
}
// Remove, complexity O(1)
remove(key) {
const index = this.hash(key);
this.table[index] = null;
}
// Hash function, convert key to a valid index in the table
// The time complexity must be O(1) to ensure that the above methods' complexity are O(1)
hash(key) {
// ...
}
}
There are many details to handle in the actual implementation, such as the design of the hash function and the handling of hash collisions. But if you understand the core principles mentioned above, you're already halfway there. The rest is just writing the code, which shouldn't be too hard, right?
Next, let's specifically introduce some key concepts and potential issues in the add, delete, search, and update processes mentioned above.
Key Concepts and Principles
key
is Unique, value
can be Duplicated
In a hash table, there cannot be two identical keys
, but values
can be duplicated.
Understanding this principle is straightforward if you compare it to an array:
Each index in an array is unique; you cannot have an array with two index 0s. As for what elements are stored in the array, it is up to you, and no one cares.
Similarly, in a hash table, the value of a key
cannot be duplicated, while the value
can be anything.
Hash Function
The purpose of a hash function is to convert an input of any length (key) into a fixed-length output (index).
As you can see, operations like addition, deletion, search, and update all use the hash function to compute indices. If the hash function you design has a complexity of , then the performance of these operations will degrade to . Therefore, the performance of this function is crucial.
Another requirement for this function is that it must produce the same output for the same input key
to ensure the correctness of the hash table. You cannot compute hash("123") = 5
now and hash("123") = 6
later, as this would render the hash table useless.
How does a hash function convert a non-integer key
into an integer index? And how does it ensure that this index is valid?
How to Convert `key` into an Integer
There are various ways to address this problem, as different hash function designs may use different methods. I will describe a simple approach using the Java language. Similar methods can be applied in other programming languages by consulting relevant standard library documentation.
Every Java object has an int hashCode()
method. If you do not override this method in a custom class, the default return value can be considered the memory address of the object. The memory address of an object is a globally unique integer.
Therefore, by calling the hashCode()
method of key
, you effectively convert key
into an integer, and this integer is globally unique.
Of course, this method has some issues, which will be explained later, but for now, we have at least found a way to convert any object into an integer.
How to Ensure Valid Indexing
The hashCode
method returns an int type, but the issue here is that this int value may be negative, whereas array indices are non-negative integers.
You might think of writing code to convert this value into a non-negative number:
int h = key.hashCode();
if (h < 0) h = -h;
However, this approach has problems. The smallest value an int can represent is -2^31
, and the largest value is 2^31 - 1
. So if h = -2^31
, then -h = 2^31
would exceed the maximum value an int can represent, leading to integer overflow. This would cause a compiler error or even unpredictable results.
Why is the minimum value of an int -2^31
and the maximum 2^31 - 1
? This involves the principle of two's complement encoding in computers. Simply put, an int is 32 binary bits, where the highest bit (the leftmost one) is the sign bit. The sign bit is 0 for positive numbers and 1 for negative numbers.
The challenge is to ensure h
is non-negative without directly negating it. A straightforward way is to utilize the principle of two's complement encoding by directly setting the highest sign bit to 0, which ensures h
is non-negative.
int h = key.hashCode();
// Bitwise operation, remove the highest bit (sign bit)
// Additionally, bitwise operations are generally faster than arithmetic operations
// Therefore, you will see that the standard library
// source code prefers bitwise operations where possible
h = h & 0x7fffffff;
// The binary representation of 0x7fffffff is 0111 1111 ... 1111
// That is, all bits are 1 except the highest bit (sign bit) which is 0
// By performing a bitwise AND with 0x7fffffff, the highest
// bit (sign bit) is cleared, ensuring h is non-negative
I won't go into the principles of two's complement encoding in detail here. If you're interested, you can look it up and study it on your own.
Alright, the issue of hashCode
potentially being negative is resolved. But there's another problem: this hashCode
is usually very large, and we need to map it to a valid index in the table
array.
This shouldn't be difficult for you. Previously, in Circular Array Principles and Implementation, we used the %
modulus operation to ensure the index always falls within the valid range of the array. So, here we can also use the %
operation to ensure the index is valid. The complete hash
function implementation is as follows:
int hash(K key) {
int h = key.hashCode();
// ensure non-negative number
h = h & 0x7fffffff;
// map to a valid index in the table array
return h % table.length;
}
Of course, directly using %
also has its issues because the modulus operation %
is relatively performance-intensive. In standard libraries that prioritize runtime efficiency, the %
operation is usually avoided in favor of bitwise operations to improve performance.
However, the main purpose of this chapter is to help you understand how to implement a simple hash table, so we will ignore these optimization details. If you're interested, you can check out the source code of Java's HashMap to see how it implements the hash
function.
Hash Collision
Above, we discussed the implementation of the hash
function. Naturally, you might wonder, what happens if two different keys
produce the same index through the hash function? This situation is known as a "hash collision."
Can Hash Collisions Be Avoided?
Hash collisions cannot be avoided; they can only be handled properly at the algorithm level.
Hash collisions are inevitable because the hash
function essentially maps an infinite space to a finite index space, so different keys
are bound to map to the same index.
It's similar to how a three-dimensional object casts a two-dimensional shadow. This lossy compression necessarily results in information loss, and lossy information cannot correspond one-to-one with the original information.
How do we resolve hash collisions? There are two common solutions: chaining and linear probing (also often called open addressing).
The names might sound fancy, but they essentially boil down to two approaches: vertical extension and horizontal extension.
In chaining, the underlying array of the hash table doesn't store value
types directly but stores a linked list. When multiple different keys
map to the same index, these key -> value
pairs are stored in the linked list, effectively resolving hash collisions.
In linear probing, if a key
finds that the calculated index
is already occupied by another key
, it checks the position at index + 1
. If that spot is also occupied, it continues to check subsequent positions until it finds an empty spot.
For example, in the image above, the keys are inserted in the order k2, k4, k5, k3, k1
, resulting in the hash table looking like this:
Here, I'll explain the principles first. In the following chapters, I will walk you through the implementation of these two methods to resolve hash collisions.
Capacity Expansion and Load Factor
You might have heard the term "load factor" before. Now that you understand the issue of hash collisions, you can grasp the significance of the load factor.
While chaining and linear probing methods can resolve hash collisions, they may lead to performance degradation.
For instance, with chaining, you calculate index = hash(key)
and find a linked list at that index. You then need to traverse this list to locate the desired value
. The time complexity of this process is , where K
is the length of the linked list.
Linear probing is similar. You calculate index = hash(key)
and check the index, but if the stored key isn't the one you are looking for, you can't be sure the key doesn't exist. You must keep searching until you find an empty spot or the key itself. This process also has a time complexity of , where K
is the number of consecutive probes.
Therefore, frequent hash collisions increase the value of K
, leading to a significant drop in hash table performance. This is something we aim to avoid.
Why do hash collisions occur frequently? There are two reasons:
A poorly designed hash function results in uneven distribution of
key
hash values, causing many keys to map to the same index.The hash table is too full, making collisions inevitable even with a perfect hash function.
There's little to say about the first issue; language standard library developers have already designed the hash functions for you. You just need to use them.
The second issue is controllable — avoid overfilling the hash table, which introduces the concept of the "load factor."
Load Factor
The load factor measures how full a hash table is. Generally, a higher load factor indicates more key-value
pairs in the hash table, increasing the likelihood of hash collisions and degrading hash table performance.
The formula for the load factor is straightforward: size / table.length
, where size
is the number of key-value
pairs in the hash table, and table.length
is the capacity of the underlying array.
You'll find that a hash table implemented with chaining can have an infinitely large load factor because linked lists can extend indefinitely; with linear probing, the load factor cannot exceed 1.
In Java's HashMap, you can define the load factor when creating a hash table. If not set, it defaults to 0.75
, an empirical value that generally works fine.
When the number of elements in a hash table reaches the load factor, the hash table expands. This is similar to the explanation in Dynamic Array Implementation, where the capacity of the underlying table
array is increased, and the data is moved to a new, larger array. The size
remains the same, table.length
increases, and the load factor decreases.
Why You Shouldn't Rely on Hash Table Traversal Order
You may have heard the programming rule that the order of keys in a hash table is unordered, and you shouldn't rely on traversal order when writing programs. But why is that?
Hash table traversal essentially means traversing the underlying table
array:
// pseudo-code logic for iterating all keys
// underlying table array of the hash table
KVNode[] table = new KVNode[1000];
// get all keys in the hash table
// we cannot rely on the order of this keys list
List<KeyType> keys = new ArrayList<>();
for (int i = 0; i < table.length; i++) {
KVNode node = table[i];
if (node != null) {
keys.add(node.key);
}
}
If you understand the previous content, you should be able to grasp this problem.
First, because the hash
function maps your key
, the distribution of keys
in the underlying table
array is random, unlike arrays or linked lists that have a clear element order.
Second, what happens when a hash table reaches its load factor? It resizes, right? This means the table.length
changes, and elements are moved.
So, during this rehashing process, the hash
function recalculates the hash value for each key
and places it in the new table
array.
The hash
function calculates values based on table.length
. This means that after a hash table automatically resizes, the hash value for the same key
might change, and the index where the key-value
pair is stored in the table
might also change, so the traversal order becomes different from before.
The phenomenon you observe is that the first key in one traversal is key1
, but after adding or removing a few elements, key1
might end up at the end.
So, there's no need to memorize these details. Once you understand the principle, you can figure it out with a bit of reasoning.
Why It Is Not Recommended to Add/Delete Hash Table Keys
in a For Loop
Note that I say it is not recommended, not that it is absolutely forbidden. Different programming languages implement hash tables in their standard libraries differently. Some languages have optimizations for this situation, so whether it is feasible depends on the documentation.
From the principle of hash tables, adding or deleting keys
in a for loop can easily cause problems. The reason is similar to what was mentioned earlier, which is the change in hash values due to resizing.
Iterating over the keys
of a hash table essentially means traversing the underlying table
array of the hash table. If you add or remove elements while iterating, and an insertion or deletion triggers resizing halfway through, the entire table
array changes. What should be the behavior then? Moreover, should the newly inserted or deleted elements during the iteration be traversed?
The change in key
order due to resizing is a behavior unique to hash tables. However, even if we exclude this factor, it is generally not recommended to add or delete elements in any data structure during iteration, as it often leads to unexpected behavior.
If you must do so, make sure to consult the relevant documentation to understand the behavior of this operation clearly.
Keys
Must Be Immutable
Only immutable types can be used as keys
in a hash table. This is very important.
An immutable type means that once an object is created, its value cannot be changed. For example, in Java, types like String
and Integer
are immutable. Once these objects are created, you can only read their values but cannot modify them.
In contrast, objects like ArrayList
and LinkedList
in Java can have elements freely added or removed after creation, making them mutable types.
Thus, you can use String
objects as keys
in a hash table, but you cannot use ArrayList
objects as keys
:
// Immutable types can be used as keys
Map<String, AnyOtherType> map1 = new HashMap<>();
Map<Integer, AnyOtherType> map2 = new HashMap<>();
// Mutable types should not be used as keys
// Note, this won't cause a syntax error, but the code is very likely to have bugs
Map<ArrayList<Integer>, AnyOtherType> map3 = new HashMap<>();
Why is it not recommended to use mutable types as key
? For example, take this ArrayList
. The implementation logic of its hashCode
method is as follows:
public int hashCode() {
for (int i = 0; i < elementData.length; i++) {
h = 31 * h + elementData[i];
}
}
The first issue is efficiency. Every time hashCode
is calculated, it requires traversing the entire array, resulting in a complexity of . This can degrade the complexity of add, delete, search, and update operations in a hash table to .
A more severe issue is that the hashCode
of an ArrayList
is computed based on its elements. If you add or remove elements from this ArrayList
, or if the hashCode
of any element changes, the hashCode
return value of the ArrayList
will also change.
For instance, if you use an ArrayList
variable arr
as a key in a hash table to store a corresponding value
, and then an element in arr
is modified elsewhere in the program, the hashCode
of arr
will change. At this point, if you use this arr
variable to query the hash table, you will find no corresponding value.
In other words, the key-value
pair you stored in the hash table is unexpectedly lost, which is a very serious bug and can lead to potential memory leaks.
public class Test {
public static void main(String[] args) {
// Incorrect example
// Using mutable types as keys in HashMap
Map<ArrayList<Integer>, Integer> map = new HashMap<>();
ArrayList<Integer> arr = new ArrayList<>();
arr.add(1);
arr.add(2);
map.put(arr, 999);
System.out.println(map.containsKey(arr)); // true
System.out.println(map.get(arr)); // 999
arr.add(3);
// Serious bug occurs, key-value pair is lost
System.out.println(map.containsKey(arr)); // false
System.out.println(map.get(arr)); // null
// At this point, the key-value pair for arr
// still exists in the underlying table of
// But because the hashCode of arr has changed, this key-value pair cannot be found
// This can also cause a memory leak because the arr
// variable is referenced by the map and cannot be garbage
}
}
The above is a simple example of an error. You might say, if you remove the element 3
, doesn't the key-value pair arr -> 999
reappear? Or, by directly iterating over the underlying table
array of the hash table, shouldn't you also be able to see this key-value pair?
Come on 🙏🏻, are you writing code or a ghost story? Now you see it, now you don't—is your hash table haunted?
Just joking. The truth is, mutable types inherently bring uncertainty. In the overwhelming complexity of code, how do you know where this arr
gets modified?
The correct approach is to use immutable types as the key
for a hash table, such as using String
as the key
. In Java, once a String
object is created, its value cannot be changed, so you won't encounter the problem mentioned above.
The hashCode
method for String
also needs to iterate over all characters, but due to its immutability, this value can be cached after being computed once, eliminating the need for repeated calculations. Therefore, the average time complexity remains .
Here, I used Java as an example, but it's similar in other languages. You should consult the relevant documentation to understand how the standard library's hash table calculates object hash values to avoid similar issues.
Summary
The explanation above should have connected all the underlying principles of hash tables. Let's conclude the content of this article by simulating several interview questions:
1. Why do we often say that the efficiency of adding, deleting, searching, and updating in a hash table is ?
Because the underlying structure of a hash table is an array operation. The primary time complexity comes from calculating the index using the hash function and resolving hash collisions. As long as the complexity of the hash function is and hash collisions are resolved reasonably, the complexity of adding, deleting, searching, and updating remains .
2. Why does the traversal order of a hash table change?
Because when a hash table reaches a certain load factor, it expands, which changes the capacity of the underlying array. This also alters the indices calculated by the hash function, thus changing the traversal order.
3. Is the efficiency of adding, deleting, searching, and updating in a hash table always ?
Not necessarily. As analyzed earlier, only when the complexity of the hash function is and hash collisions are handled properly can the complexity of these operations be guaranteed to be . Hash collisions can be resolved with standard solutions. The key issue is the computational complexity of the hash function. If an incorrect key
type is used, such as an ArrayList
as a key
, the complexity of the hash table can degrade to .
4. Why must immutable types be used as hash table keys
?
Because the main operations of a hash table rely on the index calculated by the hash function. If the hash value of the key
changes, it can cause key-value pairs to be lost unexpectedly, leading to severe bugs.
Having a basic understanding of the source code in the standard library of your programming language is essential to writing efficient code.
Next, I will guide you step-by-step to implement a simple hash table using the chaining method and linear probing method to deepen your understanding of hash tables.