Home >Blog

Hashing basics

Hashing is a technique that is used to uniquely identify a specific object from a collection of similar objects. Some examples of how hashing is used in our lives include:

In schools, each student is assigned a unique roll number that can be used to retrieve data about them.
In libraries, each book is assigned a unique book identifier that can be used to determine information about the book, such as its exact position in the library or the users it has been issued to etc.

In both of these examples the students and books were hashed to a unique number, an identifier.

Imagine that we have an object and we want to assign a key to it to make searching easy. To store the key/value pair, you can use a simple array like a data structure where keys (integers) can be used directly as an index to store values. However, in cases where the keys are large and cannot be used directly as an index, you should use hashing.

In hashing, large keys are converted into small keys by using hash functions. The values are then stored in a data structure called a hash table. The idea of hashing is to distribute entries (key/value pairs) uniformly across an array. Each element is assigned a key (converted key). By using that key you can access the element in O(1) time. Using the key, the algorithm (hash function) computes an index that suggests where an entry can be found or inserted.

Hashing is implemented in two steps:

An element is converted into an integer by using a hash function. This element can be used as an index to store the original element, which falls into the hash table.
The element is stored in the hash table where it can be quickly retrieved using hashed key.
hash = hashfunc(key)
index = hash % array_size

In this method, the hash is independent of the array size and it is then reduced to an index (a number between 0 and array_size − 1) by using the modulo operator (%).

Hash function

A hash function is any function that can be used to map a data set of an arbitrary size to a data set of a fixed size, which falls into the hash table. The values returned by a hash function are called hash values, hash codes, hash sums, or simply hashes.

To achieve a good hashing mechanism, it is important to have a good hash function with the following basic requirements:

Easy to compute: It should be easy to compute and must not become an algorithm in itself.
Uniform distribution: It should provide a uniform distribution across the hash table and should not result in clustering.
Less collisions: Collisions occur when pairs of elements are mapped to the same hash value. These should be avoided.

Note: Irrespective of how good a hash function is, collisions are bound to occur. Therefore, to maintain the performance of a hash table, it is important to manage collisions through various collision resolution techniques.

Need for a good hash function

Let us understand the need for a good hash function. Assume that you have to store strings in the hash table by using the hashing technique {“abcdef”, “bcdefa”, “cdefab” , “defabc” }.

To compute the index for storing the strings, use a hash function that states the following:

The index for a specific string will be equal to the sum of the ASCII values of the characters modulo 599.

As 599 is a prime number, it will reduce the possibility of indexing different strings (collisions). It is recommended that you use prime numbers in case of modulo. The ASCII values of a, b, c, d, e, and f are 97, 98, 99, 100, 101, and 102 respectively. Since all the strings contain the same characters with different permutations, the sum will be 599.

The hash function will compute the same index for all the strings and the strings will be stored in the hash table in the following format. As the index of all the strings is the same, you can create a list on that index and insert all the strings in that list.

Here, it will take O(n) time (where n is the number of strings) to access a specific string. This shows that the hash function is not a good hash function.

Let’s try a different hash function. The index for a specific string will be equal to the sum of the ASCII values of the characters multiplied by their respective order in the string, after which it is modulo with 2069 (another prime number).

String Hash function Index

abcdef (971 + 982 + 993 + 1004 + 1015 + 1026)%2069 38

bcdefa (981 + 992 + 1003 + 1014 + 1025 + 976)%2069 23

cdefab (991 + 1002 + 1013 + 1024 + 975 + 986)%2069 14

defabc (1001 + 1012 + 1023 + 974 + 985 + 996)%2069 11

Double hashing

Double hashing is similar to linear probing and the only difference is the interval between successive probes. Here, the interval between probes is computed by using two hash functions.

Let us say that the hashed index for an entry record is an index that is computed by one hashing function and the slot at that index is already occupied. You must start traversing in a specific probing sequence to look for an unoccupied slot. The probing sequence will be:

index = (index + 1 * indexH) % hashTableSize;

index = (index + 2 * indexH) % hashTableSize;

and so on…

Here, indexH is the hash value that is computed by another hash function.

Implementation of hash table with double hashing

Assumption

There are no more than 20 elements in the data set
Hash functions will return an integer from 0 to 19
Data set must have unique elements

Applications

Associative arrays: Hash tables are commonly used to implement many types of in-memory tables. They are used to implement associative arrays (arrays whose indices are arbitrary strings or other complicated objects).
Database indexing: Hash tables may also be used as disk-based data structures and database indices (such as in dbm).
Caches: Hash tables can be used to implement caches i.e. auxiliary data tables that are used to speed up the access to data, which is primarily stored in slower media.
Object representation: Several dynamic languages, such as Perl, Python, JavaScript, and Ruby use hash tables to implement objects.
Hash Functions are used in various algorithms to make their computing faster.