Introduction to Rolling Hash - Data Structures and Algorithms
Last Updated :
23 Jul, 2025
A rolling hash is a hash function that is used to efficiently compute a hash value for a sliding window of data. It is commonly used in computer science and computational biology, where it can be used to detect approximate string matches, find repeated substrings, and perform other operations on sequences of data.
The idea behind a rolling hash is to compute the hash value for a fixed-size window of data, and then "roll" the window one position at a time and re-compute the hash value. By using a sliding window, the rolling hash can avoid re-computing the hash value from scratch for each position, which can be time-consuming for large data sets. Instead, the rolling hash only needs to update the hash value for the new data that is added or removed from the window.
Examples:
Example 1:
- let's say we have a string "hello world" and we want to search for the substring "world" within it. We can use the Rabin-Karp hash function to compute the hash value of each possible substring of length 5 (the length of the substring "world") and compare it to the hash value of the target substring "world".
- To compute the hash value of a substring, the Rabin-Karp algorithm uses a rolling hash function that maintains a running hash value of the current substring and updates the hash value as the sliding window moves across the input string.
- For instance, the Rabin-Karp rolling hash function might use the following formula to compute the hash value of a substring: hash(substring) = (c1 * a^(k-1) + c2 * a^(k-2) + ... + ck * a^0) mod m where c1, c2, ..., ck are the ASCII values of the characters in the substring, a is a prime number (e.g., 31), k is the length of the substring, and m is a large prime number (e.g., 10^9 + 7).
- Using this rolling hash function, we can compute the hash value of each possible substring of length 5 in the string "hello world", and compare it to the hash value of the target substring "world" to search for a match. If a match is found, we can return the index of the matching substring in the input string.
- Rolling hashes are useful for many applications where data needs to be efficiently compressed, searched, or compared in a continuous and incremental manner.
Example 2:
- Suppose we have a large file (e.g., several gigabytes in size) that we want to deduplicate by identifying and removing duplicate blocks of data. To do this efficiently, we can use a rolling hash to compute the hash value of each block of data and compare it to the hash values of previously seen blocks.
- One common rolling hash used for this purpose is the Adler-32 hash function, which is a checksum algorithm that is fast and simple to compute. The Adler-32 hash function works by combining two 16-bit values (s1 and s2) that are updated for each byte in the input data.
- To compute the Adler-32 hash of a block of data, we can use the following formula: hash(data) = s2 * 65536 + s1 where s1 is the sum of all bytes in the data modulo 65521, and s2 is the sum of all s1 values modulo 65521.
- Using this rolling hash function, we can compute the hash value of each block of data in the file as we read it, and compare it to the hash values of previously seen blocks stored in a hash table or a bloom filter. If a match is found, we can skip the duplicate block and save space by only storing a reference to the previously seen block.
The most commonly used algorithm for rolling hashes is the Rabin-Karp algorithm, which uses a polynomial hash function to compute the hash value for the window of data. The polynomial hash function uses the coefficients of a polynomial to represent the window of data, and the hash value is computed by evaluating the polynomial at a fixed point in a finite field. The hash value can be efficiently updated by removing the first coefficient from the polynomial and adding the next coefficient from the sliding window, which can be done in constant time.
Here's an implementation of the rolling hash algorithm in Python:
- This function takes in a string s, a window size window_size, and two optional arguments base and mod. The base argument is used as the base for the polynomial hash function, and the mod is the modulus used to reduce the hash values.
- The function first initializes some variables, including a list power that stores the powers of the base modulo the mod, a list hash_values that stores the hash values of all substrings of length window_size in s, and a variable current_hash that stores the hash value of the current window.
- The function then uses a for loop to precompute the powers of the base modulo the mod. It then computes the hash value of the first window using another for loop that iterates over the characters in the first window.
- Finally, the function uses another for loop to compute the hash values of the rest of the substrings. It first removes the contribution of the first character in the current window, shifts the window by one character, and adds the new character to the hash. It then stores the hash value of the current window in the hash_values list.
- The function returns the hash_values list, which contains the hash values of all substrings of length window_size in s.
Below is the implementation for the above approach:
C++
#include <bits/stdc++.h>
using namespace std;
vector<int> rolling_hash(string s, int window_size,
int base = 26,
int mod = 1000000007)
{
int n = s.length();
vector<int> power(n + 1, 1);
vector<int> hash_values(n - window_size + 1);
// Precompute the powers of the base modulo the mod
for (int i = 1; i <= n; i++) {
power[i] = (power[i - 1] * base) % mod;
}
// Compute the hash value of the first window
int current_hash = 0;
for (int i = 0; i < window_size; i++) {
current_hash = (current_hash * base + s[i]) % mod;
}
hash_values[0] = current_hash;
// Compute the hash values of the rest of the substrings
for (int i = 1; i <= n - window_size; i++) {
// Remove the contribution of the first character in
// the window
current_hash = (current_hash
- power[window_size - 1] * s[i - 1])
% mod;
// Shift the window by one character and add the new
// character to the hash
current_hash
= (current_hash * base + s[i + window_size - 1])
% mod;
hash_values[i] = current_hash;
}
return hash_values;
}
// Driver code
int main()
{
// Input string
string s = "abcabcabc";
// Window size
int window_size = 3;
// Calculate rolling hash values
vector<int> hash_values = rolling_hash(s, window_size);
// Print the hash values
for (auto& val : hash_values) {
cout << val << " ";
}
cout << endl;
return 0;
}
// this code is contributed by Prajwal kandekar
Java
import java.util.*;
public class Main {
public static ArrayList<Integer>
rolling_hash(String s, int window_size, int base,
int mod)
{
int n = s.length();
ArrayList<Integer> power
= new ArrayList<Integer>(n + 1);
ArrayList<Integer> hash_values
= new ArrayList<Integer>(n - window_size + 1);
// Precompute the powers of the base modulo the mod
for (int i = 0; i <= n; i++) {
power.add(1);
}
for (int i = 1; i <= n; i++) {
power.set(i, (power.get(i - 1) * base) % mod);
}
// Compute the hash value of the first window
int current_hash = 0;
for (int i = 0; i < window_size; i++) {
current_hash
= (current_hash * base + s.charAt(i)) % mod;
}
hash_values.add(current_hash);
// Compute the hash values of the rest of the
// substrings
for (int i = 1; i <= n - window_size; i++) {
// Remove the contribution of the first
// character in the window
current_hash = (current_hash
- power.get(window_size - 1)
* s.charAt(i - 1))
% mod;
// Shift the window by one character and add the
// new character to the hash
current_hash = (current_hash * base
+ s.charAt(i + window_size - 1))
% mod;
hash_values.add(current_hash);
}
return hash_values;
}
// Driver code
public static void main(String[] args)
{
// Input string
String s = "abcabcabc";
// Window size
int window_size = 3;
// Calculate rolling hash values
ArrayList<Integer> hash_values
= rolling_hash(s, window_size, 26, 1000000007);
// Print the hash values
for (int val : hash_values) {
System.out.print(val + " ");
}
System.out.println();
}
}
Python3
# Python Implementation
from typing import List
def rolling_hash(s: str, window_size: int, base: int = 26, mod: int = 10**9 + 7) -> List[int]:
"""
Calculates the rolling hash values of all substrings of length window_size in string s.
Uses the polynomial rolling hash algorithm with base and mod as constants.
:param s: the input string
:param window_size: the size of the rolling window
:param base: the base for the polynomial hash function
:param mod: the modulus for the polynomial hash function
:return: a list of hash values of all substrings of length window_size in s
"""
n = len(s)
power = [1] * (n + 1)
hash_values = [0] * (n - window_size + 1)
# Precompute the powers of the
# base modulo the mod
for i in range(1, n + 1):
power[i] = (power[i - 1] * base) % mod
# Compute the hash value of
# the first window
current_hash = 0
for i in range(window_size):
current_hash = (current_hash * base + ord(s[i])) % mod
hash_values[0] = current_hash
# Compute the hash values of the
# rest of the substrings
for i in range(1, n - window_size + 1):
# Remove the contribution of the
# first character in the window
current_hash = (
current_hash - power[window_size - 1] * ord(s[i - 1])) % mod
# Shift the window by one character
# and add the new character
# to the hash
current_hash = (current_hash * base + ord(s[i + window_size - 1])) % mod
hash_values[i] = current_hash
return hash_values
# Driver code
def main():
# Input string
s = "abcabcabc"
# Window size
window_size = 3
# Calculate rolling hash values
hash_values = rolling_hash(s, window_size)
# Print the hash values
print(hash_values)
if __name__ == "__main__":
main()
C#
// C# Implementation
using System;
using System.Collections.Generic;
class Program {
static List<int> rolling_hash(string s, int window_size,
int base_, int mod)
{
int n = s.Length;
List<int> power = new List<int>(n + 1);
List<int> hash_values
= new List<int>(n - window_size + 1);
// Precompute the powers of the base modulo the mod
for (int i = 0; i <= n; i++) {
power.Add(1);
}
for (int i = 1; i <= n; i++) {
power[i] = (power[i - 1] * base_) % mod;
}
// Compute the hash value of the first window
int current_hash = 0;
for (int i = 0; i < window_size; i++) {
current_hash
= (current_hash * base_ + s[i]) % mod;
}
hash_values.Add(current_hash);
// Compute the hash values of the rest of the
// substrings
for (int i = 1; i <= n - window_size; i++) {
// Remove the contribution of the first
// character in the window
current_hash
= (current_hash
- power[window_size - 1] * s[i - 1])
% mod;
// Shift the window by one character and add the
// new character to the hash
current_hash = (current_hash * base_
+ s[i + window_size - 1])
% mod;
hash_values.Add(current_hash);
}
return hash_values;
}
// Driver code
static void Main(string[] args)
{
// Input string
string s = "abcabcabc";
// Window size
int window_size = 3;
// Calculate rolling hash values
List<int> hash_values
= rolling_hash(s, window_size, 26, 1000000007);
// Print the hash values
foreach(int val in hash_values)
{
Console.Write(val + " ");
}
Console.WriteLine();
}
}
// This code is contributed by Tapesh(tapeshdua420)
JavaScript
function rollingHash(s, windowSize, base = 26, mod = 1000000007) {
const n = s.length;
const power = new Array(n + 1).fill(1);
const hashValues = new Array(n - windowSize + 1);
// Precompute the powers of the base modulo the mod
for (let i = 1; i <= n; i++) {
power[i] = (power[i - 1] * base) % mod;
}
// Compute the hash value of the first window
let currentHash = 0;
for (let i = 0; i < windowSize; i++) {
currentHash = (currentHash * base + s.charCodeAt(i)) % mod;
}
hashValues[0] = currentHash;
// Compute the hash values of the rest of the substrings
for (let i = 1; i <= n - windowSize; i++) {
// Remove the contribution of the first character in the window
currentHash =
((currentHash - power[windowSize - 1] * s.charCodeAt(i - 1)) % mod + mod) %
mod;
// Shift the window by one character and add the new character to the hash
currentHash =
(currentHash * base + s.charCodeAt(i + windowSize - 1)) % mod;
hashValues[i] = currentHash;
}
return hashValues;
}
// Driver code
const s = "abcabcabc";
// Window size
const windowSize = 3;
// Calculate rolling hash values
const hashValues = rollingHash(s, windowSize);
// Print the hash values
console.log(hashValues.join(" "));
//This code is contributed by Tushar Rokade
Output[68219, 68919, 69544, 68219, 68919, 69544, 68219]
Time Complexity: O(n)
Auxiliary Space: O(n)
Applications of the rolling hash algorithm:
- Approximate string matching: The rolling hash can be used to quickly identify regions of two strings that are likely to be similar, which can be useful for detecting plagiarism or identifying similar documents.
- Repeated substring detection: The rolling hash can be used to identify repeated substrings in a long sequence of data, which can be useful for compressing the data or finding patterns in the data.
- Data stream processing: The rolling hash can be used to efficiently process large streams of data, such as network packets or sensor data, by computing a hash value for each window of data and using the hash value to identify patterns or anomalies in the data.
- File synchronization: Rolling hashes can be used to efficiently detect changes in large files, such as multimedia files or database backups. By computing rolling hashes for fixed-size chunks of the file, it is possible to quickly identify regions of the file that have been modified or deleted, which can be used to synchronize the file with a remote backup or a different version of the file.
- Data deduplication: Rolling hashes can be used to detect duplicate blocks of data in a large data set, such as a file system or a backup archive. By computing rolling hashes for fixed-size chunks of the data, it is possible to quickly identify blocks of data that have already been stored, which can be used to save storage space and reduce backup times.
- Network intrusion detection: Rolling hashes can be used to identify network traffic that matches known patterns of attack or malicious behavior, such as denial-of-service attacks or malware infections. By computing rolling hashes for packets of network traffic, it is possible to quickly identify patterns of behavior that are indicative of an attack, which can be used to trigger an alarm or block the traffic.
- Genome analysis: Rolling hashes are widely used in computational biology to analyze large sequences of genetic data, such as DNA or RNA sequences. By computing rolling hashes for fixed-size segments of the sequence, it is possible to quickly identify regions of the sequence that are likely to be functionally important or evolutionarily conserved, which can be used to guide experimental design or annotate the sequence.
- Video processing: Rolling hashes can be used to efficiently detect duplicate or near-duplicate frames in a video stream, which can be used for video compression or to identify video content that has been illegally copied or distributed. By computing rolling hashes for fixed-size frames of the video, it is possible to quickly identify regions of the video that are likely to be similar, which can be used to group the frames into clusters and identify the most representative frames.
- Cloud storage: Rolling hashes can be used to efficiently detect changes in large data sets that are stored in the cloud, such as cloud backups or cloud-based file systems. By computing rolling hashes for fixed-size chunks of the data, it is possible to quickly identify regions of the data that have been modified, which can be used to synchronize the data with a remote backup or a different version of the data.
- Machine learning: Rolling hashes can be used to preprocess large data sets for use in machine learning models, such as image recognition or speech recognition. By computing rolling hashes for fixed-size chunks of the data, it is possible to reduce the dimensionality of the data while preserving the most salient features, which can be used to speed up training times and improve model accuracy.
Advantages of the rolling hash algorithm:
- Speed: Rolling hash is very fast and efficient in calculating the hash value of a substring. It is especially useful when dealing with large strings, where calculating the hash value of the entire string would be too slow.
- Memory usage: Rolling hash uses very little memory, as it only needs to store the hash value of the current window and the hash value of the previous window.
- Partial matching: Rolling hash allows for partial matching, which means that you can compare two substrings and quickly determine whether they are equal or not. This is useful for string searching and pattern-matching algorithms.
Disadvantages of the rolling hash algorithm:
- Hash collisions: Rolling hash can produce hash collisions, which means that two different substrings can have the same hash value. This can lead to false matches and incorrect results.
- Window size: Rolling hash requires a fixed window size, which means that it may not be suitable for all types of strings. If the window size is too small, it may miss important features of the string. If the window size is too large, it may be too slow and memory-intensive.
- Limited hash space: Rolling hash has a limited hash space, which means that there is a finite number of hash values that can be generated. This can limit the accuracy and reliability of the hash function.
Similar Reads
Basics & Prerequisites
Data Structures
Array Data StructureIn this article, we introduce array, implementation in different popular languages, its basic operations and commonly seen problems / interview questions. An array stores items (in case of C/C++ and Java Primitive Arrays) or their references (in case of Python, JS, Java Non-Primitive) at contiguous
3 min read
String in Data StructureA string is a sequence of characters. The following facts make string an interesting data structure.Small set of elements. Unlike normal array, strings typically have smaller set of items. For example, lowercase English alphabet has only 26 characters. ASCII has only 256 characters.Strings are immut
2 min read
Hashing in Data StructureHashing is a technique used in data structures that efficiently stores and retrieves data in a way that allows for quick access. Hashing involves mapping data to a specific index in a hash table (an array of items) using a hash function. It enables fast retrieval of information based on its key. The
2 min read
Linked List Data StructureA linked list is a fundamental data structure in computer science. It mainly allows efficient insertion and deletion operations compared to arrays. Like arrays, it is also used to implement other data structures like stack, queue and deque. Hereâs the comparison of Linked List vs Arrays Linked List:
2 min read
Stack Data StructureA Stack is a linear data structure that follows a particular order in which the operations are performed. The order may be LIFO(Last In First Out) or FILO(First In Last Out). LIFO implies that the element that is inserted last, comes out first and FILO implies that the element that is inserted first
2 min read
Queue Data StructureA Queue Data Structure is a fundamental concept in computer science used for storing and managing data in a specific order. It follows the principle of "First in, First out" (FIFO), where the first element added to the queue is the first one to be removed. It is used as a buffer in computer systems
2 min read
Tree Data StructureTree Data Structure is a non-linear data structure in which a collection of elements known as nodes are connected to each other via edges such that there exists exactly one path between any two nodes. Types of TreeBinary Tree : Every node has at most two childrenTernary Tree : Every node has at most
4 min read
Graph Data StructureGraph Data Structure is a collection of nodes connected by edges. It's used to represent relationships between different entities. If you are looking for topic-wise list of problems on different topics like DFS, BFS, Topological Sort, Shortest Path, etc., please refer to Graph Algorithms. Basics of
3 min read
Trie Data StructureThe Trie data structure is a tree-like structure used for storing a dynamic set of strings. It allows for efficient retrieval and storage of keys, making it highly effective in handling large datasets. Trie supports operations such as insertion, search, deletion of keys, and prefix searches. In this
15+ min read
Algorithms
Searching AlgorithmsSearching algorithms are essential tools in computer science used to locate specific items within a collection of data. In this tutorial, we are mainly going to focus upon searching in an array. When we search an item in an array, there are two most common algorithms used based on the type of input
2 min read
Sorting AlgorithmsA Sorting Algorithm is used to rearrange a given array or list of elements in an order. For example, a given array [10, 20, 5, 2] becomes [2, 5, 10, 20] after sorting in increasing order and becomes [20, 10, 5, 2] after sorting in decreasing order. There exist different sorting algorithms for differ
3 min read
Introduction to RecursionThe process in which a function calls itself directly or indirectly is called recursion and the corresponding function is called a recursive function. A recursive algorithm takes one step toward solution and then recursively call itself to further move. The algorithm stops once we reach the solution
14 min read
Greedy AlgorithmsGreedy algorithms are a class of algorithms that make locally optimal choices at each step with the hope of finding a global optimum solution. At every step of the algorithm, we make a choice that looks the best at the moment. To make the choice, we sometimes sort the array so that we can always get
3 min read
Graph AlgorithmsGraph is a non-linear data structure like tree data structure. The limitation of tree is, it can only represent hierarchical data. For situations where nodes or vertices are randomly connected with each other other, we use Graph. Example situations where we use graph data structure are, a social net
3 min read
Dynamic Programming or DPDynamic Programming is an algorithmic technique with the following properties.It is mainly an optimization over plain recursion. Wherever we see a recursive solution that has repeated calls for the same inputs, we can optimize it using Dynamic Programming. The idea is to simply store the results of
3 min read
Bitwise AlgorithmsBitwise algorithms in Data Structures and Algorithms (DSA) involve manipulating individual bits of binary representations of numbers to perform operations efficiently. These algorithms utilize bitwise operators like AND, OR, XOR, NOT, Left Shift, and Right Shift.BasicsIntroduction to Bitwise Algorit
4 min read
Advanced
Segment TreeSegment Tree is a data structure that allows efficient querying and updating of intervals or segments of an array. It is particularly useful for problems involving range queries, such as finding the sum, minimum, maximum, or any other operation over a specific range of elements in an array. The tree
3 min read
Pattern SearchingPattern searching algorithms are essential tools in computer science and data processing. These algorithms are designed to efficiently find a particular pattern within a larger set of data. Patten SearchingImportant Pattern Searching Algorithms:Naive String Matching : A Simple Algorithm that works i
2 min read
GeometryGeometry is a branch of mathematics that studies the properties, measurements, and relationships of points, lines, angles, surfaces, and solids. From basic lines and angles to complex structures, it helps us understand the world around us.Geometry for Students and BeginnersThis section covers key br
2 min read
Interview Preparation
Practice Problem