0% found this document useful (0 votes)

23 views20 pages

Zihad Projeject

The document discusses two primary data structures for implementing a text document search engine: Trie and Hash Table. It details their structures, operations, advantages, disadvantages, and use cases, along with Python implementations for both. Additionally, it explores optimizations like Ternary Search Trees and Inverted Indexing with BM25 ranking for improved efficiency and accuracy in search operations.

Uploaded by

Autisticsad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views20 pages

Zihad Projeject

Uploaded by

Autisticsad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 20

A Text Document Search Engine is a system that efficiently searches for

words or phrases in a collection of documents. In Data Structures and

Algorithms (DSA), two of the most effective approaches to implementing a
text search engine are:

1. Trie (Prefix Tree)

2. Hash Table (Hashing with Indexing)

Let's explore both in detail.

1. Trie-Based Search Engine

Concept

A Trie (prefix tree) is a specialized tree data structure used for fast retrieval
of keys, particularly useful for dictionary and text search applications. It
stores words by breaking them into characters and organizing them
hierarchically.

Structure of a Trie

 Each node represents a character.

 The root node is empty.

 Each edge represents a transition between characters.

 Words are stored by linking nodes together.

 A special flag (isEndOfWord) indicates the end of a word.

Operations

Insertion (O(n))

 Insert words character by character.

 Create new nodes when necessary.

 Mark the last character node as the end of the word.

Search (O(n))

 Start from the root and check if the characters of the query exist in
sequence.

 If all characters match and is End of Word is true, the word exists.
Prefix Matching (Auto-complete) (O(n))

 Traverse the trie using the prefix.

 Retrieve all words starting with the given prefix.

Advantages

 Fast word lookups.

 Efficient for prefix-based searches (autocomplete).

 Supports wildcard searches.

Disadvantages

 High memory usage (each character node requires pointers).

 Not suitable for searching substrings within words.

Use Cases

 Autocomplete suggestions.

 Spell checking.

 Dictionary applications.

2. Hash Table-Based Search Engine

Concept

A Hash Table is a key-value data structure that provides fast lookups using
hash functions. In a document search engine, we use hashing to index
words from documents.

Structure of Hash Table Indexing

 Key: Words from the document.

 Value: A list of document IDs or positions where the word appears.

Operations

Insertion (O (1))

 Extract words from documents.

 Compute a hash value for each word.

 Store the word in the hash table with references to the document and
position.

Search (O (1))

 Compute the hash value of the search word.

 Look up the word in the hash table.

 Retrieve document IDs where the word appears.

Wildcard and Substring Search

 Not directly supported in a basic hash table.

 Needs additional preprocessing like n-gram indexing.

Advantages

 Extremely fast lookups.

 Space-efficient compared to Tries.

 Suitable for large datasets.

Disadvantages

 No prefix-based searching.

 Poor performance for substring searches.

 Hash collisions may require additional handling.

Use Cases

 Keyword-based search in large document repositories.

 Database indexing.

 Log file search.

Comparison: Trie vs. Hash Table for Text Search

Feature Trie Hash Table

Lookup Time O(n) (word length) O (1) (average case)

Space Usage High (due to pointers) Low (hash storage)

Prefix Search Fast (O(n)) Not supported

Substring
Not efficient Needs additional processing
Search

Hash Collisions No Yes (needs handling)

Dictionary, Keyword-based search, Database

Application
Autocomplete indexing

Conclusion

 Use Trie if you need prefix searches, autocomplete, or spell

checking.

 Use Hash Table if you need fast keyword lookups across large
documents.

For an efficient text document search engine, a combination of Trie for

prefix-based searches and Hash Tables for exact keyword searches is
often used. Some advanced systems also use Inverted Indexing (used in
search engines like Google) which is an optimized combination of Hash
Tables and Tries.
Python implementation of both Trie and Hash Table for text document
search.

1. Trie Implementation for Text Search

This implementation allows inserting words into a Trie and searching for full
words or words with a given prefix.

Python Code (Trie)

class TrieNode:

def __init__(self):

self.children = {}

self.isEndOfWord = False

class Trie:

def __init__(self):

self.root = TrieNode()

def insert(self, word):

node = self.root

for char in word:

if char not in node.children:

node.children[char] = TrieNode()

node = node.children[char]

node.isEndOfWord = True

def search(self, word):

node = self.root

for char in word:

if char not in node.children:

return False

node = node.children[char]

return node.isEndOfWord

def starts_with(self, prefix):

node = self.root

for char in prefix:

if char not in node.children:

return []

node = node.children[char]

words = []

self._dfs(node, prefix, words)

return words

def _dfs(self, node, prefix, words):

if node.isEndOfWord:

words.append(prefix)

for char, child in node.children.items():

self._dfs(child, prefix + char, words)

# Example Usage
trie = Trie()

words = ["apple", "app", "apex", "bat", "ball"]

for word in words:

trie.insert(word)

print(trie.search("apple")) # True

print(trie.search("bat")) # True

print(trie.search("batman")) # False

print(trie.starts_with("ap")) # ['apple', 'app', 'apex']

2. Hash Table Implementation for Text Search

Here, we use a hash table (dictionary) to store words as keys and lists of
document IDs as values.

Python Code (Hash Table)

from collections import defaultdict

class HashTableSearch:

def __init__(self):

self.index = defaultdict(list)

def add_document(self, doc_id, text):

words = text.lower().split()

for word in words:

self.index[word].append(doc_id)

def search(self, word):

return self.index.get(word.lower(), [])

# Example Usage

search_engine = HashTableSearch()

documents = {

1: "apple banana mango",

2: "banana orange grape",

3: "apple orange banana"

for doc_id, text in documents.items():

search_engine.add_document(doc_id, text)

print(search_engine.search("apple")) # [1, 3]

print(search_engine.search("banana")) # [1, 2, 3]

print(search_engine.search("grape")) # [2]

print(search_engine.search("mango")) # [1]

Key Takeaways

 Trie is great for prefix-based searches and auto-complete.

 Hash Table is efficient for exact keyword lookups across

documents.

 Combination of both is used in real-world search engines.

Here are some optimizations and improvements for both Trie and Hash
Table implementations to make them more efficient and scalable.

1. Optimized Trie for Text Search

Improvements

✅ Memory Efficiency: Use a defaultdict to reduce explicit node creation.

✅ Faster Insert & Search: Optimize character traversal.
✅ Autocomplete with Sorting: Sort words in the Trie to return results in
alphabetical order.

Optimized Trie Code

from collections import defaultdict

class TrieNode:

def __init__(self):

self.children = defaultdict(TrieNode)

self.isEndOfWord = False

class Trie:

def __init__(self):

self.root = TrieNode()

def insert(self, word):

node = self.root

for char in word:

node = node.children[char]

node.isEndOfWord = True

def search(self, word):

node = self.root

for char in word:

if char not in node.children:

return False

node = node.children[char]

return node.isEndOfWord

def starts_with(self, prefix):

node = self.root

for char in prefix:

if char not in node.children:

return []

node = node.children[char]

words = []

self._dfs(node, prefix, words)

return sorted(words) # Returns sorted results

def _dfs(self, node, prefix, words):

if node.isEndOfWord:

words.append(prefix)

for char in sorted(node.children.keys()): # Alphabetically ordered

traversal

self._dfs(node.children[char], prefix + char, words)

# Example Usage

trie = Trie()
words = ["apple", "app", "apex", "bat", "ball", "banana"]

for word in words:

trie.insert(word)

print(trie.search("apple")) # True

print(trie.search("bat")) # True

print(trie.search("batman")) # False

print(trie.starts_with("ap")) # ['app', 'apple', 'apex']

2. Optimized Hash Table for Text Search

Improvements

✅ Uses Inverted Indexing: Instead of simple key-value storage, it maps

words to multiple document positions.
✅ Handles Large Datasets: Efficiently processes and stores large text files.
✅ Tokenization & Normalization: Converts text to lowercase, removes
punctuation, and splits efficiently.

Optimized Hash Table Code

import re

from collections import defaultdict

class HashTableSearch:

def __init__(self):

self.index = defaultdict(list)

def add_document(self, doc_id, text):

words = re.findall(r'\b\w+\b', text.lower()) # Tokenization (ignores

punctuation)

for position, word in enumerate(words):

self.index[word].append((doc_id, position)) # Store word position in
each document

def search(self, word):

return self.index.get(word.lower(), [])

# Example Usage

search_engine = HashTableSearch()

documents = {

1: "Apple is a fruit. Banana is also a fruit.",

2: "Orange and Banana are my favorite fruits.",

3: "I love eating an apple with peanut butter."

for doc_id, text in documents.items():

search_engine.add_document(doc_id, text)

print(search_engine.search("apple")) # [(1, 0), (3, 4)]

print(search_engine.search("banana")) # [(1, 5), (2, 1)]

print(search_engine.search("fruits")) # [(2, 5)]

print(search_engine.search("peanut")) # [(3, 6)]

Further Optimizations

✅ For Trie

1. Compress Trie using Ternary Search Tree (reduces memory

usage).

2. Use DAWG (Directed Acyclic Word Graph) instead of Trie for

compact storage.

3. Lazy Deletion to optimize delete operations.

✅ For Hash Table

1. Use Bloom Filters to reduce unnecessary lookups.

2. Implement n-gram indexing to support substring searches.

3. Use an Inverted Index with BM25 ranking (used in search engines

like Elasticsearch).

Let's implement two advanced optimizations:

1. Compressed Trie using Ternary Search Tree (TST) – More

memory-efficient than a standard Trie.

2. Inverted Index with BM25 Ranking – Used in real-world search

engines like Elasticsearch and Google for ranked search results.

1. Optimized Ternary Search Tree (TST) for Memory Efficiency

A Ternary Search Tree (TST) is a Trie variation that:

✅ Reduces memory usage by using only three pointers per node (left,
middle, right).
✅ Provides fast lookup and autocomplete like a normal Trie.
✅ Efficiently handles large dictionaries and word lists.

TST Implementation

class TSTNode:

def init(self, char):

self.char = char

self.left = self.middle = self.right = None

self.isEndOfWord = False

class TernarySearchTree:

def __init__(self):

self.root = None

def insert(self, word):

self.root = self._insert(self.root, word, 0)

def _insert(self, node, word, index):

if index >= len(word):

return node

char = word[index]

if node is None:

node = TSTNode(char)

if char < node.char:

node.left = self._insert(node.left, word, index)

elif char > node.char:

node.right = self._insert(node.right, word, index)

else:

if index == len(word) - 1:

node.isEndOfWord = True

else:

node.middle = self._insert(node.middle, word, index + 1)

return node

def search(self, word):

return self._search(self.root, word, 0)

def _search(self, node, word, index):

if not node or index >= len(word):

return False

char = word[index]

if char < node.char:

return self._search(node.left, word, index)

elif char > node.char:

return self._search(node.right, word, index)

else:

if index == len(word) - 1:

return node.isEndOfWord

return self._search(node.middle, word, index + 1)

def autocomplete(self, prefix):

node = self._search_prefix(self.root, prefix, 0)

words = []

if node:

self._dfs(node.middle, prefix, words)

return words

def _search_prefix(self, node, prefix, index):

if not node or index >= len(prefix):

return None

char = prefix[index]

if char < node.char:

return self._search_prefix(node.left, prefix, index)

elif char > node.char:

return self._search_prefix(node.right, prefix, index)

else:

if index == len(prefix) - 1:

return node

return self._search_prefix(node.middle, prefix, index + 1)

def _dfs(self, node, prefix, words):

if node is None:

return

if node.isEndOfWord:

words.append(prefix + node.char)

self._dfs(node.left, prefix, words)

self._dfs(node.middle, prefix + node.char, words)

self._dfs(node.right, prefix, words)

# Example Usage

tst = TernarySearchTree()

words = ["cat", "car", "cart", "care", "bat", "bar", "bark"]

for word in words:

tst.insert(word)
print(tst.search("car")) # True

print(tst.search("cart")) # True

print(tst.search("batman")) # False

print(tst.autocomplete("ca")) # ['car', 'cart', 'care', 'cat']

2. Inverted Index with BM25 Ranking

What is an Inverted Index?

An inverted index is a mapping of words to documents where they

appear. It enables fast retrieval by storing:
✅ Word → Document IDs + Positions
✅ Efficient multi-word search.
✅ Works well for large text datasets.

BM25 Ranking for Search Accuracy

BM25 is an advanced ranking algorithm that:

✔ Scores documents based on keyword frequency and relevance.
✔ Prioritizes documents where keywords appear more frequently.
✔ Adjusts scores based on document length.

Implementation

import math

import re

from collections import defaultdict

class InvertedIndexBM25:

def init(self, k1=1.5, b=0.75):

self.index = defaultdict(list) # Word -> [(doc_id, position)]

self.doc_lengths = {} # doc_id -> document length

self.total_docs = 0

self.k1 = k1 # BM25 term frequency parameter

self.b = b # Length normalization parameter

def add_document(self, doc_id, text):

words = re.findall(r'\b\w+\b', text.lower())

self.doc_lengths[doc_id] = len(words)

for pos, word in enumerate(words):

self.index[word].append((doc_id, pos))

self.total_docs += 1

def bm25_score(self, query):

avg_doc_length = sum(self.doc_lengths.values()) / self.total_docs

doc_scores = defaultdict(float)

query_words = re.findall(r'\b\w+\b', query.lower())

for word in query_words:

postings = self.index.get(word, [])

doc_freq = len(set(doc_id for doc_id, _ in postings)) # Count unique

docs

for doc_id, _ in postings:

term_freq = sum(1 for d, _ in postings if d == doc_id) # Count

occurrences in doc

doc_length = self.doc_lengths[doc_id]

# BM25 Formula

numerator = term_freq * (self.k1 + 1)

denominator = term_freq + self.k1 * (1 - self.b + self.b *

(doc_length / avg_doc_length))
idf = math.log((self.total_docs - doc_freq + 0.5) / (doc_freq + 0.5)
+ 1) # Inverse document frequency

doc_scores[doc_id] += idf * (numerator / denominator)

return sorted(doc_scores.items(), key=lambda x: x[1], reverse=True) #

Sort by score

# Example Usage

search_engine = InvertedIndexBM25()

documents = {

1: "Apple is a popular fruit. It is red or green in color.",

2: "Banana is a yellow fruit. It is sweet and healthy.",

3: "Apple and banana are both fruits. Apple is often eaten raw.",

4: "Orange is a citrus fruit. It is rich in vitamin C."

for doc_id, text in documents.items():

search_engine.add_document(doc_id, text)

print(search_engine.bm25_score("apple fruit"))

# Output: Ranked documents based on relevance scores

Final Thoughts

✅ Which Approach Should You Use?

Inverted Index
Feature Trie / TST
(BM25)

Search Type Prefix-based search Full-text search

Memory
High (TST reduces it) Lower than Trie
Usage

Ranking
No ranking BM25 ranks results
Results

Wildcard Can match phrases

Limited
Support easily

Autocomplete, spell Large text search

Best For
check engines

If you need autocomplete or prefix matching, use TST (Trie variant).

If you need Google-like ranked search, use BM25-based Inverted
Index.

CS 3308 Programming Assignment Unit 4
No ratings yet
CS 3308 Programming Assignment Unit 4
7 pages
Tries Data Structures (Trie) PPT
100% (1)
Tries Data Structures (Trie) PPT
11 pages
Information Retrievals Full Notes
No ratings yet
Information Retrievals Full Notes
8 pages
Advance Data Structures
No ratings yet
Advance Data Structures
184 pages
Digital Search Tree
No ratings yet
Digital Search Tree
61 pages
Crack Maang Companies - DSA Questions (C++) - 1
No ratings yet
Crack Maang Companies - DSA Questions (C++) - 1
257 pages
Trie Insertion
No ratings yet
Trie Insertion
31 pages
9 Dictionaries and Tolerant Retrieval
No ratings yet
9 Dictionaries and Tolerant Retrieval
58 pages
Ders10 Data Structures-Tries
No ratings yet
Ders10 Data Structures-Tries
34 pages
CSC10004: Data Structures and Algorithms
No ratings yet
CSC10004: Data Structures and Algorithms
20 pages
1.advanced Tree Structures
No ratings yet
1.advanced Tree Structures
29 pages
Dsa Lab Manual
No ratings yet
Dsa Lab Manual
77 pages
Presentation 1
No ratings yet
Presentation 1
20 pages
Tries 1427
No ratings yet
Tries 1427
19 pages
Radix Search Tree
100% (1)
Radix Search Tree
18 pages
Ads 2 Part 4
No ratings yet
Ads 2 Part 4
18 pages
Lecture3 Tolerent
No ratings yet
Lecture3 Tolerent
81 pages
Tries
No ratings yet
Tries
33 pages
Trie - Wikipedia
No ratings yet
Trie - Wikipedia
10 pages
Unit 3 Tries
No ratings yet
Unit 3 Tries
16 pages
Daa Tut 6 Sudhanshu Raut: Pseudo Code For KMP Algorithm
No ratings yet
Daa Tut 6 Sudhanshu Raut: Pseudo Code For KMP Algorithm
11 pages
5.4. ADS - Tries - Standard Tries
No ratings yet
5.4. ADS - Tries - Standard Tries
34 pages
Obs Ds Unit5
No ratings yet
Obs Ds Unit5
10 pages
A2SV - Trie Lecture (No Code)
No ratings yet
A2SV - Trie Lecture (No Code)
39 pages
DSA Paractical by Me
No ratings yet
DSA Paractical by Me
24 pages
Extra
No ratings yet
Extra
4 pages
55 TriesNOTES
No ratings yet
55 TriesNOTES
18 pages
Tries and Radix Tree1
No ratings yet
Tries and Radix Tree1
27 pages
Representation:: Insertion and Search in Trie Data Structure
No ratings yet
Representation:: Insertion and Search in Trie Data Structure
25 pages
Implement Trie (Prefix Tree)
No ratings yet
Implement Trie (Prefix Tree)
7 pages
Unit5 Trie
No ratings yet
Unit5 Trie
23 pages
DSA Unit V
No ratings yet
DSA Unit V
34 pages
ADS Lab Week 12
No ratings yet
ADS Lab Week 12
4 pages
Unit 3
No ratings yet
Unit 3
22 pages
Session-3 Python Strings
No ratings yet
Session-3 Python Strings
48 pages
Trie Data Structure
No ratings yet
Trie Data Structure
5 pages
Back End
No ratings yet
Back End
8 pages
CS 3308 Programming Assignment Unit 2
No ratings yet
CS 3308 Programming Assignment Unit 2
10 pages
Chapter-4 - Data Structure-File Structure
No ratings yet
Chapter-4 - Data Structure-File Structure
34 pages
Search Engine
No ratings yet
Search Engine
6 pages
Problem: Implement A Trie With Autocomplete Feature Difficulty: Hard Problem Statement
No ratings yet
Problem: Implement A Trie With Autocomplete Feature Difficulty: Hard Problem Statement
3 pages
Ads Notes - Unit Vi
No ratings yet
Ads Notes - Unit Vi
41 pages
Lecture Notes On Tries
No ratings yet
Lecture Notes On Tries
10 pages
CF
No ratings yet
CF
4 pages
Gr-A Pr-1
No ratings yet
Gr-A Pr-1
7 pages
Chapter 3,4, 5 and 6
No ratings yet
Chapter 3,4, 5 and 6
145 pages
Implement Trie (Prefix Tree) - LeetCode
No ratings yet
Implement Trie (Prefix Tree) - LeetCode
1 page
Lecture4 - Indexing and Searching I
No ratings yet
Lecture4 - Indexing and Searching I
56 pages
Topic - Q Implement Trie (Prefix Tree) - Information O..
No ratings yet
Topic - Q Implement Trie (Prefix Tree) - Information O..
3 pages
Trie Tree
No ratings yet
Trie Tree
21 pages
Trie Vs BST Vs HashTable
No ratings yet
Trie Vs BST Vs HashTable
2 pages
Abstract
No ratings yet
Abstract
5 pages
Lab 06
No ratings yet
Lab 06
4 pages
Design Engineer Interview Questions
0% (1)
Design Engineer Interview Questions
2 pages
Assignment 4
No ratings yet
Assignment 4
11 pages
Over 251 Google Products & Services You Probably Don't Know
No ratings yet
Over 251 Google Products & Services You Probably Don't Know
13 pages
A. Yet Another Problem With Strings: ACM ICPC Practice Contest, 8 November, 2015
No ratings yet
A. Yet Another Problem With Strings: ACM ICPC Practice Contest, 8 November, 2015
2 pages
Advance Data Structures: Tries
No ratings yet
Advance Data Structures: Tries
26 pages
Dictionary Using Tries in C
No ratings yet
Dictionary Using Tries in C
6 pages
Copyright in Digital Age
100% (1)
Copyright in Digital Age
12 pages
Advantages Relative To Other Search Algorithms
No ratings yet
Advantages Relative To Other Search Algorithms
7 pages
Flange Dim EN1092-1
No ratings yet
Flange Dim EN1092-1
18 pages
Vedic Math Archive
No ratings yet
Vedic Math Archive
17 pages
Amir Maleki Moghaddam: Advanced Workflow To Evaluate and Compare The Performance of Directional Drilling Control Tools
No ratings yet
Amir Maleki Moghaddam: Advanced Workflow To Evaluate and Compare The Performance of Directional Drilling Control Tools
80 pages
Quiz 1 Patterns of Paragraph Development
No ratings yet
Quiz 1 Patterns of Paragraph Development
7 pages
I2c 1602 LCD
100% (1)
I2c 1602 LCD
8 pages
Particulars of Factories Paying Revenue of Rs. One Crore and Above During The Year 2006-2007 As Compared To 2005 - 06 Commissionerate: Chennai-Iv
No ratings yet
Particulars of Factories Paying Revenue of Rs. One Crore and Above During The Year 2006-2007 As Compared To 2005 - 06 Commissionerate: Chennai-Iv
13 pages
Revit Mass BIM
No ratings yet
Revit Mass BIM
3 pages
Workbook: Variable-Length Subnet Mask
No ratings yet
Workbook: Variable-Length Subnet Mask
29 pages
AccountStatement - 23-06-2025 17 - 03 - 59
No ratings yet
AccountStatement - 23-06-2025 17 - 03 - 59
20 pages
Screenshot 2023-05-30 at 14.41.45
No ratings yet
Screenshot 2023-05-30 at 14.41.45
37 pages
Commercial Proposal-GoodWin Pontoon and Slurry Pump Installation
No ratings yet
Commercial Proposal-GoodWin Pontoon and Slurry Pump Installation
4 pages
IoT The Network Protocols and Technologies - v4
No ratings yet
IoT The Network Protocols and Technologies - v4
28 pages
Stacks
No ratings yet
Stacks
29 pages
Common
No ratings yet
Common
81 pages
Batt Mobile - Digital Strategy Deck
No ratings yet
Batt Mobile - Digital Strategy Deck
72 pages
Puyat Na Kami - Ni Jong Final
No ratings yet
Puyat Na Kami - Ni Jong Final
57 pages
Graham Giller Wilmott Talk
No ratings yet
Graham Giller Wilmott Talk
31 pages
Program Technical Sessions
No ratings yet
Program Technical Sessions
17 pages
Handbook of Experimental Structural Dynamics - Peter Avitable - Randall Allemag - 2017
No ratings yet
Handbook of Experimental Structural Dynamics - Peter Avitable - Randall Allemag - 2017
7 pages
Aptitude Training Registered Students
No ratings yet
Aptitude Training Registered Students
24 pages
Applications of Reinforcement Learning
No ratings yet
Applications of Reinforcement Learning
10 pages
401 Presentation: Group - II
No ratings yet
401 Presentation: Group - II
33 pages
B Tech Manufacturing Technology
No ratings yet
B Tech Manufacturing Technology
61 pages
Chapter 2 - Parallel Programming Platforms
No ratings yet
Chapter 2 - Parallel Programming Platforms
33 pages
DS Theory HW 3
No ratings yet
DS Theory HW 3
6 pages
Emtech 4 5 6 7 Outline
No ratings yet
Emtech 4 5 6 7 Outline
8 pages
Payment
No ratings yet
Payment
1 page
300+ Python Algorithms: Mastering the Art of Problem-Solving
From Everand
300+ Python Algorithms: Mastering the Art of Problem-Solving
Hernando Abella
5/5 (1)
Hashing
From Everand
Hashing
Prakash Hegade
No ratings yet