0% found this document useful (0 votes)
23 views20 pages

Zihad Projeject

The document discusses two primary data structures for implementing a text document search engine: Trie and Hash Table. It details their structures, operations, advantages, disadvantages, and use cases, along with Python implementations for both. Additionally, it explores optimizations like Ternary Search Trees and Inverted Indexing with BM25 ranking for improved efficiency and accuracy in search operations.

Uploaded by

Autisticsad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views20 pages

Zihad Projeject

The document discusses two primary data structures for implementing a text document search engine: Trie and Hash Table. It details their structures, operations, advantages, disadvantages, and use cases, along with Python implementations for both. Additionally, it explores optimizations like Ternary Search Trees and Inverted Indexing with BM25 ranking for improved efficiency and accuracy in search operations.

Uploaded by

Autisticsad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

A Text Document Search Engine is a system that efficiently searches for

words or phrases in a collection of documents. In Data Structures and


Algorithms (DSA), two of the most effective approaches to implementing a
text search engine are:

1. Trie (Prefix Tree)

2. Hash Table (Hashing with Indexing)

Let's explore both in detail.

1. Trie-Based Search Engine

Concept

A Trie (prefix tree) is a specialized tree data structure used for fast retrieval
of keys, particularly useful for dictionary and text search applications. It
stores words by breaking them into characters and organizing them
hierarchically.

Structure of a Trie

 Each node represents a character.

 The root node is empty.

 Each edge represents a transition between characters.

 Words are stored by linking nodes together.

 A special flag (isEndOfWord) indicates the end of a word.

Operations

Insertion (O(n))

 Insert words character by character.

 Create new nodes when necessary.

 Mark the last character node as the end of the word.

Search (O(n))

 Start from the root and check if the characters of the query exist in
sequence.

 If all characters match and is End of Word is true, the word exists.
Prefix Matching (Auto-complete) (O(n))

 Traverse the trie using the prefix.

 Retrieve all words starting with the given prefix.

Advantages

 Fast word lookups.

 Efficient for prefix-based searches (autocomplete).

 Supports wildcard searches.

Disadvantages

 High memory usage (each character node requires pointers).

 Not suitable for searching substrings within words.

Use Cases

 Autocomplete suggestions.

 Spell checking.

 Dictionary applications.

2. Hash Table-Based Search Engine

Concept

A Hash Table is a key-value data structure that provides fast lookups using
hash functions. In a document search engine, we use hashing to index
words from documents.

Structure of Hash Table Indexing

 Key: Words from the document.

 Value: A list of document IDs or positions where the word appears.


Operations

Insertion (O (1))

 Extract words from documents.

 Compute a hash value for each word.

 Store the word in the hash table with references to the document and
position.

Search (O (1))

 Compute the hash value of the search word.

 Look up the word in the hash table.

 Retrieve document IDs where the word appears.

Wildcard and Substring Search

 Not directly supported in a basic hash table.

 Needs additional preprocessing like n-gram indexing.

Advantages

 Extremely fast lookups.

 Space-efficient compared to Tries.

 Suitable for large datasets.

Disadvantages

 No prefix-based searching.

 Poor performance for substring searches.

 Hash collisions may require additional handling.

Use Cases

 Keyword-based search in large document repositories.

 Database indexing.

 Log file search.


Comparison: Trie vs. Hash Table for Text Search

Feature Trie Hash Table

Lookup Time O(n) (word length) O (1) (average case)

Space Usage High (due to pointers) Low (hash storage)

Prefix Search Fast (O(n)) Not supported

Substring
Not efficient Needs additional processing
Search

Hash Collisions No Yes (needs handling)

Dictionary, Keyword-based search, Database


Application
Autocomplete indexing

Conclusion

 Use Trie if you need prefix searches, autocomplete, or spell


checking.

 Use Hash Table if you need fast keyword lookups across large
documents.

For an efficient text document search engine, a combination of Trie for


prefix-based searches and Hash Tables for exact keyword searches is
often used. Some advanced systems also use Inverted Indexing (used in
search engines like Google) which is an optimized combination of Hash
Tables and Tries.
Python implementation of both Trie and Hash Table for text document
search.

1. Trie Implementation for Text Search

This implementation allows inserting words into a Trie and searching for full
words or words with a given prefix.

Python Code (Trie)

class TrieNode:

def __init__(self):

self.children = {}

self.isEndOfWord = False

class Trie:

def __init__(self):

self.root = TrieNode()

def insert(self, word):

node = self.root

for char in word:

if char not in node.children:

node.children[char] = TrieNode()

node = node.children[char]

node.isEndOfWord = True

def search(self, word):

node = self.root

for char in word:


if char not in node.children:

return False

node = node.children[char]

return node.isEndOfWord

def starts_with(self, prefix):

node = self.root

for char in prefix:

if char not in node.children:

return []

node = node.children[char]

words = []

self._dfs(node, prefix, words)

return words

def _dfs(self, node, prefix, words):

if node.isEndOfWord:

words.append(prefix)

for char, child in node.children.items():

self._dfs(child, prefix + char, words)

# Example Usage
trie = Trie()

words = ["apple", "app", "apex", "bat", "ball"]

for word in words:

trie.insert(word)

print(trie.search("apple")) # True

print(trie.search("bat")) # True

print(trie.search("batman")) # False

print(trie.starts_with("ap")) # ['apple', 'app', 'apex']

2. Hash Table Implementation for Text Search

Here, we use a hash table (dictionary) to store words as keys and lists of
document IDs as values.

Python Code (Hash Table)

from collections import defaultdict

class HashTableSearch:

def __init__(self):

self.index = defaultdict(list)

def add_document(self, doc_id, text):

words = text.lower().split()

for word in words:

self.index[word].append(doc_id)

def search(self, word):

return self.index.get(word.lower(), [])


# Example Usage

search_engine = HashTableSearch()

documents = {

1: "apple banana mango",

2: "banana orange grape",

3: "apple orange banana"

for doc_id, text in documents.items():

search_engine.add_document(doc_id, text)

print(search_engine.search("apple")) # [1, 3]

print(search_engine.search("banana")) # [1, 2, 3]

print(search_engine.search("grape")) # [2]

print(search_engine.search("mango")) # [1]

Key Takeaways

 Trie is great for prefix-based searches and auto-complete.

 Hash Table is efficient for exact keyword lookups across


documents.

 Combination of both is used in real-world search engines.


Here are some optimizations and improvements for both Trie and Hash
Table implementations to make them more efficient and scalable.

1. Optimized Trie for Text Search

Improvements

✅ Memory Efficiency: Use a defaultdict to reduce explicit node creation.


✅ Faster Insert & Search: Optimize character traversal.
✅ Autocomplete with Sorting: Sort words in the Trie to return results in
alphabetical order.

Optimized Trie Code

from collections import defaultdict

class TrieNode:

def __init__(self):

self.children = defaultdict(TrieNode)

self.isEndOfWord = False

class Trie:

def __init__(self):

self.root = TrieNode()

def insert(self, word):

node = self.root

for char in word:

node = node.children[char]

node.isEndOfWord = True

def search(self, word):


node = self.root

for char in word:

if char not in node.children:

return False

node = node.children[char]

return node.isEndOfWord

def starts_with(self, prefix):

node = self.root

for char in prefix:

if char not in node.children:

return []

node = node.children[char]

words = []

self._dfs(node, prefix, words)

return sorted(words) # Returns sorted results

def _dfs(self, node, prefix, words):

if node.isEndOfWord:

words.append(prefix)

for char in sorted(node.children.keys()): # Alphabetically ordered


traversal

self._dfs(node.children[char], prefix + char, words)

# Example Usage

trie = Trie()
words = ["apple", "app", "apex", "bat", "ball", "banana"]

for word in words:

trie.insert(word)

print(trie.search("apple")) # True

print(trie.search("bat")) # True

print(trie.search("batman")) # False

print(trie.starts_with("ap")) # ['app', 'apple', 'apex']

2. Optimized Hash Table for Text Search

Improvements

✅ Uses Inverted Indexing: Instead of simple key-value storage, it maps


words to multiple document positions.
✅ Handles Large Datasets: Efficiently processes and stores large text files.
✅ Tokenization & Normalization: Converts text to lowercase, removes
punctuation, and splits efficiently.

Optimized Hash Table Code

import re

from collections import defaultdict

class HashTableSearch:

def __init__(self):

self.index = defaultdict(list)

def add_document(self, doc_id, text):

words = re.findall(r'\b\w+\b', text.lower()) # Tokenization (ignores


punctuation)

for position, word in enumerate(words):


self.index[word].append((doc_id, position)) # Store word position in
each document

def search(self, word):

return self.index.get(word.lower(), [])

# Example Usage

search_engine = HashTableSearch()

documents = {

1: "Apple is a fruit. Banana is also a fruit.",

2: "Orange and Banana are my favorite fruits.",

3: "I love eating an apple with peanut butter."

for doc_id, text in documents.items():

search_engine.add_document(doc_id, text)

print(search_engine.search("apple")) # [(1, 0), (3, 4)]

print(search_engine.search("banana")) # [(1, 5), (2, 1)]

print(search_engine.search("fruits")) # [(2, 5)]

print(search_engine.search("peanut")) # [(3, 6)]


Further Optimizations

✅ For Trie

1. Compress Trie using Ternary Search Tree (reduces memory


usage).

2. Use DAWG (Directed Acyclic Word Graph) instead of Trie for


compact storage.

3. Lazy Deletion to optimize delete operations.

✅ For Hash Table

1. Use Bloom Filters to reduce unnecessary lookups.

2. Implement n-gram indexing to support substring searches.

3. Use an Inverted Index with BM25 ranking (used in search engines


like Elasticsearch).

Let's implement two advanced optimizations:

1. Compressed Trie using Ternary Search Tree (TST) – More


memory-efficient than a standard Trie.

2. Inverted Index with BM25 Ranking – Used in real-world search


engines like Elasticsearch and Google for ranked search results.

1. Optimized Ternary Search Tree (TST) for Memory Efficiency

A Ternary Search Tree (TST) is a Trie variation that:


✅ Reduces memory usage by using only three pointers per node (left,
middle, right).
✅ Provides fast lookup and autocomplete like a normal Trie.
✅ Efficiently handles large dictionaries and word lists.

TST Implementation

class TSTNode:

def __init__(self, char):

self.char = char

self.left = self.middle = self.right = None


self.isEndOfWord = False

class TernarySearchTree:

def __init__(self):

self.root = None

def insert(self, word):

self.root = self._insert(self.root, word, 0)

def _insert(self, node, word, index):

if index >= len(word):

return node

char = word[index]

if node is None:

node = TSTNode(char)

if char < node.char:

node.left = self._insert(node.left, word, index)

elif char > node.char:

node.right = self._insert(node.right, word, index)

else:

if index == len(word) - 1:

node.isEndOfWord = True

else:

node.middle = self._insert(node.middle, word, index + 1)


return node

def search(self, word):

return self._search(self.root, word, 0)

def _search(self, node, word, index):

if not node or index >= len(word):

return False

char = word[index]

if char < node.char:

return self._search(node.left, word, index)

elif char > node.char:

return self._search(node.right, word, index)

else:

if index == len(word) - 1:

return node.isEndOfWord

return self._search(node.middle, word, index + 1)

def autocomplete(self, prefix):

node = self._search_prefix(self.root, prefix, 0)

words = []

if node:

self._dfs(node.middle, prefix, words)

return words

def _search_prefix(self, node, prefix, index):


if not node or index >= len(prefix):

return None

char = prefix[index]

if char < node.char:

return self._search_prefix(node.left, prefix, index)

elif char > node.char:

return self._search_prefix(node.right, prefix, index)

else:

if index == len(prefix) - 1:

return node

return self._search_prefix(node.middle, prefix, index + 1)

def _dfs(self, node, prefix, words):

if node is None:

return

if node.isEndOfWord:

words.append(prefix + node.char)

self._dfs(node.left, prefix, words)

self._dfs(node.middle, prefix + node.char, words)

self._dfs(node.right, prefix, words)

# Example Usage

tst = TernarySearchTree()

words = ["cat", "car", "cart", "care", "bat", "bar", "bark"]

for word in words:

tst.insert(word)
print(tst.search("car")) # True

print(tst.search("cart")) # True

print(tst.search("batman")) # False

print(tst.autocomplete("ca")) # ['car', 'cart', 'care', 'cat']

2. Inverted Index with BM25 Ranking

What is an Inverted Index?

An inverted index is a mapping of words to documents where they


appear. It enables fast retrieval by storing:
✅ Word → Document IDs + Positions
✅ Efficient multi-word search.
✅ Works well for large text datasets.

BM25 Ranking for Search Accuracy

BM25 is an advanced ranking algorithm that:


✔ Scores documents based on keyword frequency and relevance.
✔ Prioritizes documents where keywords appear more frequently.
✔ Adjusts scores based on document length.

Implementation

import math

import re

from collections import defaultdict

class InvertedIndexBM25:

def __init__(self, k1=1.5, b=0.75):

self.index = defaultdict(list) # Word -> [(doc_id, position)]

self.doc_lengths = {} # doc_id -> document length

self.total_docs = 0

self.k1 = k1 # BM25 term frequency parameter


self.b = b # Length normalization parameter

def add_document(self, doc_id, text):

words = re.findall(r'\b\w+\b', text.lower())

self.doc_lengths[doc_id] = len(words)

for pos, word in enumerate(words):

self.index[word].append((doc_id, pos))

self.total_docs += 1

def bm25_score(self, query):

avg_doc_length = sum(self.doc_lengths.values()) / self.total_docs

doc_scores = defaultdict(float)

query_words = re.findall(r'\b\w+\b', query.lower())

for word in query_words:

postings = self.index.get(word, [])

doc_freq = len(set(doc_id for doc_id, _ in postings)) # Count unique


docs

for doc_id, _ in postings:

term_freq = sum(1 for d, _ in postings if d == doc_id) # Count


occurrences in doc

doc_length = self.doc_lengths[doc_id]

# BM25 Formula

numerator = term_freq * (self.k1 + 1)

denominator = term_freq + self.k1 * (1 - self.b + self.b *


(doc_length / avg_doc_length))
idf = math.log((self.total_docs - doc_freq + 0.5) / (doc_freq + 0.5)
+ 1) # Inverse document frequency

doc_scores[doc_id] += idf * (numerator / denominator)

return sorted(doc_scores.items(), key=lambda x: x[1], reverse=True) #


Sort by score

# Example Usage

search_engine = InvertedIndexBM25()

documents = {

1: "Apple is a popular fruit. It is red or green in color.",

2: "Banana is a yellow fruit. It is sweet and healthy.",

3: "Apple and banana are both fruits. Apple is often eaten raw.",

4: "Orange is a citrus fruit. It is rich in vitamin C."

for doc_id, text in documents.items():

search_engine.add_document(doc_id, text)

print(search_engine.bm25_score("apple fruit"))

# Output: Ranked documents based on relevance scores

Final Thoughts

✅ Which Approach Should You Use?


Inverted Index
Feature Trie / TST
(BM25)

Search Type Prefix-based search Full-text search

Memory
High (TST reduces it) Lower than Trie
Usage

Ranking
No ranking BM25 ranks results
Results

Wildcard Can match phrases


Limited
Support easily

Autocomplete, spell Large text search


Best For
check engines

If you need autocomplete or prefix matching, use TST (Trie variant).


If you need Google-like ranked search, use BM25-based Inverted
Index.

You might also like