Zihad Projeject
Zihad Projeject
Concept
A Trie (prefix tree) is a specialized tree data structure used for fast retrieval
of keys, particularly useful for dictionary and text search applications. It
stores words by breaking them into characters and organizing them
hierarchically.
Structure of a Trie
Operations
Insertion (O(n))
Search (O(n))
Start from the root and check if the characters of the query exist in
sequence.
If all characters match and is End of Word is true, the word exists.
Prefix Matching (Auto-complete) (O(n))
Advantages
Disadvantages
Use Cases
Autocomplete suggestions.
Spell checking.
Dictionary applications.
Concept
A Hash Table is a key-value data structure that provides fast lookups using
hash functions. In a document search engine, we use hashing to index
words from documents.
Insertion (O (1))
Store the word in the hash table with references to the document and
position.
Search (O (1))
Advantages
Disadvantages
No prefix-based searching.
Use Cases
Database indexing.
Substring
Not efficient Needs additional processing
Search
Conclusion
Use Hash Table if you need fast keyword lookups across large
documents.
This implementation allows inserting words into a Trie and searching for full
words or words with a given prefix.
class TrieNode:
def __init__(self):
self.children = {}
self.isEndOfWord = False
class Trie:
def __init__(self):
self.root = TrieNode()
node = self.root
node.children[char] = TrieNode()
node = node.children[char]
node.isEndOfWord = True
node = self.root
return False
node = node.children[char]
return node.isEndOfWord
node = self.root
return []
node = node.children[char]
words = []
return words
if node.isEndOfWord:
words.append(prefix)
# Example Usage
trie = Trie()
trie.insert(word)
print(trie.search("apple")) # True
print(trie.search("bat")) # True
print(trie.search("batman")) # False
Here, we use a hash table (dictionary) to store words as keys and lists of
document IDs as values.
class HashTableSearch:
def __init__(self):
self.index = defaultdict(list)
words = text.lower().split()
self.index[word].append(doc_id)
search_engine = HashTableSearch()
documents = {
search_engine.add_document(doc_id, text)
print(search_engine.search("apple")) # [1, 3]
print(search_engine.search("banana")) # [1, 2, 3]
print(search_engine.search("grape")) # [2]
print(search_engine.search("mango")) # [1]
Key Takeaways
Improvements
class TrieNode:
def __init__(self):
self.children = defaultdict(TrieNode)
self.isEndOfWord = False
class Trie:
def __init__(self):
self.root = TrieNode()
node = self.root
node = node.children[char]
node.isEndOfWord = True
return False
node = node.children[char]
return node.isEndOfWord
node = self.root
return []
node = node.children[char]
words = []
if node.isEndOfWord:
words.append(prefix)
# Example Usage
trie = Trie()
words = ["apple", "app", "apex", "bat", "ball", "banana"]
trie.insert(word)
print(trie.search("apple")) # True
print(trie.search("bat")) # True
print(trie.search("batman")) # False
Improvements
import re
class HashTableSearch:
def __init__(self):
self.index = defaultdict(list)
# Example Usage
search_engine = HashTableSearch()
documents = {
search_engine.add_document(doc_id, text)
✅ For Trie
TST Implementation
class TSTNode:
self.char = char
class TernarySearchTree:
def __init__(self):
self.root = None
return node
char = word[index]
if node is None:
node = TSTNode(char)
else:
if index == len(word) - 1:
node.isEndOfWord = True
else:
return False
char = word[index]
else:
if index == len(word) - 1:
return node.isEndOfWord
words = []
if node:
return words
return None
char = prefix[index]
else:
if index == len(prefix) - 1:
return node
if node is None:
return
if node.isEndOfWord:
words.append(prefix + node.char)
# Example Usage
tst = TernarySearchTree()
tst.insert(word)
print(tst.search("car")) # True
print(tst.search("cart")) # True
print(tst.search("batman")) # False
Implementation
import math
import re
class InvertedIndexBM25:
self.total_docs = 0
self.doc_lengths[doc_id] = len(words)
self.index[word].append((doc_id, pos))
self.total_docs += 1
doc_scores = defaultdict(float)
doc_length = self.doc_lengths[doc_id]
# BM25 Formula
# Example Usage
search_engine = InvertedIndexBM25()
documents = {
3: "Apple and banana are both fruits. Apple is often eaten raw.",
search_engine.add_document(doc_id, text)
print(search_engine.bm25_score("apple fruit"))
Final Thoughts
Memory
High (TST reduces it) Lower than Trie
Usage
Ranking
No ranking BM25 ranks results
Results