
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Developing a Text Search Engine Using Whoosh Library in Python
Whoosh is a python library of classes and functions for indexing text and then searching the index. Suppose you are building an application that needs to go through various documents and then find similarities or get data from it based on a few predefined conditions, or let's say you want to count the number of times the title of the project is mentioned in a research paper, then what we are building in this tutorial will come in handy.
Getting Started
For building our text search engine, we will be working with the whoosh library.
This library does not come pre?packaged with Python. So, we'll be downloading and installing it using the pip package manager.
To install the whoosh library, use the below line.
pip install whoosh
And now, we can import it to our script using the below line.
from whoosh.fields import Schema, TEXT, ID from whoosh import index
Building a Text Search Engine using Python
First, let us define a folder where we will be saving the indexed files when needed.
import os.path os.mkdir("dir")
Next up, let us define a schema. Schema specifies the fields of documents in an index.
schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored = True)) ind = index.create_in("dir", schema) writer = ind.writer() writer.add_document(title=u"doc", content=u"Py doc hello big world", path=u"/a") writer.commit()
Now that we've indexed the document, we search it.
from whoosh.qparser import QueryParser with ind.searcher() as searcher: query = QueryParser("content", ind.schema).parse("hello world") results = searcher.search(query, terms=True) for r in results: print (r, r.score) if results.has_matched_terms(): print(results.matched_terms())
Output
It will produce the following output:
<Hit {'path': '/a', 'title': 'doc', 'content': 'Py doc hello big world'}> 1.7906976744186047 {('content', b'hello'), ('content', b'world')}
Example
Here is the complete code:
from whoosh.fields import Schema, TEXT, ID from whoosh import index import os.path os.mkdir("dir") schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored = True)) ind = index.create_in("dir", schema) writer = ind.writer() writer.add_document(title=u"doc", content=u"Py doc hello big world", path=u"/a") writer.commit() from whoosh.qparser import QueryParser with ind.searcher() as searcher: query = QueryParser("content", ind.schema).parse("hello world") results = searcher.search(query, terms=True) for r in results: print (r, r.score) if results.has_matched_terms(): print(results.matched_terms())
Conclusion
You have now learnt to create text search engines in Python. Using this you can search through various documents to extract useful content within seconds. You've also explored the potential of the Whoosh library in Python.