0% found this document useful (0 votes)
242 views

Mongodb Full Text Search With Sphinx: Twitter: Web: Email

This document describes MongoDB full text search using Sphinx. It discusses using MongoDB to store course documents and xmlpipe2 to stream the documents to Sphinx as XML. It highlights some pitfalls with document IDs needing to be unique integers and encoding issues if the data is not properly UTF-8 encoded. It also discusses using mongos3 to backup MongoDB documents to S3 objects.

Uploaded by

postfix
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODP, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
242 views

Mongodb Full Text Search With Sphinx: Twitter: Web: Email

This document describes MongoDB full text search using Sphinx. It discusses using MongoDB to store course documents and xmlpipe2 to stream the documents to Sphinx as XML. It highlights some pitfalls with document IDs needing to be unique integers and encoding issues if the data is not properly UTF-8 encoded. It also discusses using mongos3 to backup MongoDB documents to S3 objects.

Uploaded by

postfix
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODP, PDF, TXT or read online on Scribd
You are on page 1/ 11

www.ocwsearch.

com

MongoDB Full Text Search


with Sphinx
Pierre Far, PhD
Twitter: @ocwsearch
Web: www.ocwsearch.com
Email: [email protected]
About www.ocwsearch.com

A search engine of the full text of OpenCourseWare course


materials.
2600+ courses, 10 universities, 11 OCW collections
Courses in English, Japanese, Spanish, Dutch
Why MongoDB? www.ocwsearch.com

Very helpful community

Document DB

Schemaless
Technology Stack www.ocwsearch.com

Website (HTML), API (JSON)

Query

Index

mongos3 xmlpipe2
Amazon
S3

Adaptor
Scripts
xmlpipe2 www.ocwsearch.com

An XML documents input into Sphinx


Any XML source so...

Read courses from MongoDB and stream as XML

sphinxsearch.com/wiki/doku.php?id=sphinx_xmlpipe2_tutorial
Pitfall 1: Document ID www.ocwsearch.com

“ALL DOCUMENT IDS MUST BE UNIQUE


UNSIGNED NON-ZERO INTEGER NUMBERS”

Generate a unique 10-digit numeric ID for each course.


Must be deterministic
Unique index on field.
Pitfall 2: UTF-8 www.ocwsearch.com

“Fatal error: Uncaught exception 'MongoException' with


message 'non-utf8 string”

Encoding: it’s a lie.


mb_detect_encoding() unreliable.

2-part solution
1. $HTML = @mb_convert_encoding($HTML, 'HTML-ENTITIES', 'utf-
8');
2. $Text = FixEncoding($Text);
FixEncoding(); www.ocwsearch.com

A set of real encoding detection functions


http://lachy.id.au/dev/2005/11/encoding-functions-source

FixEncoding() is a wrapper for these functions


UTF-8 in Sphinx www.ocwsearch.com

In sphinx.conf:
charset_type = utf-8
ngram_chars
charset_table

sphinxsearch.com/wiki/doku.php?
id=charset_tables
mongos3 www.ocwsearch.com

MongoDB document = S3 object

Backup tool for MongoDB

$Contents = gzencode(json_encode($Course), 9);


www.ocwsearch.com

Thanks!
Any questions?
Twitter: @ocwsearch
Web: www.ocwsearch.com
Email: [email protected]

You might also like