2012 Vladimir Fedorkov Percona Live2012 How To Offload MySQL Server With Sphinx Extended
2012 Vladimir Fedorkov Percona Live2012 How To Offload MySQL Server With Sphinx Extended
with Sphinx
Vladimir Fedorkov, Sphinx Technologies
Percona Live, MySQL UC, Santa Clara 2012
About me
• For users
• For search engines
Application is not a solid rock
• Lots of layers
• Apache/Nginx/Lighttpd/Tomcat/You name it
• Perl/PHP/Python/Ruby/Java/.NET/C++/Haskel/…
• Percona Server/MariaDB/Drizzle/MySQL/
• PostgreSQL/MSSQL/Oracle/DB2/Firebird...
• Memcache/MongoDB/CouchDB…
• Sphinx/Lucene/SOLR/Elastic/IndexDen/…
• Third party libraries and frameworks
• Your own code
Which one to use?
What do we need
Data layer. The basement.
• http://sphinxsearch.com/downloads/
• http://sphinxsearch.googlecode.com/svn/
• configure && make && make install
Where to look for the data?
• MySQL
• PostgreSQL
• MSSQL
• ODBC source
• XML pipe
MySQL source
source data_source
{
…
sql_query = \
SELECT id, channel_id, ts, title, content \
FROM mytable
sql_attr_uint = channel_id
sql_attr_timestamp = ts
…
}
A complete version
source data_source
{
type = mysql
sql_host = localhost
sql_user = my_user
sql_pass = my******
sql_db = test
sql_attr_uint = channel_id
sql_attr_timestamp = ts
html_strip =1
morphology = stem_en
stopwords = stopwords.txt
charset_type = utf-8
}
Indexer configuration
indexer
{
mem_limit = 512M
max_iops = 40
max_iosize = 1048576
}
Configuring searchd
searchd
{
listen = localhost:9312
listen = localhost:9306:mysql4
query_log = query.log
query_log_format = sphinxql
pid_file = searchd.pid
}
Integration
Just like MySQL
$ mysql -h 0 -P 9306
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 2.1.0-id64-dev (r3028)
Type 'help;' or '\h' for help. Type '\c' to clear the current
input statement.
mysql>
But not quite!
mysql> SELECT *
-> FROM lj1m
-> WHERE MATCH('I love Sphinx')
-> LIMIT 5
-> OPTION field_weights=(title=100, content=1);
+---------+--------+------------+------------+
| id | weight | channel_id | ts |
+---------+--------+------------+------------+
| 7637682 | 101652 | 358842 | 1112905663 |
| 6598265 | 101612 | 454928 | 1102858275 |
| 6941386 | 101612 | 424983 | 1076253605 |
| 6913297 | 101584 | 419235 | 1087685912 |
| 7139957 | 1667 | 403287 | 1078242789 |
+---------+--------+------------+------------+
5 rows in set (0.00 sec)
What's different?
• Meta fields @weight, @group, @count
• No full-text fields in output
• So far
• Requires additional lookup to fetch data
• MySQL query become primary key lookup
• WHERE id IN (33, 9, 12, …, 17, 5)
• Good for caching
• Adding nodes is transparent for the application
• zero downtime or less ;-)
SQL & SphinxQL
• API
• PHP, Python, Java, Ruby, C is included in distro
• .NET, Rails (via Thinking Sphinx) is available
• SphinxSE
• Prebuilt into MariaDB
Sphinx API
<?php
require ( "sphinxapi.php" ); //from sphinx distro
$cl->SetServer ( $host, $port );
$cl = new SphinxClient();
$res = $cl->Query ( "my first query", “my_sphinx_index" );
var_dump ( $res );
?>
●More in api/test.php
3. Facets? You bet!
mysql> SELECT *, YEAR(ts) as yr
-> FROM lj1m
-> WHERE MATCH('I love Sphinx')
-> GROUP BY yr
-> ORDER BY yr DESC
-> LIMIT 5
-> OPTION field_weights=(title=100, content=1);
+---------+--------+------------+------------+------+----------+--------+
| id | weight | channel_id | ts | yr | @groupby | @count |
+---------+--------+------------+------------+------+----------+--------+
| 7637682 | 101652 | 358842 | 1112905663 | 2005 | 2005 | 14 |
| 6598265 | 101612 | 454928 | 1102858275 | 2004 | 2004 | 27 |
| 7139960 | 1642 | 403287 | 1070220903 | 2003 | 2003 | 8 |
| 5340114 | 1612 | 537694 | 1020213442 | 2002 | 2002 | 1 |
| 5744405 | 1588 | 507895 | 995415111 | 2001 | 2001 | 1 |
+---------+--------+------------+------------+------+----------+--------+
5 rows in set (0.00 sec)
4. Real Time engine
index rt
{
type = rt
rt_mem_limit = 512M
rt_field = title
rt_field = content
rt_attr_uint = channel_id
rt_attr_timestamp = ts
}
RT — Memory Utilization
• Profile
• Scale
• Optimize
• Compact
Remove high-frequency words
• SPH_RANK_NONE
• Fastest, implements boolean search
• SPH_RANK_WORDCOUNT
• SPH_RANK_PROXIMITY
Use custom ranking
• SPH_RANK_SPH04
• Actially slower but more relevent in some cases
• SPH_RANK_EXPR
• Allow you to build your own ranker
Available ranking factors
• Document-level
• bm25
• max_lcs, field_mask, doc_word_count
• Field-level
• LCS (Longest Common Subsequence)
• hit_count, word_count, tf_idf
• More :)
Extended search syntax
• Phrase search
• “hello world”
• Proximity search
• “hello world”~10
• Distance support
• hello NEAR/10 world
• Quorum matching
• "the world is a wonderful place"/3
Even more
SELECT *,
GEODIST(docs_lat, doc_long, %d1, %d2) as dist,
FROM sphinx_index
ORDER BY dist DESC
LIMIT 0, 20
Segments and Ranges
• Grouping results by
• Price ranges (items, offers)
• Date range (blog posts and news articles)
• Ratings (product reviews)
• INTERVAL(field, x0, x1, …, xN)
SELECT
INTERVAL(item_price, 0, 20, 50, 90) as range, @count
FROM my_sphinx_products GROUP BY range
ORDER BY range ASC;
Segments: Results example
+-------+--------+-------+--------+
| id | weight | range | @count |
+-------+--------+-------+--------+
| 34545 | 1 | 1 | 654 |
| 75836 | 1 | 2 | 379 |
| 94862 | 1 | 3 | 14 |
+-------+--------+-------+--------+
3 rows in set (0.00 sec)
Performance tricks: MVA
• sql_file_field
• Keeps huge text collections out of database.
• sql_file_field = path_to_text_file
• max_file_field_buffer needs to be set properly
Multiquery
index my_distribited_index1
{
type = distributed
local = ondisk_index1
local = ondisk_index2
local = ondisk_index3
local = ondisk_index4
}
…
dist_threads = 4
…
Scalling: multi-box configuration
index my_distribited_index2
{
type = distributed
agent = 192.168.100.51:9312:ondisk_index1
agent = 192.168.100.52:9312:ondisk_index2
agent = 192.168.100.53:9312:rt_index
}
Know your queries