php下利用scws+sphinx+mysql打造千万级数据全文搜索
Sphinx是由俄罗斯人Andrew Aksyonoff开发的一个全文检索引擎。意图为其他应用提供高速、低空间占用、高结果 相关度的全文搜索功能。Sphinx可以非常容易的与SQL数据库和脚本语言集成。当前系统内置MySQL和PostgreSQL 数据库数据源的支持,也支持从标准输入读取特定格式 的XML数据。
Sphinx创建索引的速度为:创建100万条记录的索引只需3~4分钟,创建1000万条记录的索引可以在50分钟内完成,而只包含最新10万条记录的增量索引,重建一次只需几十秒。
Sphinx的特性如下:
a) 高速的建立索引(在当代CPU上,峰值性能可达到10 MB/秒);
b) 高性能的搜索(在2 – 4GB 的文本数据上,平均每次检索响应时间小于0.1秒);
c) 可处理海量数据(目前已知可以处理超过100 GB的文本数据, 在单一CPU的系统上可处理100 M 文档);
d) 提供了优秀的相关度算法,基于短语相似度和统计(BM25)的复合Ranking方法;
e) 支持分布式搜索;
f) 支持短语搜索
g) 提供文档摘要生成
h) 可作为MySQL的存储引擎提供搜索服务;
i) 支持布尔、短语、词语相似度等多种检索模式;
j) 文档支持多个全文检索字段(最大不超过32个);
k) 文档支持多个额外的属性信息(例如:分组信息,时间戳等);
l) 支持断词;
虽然mysql的MYISAM提供全文索引,但是性能却不敢让人恭维
开始搭建
系统环境:centos6.5+php5.6+apache+mysql
1、安装依赖包
[php]view plaincopy
yum-yinstallmakegccg++gcc-c++libtoolautoconfautomakeimakephp-develmysql-devellibxml2-develexpat-devel
2、安装Sphinx
[php]view plaincopy
yuminstallexpatexpat-devel
wget-chttps://sphinxsearch.com/files/sphinx-2.0.7-release.tar.gz
tarzxvfsphinx-2.0.7-release.tar.gz
cdsphinx-2.0.7-release
./configure--prefix=/usr/local/sphinx--with-mysql--with-libexpat--enable-id64
make&&makeinstall
3、安装libsphinxclient,php扩展用到
[php]view plaincopy
cdapi/libsphinxclient
./configure--prefix=/usr/local/sphinx/libsphinxclient
make&&makeinstall
4、安装Sphinx的PHP扩展:我的是5.6需装sphinx-1.3.3.tgz,如果是php5.4以下可sphinx-1.3.0.tgz
[php]view plaincopy
wget-chttps://pecl.php.net/get/sphinx-1.3.3.tgz
tarzxvfsphinx-1.3.3.tgz
cdsphinx-1.3.3
phpize
./configure--with-sphinx=/usr/local/sphinx/libsphinxclient/--with-php-config=/usr/bin/php-config
make&&makeinstall
成功后会提示:
Installingsharedextensions:/usr/lib64/php/modules/
echo"[Sphinx]">>/etc/php.ini
echo"extension=sphinx.so">>/etc/php.ini
#重启apache
servicehttpdrestart
5、创建测试数据
[php]view plaincopy
CREATETABLEIFNOTEXISTS`items`(
`id`int(11)NOTNULLAUTO_INCREMENT,
`title`varchar(255)NOTNULL,
`content`textNOTNULL,
`created`datetimeNOTNULL,
PRIMARYKEY(`id`)
)ENGINE=InnoDBDEFAULTCHARSET=utf8COMMENT='全文检索测试的数据表'AUTO_INCREMENT=11;
INSERTINTO`items`(`id`,`title`,`content`,`created`)VALUES
(1,'linuxmysql集群安装','MySQLCluster是MySQL适合于分布式计算环境的高实用、可拓展、高性能、高冗余版本','2016-09-0700:00:00'),
(2,'mysql主从复制','mysql主从备份(复制)的基本原理mysql支持单向、异步复制,复制过程中一个服务器充当主服务器,而一个或多个其它服务器充当从服务器','2016-09-0600:00:00'),
(3,'hello','canyousearchme','2016-09-0500:00:00'),
(4,'mysql','mysqlisthebestdatabase','2016-09-0300:00:00'),
(5,'mysql索引','关于MySQL索引的好处,如果正确合理设计并且使用索引的MySQL是一辆兰博基尼的话,那么没有设计和使用索引的MySQL就是一个人力三轮车','2016-09-0100:00:00'),
(6,'集群','关于MySQL索引的好处,如果正确合理设计并且使用索引的MySQL是一辆兰博基尼的话,那么没有设计和使用索引的MySQL就是一个人力三轮车','0000-00-0000:00:00'),
(9,'复制原理','redis也有复制','0000-00-0000:00:00'),
(10,'redis集群','集群技术是构建高性能网站架构的重要手段,试想在网站承受高并发访问压力的同时,还需要从海量数据中查询出满足条件的数据,并快速响应,我们必然想到的是将数据进行切片,把数据根据某种规则放入多个不同的服务器节点,来降低单节点服务器的压力','0000-00-0000:00:00');
CREATETABLEIFNOTEXISTS`sph_counter`(
`counter_id`int(11)NOTNULL,
`max_doc_id`int(11)NOTNULL,
PRIMARYKEY(`counter_id`)
)ENGINE=MyISAMDEFAULTCHARSET=utf8COMMENT='增量索引标示的计数表';
以下采用"Main + Delta" ("主索引"+"增量索引")的索引策略,使用Sphinx自带的一元分词。
6、Sphinx配置:注意修改数据源配置信息
[php]view plaincopy
vi/usr/local/sphinx/etc/sphinx.conf
sourceitems{
type=mysql
sql_host=localhost
sql_user=root
sql_pass=123456
sql_db=sphinx_items
sql_query_pre=SETNAMESutf8
sql_query_pre=SETSESSIONquery_cache_type=OFF
sql_query_pre=REPLACEINTOsph_counterSELECT1,MAX(id)FROMitems
sql_query_range=SELECTMIN(id),MAX(id)FROMitems\
WHEREid<=(SELECTmax_doc_idFROMsph_counterWHEREcounter_id=1)
sql_range_step=1000
sql_ranged_throttle=1000
sql_query=SELECTid,title,content,created,0asdeletedFROMitems\
WHEREid<=(SELECTmax_doc_idFROMsph_counterWHEREcounter_id=1)\
ANDid>=$startANDid<=$end
sql_attr_timestamp=created
sql_attr_bool=deleted
}
sourceitems_delta:items{
sql_query_pre=SETNAMESutf8
sql_query_range=SELECTMIN(id),MAX(id)FROMitems\
WHEREid>(SELECTmax_doc_idFROMsph_counterWHEREcounter_id=1)
sql_query=SELECTid,title,content,created,0asdeletedFROMitems\
WHEREid>(SELECTmax_doc_idFROMsph_counterWHEREcounter_id=1)\
ANDid>=$startANDid<=$end
sql_query_post_index=set@max_doc_id:=(SELECTmax_doc_idFROMsph_counterWHEREcounter_id=1)
sql_query_post_index=REPLACEINTOsph_counterSELECT2,IF($maxid,$maxid,@max_doc_id)
}
#主索引
indexitems{
source=items
path=/usr/local/sphinx/var/data/items
docinfo=extern
morphology=none
min_word_len=1
min_prefix_len=0
html_strip=1
html_remove_elements=style,script
ngram_len=1
ngram_chars=U+3000..U+2FA1F
charset_type=utf-8
charset_table=0..9,A..Z->a..z,_,a..z,U+410..U+42F->U+430..U+44F,U+430..U+44F
preopen=1
min_infix_len=1
}
#增量索引
indexitems_delta:items{
source=items_delta
path=/usr/local/sphinx/var/data/items-delta
}
#分布式索引
indexmaster{
type=distributed
local=items
local=items_delta
}
indexer{
mem_limit=256M
}
searchd{
listen=9312
listen=9306:mysql41#UsedforSphinxQL
log=/usr/local/sphinx/var/log/searchd.log
query_log=/usr/local/sphinx/var/log/query.log
compat_sphinxql_magics=0
attr_flush_period=600
mva_updates_pool=16M
read_timeout=5
max_children=0
dist_threads=2
pid_file=/usr/local/sphinx/var/log/searchd.pid
max_matches=1000
seamless_rotate=1
preopen_indexes=1
unlink_old=1
workers=threads#forRTtowork
binlog_path=/usr/local/sphinx/var/data
}保存退出
7、Sphinx创建索引
[php]view plaincopy
#第一次需重建索引:
[root@localhostbin]#./indexer-c/usr/local/sphinx/etc/sphinx.conf--all
Sphinx2.0.7-id64-release(r3759)
Copyright(c)2001-2012,AndrewAksyonoff
Copyright(c)2008-2012,SphinxTechnologiesInc(https://sphinxsearch.com)
usingconfigfile'/usr/local/sphinx/etc/sphinx.conf'...
indexingindex'items'...
collected8docs,0.0MB
sorted0.0Mhits,100.0%done
total8docs,1121bytes
total1.017sec,1101bytes/sec,7.86docs/sec
indexingindex'items_delta'...
collected0docs,0.0MB
total0docs,0bytes
total1.007sec,0bytes/sec,0.00docs/sec
skippingnon-plainindex'master'...
total4reads,0.000sec,0.7kb/callavg,0.0msec/callavg
total14writes,0.001sec,0.5kb/callavg,0.1msec/callavg
#启动sphinx
[root@localhostbin]#./searchd-c/usr/local/sphinx/etc/sphinx.conf
Sphinx2.0.7-id64-release(r3759)
Copyright(c)2001-2012,AndrewAksyonoff
Copyright(c)2008-2012,SphinxTechnologiesInc(https://sphinxsearch.com)
usingconfigfile'/usr/local/sphinx/etc/sphinx.conf'...
listeningonallinterfaces,port=9312
listeningonallinterfaces,port=9306
precachingindex'items'
precachingindex'items_delta'
rotatingindex'items_delta':success
precached2indexesin0.012sec
#查看进程
[root@localhostbin]#ps-ef|grepsearchd
root304311023:5900:00:00./searchd-c/usr/local/sphinx/etc/sphinx.conf
root3043230431023:5900:00:00./searchd-c/usr/local/sphinx/etc/sphinx.conf
root304371490023:59pts/000:00:00grepsearchd
#停止Searchd:
./searchd-c/usr/local/sphinx/etc/sphinx.conf--stop
#查看Searchd状态:
./searchd-c/usr/local/sphinx/etc/sphinx.conf--status
索引更新及使用说明
"增量索引"每N分钟更新一次.通常在每天晚上低负载的时进行一次索引合并,同时重新建立"增量索引"。当然"主索引"数据不多的话,也可以直接重新建立"主索引"。
API搜索的时,同时使用"主索引"和"增量索引",这样可以获得准实时的搜索数据.本文的Sphinx配置将"主索引"和"增量索引"放到分布式索引master中,因此只需查询分布式索引"master"即可获得全部匹配数据(包括最新数据)。
索引的更新与合并的操作可以放到cron job完成:
[php]view plaincopy
crontab-e
*/1****/usr/local/sphinx/shell/delta_index_update.sh
03***/usr/local/sphinx/shell/merge_daily_index.sh
crontab-l
cron job所用的shell脚本例子:
delta_index_update.sh:
[php]view plaincopy
#!/bin/bash
/usr/local/sphinx/bin/indexer-c/usr/local/sphinx/etc/sphinx.conf--rotateitems_delta>/dev/null2>&1merge_daily_index.sh:
[php]view plaincopy
#!/bin/bash
indexer=`whichindexer`
mysql=`whichmysql`
QUERY="usesphinx_items;selectmax_doc_idfromsph_counterwherecounter_id=2limit1;"
index_counter=$($mysql-h192.168.1.198-uroot-p123456-sN-e"$QUERY")
#merge"main+delta"indexes
$indexer-c/usr/local/sphinx/etc/sphinx.conf--rotate--mergeitemsitems_delta--merge-dst-rangedeleted00>>/usr/local/sphinx/var/index_merge.log2>&1
if["$"-eq0];then
##updatesphinxcounter
if[!-z$index_counter];then
$mysql-h192.168.1.198-uroot-p123456-Dsphinx_items-e"REPLACEINTOsph_counterVALUES(1,'$index_counter')"
fi
##rebuilddeltaindextoavoidconfusionwithmainindex
$indexer-c/usr/local/sphinx/etc/sphinx.conf--rotateitems_delta>>/usr/local/sphinx/var/rebuild_deltaindex.log2>&1
fi
8、php中文分词scws安装:注意扩展的版本和php的版本
[php]view plaincopy
wget-chttps://www.xunsearch.com/scws/down/scws-1.2.3.tar.bz2
tarjxvfscws-1.2.3.tar.bz2
cdscws-1.2.3
./configure--prefix=/usr/local/scws
make&&makeinstall
9、scws的PHP扩展安装:
[php]view plaincopy
cd./phpext
phpize
./configure
make&&makeinstall
echo"[scws]">>/etc/php.ini
echo"extension=scws.so">>/etc/php.ini
echo"scws.default.charset=utf-8">>/etc/php.ini
echo"scws.default.fpath=/usr/local/scws/etc/">>/etc/php.ini
10、词库安装:[php]view plaincopy
wgethttps://www.xunsearch.com/scws/down/scws-dict-chs-utf8.tar.bz2
tarjxvfscws-dict-chs-utf8.tar.bz2-C/usr/local/scws/etc/
chownwww:www/usr/local/scws/etc/dict.utf8.xdb
11、php使用Sphinx+scws测试例子
在Sphinx源码API中,有好几种语言的API调用.其中有一个是sphinxapi.php。
不过以下的测试使用的是Sphinx的PHP扩展.具体安装见本文开头的Sphinx安装部分。
测试用的搜索类Search.php:注意修改getDBConnection()信息为自己的
[php]view plaincopy
classSearch{
/**
*@varSphinxClient
**/
protected$client;
/**
*@varstring
**/
protected$keywords;
/**
*@varresource
**/
privatestatic$dbconnection=null;
/**
*Constructor
**/
publicfunction__construct($options=array()){
$defaults=array(
'query_mode'=>SPH_MATCH_EXTENDED2,
'sort_mode'=>SPH_SORT_EXTENDED,
'ranking_mode'=>SPH_RANK_PROXIMITY_BM25,
'field_weights'=>array(),
'max_matches'=>1000,
'snippet_enabled'=>true,
'snippet_index'=>'items',
'snippet_fields'=>array(),
);
$this->options=array_merge($defaults,$options);
$this->client=newSphinxClient();
//$this->client->setServer("192.168.1.198",9312);
$this->client->setMatchMode($this->options['query_mode']);
if($this->options['field_weights']!==array()){
$this->client->setFieldWeights($this->options['field_weights']);
}
/*
if(in_array($this->options['query_mode'],[SPH_MATCH_EXTENDED2,SPH_MATCH_EXTENDED])){
$this->client->setRankingMode($this->options['ranking_mode']);
}
*/
}
/**
*Query
*
*@paramstring$keywords
*@paraminteger$offset
*@paraminteger$limit
*@paramstring$index
*@returnarray
**/
publicfunctionquery($keywords,$offset=0,$limit=10,$index='*'){
$this->keywords=$keywords;
$max_matches=$limit>$this->options['max_matches']$limit:$this->options['max_matches'];
$this->client->setLimits($offset,$limit,$max_matches);
$query_results=$this->client->query($keywords,$index);
if($query_results===false){
$this->log('error:'.$this->client->getLastError());
}
$res=[];
if(empty($query_results['matches'])){
return$res;
}
$res['total']=$query_results['total'];
$res['total_found']=$query_results['total_found'];
$res['time']=$query_results['time'];
$doc_ids=array_keys($query_results['matches']);
unset($query_results);
$res['data']=$this->fetch_data($doc_ids);
if($this->options['snippet_enabled']){
$this->buildExcerptRows($res['data']);
}
return$res;
}
/**
*customsorting
*
*@paramstring$sortBy
*@paramint$mode
*@returnbool
**/
publicfunctionsetSortBy($sortBy='',$mode=0){
if($sortBy){
$mode=$mode:$this->options['sort_mode'];
$this->client->setSortMode($mode,$sortBy);
}else{
$this->client->setSortMode(SPH_SORT_RELEVANCE);
}
}
/**
*fetchdatabasedonmatcheddoc_ids
*
*@paramarray$doc_ids
*@returnarray
**/
protectedfunctionfetch_data($doc_ids){
$ids=implode(',',$doc_ids);
$queries=self::getDBConnection()->query("SELECT*FROMitemsWHEREidin($ids)",PDO::FETCH_ASSOC);
returniterator_to_array($queries);
}
/**
*buildexcerptsfordata
*
*@paramarray$rows
*@returnarray
**/
protectedfunctionbuildExcerptRows(&$rows){
$options=array(
'before_match'=>'',
'after_match'=>'',
'chunk_separator'=>'...',
'limit'=>256,
'around'=>3,
'exact_phrase'=>false,
'single_passage'=>true,
'limit_words'=>5,
);
$scount=count($this->options['snippet_fields']);
foreach($rowsas&$row){
foreach($rowas$fk=>$item){
if(!is_string($item)||($scount&&!in_array($fk,$this->options['snippet_fields'])))continue;
$item=preg_replace('/[\r\t\n]+/','',strip_tags($item));
$res=$this->client->buildExcerpts(array($item),$this->options['snippet_index'],$this->keywords,$options);
$row[$fk]=$res===false$item:$res[0];
}
}
return$rows;
}
/**
*databaseconnection
*
*@returnresource
**/
privatestaticfunctiongetDBConnection(){
$dsn='mysql:host=192.168.1.198;dbname=sphinx_items';
$user='root';
$pass='123456';
if(!self::$dbconnection){
try{
self::$dbconnection=newPDO($dsn,$user,$pass);
}catch(PDOException$e){
die('Connectionfailed:'.$e->getMessage());
}
}
returnself::$dbconnection;
}
/**
*Chinesewordssegmentation
*
**/
publicfunctionwordSplit($keywords){
$fpath=ini_get('scws.default.fpath');
$so=scws_new();
$so->set_charset('utf-8');
$so->add_dict($fpath.'/dict.utf8.xdb');
//$so->add_dict($fpath.'/custom_dict.txt',SCWS_XDICT_TXT);
$so->set_rule($fpath.'/rules.utf8.ini');
$so->set_ignore(true);
$so->set_multi(false);
$so->set_duality(false);
$so->send_text($keywords);
$words=[];
$results=$so->get_result();
foreach($resultsas$res){
$words[]='('.$res['word'].')';
}
$words[]='('.$keywords.')';
returnjoin('|',$words);
}
/**
*getcurrentsphinxclient
*
*@returnresource
**/
publicfunctiongetClient(){
return$this->client;
}
/**
*logerror
**/
publicfunctionlog($msg){
//logerrorshere
//echo$msg;
}
/**
*magicmethods
**/
publicfunction__call($method,$args){
$rc=newReflectionClass('SphinxClient');
if(!$rc->hasMethod($method)){
thrownewException('invalidmethod:'.$method);
}
returncall_user_func_array(array($this->client,$method),$args);
}
}
测试文件test.php:
[php]view plaincopy
require('Search.php');
$s=newSearch([
'snippet_fields'=>['title','content'],
'field_weights'=>['title'=>20,'content'=>10],
]);
$s->setSortMode(SPH_SORT_EXTENDED,'createddesc,@weightdesc');
//$s->setSortBy('createddesc,@weightdesc');
$words=$s->wordSplit("mysql集群");//先分词结果:(mysql)|(mysql集群)
//print_r($words);exit;
$res=$s->query($words,0,10,'master');
echo'
';print_r($res);
12、SphinxQL测试
要使用SphinxQL需要在Searchd的配置里面增加相应的监听端口(参考上文配置)。
[php]view plaincopy
[root@localhostbin]#mysql-h127.0.0.1-P9306-uroot-p
Enterpassword:
WelcometotheMySQLmonitor.Commandsendwith;or\g.
YourMySQLconnectionidis1
Serverversion:2.0.7-id64-release(r3759)
Copyright(c)2000,2016,Oracleand/oritsaffiliates.Allrightsreserved.
OracleisaregisteredtrademarkofOracleCorporationand/orits
affiliates.Othernamesmaybetrademarksoftheirrespective
owners.
Type'help;'or'\h'forhelp.Type'\c'toclearthecurrentinputstatement.
mysql>showglobalvariables;
+----------------------+---------+
|Variable_name|Value|
+----------------------+---------+
|autocommit|1|
|collation_connection|libc_ci|
|query_log_format|plain|
|log_level|info|
+----------------------+---------+
4rowsinset(0.00sec)
mysql>descitems;
+---------+-----------+
|Field|Type|
+---------+-----------+
|id|bigint|
|title|field|
|content|field|
|created|timestamp|
|deleted|bool|
+---------+-----------+
5rowsinset(0.00sec)
mysql>select*frommasterwherematch('mysql集群')limit10;
+------+---------+---------+
|id|created|deleted|
+------+---------+---------+
|1|2016|0|
|6|0|0|
+------+---------+---------+
2rowsinset(0.00sec)
mysql>showmeta;
+---------------+-------+
|Variable_name|Value|
+---------------+-------+
|total|2|
|total_found|2|
|time|0.006|
|keyword[0]|mysql|
|docs[0]|5|
|hits[0]|15|
|keyword[1]|集|
|docs[1]|3|
|hits[1]|4|
|keyword[2]|群|
|docs[2]|3|
|hits[2]|4|
+---------------+-------+
12rowsinset(0.00sec)
mysql>
上一篇: 干货:6个内容营销策略应该避免的错误
下一篇: 红苕是什么?它可以做什么美食呢?