欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

php下利用scws+sphinx+mysql打造千万级数据全文搜索

程序员文章站 2023-09-29 15:45:03
Sphinx是由俄罗斯人Andrew Aksyonoff开发的一个全文检索引擎。意图为其他应用提供高速、低空间占用、高结果 相关度的全文搜索功能。Sphinx可以非常容易的与SQL...

Sphinx是由俄罗斯人Andrew Aksyonoff开发的一个全文检索引擎。意图为其他应用提供高速、低空间占用、高结果 相关度的全文搜索功能。Sphinx可以非常容易的与SQL数据库和脚本语言集成。当前系统内置MySQL和PostgreSQL 数据库数据源的支持,也支持从标准输入读取特定格式 的XML数据。

Sphinx创建索引的速度为:创建100万条记录的索引只需3~4分钟,创建1000万条记录的索引可以在50分钟内完成,而只包含最新10万条记录的增量索引,重建一次只需几十秒。

Sphinx的特性如下:

a) 高速的建立索引(在当代CPU上,峰值性能可达到10 MB/秒);

b) 高性能的搜索(在2 – 4GB 的文本数据上,平均每次检索响应时间小于0.1秒);

c) 可处理海量数据(目前已知可以处理超过100 GB的文本数据, 在单一CPU的系统上可处理100 M 文档);

d) 提供了优秀的相关度算法,基于短语相似度和统计(BM25)的复合Ranking方法;

e) 支持分布式搜索;

f) 支持短语搜索

g) 提供文档摘要生成

h) 可作为MySQL的存储引擎提供搜索服务;

i) 支持布尔、短语、词语相似度等多种检索模式;

j) 文档支持多个全文检索字段(最大不超过32个);

k) 文档支持多个额外的属性信息(例如:分组信息,时间戳等);

l) 支持断词;

虽然mysql的MYISAM提供全文索引,但是性能却不敢让人恭维

开始搭建

系统环境:centos6.5+php5.6+apache+mysql

1、安装依赖包

[php]view plaincopy

yum-yinstallmakegccg++gcc-c++libtoolautoconfautomakeimakephp-develmysql-devellibxml2-develexpat-devel

2、安装Sphinx

[php]view plaincopy

yuminstallexpatexpat-devel

wget-chttps://sphinxsearch.com/files/sphinx-2.0.7-release.tar.gz

tarzxvfsphinx-2.0.7-release.tar.gz

cdsphinx-2.0.7-release

./configure--prefix=/usr/local/sphinx--with-mysql--with-libexpat--enable-id64

make&&makeinstall

3、安装libsphinxclient,php扩展用到

[php]view plaincopy

cdapi/libsphinxclient

./configure--prefix=/usr/local/sphinx/libsphinxclient

make&&makeinstall

4、安装Sphinx的PHP扩展:我的是5.6需装sphinx-1.3.3.tgz,如果是php5.4以下可sphinx-1.3.0.tgz

[php]view plaincopy

wget-chttps://pecl.php.net/get/sphinx-1.3.3.tgz

tarzxvfsphinx-1.3.3.tgz

cdsphinx-1.3.3

phpize

./configure--with-sphinx=/usr/local/sphinx/libsphinxclient/--with-php-config=/usr/bin/php-config

make&&makeinstall

成功后会提示:

Installingsharedextensions:/usr/lib64/php/modules/

echo"[Sphinx]">>/etc/php.ini

echo"extension=sphinx.so">>/etc/php.ini

#重启apache

servicehttpdrestart

5、创建测试数据

[php]view plaincopy

CREATETABLEIFNOTEXISTS`items`(

`id`int(11)NOTNULLAUTO_INCREMENT,

`title`varchar(255)NOTNULL,

`content`textNOTNULL,

`created`datetimeNOTNULL,

PRIMARYKEY(`id`)

)ENGINE=InnoDBDEFAULTCHARSET=utf8COMMENT='全文检索测试的数据表'AUTO_INCREMENT=11;

INSERTINTO`items`(`id`,`title`,`content`,`created`)VALUES

(1,'linuxmysql集群安装','MySQLCluster是MySQL适合于分布式计算环境的高实用、可拓展、高性能、高冗余版本','2016-09-0700:00:00'),

(2,'mysql主从复制','mysql主从备份(复制)的基本原理mysql支持单向、异步复制,复制过程中一个服务器充当主服务器,而一个或多个其它服务器充当从服务器','2016-09-0600:00:00'),

(3,'hello','canyousearchme','2016-09-0500:00:00'),

(4,'mysql','mysqlisthebestdatabase','2016-09-0300:00:00'),

(5,'mysql索引','关于MySQL索引的好处,如果正确合理设计并且使用索引的MySQL是一辆兰博基尼的话,那么没有设计和使用索引的MySQL就是一个人力三轮车','2016-09-0100:00:00'),

(6,'集群','关于MySQL索引的好处,如果正确合理设计并且使用索引的MySQL是一辆兰博基尼的话,那么没有设计和使用索引的MySQL就是一个人力三轮车','0000-00-0000:00:00'),

(9,'复制原理','redis也有复制','0000-00-0000:00:00'),

(10,'redis集群','集群技术是构建高性能网站架构的重要手段,试想在网站承受高并发访问压力的同时,还需要从海量数据中查询出满足条件的数据,并快速响应,我们必然想到的是将数据进行切片,把数据根据某种规则放入多个不同的服务器节点,来降低单节点服务器的压力','0000-00-0000:00:00');

CREATETABLEIFNOTEXISTS`sph_counter`(

`counter_id`int(11)NOTNULL,

`max_doc_id`int(11)NOTNULL,

PRIMARYKEY(`counter_id`)

)ENGINE=MyISAMDEFAULTCHARSET=utf8COMMENT='增量索引标示的计数表';

以下采用"Main + Delta" ("主索引"+"增量索引")的索引策略,使用Sphinx自带的一元分词。

6、Sphinx配置:注意修改数据源配置信息

[php]view plaincopy

vi/usr/local/sphinx/etc/sphinx.conf

sourceitems{

type=mysql

sql_host=localhost

sql_user=root

sql_pass=123456

sql_db=sphinx_items

sql_query_pre=SETNAMESutf8

sql_query_pre=SETSESSIONquery_cache_type=OFF

sql_query_pre=REPLACEINTOsph_counterSELECT1,MAX(id)FROMitems

sql_query_range=SELECTMIN(id),MAX(id)FROMitems\

WHEREid<=(SELECTmax_doc_idFROMsph_counterWHEREcounter_id=1)

sql_range_step=1000

sql_ranged_throttle=1000

sql_query=SELECTid,title,content,created,0asdeletedFROMitems\

WHEREid<=(SELECTmax_doc_idFROMsph_counterWHEREcounter_id=1)\

ANDid>=$startANDid<=$end

sql_attr_timestamp=created

sql_attr_bool=deleted

}

sourceitems_delta:items{

sql_query_pre=SETNAMESutf8

sql_query_range=SELECTMIN(id),MAX(id)FROMitems\

WHEREid>(SELECTmax_doc_idFROMsph_counterWHEREcounter_id=1)

sql_query=SELECTid,title,content,created,0asdeletedFROMitems\

WHEREid>(SELECTmax_doc_idFROMsph_counterWHEREcounter_id=1)\

ANDid>=$startANDid<=$end

sql_query_post_index=set@max_doc_id:=(SELECTmax_doc_idFROMsph_counterWHEREcounter_id=1)

sql_query_post_index=REPLACEINTOsph_counterSELECT2,IF($maxid,$maxid,@max_doc_id)

}

#主索引

indexitems{

source=items

path=/usr/local/sphinx/var/data/items

docinfo=extern

morphology=none

min_word_len=1

min_prefix_len=0

html_strip=1

html_remove_elements=style,script

ngram_len=1

ngram_chars=U+3000..U+2FA1F

charset_type=utf-8

charset_table=0..9,A..Z->a..z,_,a..z,U+410..U+42F->U+430..U+44F,U+430..U+44F

preopen=1

min_infix_len=1

}

#增量索引

indexitems_delta:items{

source=items_delta

path=/usr/local/sphinx/var/data/items-delta

}

#分布式索引

indexmaster{

type=distributed

local=items

local=items_delta

}

indexer{

mem_limit=256M

}

searchd{

listen=9312

listen=9306:mysql41#UsedforSphinxQL

log=/usr/local/sphinx/var/log/searchd.log

query_log=/usr/local/sphinx/var/log/query.log

compat_sphinxql_magics=0

attr_flush_period=600

mva_updates_pool=16M

read_timeout=5

max_children=0

dist_threads=2

pid_file=/usr/local/sphinx/var/log/searchd.pid

max_matches=1000

seamless_rotate=1

preopen_indexes=1

unlink_old=1

workers=threads#forRTtowork

binlog_path=/usr/local/sphinx/var/data

}保存退出

7、Sphinx创建索引

[php]view plaincopy

#第一次需重建索引:

[root@localhostbin]#./indexer-c/usr/local/sphinx/etc/sphinx.conf--all

Sphinx2.0.7-id64-release(r3759)

Copyright(c)2001-2012,AndrewAksyonoff

Copyright(c)2008-2012,SphinxTechnologiesInc(https://sphinxsearch.com)

usingconfigfile'/usr/local/sphinx/etc/sphinx.conf'...

indexingindex'items'...

collected8docs,0.0MB

sorted0.0Mhits,100.0%done

total8docs,1121bytes

total1.017sec,1101bytes/sec,7.86docs/sec

indexingindex'items_delta'...

collected0docs,0.0MB

total0docs,0bytes

total1.007sec,0bytes/sec,0.00docs/sec

skippingnon-plainindex'master'...

total4reads,0.000sec,0.7kb/callavg,0.0msec/callavg

total14writes,0.001sec,0.5kb/callavg,0.1msec/callavg

#启动sphinx

[root@localhostbin]#./searchd-c/usr/local/sphinx/etc/sphinx.conf

Sphinx2.0.7-id64-release(r3759)

Copyright(c)2001-2012,AndrewAksyonoff

Copyright(c)2008-2012,SphinxTechnologiesInc(https://sphinxsearch.com)

usingconfigfile'/usr/local/sphinx/etc/sphinx.conf'...

listeningonallinterfaces,port=9312

listeningonallinterfaces,port=9306

precachingindex'items'

precachingindex'items_delta'

rotatingindex'items_delta':success

precached2indexesin0.012sec

#查看进程

[root@localhostbin]#ps-ef|grepsearchd

root304311023:5900:00:00./searchd-c/usr/local/sphinx/etc/sphinx.conf

root3043230431023:5900:00:00./searchd-c/usr/local/sphinx/etc/sphinx.conf

root304371490023:59pts/000:00:00grepsearchd

#停止Searchd:

./searchd-c/usr/local/sphinx/etc/sphinx.conf--stop

#查看Searchd状态:

./searchd-c/usr/local/sphinx/etc/sphinx.conf--status

索引更新及使用说明

"增量索引"每N分钟更新一次.通常在每天晚上低负载的时进行一次索引合并,同时重新建立"增量索引"。当然"主索引"数据不多的话,也可以直接重新建立"主索引"。

API搜索的时,同时使用"主索引"和"增量索引",这样可以获得准实时的搜索数据.本文的Sphinx配置将"主索引"和"增量索引"放到分布式索引master中,因此只需查询分布式索引"master"即可获得全部匹配数据(包括最新数据)。

索引的更新与合并的操作可以放到cron job完成:

[php]view plaincopy

crontab-e

*/1****/usr/local/sphinx/shell/delta_index_update.sh

03***/usr/local/sphinx/shell/merge_daily_index.sh

crontab-l

cron job所用的shell脚本例子:

delta_index_update.sh:

[php]view plaincopy

#!/bin/bash

/usr/local/sphinx/bin/indexer-c/usr/local/sphinx/etc/sphinx.conf--rotateitems_delta>/dev/null2>&1merge_daily_index.sh:

[php]view plaincopy

#!/bin/bash

indexer=`whichindexer`

mysql=`whichmysql`

QUERY="usesphinx_items;selectmax_doc_idfromsph_counterwherecounter_id=2limit1;"

index_counter=$($mysql-h192.168.1.198-uroot-p123456-sN-e"$QUERY")

#merge"main+delta"indexes

$indexer-c/usr/local/sphinx/etc/sphinx.conf--rotate--mergeitemsitems_delta--merge-dst-rangedeleted00>>/usr/local/sphinx/var/index_merge.log2>&1

if["$"-eq0];then

##updatesphinxcounter

if[!-z$index_counter];then

$mysql-h192.168.1.198-uroot-p123456-Dsphinx_items-e"REPLACEINTOsph_counterVALUES(1,'$index_counter')"

fi

##rebuilddeltaindextoavoidconfusionwithmainindex

$indexer-c/usr/local/sphinx/etc/sphinx.conf--rotateitems_delta>>/usr/local/sphinx/var/rebuild_deltaindex.log2>&1

fi

8、php中文分词scws安装:注意扩展的版本和php的版本

[php]view plaincopy

wget-chttps://www.xunsearch.com/scws/down/scws-1.2.3.tar.bz2

tarjxvfscws-1.2.3.tar.bz2

cdscws-1.2.3

./configure--prefix=/usr/local/scws

make&&makeinstall

9、scws的PHP扩展安装:

[php]view plaincopy

cd./phpext

phpize

./configure

make&&makeinstall

echo"[scws]">>/etc/php.ini

echo"extension=scws.so">>/etc/php.ini

echo"scws.default.charset=utf-8">>/etc/php.ini

echo"scws.default.fpath=/usr/local/scws/etc/">>/etc/php.ini

10、词库安装:[php]view plaincopy

wgethttps://www.xunsearch.com/scws/down/scws-dict-chs-utf8.tar.bz2

tarjxvfscws-dict-chs-utf8.tar.bz2-C/usr/local/scws/etc/

chownwww:www/usr/local/scws/etc/dict.utf8.xdb

11、php使用Sphinx+scws测试例子

在Sphinx源码API中,有好几种语言的API调用.其中有一个是sphinxapi.php。

不过以下的测试使用的是Sphinx的PHP扩展.具体安装见本文开头的Sphinx安装部分。

测试用的搜索类Search.php:注意修改getDBConnection()信息为自己的

[php]view plaincopy

classSearch{

/**

*@varSphinxClient

**/

protected$client;

/**

*@varstring

**/

protected$keywords;

/**

*@varresource

**/

privatestatic$dbconnection=null;

/**

*Constructor

**/

publicfunction__construct($options=array()){

$defaults=array(

'query_mode'=>SPH_MATCH_EXTENDED2,

'sort_mode'=>SPH_SORT_EXTENDED,

'ranking_mode'=>SPH_RANK_PROXIMITY_BM25,

'field_weights'=>array(),

'max_matches'=>1000,

'snippet_enabled'=>true,

'snippet_index'=>'items',

'snippet_fields'=>array(),

);

$this->options=array_merge($defaults,$options);

$this->client=newSphinxClient();

//$this->client->setServer("192.168.1.198",9312);

$this->client->setMatchMode($this->options['query_mode']);

if($this->options['field_weights']!==array()){

$this->client->setFieldWeights($this->options['field_weights']);

}

/*

if(in_array($this->options['query_mode'],[SPH_MATCH_EXTENDED2,SPH_MATCH_EXTENDED])){

$this->client->setRankingMode($this->options['ranking_mode']);

}

*/

}

/**

*Query

*

*@paramstring$keywords

*@paraminteger$offset

*@paraminteger$limit

*@paramstring$index

*@returnarray

**/

publicfunctionquery($keywords,$offset=0,$limit=10,$index='*'){

$this->keywords=$keywords;

$max_matches=$limit>$this->options['max_matches']$limit:$this->options['max_matches'];

$this->client->setLimits($offset,$limit,$max_matches);

$query_results=$this->client->query($keywords,$index);

if($query_results===false){

$this->log('error:'.$this->client->getLastError());

}

$res=[];

if(empty($query_results['matches'])){

return$res;

}

$res['total']=$query_results['total'];

$res['total_found']=$query_results['total_found'];

$res['time']=$query_results['time'];

$doc_ids=array_keys($query_results['matches']);

unset($query_results);

$res['data']=$this->fetch_data($doc_ids);

if($this->options['snippet_enabled']){

$this->buildExcerptRows($res['data']);

}

return$res;

}

/**

*customsorting

*

*@paramstring$sortBy

*@paramint$mode

*@returnbool

**/

publicfunctionsetSortBy($sortBy='',$mode=0){

if($sortBy){

$mode=$mode:$this->options['sort_mode'];

$this->client->setSortMode($mode,$sortBy);

}else{

$this->client->setSortMode(SPH_SORT_RELEVANCE);

}

}

/**

*fetchdatabasedonmatcheddoc_ids

*

*@paramarray$doc_ids

*@returnarray

**/

protectedfunctionfetch_data($doc_ids){

$ids=implode(',',$doc_ids);

$queries=self::getDBConnection()->query("SELECT*FROMitemsWHEREidin($ids)",PDO::FETCH_ASSOC);

returniterator_to_array($queries);

}

/**

*buildexcerptsfordata

*

*@paramarray$rows

*@returnarray

**/

protectedfunctionbuildExcerptRows(&$rows){

$options=array(

'before_match'=>'',

'after_match'=>'',

'chunk_separator'=>'...',

'limit'=>256,

'around'=>3,

'exact_phrase'=>false,

'single_passage'=>true,

'limit_words'=>5,

);

$scount=count($this->options['snippet_fields']);

foreach($rowsas&$row){

foreach($rowas$fk=>$item){

if(!is_string($item)||($scount&&!in_array($fk,$this->options['snippet_fields'])))continue;

$item=preg_replace('/[\r\t\n]+/','',strip_tags($item));

$res=$this->client->buildExcerpts(array($item),$this->options['snippet_index'],$this->keywords,$options);

$row[$fk]=$res===false$item:$res[0];

}

}

return$rows;

}

/**

*databaseconnection

*

*@returnresource

**/

privatestaticfunctiongetDBConnection(){

$dsn='mysql:host=192.168.1.198;dbname=sphinx_items';

$user='root';

$pass='123456';

if(!self::$dbconnection){

try{

self::$dbconnection=newPDO($dsn,$user,$pass);

}catch(PDOException$e){

die('Connectionfailed:'.$e->getMessage());

}

}

returnself::$dbconnection;

}

/**

*Chinesewordssegmentation

*

**/

publicfunctionwordSplit($keywords){

$fpath=ini_get('scws.default.fpath');

$so=scws_new();

$so->set_charset('utf-8');

$so->add_dict($fpath.'/dict.utf8.xdb');

//$so->add_dict($fpath.'/custom_dict.txt',SCWS_XDICT_TXT);

$so->set_rule($fpath.'/rules.utf8.ini');

$so->set_ignore(true);

$so->set_multi(false);

$so->set_duality(false);

$so->send_text($keywords);

$words=[];

$results=$so->get_result();

foreach($resultsas$res){

$words[]='('.$res['word'].')';

}

$words[]='('.$keywords.')';

returnjoin('|',$words);

}

/**

*getcurrentsphinxclient

*

*@returnresource

**/

publicfunctiongetClient(){

return$this->client;

}

/**

*logerror

**/

publicfunctionlog($msg){

//logerrorshere

//echo$msg;

}

/**

*magicmethods

**/

publicfunction__call($method,$args){

$rc=newReflectionClass('SphinxClient');

if(!$rc->hasMethod($method)){

thrownewException('invalidmethod:'.$method);

}

returncall_user_func_array(array($this->client,$method),$args);

}

}

测试文件test.php:

[php]view plaincopy

require('Search.php');

$s=newSearch([

'snippet_fields'=>['title','content'],

'field_weights'=>['title'=>20,'content'=>10],

]);

$s->setSortMode(SPH_SORT_EXTENDED,'createddesc,@weightdesc');

//$s->setSortBy('createddesc,@weightdesc');

$words=$s->wordSplit("mysql集群");//先分词结果:(mysql)|(mysql集群)

//print_r($words);exit;

$res=$s->query($words,0,10,'master');

echo'

';print_r($res);

12、SphinxQL测试

要使用SphinxQL需要在Searchd的配置里面增加相应的监听端口(参考上文配置)。

[php]view plaincopy

[root@localhostbin]#mysql-h127.0.0.1-P9306-uroot-p

Enterpassword:

WelcometotheMySQLmonitor.Commandsendwith;or\g.

YourMySQLconnectionidis1

Serverversion:2.0.7-id64-release(r3759)

Copyright(c)2000,2016,Oracleand/oritsaffiliates.Allrightsreserved.

OracleisaregisteredtrademarkofOracleCorporationand/orits

affiliates.Othernamesmaybetrademarksoftheirrespective

owners.

Type'help;'or'\h'forhelp.Type'\c'toclearthecurrentinputstatement.

mysql>showglobalvariables;

+----------------------+---------+

|Variable_name|Value|

+----------------------+---------+

|autocommit|1|

|collation_connection|libc_ci|

|query_log_format|plain|

|log_level|info|

+----------------------+---------+

4rowsinset(0.00sec)

mysql>descitems;

+---------+-----------+

|Field|Type|

+---------+-----------+

|id|bigint|

|title|field|

|content|field|

|created|timestamp|

|deleted|bool|

+---------+-----------+

5rowsinset(0.00sec)

mysql>select*frommasterwherematch('mysql集群')limit10;

+------+---------+---------+

|id|created|deleted|

+------+---------+---------+

|1|2016|0|

|6|0|0|

+------+---------+---------+

2rowsinset(0.00sec)

mysql>showmeta;

+---------------+-------+

|Variable_name|Value|

+---------------+-------+

|total|2|

|total_found|2|

|time|0.006|

|keyword[0]|mysql|

|docs[0]|5|

|hits[0]|15|

|keyword[1]|集|

|docs[1]|3|

|hits[1]|4|

|keyword[2]|群|

|docs[2]|3|

|hits[2]|4|

+---------------+-------+

12rowsinset(0.00sec)

mysql>