欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  后端开发

php简单中文分词系统(1/2)_PHP教程

程序员文章站 2022-04-08 21:54:50
...
php简单中文分词系统结构:首字散列表、Trie索引树结点优点:分词中,不需预知待查询词的长度,沿树链逐字匹配。缺点:构造和维护比较复杂,单词树枝多,浪费了一定的空间

php教程简单中文分词系统

结构:首字散列表、trie索引树结点
优点:分词中,不需预知待查询词的长度,沿树链逐字匹配。
缺点:构造和维护比较复杂,单词树枝多,浪费了一定的空间
* @version 0.1
* @todo 构造通用的字典算法,并写了一个简易的分词
* @author shjuto@gmail.com
* trie字典树
*
*/

class trie
{
private $trie;

function __construct()
{
$trie = array('children' => array(),'isword'=>false);
}

/**
* 把词加入词典
*
* @param string $key
*/
function &setword($word='')
{
$trienode = &$this->trie;
for($i = 0;$i {
$character = $word[$i];
if(!isset($trienode['children'][$character]))
{
$trienode['children'][$character] = array('isword'=>false);
}
if($i == strlen($word)-1)
{
$trienode['children'][$character] = array('isword'=>true);
}
$trienode = &$trienode['children'][$character];
}
}

/**
* 判断是否为词典词
*
* @param string $word
* @return bool true/false
*/
function & isword($word)
{
$trienode = &$this->trie;
for($i = 0;$i {
$character = $word[$i];
if(!isset($trienode['children'][$character]))
{
return false;
}
else
{
//判断词结束
if($i == (strlen($word)-1) && $trienode['children'][$character]['isword'] == true)
{
return true;
}
elseif($i == (strlen($word)-1) && $trienode['children'][$character]['isword'] == false)
{
return false;
}
$trienode = &$trienode['children'][$character];
}
}
}


/**
* 在文本$text找词出现的位置
*
* @param string $text
* @return array array('position'=>$position,'word' =>$word);
*/
function search($text="")
{
$textlen = strlen($text);
$trienode = $tree = $this->trie;
$find = array();
$wordrootposition = 0;//词根位置
$prenode = false;//回溯参数,当词典ab,在字符串aab中,需要把$i向前回溯一次
$word = '';
for ($i = 0; $i {

if(isset($trienode['children'][$text[$i]]))
{
$word = $word .$text[$i];
$trienode = $trienode['children'][$text[$i]];
if($prenode == false)
{
$wordrootposition = $i;
}
$prenode = true;
if($trienode['isword'])
{
$find[] = array('position'=>$wordrootposition,'word' =>$word);
}
}
else
{
$trienode = $tree;
$word = '';
if($prenode)
{
$i = $i -1;
$prenode = false;
}
}
}
return $find;
}
}

1 2

www.bkjia.comtruehttp://www.bkjia.com/PHPjc/444871.htmlTechArticlephp简单中文分词系统结构:首字散列表、Trie索引树结点优点:分词中,不需预知待查询词的长度,沿树链逐字匹配。缺点:构造和维护比较...