欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

coreseek一元切分模式中英文单词不切分问题 博客分类: coreseek;sphinx数据库 sphinxcoreseek搜索分词

程序员文章站 2024-03-22 16:36:16
...
        网站搜索使用coreseek(sphinx),采用的一元分词模式,但按照官方网站的文档说明,却不支持英文单词、数字串一元分词,如:光华路SOHO,输入soho中任一字母不能查找出soho;输入soho可以查出,如标题中仅一个字母时,是可以的,如光华路h,输入“h”,可以查出,由此推断英文单词没有做一元分词索引,仔细查看文档:
(http://www.coreseek.cn/products-install/ngram_len_cjk/ 文档地址,此处仅列出主要部分)
#部分文档:

     ngram_chars = U+4E00..U+9FBF, U+3400..U+4DBF, U+20000..U+2A6DF, U+F900..U+FAFF,\
U+2F800..U+2FA1F, U+2E80..U+2EFF, U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF,\
U+3040..U+309F, U+30A0..U+30FF, U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF,\
U+3130..U+318F, U+A000..U+A48F, U+A490..U+A4CF
     
charset_table = U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z,\
A..Z->a..z, a..z, U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,\ ......略..


# end

   其中: ngram_chars 表示要进行一元字符切分模式的字符集;
          charset_table 表示可被一元字符切分模式认可的有效字符集;

    仔细对比字符集开头,发现ngram_chars中没有数字与英文字母的集合,呵呵!终于找到原因了,将charset_table字符集开头:“U+FF10..U+FF19->0..9,0..9,U+FF41..U+FF5A->a..z,U+FF21..U+FF3A->a..z,A..Z->a..z, a..z,”部分,复制到ngram_char字符集前头如下:
    ngram_chars =U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z,\
A..Z->a..z, a..z, U+4E00..U+9FBF, U+3400..U+4DBF, U+20000..U+2A6DF, U+F900..U+FAFF,\
U+2F800..U+2FA1F, U+2E80..U+2EFF, U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF,\
U+3040..U+309F, U+30A0..U+30FF, U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF,\
U+3130..U+318F, U+A000..U+A48F, U+A490..U+A4CF
     
charset_table = U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z,\
A..Z->a..z, a..z, U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,\ ......略..
重新执行索引,问题解决。