盘古分词+一元/二元分词Lucene

程序员文章站 2022-06-28 10:36:56

本文参考自：https://blog.csdn.net/mss359681091/article/details/52078147 http://www.cnblogs.com/top5/archive/2011/08/18/2144030.html 本文所有需要用到的文件下载包含项目： Lucen ......

本文参考自：https://blog.csdn.net/mss359681091/article/details/52078147

　　　　　　http://www.cnblogs.com/top5/archive/2011/08/18/2144030.html

本文所有需要用到的文件下载包含项目：

Lucene配置文件下载

中文分词配置文件下载

本文项目下载Zip

1.一元分词 / 2.二元分词 / 3.盘古分词 / 4.中文分词 / 5.简单搜索

用vs2015创建Windows窗体应用程序，创建好项目时记得将其属性改为“控制台应用程序”，当然也可以是默认的，只是这样方便些。如下图

盘古分词+一元/二元分词Lucene

1.一元分词法

除此外，还需要引用’Lucene.Net.dll‘

 1 /// <summary>
 2         /// 一元分词法
 3         /// </summary>
 4         /// <param name="sender"></param>
 5         /// <param name="e"></param>
 6         private void button1_Click(object sender, EventArgs e)
 7         {
 8             Analyzer analyzer = new StandardAnalyzer(); // 标准分词 → 一元分词  
 9             TokenStream tokenStream = analyzer.TokenStream("", new StringReader("喝奶只喝纯牛奶，这是不可能的——黑夜中的萤火虫"));
10             Token token = null;
11             while ((token = tokenStream.Next()) != null) // 只要还有词，就不返回null  
12             {
13                 string word = token.TermText(); // token.TermText() 取得当前分词  
14                 Console.Write(word + "   |  ");
15             }
16         }

一元分词法

2.二元分词法

在刚才的基础上，再引用文件夹“Analyzers”中的两个.cs文件，如下图

盘古分词+一元/二元分词Lucene

 /// <summary>
        /// 二元分词
        /// </summary>
        /// <param name="sender"></param>
        /// <param name="e"></param>
        private void button2_Click(object sender, EventArgs e)
        {
            Analyzer analyzer = new CJKAnalyzer(); // 标准分词 → 一元分词  
            TokenStream tokenStream = analyzer.TokenStream("", new StringReader("喝奶只喝纯牛奶，这是不可能的——黑夜中的萤火虫"));
            Token token = null;
            while ((token = tokenStream.Next()) != null) // 只要还有词，就不返回null  
            {
                string word = token.TermText(); // token.TermText() 取得当前分词  
                Console.Write(word + "   |  ");
            }
        }

二元分词法

3.盘古分词法

再引用以下两个配置文件

盘古分词+一元/二元分词Lucene

/// <summary>
        /// 盘古分词法
        /// </summary>
        /// <param name="sender"></param>
        /// <param name="e"></param>
        private void button3_Click(object sender, EventArgs e)
        {
            Analyzer analyzer = new PanGuAnalyzer(); // 盘古分词
            TokenStream tokenStream = analyzer.TokenStream("", new StringReader("喝奶只喝纯牛奶，这是不可能的——黑夜中的萤火虫"));
            Token token = null;
            while ((token = tokenStream.Next()) != null) // 只要还有词，就不返回null  
            {
                string word = token.TermText(); // token.TermText() 取得当前分词  
                Console.Write(word + "   |  ");
            }
        }

盘古分词法

如果不去更改，盘古词包中并不包含所需词汇，以下是运行效果图

盘古分词+一元/二元分词Lucene

而用‘DictManage.exe‘来打开项目中的Dict.dct文件，添加词汇，并加以保存。

盘古分词+一元/二元分词Lucene

下图是修改后的运行效果：

盘古分词+一元/二元分词Lucene

4.中文分词算法

 1 private void button1_Click(object sender, EventArgs e)
 2         {
 3             StringBuilder sb = new StringBuilder();
 4             sb.Remove(0, sb.Length);
 5             string t1 = "";
 6             int i = 0;
 7             Analyzer analyzer = new Lucene.China.ChineseAnalyzer();
 8             StringReader sr = new StringReader(richTextBox1.Text);
 9             TokenStream stream = analyzer.TokenStream(null, sr);
10 
11             long begin = System.DateTime.Now.Ticks;
12             Token t = stream.Next();
13             while (t != null)
14             {
15                 t1 = t.ToString();   //显示格式： (关键词,0,2) ，需要处理
16                 t1 = t1.Replace("(", "");
17                 char[] separator = { ',' };
18                 t1 = t1.Split(separator)[0];
19 
20                 sb.Append(i + ":" + t1 + "\r\n");
21                 t = stream.Next();
22                 i++;
23             }
24             richTextBox2.Text = sb.ToString();
25             long end = System.DateTime.Now.Ticks; //100毫微秒
26             int time = (int)((end - begin) / 10000); //ms
27 
28 
29             richTextBox2.Text += "耗时" + (time) + "ms \r\n=================================\r\n";
30         }

中文分词测试后台代码

5.简单搜索

创建web窗体SearchWords.aspx，如下图

盘古分词+一元/二元分词Lucene

<%@ Page Language="C#" AutoEventWireup="true" CodeBehind="SearchWords.aspx.cs" Inherits="PanGu_Search.Views.SearchWords" %>

<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
    <title>最简单的搜索引擎</title>  
    <script>
    $(document).keydown(function (event) {
    if (event.keyCode == 13) {
        $("#btnGetSearchResult").click();
    }
    });
    </script>
</head>  
<body>  
    <form id="mainForm" runat="server">  
        <div align="center">  
            <asp:Button ID="btnCreateIndex" runat="server" Text="Create Index" OnClick="btnCreateIndex_Click" />  
            <asp:Label ID="lblIndexStatus" runat="server" Visible="false" />  
            <hr />  
            <asp:TextBox ID="txtKeyWords" runat="server" Text="" Width="250"></asp:TextBox>  
            <asp:Button ID="btnGetSearchResult" runat="server" Text="Search" OnClick="btnGetSearchResult_Click" />  
            <hr />  
        </div>  
        <div>  
            <ul>  
                <asp:Repeater ID="rptSearchResult" runat="server">  
                    <ItemTemplate>  
                        <li>Id:<%#Eval("Id") %><br /><%#Eval("Msg") %></li>  
                    </ItemTemplate>  
                </asp:Repeater>  
            </ul>  
        </div>  
    </form>  
</body>  
</html>

前台aspx设计

  /// <summary>
        /// 创建索引方法
        /// </summary>
        /// <param name="sender"></param>
        /// <param name="e"></param>
        protected void btnCreateIndex_Click(object sender, EventArgs e)
        {
            string indexPath = Context.Server.MapPath("~/Index"); // 索引文档保存位置  
            FSDirectory directory = FSDirectory.Open(new DirectoryInfo(indexPath), new NativeFSLockFactory());
            bool isUpdate = IndexReader.IndexExists(directory); //判断索引库是否存在  
            if (isUpdate)
            {
                //  如果索引目录被锁定（比如索引过程中程序异常退出），则首先解锁  
                //  Lucene.Net在写索引库之前会自动加锁，在close的时候会自动解锁  
                //  不能多线程执行，只能处理意外被永远锁定的情况  
                if (IndexWriter.IsLocked(directory))
                {
                    IndexWriter.Unlock(directory);  //unlock:强制解锁，待优化  
                }
            }
            //  创建向索引库写操作对象  IndexWriter(索引目录,指定使用盘古分词进行切词,最大写入长度限制)  
            //  补充:使用IndexWriter打开directory时会自动对索引库文件上锁  
            IndexWriter writer = new IndexWriter(directory, new PanGuAnalyzer(), !isUpdate,
                IndexWriter.MaxFieldLength.UNLIMITED);

            for (int i = 1; i < 3; i++)
            {
                string txt = File.ReadAllText(Context.Server.MapPath("~/Upload/Articles/") + i + ".txt");
                //  一条Document相当于一条记录  
                Document document = new Document();
                //  每个Document可以有自己的属性（字段），所有字段名都是自定义的，值都是string类型  
                //  Field.Store.YES不仅要对文章进行分词记录，也要保存原文，就不用去数据库里查一次了  
                document.Add(new Field("id", i.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
                //  需要进行全文检索的字段加 Field.Index. ANALYZED  
                //  Field.Index.ANALYZED:指定文章内容按照分词后结果保存，否则无法实现后续的模糊查询   
                //  WITH_POSITIONS_OFFSETS:指示不仅保存分割后的词，还保存词之间的距离  
                document.Add(new Field("msg", txt, Field.Store.YES, Field.Index.ANALYZED,
                    Field.TermVector.WITH_POSITIONS_OFFSETS));
                //  防止重复索引，如果不存在则删除0条  
                writer.DeleteDocuments(new Term("id", i.ToString()));// 防止已存在的数据 => delete from t where id=i  
                                                                     //  把文档写入索引库  
                writer.AddDocument(document);
                Console.WriteLine("索引{0}创建完毕", i.ToString());
            }

            writer.Close(); // Close后自动对索引库文件解锁  
            directory.Close();  //  不要忘了Close，否则索引结果搜不到  

            lblIndexStatus.Text = "索引文件创建成功！";
            lblIndexStatus.Visible = true;
            btnCreateIndex.Enabled = false;
        }

创建索引方法

/// <summary>
        /// 搜索方法
        /// </summary>
        /// <param name="sender"></param>
        /// <param name="e"></param>
        protected void btnGetSearchResult_Click(object sender, EventArgs e)
        {
            string keyword = txtKeyWords.Text;

            string indexPath = Context.Server.MapPath("~/Index"); // 索引文档保存位置  
            FSDirectory directory = FSDirectory.Open(new DirectoryInfo(indexPath), new NoLockFactory());
            IndexReader reader = IndexReader.Open(directory, true);
            IndexSearcher searcher = new IndexSearcher(reader);
            // 查询条件  
            PhraseQuery query = new PhraseQuery();
            // 等同于 where contains("msg",kw)  
            query.Add(new Term("msg", keyword));
            // 两个词的距离大于100（经验值）就不放入搜索结果，因为距离太远相关度就不高了  
            query.SetSlop(100);
            // TopScoreDocCollector:盛放查询结果的容器  
            TopScoreDocCollector collector = TopScoreDocCollector.create(1000, true);
            // 使用query这个查询条件进行搜索，搜索结果放入collector  
            searcher.Search(query, null, collector);
            // 从查询结果中取出第m条到第n条的数据  
            // collector.GetTotalHits()表示总的结果条数  
            ScoreDoc[] docs = collector.TopDocs(0, collector.GetTotalHits()).scoreDocs;
            // 遍历查询结果  
            IList<SearchResult> resultList = new List<SearchResult>();
            for (int i = 0; i < docs.Length; i++)
            {
                // 拿到文档的id，因为Document可能非常占内存（DataSet和DataReader的区别）  
                int docId = docs[i].doc;
                // 所以查询结果中只有id，具体内容需要二次查询  
                // 根据id查询内容：放进去的是Document，查出来的还是Document  
                Document doc = searcher.Doc(docId);
                SearchResult result = new SearchResult();
                result.Id = Convert.ToInt32(doc.Get("id"));
                result.Msg = HighlightHelper.HighLight(keyword, doc.Get("msg"));

                resultList.Add(result);
            }

            // 绑定到Repeater  
            rptSearchResult.DataSource = resultList;
            rptSearchResult.DataBind();
        }

搜索方法

 protected void Page_Load(object sender, EventArgs e)
        {
            if (!IsPostBack)
            {
                // 检查是否已存在生成的索引文件
                CheckIndexData();
            }
        }

        /// <summary>
        /// 检查索引是否创建成功
        /// </summary>
        private void CheckIndexData()
        {
            string indexPath = Context.Server.MapPath("~/Index"); // 索引文档保存位置
            var files = System.IO.Directory.GetFiles(indexPath);
            if (files.Length > 0)
            {
                btnCreateIndex.Visible = false;
                lblIndexStatus.Text = "简单搜索";
                lblIndexStatus.Visible = true;
            }
        }

检查索引是否存在方法

运行效果如图：

盘古分词+一元/二元分词Lucene

上一篇：优化Google AdSense展示,提高网站收入

下一篇：网站优化及营销推广该怎么做

盘古分词+一元/二元分词Lucene

php 一元分词算法

Lucene05-分词器

Net Core使用Lucene.Net和盘古分词器实现全文检索

Lucene-分词器简介及IK分词器的使用

php 一元分词算法

lucene 分词原理

lucene：索引 -不分词

lucene使用hanlp分词

lucene 的分词StandardAnalyzer

lucene分词

盘古分词+一元/二元分词Lucene

php 一元分词算法

Lucene05-分词器

Net Core使用Lucene.Net和盘古分词器 实现全文检索

Lucene-分词器简介及IK分词器的使用

php 一元分词算法

lucene 分词原理

lucene：索引 -不分词

lucene使用hanlp分词

lucene 的分词StandardAnalyzer

lucene分词

Net Core使用Lucene.Net和盘古分词器实现全文检索