C# TrieTree介绍及实现方法

程序员文章站 2023-12-15 18:12:40

在自然语言处理（nlp）研究中，ngram是最基本但也是最有用的一种比对方式，这里的n是需要比对的字符串的长度，而今天我介绍的trietree，正是和ngram密切相关的一...

在自然语言处理（nlp）研究中，ngram是最基本但也是最有用的一种比对方式，这里的n是需要比对的字符串的长度，而今天我介绍的trietree，正是和ngram密切相关的一种数据结构，有人称之为字典树。trietree简单的说是一种多叉树，每个节点保存一个字符，这么做的好处是当我们要做ngram比对时，只需要直接从树的根节点开始沿着某个树叉遍历下去，就能完成比对；如果没找到，停止本次遍历。这话讲得有些抽象，我们来看一个实际的例子。

假设我们现在词库里面有以下一些词：

上海市
上海滩
上海人
上海公司
北京
北斗星
杨柳
杨浦区

C# TrieTree介绍及实现方法

如图所示：挂在根节点上的字有上、北、杨，

如果我们现在对“上海市杨浦区”这个词做3gram就有上海市、海市杨、市杨浦、杨浦区，现在我们要知道哪些词是能够被这个字典识别的，通常我们可以用ngram来做分词。有了这颗树，我们只需要依次取每个字符，从根开始进行比对，比如上海市，我们能够匹配上->海->市，这个路径，所以匹配；比如海市杨，由于没有“海”字挂在根节点上，所以停止；市杨浦也无法匹配；最终匹配杨浦区，得到杨->浦->区这个路径，匹配。

最终我们可以把“上海市杨浦区”切分为上海市|杨浦区。

尽管trietree要比普通字符串数组节省很多时间，但这并不是没有代价的，因为你要先根据字典构建这棵树，这个代价并不低，当然对于某个应用来说一旦trietree构建完成就可以重复使用，所以针对大规模比对来说，性能提升还是很客观的。

下面是trietree的c#实现。

复制代码代码如下:

   public class trietree    
   {        
   trienode _root = null;  
   private trietree()     
   {           
   _root = new trienode(char.maxvalue,0);  
   charcount = 0;   
   }       
   static trietree _instance = null; 
   public static trietree getinstance()   
   {            
   if (_instance == null)        
   {             
   _instance = new trietree();       
   }           
   return _instance;   
   }        
   public trienode root   
   {           
   get { return _root; 
   }   
   }        
   public void addword(char ch) 
   {           
trienode newnode=_root.addchild(ch);   
newnode.increasefrequency();           
newnode.wordended = true;      
}        int charcount;    
public void addword(string word)   
{          
if (word.length == 1)     
{               
addword(word[0]);     
charcount++;       
}         
else    
{                 
char[] chars=word.tochararray();     
trienode node = _root;           
charcount += chars.length;      
for (int i = 0; i < chars.length; i++)  
{                    
trienode newnode=node.addchild(chars[i]);    
newnode.increasefrequency();           
node = newnode;           
}           
node.wordended = true;  
}       
}       
public int getfrequency(char ch)   
{           
trienode matchednode = _root.children.firstordefault(n => n.character == ch);  
if (matchednode == null)      
{               
return 0;        
}           
return matchednode.frequency;  
}       
public int getfrequency(string word)
{        
if (word.length == 1) 
{              
return getfrequency(word[0]); 
}            
else      
{            
char[] chars = word.tochararray(); 
trienode node = _root;        
for (int i = 0; i < chars.length; i++)   
{                 
if (node.children == null)   
return 0;              
trienode matchednode = node.children.firstordefault(n => n.character == chars[i]);
if (matchednode == null)          
{                      
return 0;         
}                  
node = matchednode;    
}              
if (node.wordended == true)        
return node.frequency;       
else                   
return -1;           
}      
}   
}

这里我们使用了单例模式，因为trietree类似缓存，不需要重复创建。下面是treenode的实现：

复制代码代码如下:

   public class trienode    
   {       
   public trienode(char ch,int depth)
   {           
   this.character=ch; 
   this._depth=depth;
   }       
   public char character; 
   int _depth;        
   public int depth   
   {            
   get{return _depth;
   }      
   }     
   trienode _parent=null;
   public trienode parent     
   {          
   get { 
   return _parent;
   }          
   set { _parent = value;
   } 
   }       
   public bool wordended = false;  
   hashset<trienode> _children=null;  
   public hashset<trienode> children 
   {           
   get { 
   return _children;
   }       
   }        
   public trienode getchildnode(char ch) 
   {            
   if (_children != null)    
   return _children.firstordefault(n => n.character == ch);  
   else               
   return null;      
   }       
   public trienode addchild(char ch) 
   {           
   trienode matchednode=null;      
   if (_children != null)      
   {               
   matchednode = _children.firstordefault(n => n.character == ch);  
   }           
   if (matchednode != null)   
   //found the char in the list   
   {                
   //matchednode.increasefrequency();      
   return matchednode;         
   }           
   else          
   {  
   //not found       
   trienode node = new trienode(ch, this.depth + 1);     
   node.parent = this;      
   //node.increasefrequency();            
   if (_children == null)               
   _children = new hashset<trienode>();   
   _children.add(node);              
   return node;          
   }       
   }       
   int _frequency = 0;       
   public int frequency    
   {         
   get { return _frequency;
   }        
   }       
   public void increasefrequency()      
   {          
   _frequency++;   
   }      
   public string getword() 
   {             
   trienode tmp=this;     
   string result = string.empty;  
   while(tmp.parent!=null) //until root node  
   {                
   result = tmp.character + result;   
   tmp = tmp.parent;     
   }            
   return result;     
   }        
   public override string tostring()
   {          
   return convert.tostring(this.character);
   }    
   }

C# TrieTree介绍及实现方法

C#实现发送邮件的三种方法

C#中方括号[]的语法及作用介绍

基于C#后台调用跨域MVC服务及带Cookie验证的实现

C#实现.net页面之间传值传参方法汇总

C#实现鼠标移动到曲线图上显示值的方法

C# websocket及时通信协议的实现方法示例

C#实现去除Strings中空格的方法

C#实现对象XML序列化的方法

C#判断页面中的多个文本框输入值是否有重复的实现方法

C#实现HTTP上传文件的方法