Lua源码中字符串类型的实现

程序员文章站 2022-08-29 21:49:58

概述 lua完全采用8位编码，lua字符串中的字符可以具有任何数值编码，包括数值0。也就是说，可以将任意二进制数据存储到...

概述

lua完全采用8位编码，lua字符串中的字符可以具有任何数值编码，包括数值0。也就是说，可以将任意二进制数据存储到一个字符串中。lua的字符串是不可变的值（immutable values）。如果修改，实质上是新建一个字符串。根据上文《lua中数据类型的源码实现》中知道，在lua中，字符串是自动内存管理机制所管理的对象，并且由联合体tstring来实现存储字符串值的。下面将通过lua 5.2.1的源码来看字符串的实现以及总结了在lua中使用字符串的注意事项。

源码实现

首先来看字符串对应的数据结构tstring，其源码如下（lobject.h）：

410 /* 
411 ** header for string value; string bytes follow the end of this structure 
412 */ 
413 typedef union tstring { 
414  l_umaxalign dummy; /* ensures maximum alignment for strings */ 
415  struct { 
416   commonheader; 
417   lu_byte extra; /* reserved words for short strings; "has hash" for longs */ 
418   unsigned int hash; 
419   size_t len; /* number of characters in string */ 
420  } tsv; 
421 } tstring;

对这个联合体定义，有几点值得说明：

i、联合体tstring中成员l_umaxalign dummy是用来保证与最大长度的c类型进行对齐，其定义如下（llimits.h）：

48 /* type to ensure maximum alignment */ 
49 #if !defined(luai_user_alignment_t) 
50 #define luai_user_alignment_t  union { double u; void *s; long l; } 
51 #endif 
52    
53 typedef luai_user_alignment_t l_umaxalign;

在其他可会回收的对象（比如table）的实现中，也可看到这个联合体成员，这样做的目的是通过内存对齐，加快cpu访问内存的速度。

ii、联合体中成员tsv才是真正用来实现字符串的。其中成员commonheader用于gc，它会以宏的形式在所有的可回收对象中定义，代码如下（lobject.h）：

74 /* 
75 ** common header for all collectable objects (in macro form, to be 
76 ** included in other objects) 
77 */ 
78 #define commonheader  gcobject *next; lu_byte tt; lu_byte marked

这个宏对应的结构体形式如下（lobject.h）：

81 /* 
82 ** common header in struct form 
83 */ 
84 typedef struct gcheader { 
85  commonheader; 
86 } gcheader;

结构体gcheader在通用的可回收对象union gcobject的定义中有用到。

iii、lu_byte extra对于短字符串，用来记录这个字符串是否为保留字，对于长字符串，可以用于惰性求hash值；unsigned int hash成员是字符串对应的hash值（在后面会具体讲lua是怎么计算字符串的hash值的）；size_t len用来表示字符串的长度。

iv、上面的结构体只是描述了一个字符串的结构，真正的字符串数据保存是紧随在结构体后面保存。

在lua5.2.1之前，不区分字符串长和短的字符串，所有的字符串保存在一个全局的hash表中，对于lua虚拟机来说，相同的字符串只有一份数据，从lua5.2.1开始，只是把短的字符串字符串（当前定义是长度小于等于40）放在全局hash表中，而长字符串都是独立生成，同时在计算hash值时，引入一个随机种子，这样做的目的防止hash dos——攻击者构造出非常多相同hash值的不同字符串，从而降低lua从外部压入字符串进入全局的字符串hash表的效率。下面是lua5.2.1中，生成一个新字符串的步骤，其相应的代码都在lstring.c中：

（1）若字符串长度大于luai_maxshortlen（默认值是40），则是长字符串，直接调用创建字符串接口的函数createstrobj（当然字符串的长度需要能保存在成员size_t len中，否则直接返回），代码如下（lstring.c）：

 95 /* 
 96 ** creates a new string object 
 97 */                                                  
 98 static tstring *createstrobj (lua_state *l, const char *str, size_t l,                
 99                int tag, unsigned int h, gcobject **list) {              
100  tstring *ts;                                            
101  size_t totalsize; /* total size of tstring object */                       
102  totalsize = sizeof(tstring) + ((l + 1) * sizeof(char));                      
103  ts = &luac_newobj(l, tag, totalsize, list, 0)->ts; 
104  ts->tsv.len = l; 
105  ts->tsv.hash = h; 
106  ts->tsv.extra = 0;                                         
107  memcpy(ts+1, str, l*sizeof(char)); 
108  ((char *)(ts+1))[l] = '\0'; /* ending 0 */                            
109  return ts; 
110 }

可以看到把传入的字符串具体内存放在紧随结构体tstring内存后面，并且注意108行，字符串以”\0”结束与c语言字符串兼容的。

（2）若字符串是短字符串，首先计算字符串的hash值，找到对应的链表（短字符串的全局hash表，使用的是链接法的方法，即把所有冲突的元素放在同一个链表中），查找当前要创建的字符串是否已经在hash表中，若已经存在，则直接返回这个字符串。否则会调用函数newshrstr，而函数newshrstr会调用上面的createstrobj函数创建新字符串，并把新创建的字符串放到hash表中，代码如下（lstring.c））：

130 /* 
131 ** checks whether short string exists and reuses it or creates a new one 
132 */ 
133 static tstring *internshrstr (lua_state *l, const char *str, size_t l) { 
134  gcobject *o; 
135  global_state *g = g(l); 
136  unsigned int h = luas_hash(str, l, g->seed); 
137  for (o = g->strt.hash[lmod(h, g->strt.size)]; 
138    o != null; 
139    o = gch(o)->next) { 
140   tstring *ts = rawgco2ts(o); 
141   if (h == ts->tsv.hash && 
142     ts->tsv.len == l && 
143     (memcmp(str, getstr(ts), l * sizeof(char)) == 0)) { 
144    if (isdead(g(l), o)) /* string is dead (but was not collected yet)? */ 
145     changewhite(o); /* resurrect it */ 
146    return ts; 
147   } 
148  } 
149  return newshrstr(l, str, l, h); /* not found; create a new string */ 
150 }

全局的字符串hash表是保存在虚拟机全局状态成员strt中的（lstate.h）：

119  stringtable strt; /* hash table for strings */

而类型stringtable是一个结构体，其定义如下（lstate.h）：

59 typedef struct stringtable {                                     
60  gcobject **hash; 
61  lu_int32 nuse; /* number of elements */ 
62  int size; 
63 } stringtable;

其中成员gcobject **hash是一个指针数组，数组中每个成员实质指向tstring（注意tstring中包括宏commonheader，该宏中的next成员可以构造出hash表中开散的链表）；nuse是数组hash中已经被使用的元素个数；size是当前数组hash的大小。

在函数newshrstr插入新的字符串前，都会判断nuse值是否大于size，若大于，表明hash表大小不够需要扩充，则把hash表的大小扩充到原来的2倍，对应的代码如下（lstring.c）:

121  if (tb->nuse >= cast(lu_int32, tb->size) && tb->size <= max_int/2)                 
122   luas_resize(l, tb->size*2); /* too crowded */

在gc的时候，会判断nuse是否比size/2还小（在lua 5.1中是把nuse与size/4比较），如果是的话就重新resize这个stringtable的大小为原来的一半。对应的代码如下（lgc.c）：

783   int hs = g->strt.size / 2; /* half the size of the string table */               
784   if (g->strt.nuse < cast(lu_int32, hs)) /* using less than that half? */            
785    luas_resize(l, hs); /* halve its size */

对于字符串比较，首先比较类型，若是不同类型字符串，则肯定不相同，然后区分短字符串和长字符串，对于短字符串，若两者指针值相等，则相同，否则不相同；对应长字符串，则首先比较指针值，若不同，则比较长度值和内容逐字符比较。

总结

（1）lua中的保留字和元方法名都是做为短字符串的，他们在虚拟机启动的时候就已经放入到全局短的字符串hash表，并且是不回收的。

（2）查找字符是比较高效的，但是修改或插入字符串都是比较低效的，这里面除了计算外，至少要把外面的字符串拷贝到虚拟机中。

（3）对应长字符串的hash值，lua是不会考察每个字符的，因而能避免快速计算长字符串的散列值，其相应的代码如下（lstring.c）：

51 unsigned int luas_hash (const char *str, size_t l, unsigned int seed) {               
52  unsigned int h = seed ^ l;                                     
53  size_t l1;                                             
54  size_t step = (l >> luai_hashlimit) + 1;                              
55  for (l1 = l; l1 >= step; l1 -= step)                                
56   h = h ^ ((h<<5) + (h>>2) + cast_byte(str[l1 - 1]));                       
57  return h;                                             
58 }  
21 /*                                                 
22 ** lua will use at most ~(2^luai_hashlimit) bytes from a string to                 
23 ** compute its hash                                         
24 */                                                  
25 #if !defined(luai_hashlimit)                                     
26 #define luai_hashlimit   5                                    
27 #endif

（4）当有多个字符串连接时，不应该直接用字符串连接运算符”..”，而是用table.concat操作或者是string.format来加快字符串连接的操作。

（5）lua中字符串hash算法用的是jshash，关于字符串的各种hash函数，网络有篇文章对它进行了总结：

以上所述谁就是本文的全部内容了，希望能对大家学习lua有所帮助。

上一篇：免费ip共享库

下一篇： lua+love2d制作的2048游戏

Lua源码中字符串类型的实现

python实现查找两个字符串中相同字符并输出的方法

python实现在字符串中查找子字符串的方法

Python中实现字符串类型与字典类型相互转换的方法

php实现指定字符串中查找子字符串的方法

MySQL数据库中CAST与CONVERT函数实现类型转换的讲解

以php中的比较运算符操作整型,浮点型,字符串型,布尔型和空类型

C#中字符串的加密的源码

Shell脚本实现查找字符串中某字符最后出现的位置

Android中实现多行、水平滚动的分页的Gridview实例源码

Oracle中字符串连接的实现方法