欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

字符串相似度处理函数

程序员文章站 2022-03-24 09:03:30
oracle里面查比如存储过程里面与表SALES有关jobs: SELECT * FROM (SELECT a.name,upper(b.what)AS what,SYS.UTL_MATCH.edit_distance_similarity (a.name,upper(b.what)) AS sim ......
oracle里面查比如存储过程里面与表sales有关jobs:
select * from (select a.name,upper(b.what)as what,sys.utl_match.edit_distance_similarity
(a.name,upper(b.what)) as similarity from dba_source a,dba_jobs b
where  sys.utl_match.edit_distance_similarity(a.name,upper(b.what))>80
and upper(a.text)  like '%sales%'
and b.what not like '%dbms_refresh%')
order by 3 desc;

有自带相似度函数 sys.utl_match.edit_distance_similarity 可以直接用。

mysql8里面我只查到了一个搜索相关的文档并不适合我自己用,找了一下大佬的文章,发现了一个能用的自定义函数,用于计算字符串相似度。

delimiter $$
create definer=`root`@`localhost` function `compare_string`( s1 text, s2 text) returns int(11)
    deterministic
begin 
    declare s1_len, s2_len, i, j, c, c_temp, cost int; 
    declare s1_char char; 
    declare cv0, cv1 text; 
    set s1_len = char_length(s1), s2_len = char_length(s2), cv1 = 0x00, j = 1, i = 1, c = 0; 
    if s1 = s2 then 
      return 0;     
    elseif s1_len = 0 then 
      return s2_len; 
    elseif s2_len = 0 then 
      return s1_len; 
    else 
      while j <= s2_len do 
        set cv1 = concat(cv1, unhex(hex(j))), j = j + 1; 
      end while; 
      while i <= s1_len do 
        set s1_char = substring(s1, i, 1), c = i, cv0 = unhex(hex(i)), j = 1; 
        while j <= s2_len do 
          set c = c + 1; 
          if s1_char = substring(s2, j, 1) then  
            set cost = 0; else set cost = 1; 
          end if; 
          set c_temp = conv(hex(substring(cv1, j, 1)), 16, 10) + cost; 
          if c > c_temp then set c = c_temp; end if; 
            set c_temp = conv(hex(substring(cv1, j+1, 1)), 16, 10) + 1; 
            if c > c_temp then  
              set c = c_temp;  
            end if; 
            set cv0 = concat(cv0, unhex(hex(c))), j = j + 1; 
        end while; 
        set cv1 = cv0, i = i + 1; 
      end while; 
    end if; 
    return c; 
  end$$
delimiter ;

delimiter $$
create definer=`root`@`localhost` function `similarity_string`(a text, b text) returns double
begin
return abs(((compare_string(a, b) / length(b)) * 100) - 100);
end$$
delimiter ;

试了一下还挺好用的,一些逻辑可以自己再适当的修改。

而在gp里面,我找了许久,发现一个 fuzzystrmatch  看起来比较高档的函数,这个函数需要安装:

[gpadmin@szwpldb1085 ~]$ psql -d postgres -f $gphome/share/postgresql/contrib/fuzzystrmatch.sql
set
create function
create function
create function
create function
create function
create function
create function
create function
create function
create function

然后会在 postgres 库多出一些函数:

字符串相似度处理函数

用法在