搜索引擎项目——分词

程序员文章站 2022-03-05 10:05:56

...

分词

学习来源：jiaba分词

步骤：

从数据库导入数据
搜索引擎模式分词（最细致）：自建字典 userdict.txt
去停用词 stop.txt
导出数据

包：pymysql、jiaba

txt文件格式
userdict.txt：想要分的词，词频，词性
搜索引擎项目——分词
stop.txt ：想要去掉的词

投机取巧：

时间实在是太紧张了，因此投机取巧了。。。。

为什么要先搜索引擎分词，去停用词再分词

搜索引擎分词之后：英国短毛猫- 英国、短毛猫、英国短毛猫（want）
只要存在英国短毛猫，就一定会分成英国、短毛猫、英国短毛猫，
最后合成字符串就会变成英国短毛猫英国短毛猫
为了去除重复的分词，可以先分词，再在停用词里去掉英国、短毛猫
代码

#c表-买猫信息
import jieba
#连入数据库
import pymysql
conn = pymysql.connect(host="localhost",user="root",password="*******",database="try1",charset="utf8")
#password：数据库密码，database：操作的数据库
cursor = conn.cursor()
#从数据库c表中获取所有信息，储存到name里
# （为了避免出错挽救不了，直接将分词后的数据储存到新建的数据库d中，所以这里我获取全部信息
sql = '''select * from c;'''
cursor.execute(sql)
name = cursor.fetchall()

#将停用词txt文件里停用词的返回到stopwords列表当中
def stopwordslist(filepath):
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
    return stopwords

#自建字典分词
def cut(i):
    jieba.load_userdict("..../sousuo/fenci/userdict.txt")#自建字典地址
    seg_list = list(jieba.cut_for_search(i))#将搜索引擎模式，分词后的字符串作为链表储存
    #将链表里的字符元素重新合成字符串
    sentence = ""
    for word in seg_list:
        if word != '\t':
            sentence += word
    return sentence
    
# 去停用词
def seg_sentence(sentence):
    # 再度分词
    sentence_seged = jieba.cut(sentence.strip())#strip()：把头和尾的空格去掉，返回列表
    stopwords = stopwordslist("..../sousuo/fenci/stop.txt")#路径
    #除停用词，返回有用的字符串
    outstr = ''
    for word in sentence_seged:
        if word not in stopwords:
            if word != '\t':
                outstr += word
    return outstr



#创建表d的sql语句
cursor.execute("drop table if EXISTS Products")
sql="""create table  products (
            id INT NOT NULL  PRIMARY KEY auto_increment，
            product VARCHAR(60),
            web VARCHAR(100),
            price FLOAT(10)
          );
          """
          #int不指定任何长度否则会报错。int()×
cursor.execute(sql)


for i in name:
    pro = str(i[1])#i = [id,product,web,price]
    sentence = cut(pro) #初步分词，返回字符串
    line_seg = seg_sentence(sentence)  # 去停用词
    #将结果连接数据库储存到表d
    try:
        sql1 = "insert into Products(product,web,price)values(%s,%s,%s);"
        cursor.execute(sql1,(line_seg,i[2],i[3]))
    except:
        print("异常")#发生错误
conn.commit()

搜索引擎项目——分词

分词

2021年中国搜索引擎市场占有率（一览国内搜索引擎新排名）

github下载项目到本地（github下载的zip使用步骤）

连锁教育加盟项目有哪些（分享10个早期教育品牌）

springboot项目之相互依赖报错问题(基于idea)

投入10万左右，选择在农村创业，现在有哪些项目能做到稳妥赚钱？

springboot项目之相互依赖报错问题(基于idea)

投入10万左右，选择在农村创业，现在有哪些项目能做到稳妥赚钱？

利用抖音信息差赚钱的方法（0投资高回报的抖音项目）

抖音随手拍车赚钱项目（拍任意汽车都可以日入500+）

抖音艺术签名项目（空手套白狼的暴利项目）