机器学习-初级进阶(自然语言处理 )
程序员文章站
2022-03-06 21:24:22
...
一、自然语言处理
-
要处理的问题
对于多组对话中判断这段化是消极的还是积极的
-
处理的短语数据
Review Liked Wow... Loved this place. 1 Crust is not good. 0 Not tasty and the texture was just nasty. 0 Stopped by during the late May bank holiday of... 1 ...
-
代码实现
from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix from sklearn.naive_bayes import GaussianNB from nltk.stem.porter import PorterStemmer from nltk.corpus import stopwords import pandas as pd import re dataset = pd.read_csv("Restaurant_Reviews.tsv", delimiter='\t', quoting=3) corpus = [] # 下载虚词词库 # import nltk # nltk.download('stopwords') for i in range(1000): review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]) # 将非字母替换成空格 review = review.lower() # 将大写替换成小写 review = review.split() ps = PorterStemmer() review = [ps.stem(word) for word in review if not word in set(stopwords.words("english"))] # 清理虚词, ps.stem(word): 词根化,比如loved->love review = " ".join(review) corpus.append(review) # print(review) cv = CountVectorizer(max_features=1500) # 取1500个出现次数最多的 X = cv.fit_transform(corpus).toarray() # 稀疏矩阵转换 y = dataset.iloc[:, 1].values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) classifier = GaussianNB() classifier = classifier.fit(X_train, y_train) y_pred = classifier.predict(X_test) cm = confusion_matrix(y_test, y_pred)
输出结果:
混淆矩阵结果:array([[55, 42], [12, 91]], dtype=int64)
真确率为73% 不是很理想,还算可以
上一篇: TiDB 表分区
下一篇: (1)TiDB单机版本Docker安装