基于随机梯度下降的矩阵分解推荐算法(python)
程序员文章站
2022-06-11 17:58:05
svd是矩阵分解常用的方法,其原理为:矩阵m可以写成矩阵a、b与c相乘得到,而b可以与a或者c合并,就变成了两个元素m1与m2的矩阵相乘可以得到m。
矩阵分解推荐的思...
svd是矩阵分解常用的方法,其原理为:矩阵m可以写成矩阵a、b与c相乘得到,而b可以与a或者c合并,就变成了两个元素m1与m2的矩阵相乘可以得到m。
矩阵分解推荐的思想就是基于此,将每个user和item的内在feature构成的矩阵分别表示为m1与m2,则内在feature的乘积得到m;因此我们可以利用已有数据(user对item的打分)通过随机梯度下降的方法计算出现有user和item最可能的feature对应到的m1与m2(相当于得到每个user和每个item的内在属性),这样就可以得到通过feature之间的内积得到user没有打过分的item的分数。
本文所采用的数据是movielens中的数据,且自行切割成了train和test,但是由于数据量较大,没有用到全部数据。
代码如下:
# -*- coding: utf-8 -*- """ created on mon oct 9 19:33:00 2017 @author: wjw """ import pandas as pd import numpy as np import os def difference(left,right,on): #求两个dataframe的差集 df = pd.merge(left,right,how='left',on=on) #参数on指的是用于连接的列索引名称 left_columns = left.columns col_y = df.columns[-1] # 得到最后一列 df = df[df[col_y].isnull()]#得到boolean的list df = df.iloc[:,0:left_columns.size]#得到的数据里面还有其他同列名的column df.columns = left_columns # 重新定义columns return df def readfile(filepath): #读取文件,同时得到训练集和测试集 pwd = os.getcwd()#返回当前工程的工作目录 os.chdir(os.path.dirname(filepath)) #os.path.dirname()获得filepath文件的目录;chdir()切换到filepath目录下 initialdata = pd.read_csv(os.path.basename(filepath)) #basename()获取指定目录的相对路径 os.chdir(pwd)#回到先前工作目录下 preddata = initialdata.iloc[:,0:3] #将最后一列数据去掉 newindexdata = preddata.drop_duplicates() traindata = newindexdata.sample(axis=0,frac = 0.1) #90%的数据作为训练集 testdata = difference(newindexdata,traindata,['userid','movieid']).sample(axis=0,frac=0.1) return traindata,testdata def getmodel(train): slowrate = 0.99 prermse = 10000000.0 max_iter = 100 features = 3 lamda = 0.2 gama = 0.01 #随机梯度下降中加入,防止更新过度 user = pd.dataframe(train.userid.drop_duplicates(),columns=['userid']).reset_index(drop=true) #把在原来dataframe中的索引重新设置,drop=true并抛弃 movie = pd.dataframe(train.movieid.drop_duplicates(),columns=['movieid']).reset_index(drop=true) usernum = user.count().loc['userid'] #671 movienum = movie.count().loc['movieid'] userfeatures = np.random.rand(usernum,features) #构造user和movie的特征向量集合 moviefeatures = np.random.rand(movienum,features) #假设每个user和每个movie有3个feature userfeaturesframe =user.join(pd.dataframe(userfeatures,columns = ['f1','f2','f3'])) moviefeaturesframe =movie.join(pd.dataframe(moviefeatures,columns= ['f1','f2','f3'])) userfeaturesframe = userfeaturesframe.set_index('userid') moviefeaturesframe = moviefeaturesframe.set_index('movieid') #重新设置index for i in range(max_iter): rmse = 0 n = 0 for index,row in user.iterrows(): uid = row.userid userfeature = userfeaturesframe.loc[uid] #得到userfeatureframe中对应uid的feature u_m = train[train['userid'] == uid] #找到在train中userid点评过的movieid的data for index,row in u_m.iterrows(): u_mid = int(row.movieid) realrating = row.rating moviefeature = moviefeaturesframe.loc[u_mid] eui = realrating-np.dot(userfeature,moviefeature) rmse += pow(eui,2) n += 1 userfeaturesframe.loc[uid] += gama * (eui*moviefeature-lamda*userfeature) moviefeaturesframe.loc[u_mid] += gama*(eui*userfeature-lamda*moviefeature) nowrmse = np.sqrt(rmse*1.0/n) print('step:%f,rmse:%f'%((i+1),nowrmse)) if nowrmse<prermse: prermse = nowrmse elif nowrmse<0.5: break elif nowrmse-prermse<=0.001: break gama*=slowrate return userfeaturesframe,moviefeaturesframe def evaluate(userfeaturesframe,moviefeaturesframe,test): test['predictrating']='nan' # 新增一列 for index,row in test.iterrows(): print(index) userid = row.userid movieid = row.movieid if userid not in userfeaturesframe.index or movieid not in moviefeaturesframe.index: continue userfeature = userfeaturesframe.loc[userid] moviefeature = moviefeaturesframe.loc[movieid] test.loc[index,'predictrating'] = np.dot(userfeature,moviefeature) #不定位到不能修改值 return test if __name__ == "__main__": filepath = r"e:\学习\研究生\推荐系统\ml-latest-small\ratings.csv" train,test = readfile(filepath) userfeaturesframe,moviefeaturesframe = getmodel(train) result = evaluate(userfeaturesframe,moviefeaturesframe,test)
在test中得到的结果为:
nan则是训练集中没有的数据
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持。