【python】youtube trending 热点分析 - 什么因素与视频流行有关?
程序员文章站
2024-03-16 14:13:22
...
1. 问题概述和数据来源
- YouTube Trends
YouTube Trends是油管提供的流行视频推荐列表,每日更新,但并不是个性化推荐,每个国家的推荐列表是统一的。 - Dataset
使用的数据是在Kaggle上找到的美/英/德/法/加拿大 五个国家2017/11/14到2018/06/14每天的trending video列表 - 思考的问题
我觉得这个数据集有趣的地方在于多维,时间、空间(国家)、不同youtuber/topics、不同category 的内容以及指标(view, likes, dislikes, comments)。可以探究view, likes, dislikes, comments之间的关联,热点youtuber、topics热度趋势(还可以结合Googletrends看YouTube热点与全网搜索热点的吻合程度,超前/滞后程度),以及不同国家热点的区别 (geographic differences)。
import pandas as pd
import json
###从csv导入数据,合并,添加国家标签###
df=pd.read_csv('CAvideos.csv')
df=df.assign(country='CA')
list_cou=['DE','FR','GB','US']
for name in list_cou:
temp=pd.read_csv(name+'videos.csv')
temp=temp.assign(country=name)
df=pd.concat([df,temp])
###日期格式处理###
df['trending_date'] = pd.to_datetime(df['trending_date'], format='%y.%d.%m')
df.trending_date = df.trending_date.dt.date
df['publish_time'] = pd.to_datetime(df['publish_time'], format='%Y-%m-%dT%H:%M:%S.%fZ')
df=df.assign(publish_date=df['publish_time'].dt.date)
df['publish_time'] = df['publish_time'].dt.time
category名称另外保存在json文件中,读取添加过程如下:
###导入category名称###
df=df.assign(cat_name='a')
for name in list_cou:
id_to_category = {}
file=name+'_category_id.json'
with open(file, 'r') as f:
data=json.load(f)
for category in data['items']:
id_to_category[category['id']] = category['snippet']['title']
print(id_to_category)
###实际上每个国家的category id-name 字典是一样的
df['category_id'] = df['category_id'].astype(str)
df.insert(4, 'category', df['category_id'].map(id_to_category))
整理之后的dataframe:
df.head()
Out[31]:
video_id trending_date \
0 n1WpP7iowLc 2017-11-14
1 0dBIkQ4Mz1M 2017-11-14
2 5qpjK5DgCt4 2017-11-14
3 d380meD0W0M 2017-11-14
4 2Vv-BfVoq4g 2017-11-14
title channel_title \
0 Eminem - Walk On Water (Audio) ft. Beyoncé EminemVEVO
1 PLUSH - Bad Unboxing Fan Mail iDubbbzTV
2 Racist Superman | Rudy Mancuso, King Bach & Le... Rudy Mancuso
3 I Dare You: GOING BALD!? nigahiga
4 Ed Sheeran - Perfect (Official Music Video) Ed Sheeran
category category_id publish_time \
0 Music 10 17:00:03
1 Comedy 23 17:00:00
2 Comedy 23 19:05:24
3 Entertainment 24 18:01:41
4 Music 10 11:04:14
tags views likes \
0 Eminem|"Walk"|"On"|"Water"|"Aftermath/Shady/In... 17158579 787425
1 plush|"bad unboxing"|"unboxing"|"fan mail"|"id... 1014651 127794
2 racist superman|"rudy"|"mancuso"|"king"|"bach"... 3191434 146035
3 ryan|"higa"|"higatv"|"nigahiga"|"i dare you"|"... 2095828 132239
4 edsheeran|"ed sheeran"|"acoustic"|"live"|"cove... 33523622 1634130
dislikes comment_count thumbnail_link \
0 43420 125882 https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg
1 1688 13030 https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg
2 5339 8181 https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg
3 1989 17518 https://i.ytimg.com/vi/d380meD0W0M/default.jpg
4 21082 85067 https://i.ytimg.com/vi/2Vv-BfVoq4g/default.jpg
comments_disabled ratings_disabled video_error_or_removed \
0 False False False
1 False False False
2 False False False
3 False False False
4 False False False
description country publish_date
0 Eminem's new track Walk on Water ft. Beyoncé i... CA 2017-11-10
1 STill got a lot of packages. Probably will las... CA 2017-11-13
2 WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http... CA 2017-11-12
3 I know it's been a while since we did this sho... CA 2017-11-12
4 ?: https://ad.gt/yt-perfect\n?: https://atlant... CA 2017-11-09
各个国家的数据所占比例大致相等:
df.country.value_counts()
Out[9]:
US 40949
CA 40881
DE 40840
FR 40724
GB 38916
Name: country, dtype: int64
##3. youtube trends EDA ##
**- 流行类别: music, entertainment 和 people&blogs 最流行**
plt.hist(list(df.category), bins=32, density=True, alpha=0.5, histtype='bar', color='steelblue', edgecolor='blue',rwidth=1,align='mid')
plt.xticks(rotation=90)
plt.title('popular categories')
plt.ylabel('density in trending videos')
plt.show()
- 不同类别从发表到流行所需平均时间: 时间并不与类别的流行程度成正比
#average time between publishment and trend for each categroy
df=df.assign(days=df.trending_date-df.publish_date)
# convert timedelta to numeric
df.days=df.days.dt.day
# average days taken to become trending
avg_days=df.groupby(['category'])['days'].mean()
avg_days=avg_days.sort_values()
barlist=plt.bar(avg_days.index,avg_days)
plt.xticks(rotation=90)
plt.ylabel('days taken from publish to trending')
barlist[6].set_color('orange')
barlist[13].set_color('orange')
barlist[16].set_color('orange')
plt.show()
三个最流行的类别用黄色标出,所需时间并不与类别的流行程度成正比
- views, comments, likes, dislikes之间的关联
如果使用所有数据,计算过程耗时,而实际上因为数据充足可以采取抽样。取20%sample然后seaborn pairplot观察relations:
sample=df.loc[df['comments_disabled']==False,['views','likes','dislikes','comment_count','country']].sample(frac=0.2)
p_resp=sns.pairplot(sample, hue='country')
out:
讨论:圈出来的三个图有一些有意思的信息。红色框内,likes vs views, 总体成正比;黑色框内和蓝色框内,dislike在大多数情况下随 view 和 likes 增长较慢,但是很明显能够看到也有不少 dislikes 相对快速增长的情况,并且这两种dislike增长模式区别非常明显。红色和黑色框内,在英国的推荐中,同样view的视频 likes, dislikes 相比美国trending 视频偏少。
让我们进一步来看views, likes 和 dislikes:
- 绝大多数视频得到 Like 比得到 dislike 要容易很多
sns.scatterplot(x='likes',y='dislikes',data=sample,hue='country')
x=list(range(0,max(sample.dislikes)))
plt.plot(x,x,label='likes=dislikes',color='k',linestyle='--')
plt.legend()
- 比较不同国家的 view-likes/dislikes 线性回归,英国的trending 视频确实是同样 views 下获得反应最小的,明显低于其他国家。
sns.lmplot(x='views',y='dislikes',data=sample,hue='country',scatter_kws= {'alpha': 0.3})
plt.xlim(0,300*1e6)
plt.ylim(0,0.6*1e6)
def scient(y, position):
return str(y/1e6)
formatter = FuncFormatter(scient)
plt.gca().yaxis.set_major_formatter(formatter)
plt.gca().xaxis.set_major_formatter(formatter)
plt.xlabel('views(1e6)')
plt.ylabel('dislikes(1e6)')
plt.title('views vs. dislikes')
(持续更新中。。。。。。)
上一篇: java简单排序(选择;冒泡;插入)
下一篇: 华为研发工程师编程题---牛客