Airbin_0720
程序员文章站
2022-04-10 20:58:22
...
项目背景
数据是从kaggle下载的,Airbnb是一个让大众出租住宿民宿的网站,提供短期出租房屋或房间的服务,并且以其独特性的居住体验发展迅速,这里我们拿到的数据是2019年纽约的民宿数据。
- 这里需要注意一点,有的压缩包解压之后直接使用csv文件会有问题,正确的方式是,1)右击csv文件,打开方式选择txt,然后保存,Encloding选ANSI保存,2)再使用Excel打开csv文件,然后另存为,保存类型选择CSV UTF-8(逗号分隔),这样文件可以解决基本的中文乱码/列错位等等问题;
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('AB_NYC_2019v.csv',encoding='utf_8_sig')
df.head()
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 34177117 | Sonder | The Biltmore | 1BR | 219517861 | Sonder (NYC) | Manhattan | Theater District | 40.76118 | -73.98635 | Entire home/apt | 699 | 29 | 0 | NaN | NaN | 327 | 365 |
1 | 34202946 | Sonder | 21 Chelsea | Stunning 1BR + Rooftop | 219517861 | Sonder (NYC) | Manhattan | Chelsea | 40.74285 | -73.99595 | Entire home/apt | 277 | 29 | 0 | NaN | NaN | 327 | 365 |
2 | 35005862 | Sonder | 21 Chelsea | Classic 1BR + Rooftop | 219517861 | Sonder (NYC) | Manhattan | Chelsea | 40.74288 | -73.99438 | Entire home/apt | 275 | 29 | 0 | NaN | NaN | 327 | 365 |
3 | 33528090 | Sonder | 21 Chelsea | Vibrant 1BR + Rooftop | 219517861 | Sonder (NYC) | Manhattan | Chelsea | 40.74250 | -73.99443 | Entire home/apt | 260 | 29 | 0 | NaN | NaN | 327 | 365 |
4 | 34570839 | Sonder | 21 Chelsea | Quaint 1BR + Rooftop | 219517861 | Sonder (NYC) | Manhattan | Chelsea | 40.74305 | -73.99415 | Entire home/apt | 254 | 29 | 0 | NaN | NaN | 327 | 365 |
df.isnull().sum()
id 0
name 16
host_id 0
host_name 21
neighbourhood_group 0
neighbourhood 0
latitude 0
longitude 0
room_type 0
price 0
minimum_nights 0
number_of_reviews 0
last_review 10052
reviews_per_month 10052
calculated_host_listings_count 0
availability_365 0
dtype: int64
df.describe()
id | host_id | latitude | longitude | price | minimum_nights | number_of_reviews | reviews_per_month | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|---|---|---|---|
count | 4.889500e+04 | 4.889500e+04 | 48895.000000 | 48895.000000 | 48895.000000 | 48895.000000 | 48895.000000 | 38843.000000 | 48895.000000 | 48895.000000 |
mean | 1.901714e+07 | 6.762001e+07 | 40.728949 | -73.952170 | 152.720687 | 7.029962 | 23.274466 | 1.373221 | 7.143982 | 112.781327 |
std | 1.098311e+07 | 7.861097e+07 | 0.054530 | 0.046157 | 240.154170 | 20.510550 | 44.550582 | 1.680442 | 32.952519 | 131.622289 |
min | 2.539000e+03 | 2.438000e+03 | 40.499790 | -74.244420 | 0.000000 | 1.000000 | 0.000000 | 0.010000 | 1.000000 | 0.000000 |
25% | 9.471945e+06 | 7.822033e+06 | 40.690100 | -73.983070 | 69.000000 | 1.000000 | 1.000000 | 0.190000 | 1.000000 | 0.000000 |
50% | 1.967728e+07 | 3.079382e+07 | 40.723070 | -73.955680 | 106.000000 | 3.000000 | 5.000000 | 0.720000 | 1.000000 | 45.000000 |
75% | 2.915218e+07 | 1.074344e+08 | 40.763115 | -73.936275 | 175.000000 | 5.000000 | 24.000000 | 2.020000 | 2.000000 | 227.000000 |
max | 3.648724e+07 | 2.743213e+08 | 40.913060 | -73.712990 | 10000.000000 | 1250.000000 | 629.000000 | 58.500000 | 327.000000 | 365.000000 |
描述性统计分析:
- 一共有约48000+条数据
- 价格[price]的均值为152美元,中位数106美元,数据向右/高偏移;存在极高/低异常值;
- 最少预订天/夜晚[minimum_nights]存在极高异常值;(如果有特殊情况,结合实际处理,这里以常规情况为例)
- 每位房东的房源数量[calculated_host_listings_count]均值是7,中位数是1,说明大部分房东都只有1个房源,存在极高异常值;
- 最近一次浏览日期[last_review]和每月浏览数量均值[reviews_per_month]都有10000+个缺失值,0填充;
1.数据处理
df['last_review'].fillna(0,inplace=True)
df['reviews_per_month'].fillna(0,inplace=True)
df['unit_price'] = (df.price / df.minimum_nights) #增加一列字段 均价
df.unit_price.plot() #查看单价的大概分布
<matplotlib.axes._subplots.AxesSubplot at 0x19f1beffa20>
df.unit_price.describe()
count 48895.000000
mean 70.174247
std 157.620388
min 0.000000
25% 20.000000
50% 44.500000
75% 81.500000
max 8000.000000
Name: unit_price, dtype: float64
df[df['unit_price'] > 500].count()
id 316
name 316
host_id 316
host_name 316
neighbourhood_group 316
neighbourhood 316
latitude 316
longitude 316
room_type 316
price 316
minimum_nights 316
number_of_reviews 316
last_review 316
reviews_per_month 316
calculated_host_listings_count 316
availability_365 316
unit_price 316
dtype: int64
- 有316条房间单价unit_price大于500,占总样本数的0.64%,根据统计学,这部分数据我们做删除处理;
df = df[df.unit_price < 500]
df = df[df.unit_price > 1]
df.describe()
id | host_id | latitude | longitude | price | minimum_nights | number_of_reviews | reviews_per_month | calculated_host_listings_count | availability_365 | unit_price | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 4.822100e+04 | 4.822100e+04 | 48221.000000 | 48221.000000 | 48221.000000 | 48221.000000 | 48221.000000 | 48221.000000 | 48221.000000 | 48221.000000 | 48221.000000 |
mean | 1.903368e+07 | 6.764915e+07 | 40.728922 | -73.952029 | 143.473466 | 6.173472 | 23.470521 | 1.100040 | 7.190871 | 111.849899 | 61.690299 |
std | 1.097072e+07 | 7.860904e+07 | 0.054550 | 0.046198 | 176.823568 | 10.942889 | 44.688066 | 1.602071 | 33.161159 | 131.099858 | 63.639782 |
min | 2.539000e+03 | 2.438000e+03 | 40.499790 | -74.244420 | 10.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 1.016667 |
25% | 9.488762e+06 | 7.833913e+06 | 40.690040 | -73.982910 | 69.000000 | 1.000000 | 1.000000 | 0.040000 | 1.000000 | 0.000000 | 20.000000 |
50% | 1.969877e+07 | 3.086886e+07 | 40.722960 | -73.955570 | 105.000000 | 3.000000 | 5.000000 | 0.380000 | 1.000000 | 44.000000 | 44.000000 |
75% | 2.914645e+07 | 1.074344e+08 | 40.763170 | -73.936120 | 175.000000 | 5.000000 | 24.000000 | 1.610000 | 2.000000 | 224.000000 | 80.000000 |
max | 3.648724e+07 | 2.743213e+08 | 40.913060 | -73.712990 | 10000.000000 | 365.000000 | 629.000000 | 58.500000 | 327.000000 | 365.000000 | 499.500000 |
df.unit_price.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x19f178b8208>
2.民宿区域分析
- 从neighbourhood_group/neighbourhood两个维度查看不同区域民宿数量分布
df_pie = df.neighbourhood_group.value_counts().reset_index()
df_pie
index | neighbourhood_group | |
---|---|---|
0 | Manhattan | 21287 |
1 | *lyn | 19880 |
2 | Queens | 5614 |
3 | Bronx | 1070 |
4 | Staten Island | 370 |
plt.figure(figsize=(15,9.3))
explode = (0.1,0,0,0,0) #为了使最大占比区域分离
plt.pie(x='neighbourhood_group',labels='index',explode=explode,autopct='%1.1f%%',data=df_pie)
plt.title("The Rate of different Landmark Point")
plt.show()
pivot = df.pivot_table(index='neighbourhood',values='id',aggfunc='count')
df_pie_2 = pivot.sort_values(by='id',ascending=False).reset_index().head(20) #这里只取值前20,否则显示太多
df_pie_2.head()
neighbourhood | id | |
---|---|---|
0 | Williamsburg | 3888 |
1 | Bedford-Stuyvesant | 3666 |
2 | Harlem | 2638 |
3 | Bushwick | 2445 |
4 | Upper West Side | 1936 |
plt.figure(figsize=(15,9.3))
plt.pie(x='id',labels='neighbourhood',autopct='%1.1f%%',data=df_pie_2)
plt.show()
plt.figure(figsize=(10,8))
pivot[pivot.id > 2500].sort_values(by='id',ascending=False).plot.bar()
plt.xticks(rotation=30)
(array([0, 1, 2]), <a list of 3 Text xticklabel objects>)
<Figure size 720x576 with 0 Axes>
- 房间最多的区域前3个区域是Williamsburg 3888,Bedford-Stuyvesant 3666,Harlem 2638
3.对比Top3区域的均价
df1 = df[['neighbourhood','unit_price']].groupby(by=['neighbourhood','unit_price']).count().reset_index()
df1.head()
neighbourhood | unit_price | |
---|---|---|
0 | Allerton | 5.000000 |
1 | Allerton | 12.142857 |
2 | Allerton | 15.750000 |
3 | Allerton | 16.333333 |
4 | Allerton | 17.000000 |
top3 = ['Williamsburg','Bedford-Stuyvesant','Harlem']
for i in top3:
plt.hist(df1[df1.neighbourhood == i].unit_price,bins=20)
plt.xlabel(i)
plt.show()
- 前两个价格区间类似,都集中在20美元以下,最后一个集中在20美元左右,之后都呈递减趋势;
- 值得注意的是,前3个区域的民宿价格根据数量递减,最高价格也呈递减趋势;
- 第一多的Williamsburg最高房价高达500,而第二多的Bedford-Stuyvesant价格最高达400多(小于500),而第三的Harlem则仅需350;
4.最受欢迎的民宿
df.sort_values(by=['number_of_reviews','reviews_per_month'],ascending=False).iloc[:10,:] #注意by后面参数的顺序影响结果
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | unit_price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5321 | 9145202 | Room near JFK Queen Bed | 47621202 | Dona | Queens | Jamaica | 40.66730 | -73.76831 | Private room | 47 | 1 | 629 | 2019/7/5 | 14.58 | 2 | 333 | 47.0 |
8515 | 903972 | Great Bedroom in Manhattan | 4734398 | Jj | Manhattan | Harlem | 40.82085 | -73.94025 | Private room | 49 | 1 | 607 | 2019/6/21 | 7.75 | 3 | 293 | 49.0 |
4259 | 903947 | Beautiful Bedroom in Manhattan | 4734398 | Jj | Manhattan | Harlem | 40.82124 | -73.93838 | Private room | 49 | 1 | 597 | 2019/6/23 | 7.72 | 3 | 342 | 49.0 |
4709 | 891117 | Private Bedroom in Manhattan | 4734398 | Jj | Manhattan | Harlem | 40.82264 | -73.94041 | Private room | 49 | 1 | 594 | 2019/6/15 | 7.57 | 3 | 339 | 49.0 |
15330 | 10101135 | Room Near JFK Twin Beds | 47621202 | Dona | Queens | Jamaica | 40.66939 | -73.76975 | Private room | 47 | 1 | 576 | 2019/6/27 | 13.40 | 2 | 173 | 47.0 |
15923 | 8168619 | Steps away from Laguardia airport | 37312959 | Maya | Queens | East Elmhurst | 40.77006 | -73.87683 | Private room | 46 | 1 | 543 | 2019/7/1 | 11.59 | 5 | 163 | 46.0 |
14803 | 834190 | Manhattan Lux Loft.Like.Love.Lots.Look ! | 2369681 | Carol | Manhattan | Lower East Side | 40.71921 | -73.99116 | Private room | 99 | 2 | 540 | 2019/7/6 | 6.95 | 1 | 179 | 49.5 |
4427 | 16276632 | Cozy Room Family Home LGA Airport NO CLEANING FEE | 26432133 | Danielle | Queens | East Elmhurst | 40.76335 | -73.87007 | Private room | 48 | 1 | 510 | 2019/7/6 | 16.22 | 5 | 341 | 48.0 |
10001 | 3474320 | Private brownstone studio *lyn | 12949460 | Asa | *lyn | Park Slope | 40.67926 | -73.97711 | Entire home/apt | 160 | 1 | 488 | 2019/7/1 | 8.14 | 1 | 269 | 160.0 |
44566 | 166172 | LG Private Room/Family Friendly | 792159 | Wanda | *lyn | Bushwick | 40.70283 | -73.92131 | Private room | 60 | 3 | 480 | 2019/7/7 | 6.70 | 1 | 0 | 20.0 |
- 前10个最受欢迎的民宿,其中4家在Manhattan,4家在Queens,2家在*lyn
- 其中前8个民宿均价都在50上下浮动,最末有1个极高值160,有1个极低值20,再看房型,160的房型是整屋出租,其他都是单间
- 前10家民宿的浏览量在480~630之间稳定浮动
- 前10家民宿中,排前5的房子由2个人经营,即一个人开多家民宿,猜测是有专业的运营团队,后5个目前不确定(因为这里只显示高人气房子)
下面我们看下排行后面5个的房东是否只经营1家民宿,从后面的全部数据中筛选
df[df['host_name'] == 'Maya'].count()
id 46
name 46
host_id 46
host_name 46
neighbourhood_group 46
neighbourhood 46
latitude 46
longitude 46
room_type 46
price 46
minimum_nights 46
number_of_reviews 46
last_review 46
reviews_per_month 46
calculated_host_listings_count 46
availability_365 46
unit_price 46
dtype: int64
df['host_name'].isin(['Carol']).value_counts()
False 48176
True 45
Name: host_name, dtype: int64
df['host_name'].isin(['Danielle']).value_counts()
False 48145
True 76
Name: host_name, dtype: int64
df['host_name'].isin(['Asa']).value_counts()
False 48220
True 1
Name: host_name, dtype: int64
df['host_name'].isin(['Wanda']).value_counts()
False 48217
True 4
Name: host_name, dtype: int64
- 这里通过两种方法可以求得后5个人气民宿的房东所拥有的民宿数量,排行前10的民宿中,有9家都是团队运营,只有一家房东Asa是个人运营;
- 可能Airbin更趋于一种商业模式,而非最初的出发点,把自己闲置的房间share给旅行者;
- 有点类似中国的二手房东,只是目标客户不再是本地人,而是旅行者,租赁方式也由长租改为短租;
- 不过Airbin经过这么久的发展,已经是成熟的市场,团队应该比个人更专业些,排行靠前也就不足为奇了;
下面我们算下个人和团体的比例。计算方式:房东名下只有1套房的的视为个体,1次以上的视为团体
df2 = df['host_name'].value_counts()
df2.head()
Michael 400
David 390
Sonder (NYC) 327
John 291
Alex 272
Name: host_name, dtype: int64
df2.describe()
count 11355.000000
mean 4.244826
std 14.076734
min 1.000000
25% 1.000000
50% 1.000000
75% 2.000000
max 400.000000
Name: host_name, dtype: float64
(df2 > 1).sum() #团体经营
4518
(df2 == 1).sum() #个体经营
6837
- 个体经营者约6837人,团体经营者约4518人,个人运营者更多,不过注意前者只有1套房子,后者有多套房子
- 一共有48895条数据,减去6837条个人运营的1个房子,剩下的都是团体运营的房子,团体拥有总房子数量为个人的6倍左右;
5. 受欢迎的房子特点
keyword = df['name'].to_frame().to_csv('hotword.txt', index=False, sep=',', header=None) #方法to_frame()将Series转DataFrame
from wordcloud import WordCloud
text = open('hotword.txt',encoding = 'utf_8_sig').read()
wordcloud = WordCloud(background_color="white",max_font_size=50).generate(text)
plt.figure(figsize=(10,6.18))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()
通过词云呈现出高热点搜索词:
- 搜索词最多的是Private/Apartment/Bedroom等房型,所以一定要标注出房价类型
- 其次出现heart of/Central Park/In *lyn/East Village等描述地理位置的词汇
- 同样出现的有形容词类如Beautiful/Cozy/Bright/Charming/Moden/Lovely/Quiet等
- 房东在添加标题的时候可以考虑加入高频搜索词汇,以吸引更多的人浏览
后记:这里只是粗略的进行分析,如果有面积,实际上更深入的还可以利用统计学/机器学习领域的最小二乘法去分析下定价,或者天气等不同因素的影响,挖掘更深度的东西;
已经开始看机器学习了,不确定是否用得上,不过一口气听了10多集,觉得还行,另外,在学习machine learning之前,建议先打好基础,包括统计学,很多人推荐很多经典书,这里推荐一本深入浅出统计学,我刷了2遍,竟然发现machine learning很多用到统计学的。
最近看的一本书里提到一句话,大概意思是:数据分析只能在现有的格局里找到最优解,但是人类的思想却可以突破当下的困局,从而带来全面的革新。所以数据分析有优势,但也确实有其弊端。
继续学习吧~
推荐阅读