欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Airbin_0720

程序员文章站 2022-04-10 20:58:22
...

项目背景

数据是从kaggle下载的,Airbnb是一个让大众出租住宿民宿的网站,提供短期出租房屋或房间的服务,并且以其独特性的居住体验发展迅速,这里我们拿到的数据是2019年纽约的民宿数据。

  • 这里需要注意一点,有的压缩包解压之后直接使用csv文件会有问题,正确的方式是,1)右击csv文件,打开方式选择txt,然后保存,Encloding选ANSI保存,2)再使用Excel打开csv文件,然后另存为,保存类型选择CSV UTF-8(逗号分隔),这样文件可以解决基本的中文乱码/列错位等等问题;
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('AB_NYC_2019v.csv',encoding='utf_8_sig')
df.head()
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 34177117 Sonder | The Biltmore | 1BR 219517861 Sonder (NYC) Manhattan Theater District 40.76118 -73.98635 Entire home/apt 699 29 0 NaN NaN 327 365
1 34202946 Sonder | 21 Chelsea | Stunning 1BR + Rooftop 219517861 Sonder (NYC) Manhattan Chelsea 40.74285 -73.99595 Entire home/apt 277 29 0 NaN NaN 327 365
2 35005862 Sonder | 21 Chelsea | Classic 1BR + Rooftop 219517861 Sonder (NYC) Manhattan Chelsea 40.74288 -73.99438 Entire home/apt 275 29 0 NaN NaN 327 365
3 33528090 Sonder | 21 Chelsea | Vibrant 1BR + Rooftop 219517861 Sonder (NYC) Manhattan Chelsea 40.74250 -73.99443 Entire home/apt 260 29 0 NaN NaN 327 365
4 34570839 Sonder | 21 Chelsea | Quaint 1BR + Rooftop 219517861 Sonder (NYC) Manhattan Chelsea 40.74305 -73.99415 Entire home/apt 254 29 0 NaN NaN 327 365
df.isnull().sum()
id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64
df.describe()
id host_id latitude longitude price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365
count 4.889500e+04 4.889500e+04 48895.000000 48895.000000 48895.000000 48895.000000 48895.000000 38843.000000 48895.000000 48895.000000
mean 1.901714e+07 6.762001e+07 40.728949 -73.952170 152.720687 7.029962 23.274466 1.373221 7.143982 112.781327
std 1.098311e+07 7.861097e+07 0.054530 0.046157 240.154170 20.510550 44.550582 1.680442 32.952519 131.622289
min 2.539000e+03 2.438000e+03 40.499790 -74.244420 0.000000 1.000000 0.000000 0.010000 1.000000 0.000000
25% 9.471945e+06 7.822033e+06 40.690100 -73.983070 69.000000 1.000000 1.000000 0.190000 1.000000 0.000000
50% 1.967728e+07 3.079382e+07 40.723070 -73.955680 106.000000 3.000000 5.000000 0.720000 1.000000 45.000000
75% 2.915218e+07 1.074344e+08 40.763115 -73.936275 175.000000 5.000000 24.000000 2.020000 2.000000 227.000000
max 3.648724e+07 2.743213e+08 40.913060 -73.712990 10000.000000 1250.000000 629.000000 58.500000 327.000000 365.000000

描述性统计分析:

  • 一共有约48000+条数据
  • 价格[price]的均值为152美元,中位数106美元,数据向右/高偏移;存在极高/低异常值;
  • 最少预订天/夜晚[minimum_nights]存在极高异常值;(如果有特殊情况,结合实际处理,这里以常规情况为例)
  • 每位房东的房源数量[calculated_host_listings_count]均值是7,中位数是1,说明大部分房东都只有1个房源,存在极高异常值;
  • 最近一次浏览日期[last_review]和每月浏览数量均值[reviews_per_month]都有10000+个缺失值,0填充;

1.数据处理

df['last_review'].fillna(0,inplace=True)
df['reviews_per_month'].fillna(0,inplace=True)
df['unit_price'] = (df.price / df.minimum_nights)  #增加一列字段 均价
df.unit_price.plot()  #查看单价的大概分布
<matplotlib.axes._subplots.AxesSubplot at 0x19f1beffa20>

Airbin_0720

df.unit_price.describe()
count    48895.000000
mean        70.174247
std        157.620388
min          0.000000
25%         20.000000
50%         44.500000
75%         81.500000
max       8000.000000
Name: unit_price, dtype: float64
df[df['unit_price'] > 500].count()
id                                316
name                              316
host_id                           316
host_name                         316
neighbourhood_group               316
neighbourhood                     316
latitude                          316
longitude                         316
room_type                         316
price                             316
minimum_nights                    316
number_of_reviews                 316
last_review                       316
reviews_per_month                 316
calculated_host_listings_count    316
availability_365                  316
unit_price                        316
dtype: int64
  • 有316条房间单价unit_price大于500,占总样本数的0.64%,根据统计学,这部分数据我们做删除处理;
df = df[df.unit_price < 500]
df = df[df.unit_price > 1]
df.describe()
id host_id latitude longitude price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365 unit_price
count 4.822100e+04 4.822100e+04 48221.000000 48221.000000 48221.000000 48221.000000 48221.000000 48221.000000 48221.000000 48221.000000 48221.000000
mean 1.903368e+07 6.764915e+07 40.728922 -73.952029 143.473466 6.173472 23.470521 1.100040 7.190871 111.849899 61.690299
std 1.097072e+07 7.860904e+07 0.054550 0.046198 176.823568 10.942889 44.688066 1.602071 33.161159 131.099858 63.639782
min 2.539000e+03 2.438000e+03 40.499790 -74.244420 10.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.016667
25% 9.488762e+06 7.833913e+06 40.690040 -73.982910 69.000000 1.000000 1.000000 0.040000 1.000000 0.000000 20.000000
50% 1.969877e+07 3.086886e+07 40.722960 -73.955570 105.000000 3.000000 5.000000 0.380000 1.000000 44.000000 44.000000
75% 2.914645e+07 1.074344e+08 40.763170 -73.936120 175.000000 5.000000 24.000000 1.610000 2.000000 224.000000 80.000000
max 3.648724e+07 2.743213e+08 40.913060 -73.712990 10000.000000 365.000000 629.000000 58.500000 327.000000 365.000000 499.500000
df.unit_price.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x19f178b8208>

Airbin_0720

2.民宿区域分析

  • 从neighbourhood_group/neighbourhood两个维度查看不同区域民宿数量分布
df_pie = df.neighbourhood_group.value_counts().reset_index()
df_pie
index neighbourhood_group
0 Manhattan 21287
1 *lyn 19880
2 Queens 5614
3 Bronx 1070
4 Staten Island 370
plt.figure(figsize=(15,9.3))
explode = (0.1,0,0,0,0) #为了使最大占比区域分离
plt.pie(x='neighbourhood_group',labels='index',explode=explode,autopct='%1.1f%%',data=df_pie)
plt.title("The Rate of different Landmark Point")
plt.show()

Airbin_0720

pivot = df.pivot_table(index='neighbourhood',values='id',aggfunc='count')
df_pie_2 = pivot.sort_values(by='id',ascending=False).reset_index().head(20) #这里只取值前20,否则显示太多
df_pie_2.head()
neighbourhood id
0 Williamsburg 3888
1 Bedford-Stuyvesant 3666
2 Harlem 2638
3 Bushwick 2445
4 Upper West Side 1936
plt.figure(figsize=(15,9.3))
plt.pie(x='id',labels='neighbourhood',autopct='%1.1f%%',data=df_pie_2)
plt.show()

Airbin_0720

plt.figure(figsize=(10,8))
pivot[pivot.id > 2500].sort_values(by='id',ascending=False).plot.bar()
plt.xticks(rotation=30)
(array([0, 1, 2]), <a list of 3 Text xticklabel objects>)




<Figure size 720x576 with 0 Axes>

Airbin_0720

  • 房间最多的区域前3个区域是Williamsburg 3888,Bedford-Stuyvesant 3666,Harlem 2638

3.对比Top3区域的均价

df1 = df[['neighbourhood','unit_price']].groupby(by=['neighbourhood','unit_price']).count().reset_index()
df1.head()
neighbourhood unit_price
0 Allerton 5.000000
1 Allerton 12.142857
2 Allerton 15.750000
3 Allerton 16.333333
4 Allerton 17.000000
top3 = ['Williamsburg','Bedford-Stuyvesant','Harlem']
for i in top3:
    plt.hist(df1[df1.neighbourhood == i].unit_price,bins=20)
    plt.xlabel(i)
    plt.show()

Airbin_0720

Airbin_0720

Airbin_0720

  • 前两个价格区间类似,都集中在20美元以下,最后一个集中在20美元左右,之后都呈递减趋势;
  • 值得注意的是,前3个区域的民宿价格根据数量递减,最高价格也呈递减趋势;
  • 第一多的Williamsburg最高房价高达500,而第二多的Bedford-Stuyvesant价格最高达400多(小于500),而第三的Harlem则仅需350;

4.最受欢迎的民宿

df.sort_values(by=['number_of_reviews','reviews_per_month'],ascending=False).iloc[:10,:] #注意by后面参数的顺序影响结果
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 unit_price
5321 9145202 Room near JFK Queen Bed 47621202 Dona Queens Jamaica 40.66730 -73.76831 Private room 47 1 629 2019/7/5 14.58 2 333 47.0
8515 903972 Great Bedroom in Manhattan 4734398 Jj Manhattan Harlem 40.82085 -73.94025 Private room 49 1 607 2019/6/21 7.75 3 293 49.0
4259 903947 Beautiful Bedroom in Manhattan 4734398 Jj Manhattan Harlem 40.82124 -73.93838 Private room 49 1 597 2019/6/23 7.72 3 342 49.0
4709 891117 Private Bedroom in Manhattan 4734398 Jj Manhattan Harlem 40.82264 -73.94041 Private room 49 1 594 2019/6/15 7.57 3 339 49.0
15330 10101135 Room Near JFK Twin Beds 47621202 Dona Queens Jamaica 40.66939 -73.76975 Private room 47 1 576 2019/6/27 13.40 2 173 47.0
15923 8168619 Steps away from Laguardia airport 37312959 Maya Queens East Elmhurst 40.77006 -73.87683 Private room 46 1 543 2019/7/1 11.59 5 163 46.0
14803 834190 Manhattan Lux Loft.Like.Love.Lots.Look ! 2369681 Carol Manhattan Lower East Side 40.71921 -73.99116 Private room 99 2 540 2019/7/6 6.95 1 179 49.5
4427 16276632 Cozy Room Family Home LGA Airport NO CLEANING FEE 26432133 Danielle Queens East Elmhurst 40.76335 -73.87007 Private room 48 1 510 2019/7/6 16.22 5 341 48.0
10001 3474320 Private brownstone studio *lyn 12949460 Asa *lyn Park Slope 40.67926 -73.97711 Entire home/apt 160 1 488 2019/7/1 8.14 1 269 160.0
44566 166172 LG Private Room/Family Friendly 792159 Wanda *lyn Bushwick 40.70283 -73.92131 Private room 60 3 480 2019/7/7 6.70 1 0 20.0
  • 前10个最受欢迎的民宿,其中4家在Manhattan,4家在Queens,2家在*lyn
  • 其中前8个民宿均价都在50上下浮动,最末有1个极高值160,有1个极低值20,再看房型,160的房型是整屋出租,其他都是单间
  • 前10家民宿的浏览量在480~630之间稳定浮动
  • 前10家民宿中,排前5的房子由2个人经营,即一个人开多家民宿,猜测是有专业的运营团队,后5个目前不确定(因为这里只显示高人气房子)

下面我们看下排行后面5个的房东是否只经营1家民宿,从后面的全部数据中筛选

df[df['host_name'] == 'Maya'].count()
id                                46
name                              46
host_id                           46
host_name                         46
neighbourhood_group               46
neighbourhood                     46
latitude                          46
longitude                         46
room_type                         46
price                             46
minimum_nights                    46
number_of_reviews                 46
last_review                       46
reviews_per_month                 46
calculated_host_listings_count    46
availability_365                  46
unit_price                        46
dtype: int64
df['host_name'].isin(['Carol']).value_counts()
False    48176
True        45
Name: host_name, dtype: int64
df['host_name'].isin(['Danielle']).value_counts()
False    48145
True        76
Name: host_name, dtype: int64
df['host_name'].isin(['Asa']).value_counts()
False    48220
True         1
Name: host_name, dtype: int64
df['host_name'].isin(['Wanda']).value_counts()
False    48217
True         4
Name: host_name, dtype: int64
  • 这里通过两种方法可以求得后5个人气民宿的房东所拥有的民宿数量,排行前10的民宿中,有9家都是团队运营,只有一家房东Asa是个人运营;
  • 可能Airbin更趋于一种商业模式,而非最初的出发点,把自己闲置的房间share给旅行者;
  • 有点类似中国的二手房东,只是目标客户不再是本地人,而是旅行者,租赁方式也由长租改为短租;
  • 不过Airbin经过这么久的发展,已经是成熟的市场,团队应该比个人更专业些,排行靠前也就不足为奇了;

下面我们算下个人和团体的比例。计算方式:房东名下只有1套房的的视为个体,1次以上的视为团体

df2 = df['host_name'].value_counts()
df2.head()
Michael         400
David           390
Sonder (NYC)    327
John            291
Alex            272
Name: host_name, dtype: int64
df2.describe()
count    11355.000000
mean         4.244826
std         14.076734
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max        400.000000
Name: host_name, dtype: float64
(df2 > 1).sum() #团体经营
4518
(df2 == 1).sum() #个体经营
6837
  • 个体经营者约6837人,团体经营者约4518人,个人运营者更多,不过注意前者只有1套房子,后者有多套房子
  • 一共有48895条数据,减去6837条个人运营的1个房子,剩下的都是团体运营的房子,团体拥有总房子数量为个人的6倍左右;

5. 受欢迎的房子特点

keyword = df['name'].to_frame().to_csv('hotword.txt', index=False, sep=',', header=None) #方法to_frame()将Series转DataFrame
from wordcloud import WordCloud
text = open('hotword.txt',encoding = 'utf_8_sig').read()
wordcloud = WordCloud(background_color="white",max_font_size=50).generate(text)
plt.figure(figsize=(10,6.18))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

Airbin_0720

通过词云呈现出高热点搜索词:

  • 搜索词最多的是Private/Apartment/Bedroom等房型,所以一定要标注出房价类型
  • 其次出现heart of/Central Park/In *lyn/East Village等描述地理位置的词汇
  • 同样出现的有形容词类如Beautiful/Cozy/Bright/Charming/Moden/Lovely/Quiet等
  • 房东在添加标题的时候可以考虑加入高频搜索词汇,以吸引更多的人浏览

后记:这里只是粗略的进行分析,如果有面积,实际上更深入的还可以利用统计学/机器学习领域的最小二乘法去分析下定价,或者天气等不同因素的影响,挖掘更深度的东西;

已经开始看机器学习了,不确定是否用得上,不过一口气听了10多集,觉得还行,另外,在学习machine learning之前,建议先打好基础,包括统计学,很多人推荐很多经典书,这里推荐一本深入浅出统计学,我刷了2遍,竟然发现machine learning很多用到统计学的。

最近看的一本书里提到一句话,大概意思是:数据分析只能在现有的格局里找到最优解,但是人类的思想却可以突破当下的困局,从而带来全面的革新。所以数据分析有优势,但也确实有其弊端。

继续学习吧~

相关标签: 分析

推荐阅读