欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

pandas 数据处理

程序员文章站 2022-06-05 22:08:29
...


Pandas


*pandas* is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python. *pandas* build upon *numpy* and *scipy* providing easy-to-use data structures and data manipulation functions with integrated indexing. The main data structures *pandas* provides are *Series* and *DataFrames*. After a brief introduction to these two data structures and data ingestion, the key features of *pandas* this notebook covers are: * Generating descriptive statistics on data * Data cleaning using built in pandas functions * Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data * Merging multiple datasets using dataframes * Working with timestamps and time-series data **Additional Recommended Resources:** * *pandas* Documentation: http://pandas.pydata.org/pandas-docs/stable/ * *Python for Data Analysis* by Wes McKinney * *Python Data Science Handbook* by Jake VanderPlas Let’s get started with our first *pandas* notebook! Import Libraries
import pandas as pd

Introduction to pandas Data Structures


*pandas* has two main data structures it uses, namely, *Series* and *DataFrames*.

pandas Series

*pandas Series* one-dimensional labeled array.
ser = pd.Series(data = [100, 'foo', 300, 'bar', 500], index = ['tom', 'bob', 'nancy', 'dan', 'eric'])
ser
tom 100 bob foo nancy 300 dan bar eric 500 dtype: object
ser.index
Index([‘tom’, ‘bob’, ‘nancy’, ‘dan’, ‘eric’], dtype=’object’)
ser.loc[['nancy','bob']]
nancy 300 bob foo dtype: object
ser[[4, 3, 1]]
eric 500 dan bar bob foo dtype: object
ser.iloc[2]
300
'bob' in ser
True
ser
tom 100 bob foo nancy 300 dan bar eric 500 dtype: object
ser * 2
tom 200 bob foofoo nancy 600 dan barbar eric 1000 dtype: object
ser[['nancy', 'eric']] ** 2
nancy 90000 eric 250000 dtype: object

pandas DataFrame

*pandas DataFrame* is a 2-dimensional labeled data structure.

Create DataFrame from dictionary of Python Series

d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
     'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}
df = pd.DataFrame(d)
print(df)
one two apple 100.0 111.0 ball 200.0 222.0 cerill NaN 333.0 clock 300.0 NaN dancy NaN 4444.0
df.index
Index([‘apple’, ‘ball’, ‘cerill’, ‘clock’, ‘dancy’], dtype=’object’)
df.columns
Index([‘one’, ‘two’], dtype=’object’)
pd.DataFrame(d, index=['dancy', 'ball', 'apple'])
one two
dancy NaN 4444.0
ball 200.0 222.0
apple 100.0 111.0
pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])
two five
dancy 4444.0 NaN
ball 222.0 NaN
apple 111.0 NaN

Create DataFrame from list of Python dictionaries

data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]
pd.DataFrame(data)
alex alice dora ema joe
0 1.0 NaN NaN NaN 2.0
1 NaN 20.0 10.0 5.0 NaN
pd.DataFrame(data, index=['orange', 'red'])
alex alice dora ema joe
orange 1.0 NaN NaN NaN 2.0
red NaN 20.0 10.0 5.0 NaN
pd.DataFrame(data, columns=['joe', 'dora','alice'])
joe dora alice
0 2.0 NaN NaN
1 NaN 10.0 20.0

Basic DataFrame operations

df
one two
apple 100.0 111.0
ball 200.0 222.0
cerill NaN 333.0
clock 300.0 NaN
dancy NaN 4444.0
df['one']
apple 100.0 ball 200.0 cerill NaN clock 300.0 dancy NaN Name: one, dtype: float64
df['three'] = df['one'] * df['two']
df
one two three
apple 100.0 111.0 11100.0
ball 200.0 222.0 44400.0
cerill NaN 333.0 NaN
clock 300.0 NaN NaN
dancy NaN 4444.0 NaN
df['flag'] = df['one'] > 250
df
one two three flag
apple 100.0 111.0 11100.0 False
ball 200.0 222.0 44400.0 False
cerill NaN 333.0 NaN False
clock 300.0 NaN NaN True
dancy NaN 4444.0 NaN False
three = df.pop('three')
three
apple 11100.0 ball 44400.0 cerill NaN clock NaN dancy NaN Name: three, dtype: float64
df
one two flag
apple 100.0 111.0 False
ball 200.0 222.0 False
cerill NaN 333.0 False
clock 300.0 NaN True
dancy NaN 4444.0 False
del df['two']
df
one flag
apple 100.0 False
ball 200.0 False
cerill NaN False
clock 300.0 True
dancy NaN False
df.insert(2, 'copy_of_one', df['one'])
df
one flag copy_of_one
apple 100.0 False 100.0
ball 200.0 False 200.0
cerill NaN False NaN
clock 300.0 True 300.0
dancy NaN False NaN
df['one_upper_half'] = df['one'][:2]
df
one flag copy_of_one one_upper_half
apple 100.0 False 100.0 100.0
ball 200.0 False 200.0 200.0
cerill NaN False NaN NaN
clock 300.0 True 300.0 NaN
dancy NaN False NaN NaN

Case Study: Movie Data Analysis


This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using *pandas*. ## Download the Dataset Please note that **you will need to download the dataset**. Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints. Here are the links to the data source and location: * **Data Source: ** MovieLens web site (filename: ml-20m.zip) * **Location:** https://grouplens.org/datasets/movielens/ Once the download completes, please make sure the data files are in a directory called *movielens* in your *Week-3-pandas* folder. Let us look at the files in this dataset using the UNIX command ls.
# Note: Adjust the name of the folder to match your local directory
#linux 使用
!ls ./movielens
!cat ./movielens/movies.csv | wc -l
!head -5 ./movielens/ratings.csv

Use Pandas to Read the Dataset


In this notebook, we will be using three CSV files:

  • ratings.csv : userId,movieId,rating, timestamp
  • tags.csv : userId,movieId, tag, timestamp
  • movies.csv : movieId, title, genres

Using the read_csv function in pandas, we will ingest these three files.

movies = pd.read_csv('./movielens/movies.csv', sep=',')
print(type(movies))
movies.head(15)
# Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.head()
userId movieId tag timestamp
0 18 4141 Mark Waters 1240597180
1 65 208 dark hero 1368150078
2 65 353 dark hero 1368150079
3 65 521 noir thriller 1368149983
4 65 592 dark hero 1368150078
ratings = pd.read_csv('./movielens/ratings.csv', sep=',', parse_dates=['timestamp'])
ratings.head()
userId movieId rating timestamp
0 1 2 3.5 1112486027
1 1 29 3.5 1112484676
2 1 32 3.5 1112484819
3 1 47 3.5 1112484727
4 1 50 3.5 1112484580
# For current analysis, we will remove timestamp (we will come back to it!)

del ratings['timestamp']
del tags['timestamp']

Data Structures

Series

#Extract 0th row: notice that it is infact a Series

row_0 = tags.iloc[0]
type(row_0)
pandas.core.series.Series
print(row_0)
userId 18 movieId 4141 tag Mark Waters Name: 0, dtype: object
row_0.index
Index([‘userId’, ‘movieId’, ‘tag’], dtype=’object’)
row_0['userId']
18
'rating' in row_0
False
row_0.name
0
row_0 = row_0.rename('first_row')
row_0.name
‘first_row’

DataFrames

tags.head()
userId movieId tag
0 18 4141 Mark Waters
1 65 208 dark hero
2 65 353 dark hero
3 65 521 noir thriller
4 65 592 dark hero
tags.index
RangeIndex(start=0, stop=465564, step=1)
tags.columns
Index([‘userId’, ‘movieId’, ‘tag’], dtype=’object’)
# Extract row 0, 11, 2000 from DataFrame

tags.iloc[ [0,11,2000] ]
userId movieId tag
0 18 4141 Mark Waters
11 65 1783 noir thriller
2000 910 68554 conspiracy theory

Descriptive Statistics

Let’s look how the ratings are distributed!

ratings['rating'].describe()
count 2.000026e+07 mean 3.525529e+00 std 1.051989e+00 min 5.000000e-01 25% 3.000000e+00 50% 3.500000e+00 75% 4.000000e+00 max 5.000000e+00 Name: rating, dtype: float64
ratings.describe()
userId movieId rating
count 2.000026e+07 2.000026e+07 2.000026e+07
mean 6.904587e+04 9.041567e+03 3.525529e+00
std 4.003863e+04 1.978948e+04 1.051989e+00
min 1.000000e+00 1.000000e+00 5.000000e-01
25% 3.439500e+04 9.020000e+02 3.000000e+00
50% 6.914100e+04 2.167000e+03 3.500000e+00
75% 1.036370e+05 4.770000e+03 4.000000e+00
max 1.384930e+05 1.312620e+05 5.000000e+00
ratings['rating'].mean()
3.5255285642993797
ratings.mean()
userId 69045.872583 movieId 9041.567330 rating 3.525529 dtype: float64
ratings['rating'].min()
0.5
ratings['rating'].max()
5.0
ratings['rating'].std()
1.051988919275684
ratings['rating'].mode()
0 4.0 dtype: float64
ratings.corr()
userId movieId rating
userId 1.000000 -0.000850 0.001175
movieId -0.000850 1.000000 0.002606
rating 0.001175 0.002606 1.000000
filter_1 = ratings['rating'] > 5
print(filter_1)
filter_1.any()
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False … 20000233 False 20000234 False 20000235 False 20000236 False 20000237 False 20000238 False 20000239 False 20000240 False 20000241 False 20000242 False 20000243 False 20000244 False 20000245 False 20000246 False 20000247 False 20000248 False 20000249 False 20000250 False 20000251 False 20000252 False 20000253 False 20000254 False 20000255 False 20000256 False 20000257 False 20000258 False 20000259 False 20000260 False 20000261 False 20000262 False Name: rating, dtype: bool False
filter_2 = ratings['rating'] > 0
filter_2.all()
True

Data Cleaning: Handling Missing Data

movies.shape
(27278, 3)
#is any row NULL ?

movies.isnull().any()
movieId False title False genres False dtype: bool Thats nice ! No NULL values !
ratings.shape
(20000263, 3)
#is any row NULL ?

ratings.isnull().any()
userId False movieId False rating False dtype: bool Thats nice ! No NULL values !
tags.shape
(465564, 3)
#is any row NULL ?

tags.isnull().any()
userId False movieId False tag True dtype: bool We have some tags which are NULL.
tags = tags.dropna()
#Check again: is any row NULL ?

tags.isnull().any()
userId False movieId False tag False dtype: bool
tags.shape
(465548, 3) Thats nice ! No NULL values ! Notice the number of lines have reduced.

Data Visualization

%matplotlib inline

ratings.hist(column='rating', figsize=(15,10))
array([[
ratings.boxplot(column='rating', figsize=(15,20))
tags['tag'].head()
0 Mark Waters 1 dark hero 2 dark hero 3 noir thriller 4 dark hero Name: tag, dtype: object
movies[['title','genres']].head()
title genres
0 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 Jumanji (1995) Adventure|Children|Fantasy
2 Grumpier Old Men (1995) Comedy|Romance
3 Waiting to Exhale (1995) Comedy|Drama|Romance
4 Father of the Bride Part II (1995) Comedy
ratings[-10:]
userId movieId rating
20000253 138493 60816 4.5
20000254 138493 61160 4.0
20000255 138493 65682 4.5
20000256 138493 66762 4.5
20000257 138493 68319 4.5
20000258 138493 68954 4.5
20000259 138493 69526 4.5
20000260 138493 69644 3.0
20000261 138493 70286 5.0
20000262 138493 71619 2.5
tag_counts = tags['tag'].value_counts()
tag_counts[-10:]
Venice Film Festival Winner 2002 1 based on the life of Buford Pusser 1 but no way as good as the other two 1 see 1 Jeffrey Kimball 1 tolerable 1 Fake History - Don’t Believe a Thing 1 Boy 1 urlaub 1 conservative 1 Name: tag, dtype: int64
tag_counts[:10].plot(kind='bar', figsize=(15,10))
is_highly_rated = ratings['rating'] >= 4.0

ratings[is_highly_rated][30:50]
userId movieId rating
68 1 2021 4.0
69 1 2100 4.0
70 1 2118 4.0
71 1 2138 4.0
72 1 2140 4.0
73 1 2143 4.0
74 1 2173 4.0
75 1 2174 4.0
76 1 2193 4.0
79 1 2288 4.0
80 1 2291 4.0
81 1 2542 4.0
82 1 2628 4.0
90 1 2762 4.0
92 1 2872 4.0
94 1 2944 4.0
96 1 2959 4.0
97 1 2968 4.0
101 1 3081 4.0
102 1 3153 4.0
is_animation = movies['genres'].str.contains('Animation')

movies[is_animation][5:15]
movieId title genres
310 313 Swan Princess, The (1994) Animation|Children
360 364 Lion King, The (1994) Adventure|Animation|Children|Drama|Musical|IMAX
388 392 Secret Adventures of Tom Thumb, The (1993) Adventure|Animation
547 551 Nightmare Before Christmas, The (1993) Animation|Children|Fantasy|Musical
553 558 Pagemaster, The (1994) Action|Adventure|Animation|Children|Fantasy
582 588 Aladdin (1992) Adventure|Animation|Children|Comedy|Musical
588 594 Snow White and the Seven Dwarfs (1937) Animation|Children|Drama|Fantasy|Musical
589 595 Beauty and the Beast (1991) Animation|Children|Fantasy|Musical|Romance|IMAX
590 596 Pinocchio (1940) Animation|Children|Fantasy|Musical
604 610 Heavy Metal (1981) Action|Adventure|Animation|Horror|Sci-Fi
movies[is_animation].head(15)
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
12 13 Balto (1995) Adventure|Animation|Children
47 48 Pocahontas (1995) Animation|Children|Drama|Musical|Romance
236 239 Goofy Movie, A (1995) Animation|Children|Comedy|Romance
241 244 Gumby: The Movie (1995) Animation|Children
310 313 Swan Princess, The (1994) Animation|Children
360 364 Lion King, The (1994) Adventure|Animation|Children|Drama|Musical|IMAX
388 392 Secret Adventures of Tom Thumb, The (1993) Adventure|Animation
547 551 Nightmare Before Christmas, The (1993) Animation|Children|Fantasy|Musical
553 558 Pagemaster, The (1994) Action|Adventure|Animation|Children|Fantasy
582 588 Aladdin (1992) Adventure|Animation|Children|Comedy|Musical
588 594 Snow White and the Seven Dwarfs (1937) Animation|Children|Drama|Fantasy|Musical
589 595 Beauty and the Beast (1991) Animation|Children|Fantasy|Musical|Romance|IMAX
590 596 Pinocchio (1940) Animation|Children|Fantasy|Musical
604 610 Heavy Metal (1981) Action|Adventure|Animation|Horror|Sci-Fi

Group By and Aggregate

ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count
movieId
rating
0.5 239125
1.0 680732
1.5 279252
2.0 1430997
2.5 883398
3.0 4291193
3.5 2200156
4.0 5561926
4.5 1534824
5.0 2898660
average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.head()
rating
movieId
1 3.921240
2 3.211977
3 3.151040
4 2.861393
5 3.064592
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()
rating
movieId
1 49695
2 22243
3 12735
4 2756
5 12161
movie_count = ratings[['movieId','rating']].groupby('movieId').count()#选择某一个维度,然后根据维度group by,和sql操作类似
movie_count.tail()
rating
movieId
131254 1
131256 1
131258 1
131260 1
131262 1

Merge Dataframes

tags.head()
userId movieId tag
0 18 4141 Mark Waters
1 65 208 dark hero
2 65 353 dark hero
3 65 521 noir thriller
4 65 592 dark hero
movies.head()
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
t = movies.merge(tags, on='movieId', how='inner')
t.head()
#?movies.merge 详细说明
movieId title genres userId tag
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1644 Watched
1 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1741 computer animation
2 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1741 Disney animated feature
3 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1741 Pixar animation
4 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1741 Téa Leoni does not star in this movie

More examples: http://pandas.pydata.org/pandas-docs/stable/merging.html


Combine aggreagation, merging, and filters to get useful analytics

avg_ratings = ratings.groupby('movieId', as_index=False).mean().rename(columns={'rating':'avg_rating'})#指定columns field 就能重命名了
del avg_ratings['userId']
avg_ratings.head()
movieId avg_rating
0 1 3.921240
1 2 3.211977
2 3 3.151040
3 4 2.861393
4 5 3.064592
avg_ratings = avg_ratings.rename({'ratings':'avg_rating'})
box_office = movies.merge(avg_ratings, on='movieId', how='inner')
box_office.head()
movieId title genres avg_rating
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 3.921240
1 2 Jumanji (1995) Adventure|Children|Fantasy 3.211977
2 3 Grumpier Old Men (1995) Comedy|Romance 3.151040
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance 2.861393
4 5 Father of the Bride Part II (1995) Comedy 3.064592
is_highly_rated = box_office['avg_rating'] >= 4.0

box_office[is_highly_rated][-5:]
movieId title genres avg_rating
26737 131250 No More School (2000) Comedy 4.0
26738 131252 Forklift Driver Klaus: The First Day on the Jo… Comedy|Horror 4.0
26739 131254 Kein Bund für’s Leben (2007) Comedy 4.0
26740 131256 Feuer, Eis & Dosenbier (2002) Comedy 4.0
26743 131262 Innocence (2014) Adventure|Fantasy|Horror 4.0
is_comedy = box_office['genres'].str.contains('Comedy')

box_office[is_comedy][:5]
movieId title genres avg_rating
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 3.921240
2 3 Grumpier Old Men (1995) Comedy|Romance 3.151040
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance 2.861393
4 5 Father of the Bride Part II (1995) Comedy 3.064592
6 7 Sabrina (1995) Comedy|Romance 3.366484
box_office[is_comedy & is_highly_rated][-5:]
movieId title genres avg_rating
26736 131248 Brother Bear 2 (2006) Adventure|Animation|Children|Comedy|Fantasy 4.0
26737 131250 No More School (2000) Comedy 4.0
26738 131252 Forklift Driver Klaus: The First Day on the Jo… Comedy|Horror 4.0
26739 131254 Kein Bund für’s Leben (2007) Comedy 4.0
26740 131256 Feuer, Eis & Dosenbier (2002) Comedy 4.0

Vectorized String Operations

movies.head()
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy


Split ‘genres’ into multiple columns

movie_genres = movies['genres'].str.split('|', expand=True)
movie_genres[:10]
0 1 2 3 4 5 6 7 8 9
0 Adventure Animation Children Comedy Fantasy None None None None None
1 Adventure Children Fantasy None None None None None None None
2 Comedy Romance None None None None None None None None
3 Comedy Drama Romance None None None None None None None
4 Comedy None None None None None None None None None
5 Action Crime Thriller None None None None None None None
6 Comedy Romance None None None None None None None None
7 Adventure Children None None None None None None None None
8 Action None None None None None None None None None
9 Action Adventure Thriller None None None None None None None


Add a new column for comedy genre flag

movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')
movie_genres[:10]
0 1 2 3 4 5 6 7 8 9 isComedy
0 Adventure Animation Children Comedy Fantasy None None None None None True
1 Adventure Children Fantasy None None None None None None None False
2 Comedy Romance None None None None None None None None True
3 Comedy Drama Romance None None None None None None None True
4 Comedy None None None None None None None None None True
5 Action Crime Thriller None None None None None None None False
6 Comedy Romance None None None None None None None None True
7 Adventure Children None None None None None None None None False
8 Action None None None None None None None None None False
9 Action Adventure Thriller None None None None None None None False


Extract year from title e.g. (1995)

movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)#接受正则表达式
movies.tail()
movieId title genres year
27273 131254 Kein Bund für’s Leben (2007) Comedy 2007
27274 131256 Feuer, Eis & Dosenbier (2002) Comedy 2002
27275 131258 The Pirates (2014) Adventure 2014
27276 131260 Rentun Ruusu (2001) (no genres listed) 2001
27277 131262 Innocence (2014) Adventure|Fantasy|Horror 2014


More here: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods

Parsing Timestamps

Timestamps are common in sensor data or other time series datasets. Let us revisit the *tags.csv* dataset and read the timestamps!
tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.dtypes
userId int64 movieId int64 tag object timestamp int64 dtype: object

Unix time / POSIX time / epoch time records time in seconds
since midnight Coordinated Universal Time (UTC) of January 1, 1970

tags.head(5)
userId movieId tag timestamp
0 18 4141 Mark Waters 1240597180
1 65 208 dark hero 1368150078
2 65 353 dark hero 1368150079
3 65 521 noir thriller 1368149983
4 65 592 dark hero 1368150078
tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')#解析时间

Data Type datetime64[ns] maps to either


tags['parsed_time'].dtype
dtype(‘
tags.head(2)
userId movieId tag timestamp parsed_time
0 18 4141 Mark Waters 1240597180 2009-04-24 18:19:40
1 65 208 dark hero 1368150078 2013-05-10 01:41:18

Selecting rows based on timestamps

greater_than_t = tags['parsed_time'] > '2015-02-01'

selected_rows = tags[greater_than_t]

tags.shape, selected_rows.shape
((465564, 5), (12130, 5))

Sorting the table using the timestamps

tags.sort_values(by='parsed_time', ascending=True)[:10]
userId movieId tag timestamp parsed_time
333932 100371 2788 monty python 1135429210 2005-12-24 13:00:10
333927 100371 1732 coen brothers 1135429236 2005-12-24 13:00:36
333924 100371 1206 stanley kubrick 1135429248 2005-12-24 13:00:48
333923 100371 1193 jack nicholson 1135429371 2005-12-24 13:02:51
333939 100371 5004 peter sellers 1135429399 2005-12-24 13:03:19
333922 100371 47 morgan freeman 1135429412 2005-12-24 13:03:32
333921 100371 47 brad pitt 1135429412 2005-12-24 13:03:32
333936 100371 4011 brad pitt 1135429431 2005-12-24 13:03:51
333937 100371 4011 guy ritchie 1135429431 2005-12-24 13:03:51
333920 100371 32 bruce willis 1135429442 2005-12-24 13:04:02

Average Movie Ratings over Time

## Are Movie ratings related to the year of launch?
average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.tail()
movieId rating
26739 131254 4.0
26740 131256 4.0
26741 131258 2.5
26742 131260 3.0
26743 131262 4.0
joined = movies.merge(average_rating, on='movieId', how='inner')
joined.head()
joined.corr()
movieId rating
movieId 1.000000 -0.090369
rating -0.090369 1.000000
yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean()#将asindex设为false会作为列名,否则作为行
yearly_average[:10]
year rating
0 1891 3.000000
1 1893 3.375000
2 1894 3.071429
3 1895 3.125000
4 1896 3.183036
5 1898 3.850000
6 1899 3.625000
7 1900 3.166667
8 1901 5.000000
9 1902 3.738189
import matplotlib.pyplot as plt
yearly_average[-20:].plot(x='year', y='rating', figsize=(15,10), grid=True)
plt.show()

pandas 数据处理
Do some years look better for the boxoffice movies than others?

Does any data point seem like an outlier in some sense?

注意: 本文为edx上 UCSD 的课程py for data science笔记