Pandas
*pandas* is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python.
*pandas* build upon *numpy* and *scipy* providing easy-to-use data structures and data manipulation functions with integrated indexing.
The main data structures *pandas* provides are *Series* and *DataFrames*. After a brief introduction to these two data structures and data ingestion, the key features of *pandas* this notebook covers are:
* Generating descriptive statistics on data
* Data cleaning using built in pandas functions
* Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
* Merging multiple datasets using dataframes
* Working with timestamps and time-series data
**Additional Recommended Resources:**
* *pandas* Documentation: http://pandas.pydata.org/pandas-docs/stable/
* *Python for Data Analysis* by Wes McKinney
* *Python Data Science Handbook* by Jake VanderPlas
Let’s get started with our first *pandas* notebook!
Import Libraries
import pandas as pd
Introduction to pandas Data Structures
*pandas* has two main data structures it uses, namely, *Series* and *DataFrames*.
pandas Series
*pandas Series* one-dimensional labeled array.
ser = pd.Series(data = [100, 'foo', 300, 'bar', 500], index = ['tom', 'bob', 'nancy', 'dan', 'eric'])
ser
tom 100
bob foo
nancy 300
dan bar
eric 500
dtype: object
ser.index
Index([‘tom’, ‘bob’, ‘nancy’, ‘dan’, ‘eric’], dtype=’object’)
ser.loc[['nancy','bob']]
nancy 300
bob foo
dtype: object
ser[[4, 3, 1]]
eric 500
dan bar
bob foo
dtype: object
ser.iloc[2]
300
'bob' in ser
True
ser
tom 100
bob foo
nancy 300
dan bar
eric 500
dtype: object
ser * 2
tom 200
bob foofoo
nancy 600
dan barbar
eric 1000
dtype: object
ser[['nancy', 'eric']] ** 2
nancy 90000
eric 250000
dtype: object
pandas DataFrame
*pandas DataFrame* is a 2-dimensional labeled data structure.
Create DataFrame from dictionary of Python Series
d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}
df = pd.DataFrame(d)
print(df)
one two
apple 100.0 111.0
ball 200.0 222.0
cerill NaN 333.0
clock 300.0 NaN
dancy NaN 4444.0
df.index
Index([‘apple’, ‘ball’, ‘cerill’, ‘clock’, ‘dancy’], dtype=’object’)
df.columns
Index([‘one’, ‘two’], dtype=’object’)
pd.DataFrame(d, index=['dancy', 'ball', 'apple'])
|
one |
two |
dancy |
NaN |
4444.0 |
ball |
200.0 |
222.0 |
apple |
100.0 |
111.0 |
pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])
|
two |
five |
dancy |
4444.0 |
NaN |
ball |
222.0 |
NaN |
apple |
111.0 |
NaN |
Create DataFrame from list of Python dictionaries
data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]
pd.DataFrame(data)
|
alex |
alice |
dora |
ema |
joe |
0 |
1.0 |
NaN |
NaN |
NaN |
2.0 |
1 |
NaN |
20.0 |
10.0 |
5.0 |
NaN |
pd.DataFrame(data, index=['orange', 'red'])
|
alex |
alice |
dora |
ema |
joe |
orange |
1.0 |
NaN |
NaN |
NaN |
2.0 |
red |
NaN |
20.0 |
10.0 |
5.0 |
NaN |
pd.DataFrame(data, columns=['joe', 'dora','alice'])
|
joe |
dora |
alice |
0 |
2.0 |
NaN |
NaN |
1 |
NaN |
10.0 |
20.0 |
Basic DataFrame operations
df
|
one |
two |
apple |
100.0 |
111.0 |
ball |
200.0 |
222.0 |
cerill |
NaN |
333.0 |
clock |
300.0 |
NaN |
dancy |
NaN |
4444.0 |
df['one']
apple 100.0
ball 200.0
cerill NaN
clock 300.0
dancy NaN
Name: one, dtype: float64
df['three'] = df['one'] * df['two']
df
|
one |
two |
three |
apple |
100.0 |
111.0 |
11100.0 |
ball |
200.0 |
222.0 |
44400.0 |
cerill |
NaN |
333.0 |
NaN |
clock |
300.0 |
NaN |
NaN |
dancy |
NaN |
4444.0 |
NaN |
df['flag'] = df['one'] > 250
df
|
one |
two |
three |
flag |
apple |
100.0 |
111.0 |
11100.0 |
False |
ball |
200.0 |
222.0 |
44400.0 |
False |
cerill |
NaN |
333.0 |
NaN |
False |
clock |
300.0 |
NaN |
NaN |
True |
dancy |
NaN |
4444.0 |
NaN |
False |
three = df.pop('three')
three
apple 11100.0
ball 44400.0
cerill NaN
clock NaN
dancy NaN
Name: three, dtype: float64
df
|
one |
two |
flag |
apple |
100.0 |
111.0 |
False |
ball |
200.0 |
222.0 |
False |
cerill |
NaN |
333.0 |
False |
clock |
300.0 |
NaN |
True |
dancy |
NaN |
4444.0 |
False |
del df['two']
df
|
one |
flag |
apple |
100.0 |
False |
ball |
200.0 |
False |
cerill |
NaN |
False |
clock |
300.0 |
True |
dancy |
NaN |
False |
df.insert(2, 'copy_of_one', df['one'])
df
|
one |
flag |
copy_of_one |
apple |
100.0 |
False |
100.0 |
ball |
200.0 |
False |
200.0 |
cerill |
NaN |
False |
NaN |
clock |
300.0 |
True |
300.0 |
dancy |
NaN |
False |
NaN |
df['one_upper_half'] = df['one'][:2]
df
|
one |
flag |
copy_of_one |
one_upper_half |
apple |
100.0 |
False |
100.0 |
100.0 |
ball |
200.0 |
False |
200.0 |
200.0 |
cerill |
NaN |
False |
NaN |
NaN |
clock |
300.0 |
True |
300.0 |
NaN |
dancy |
NaN |
False |
NaN |
NaN |
Case Study: Movie Data Analysis
This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using *pandas*.
## Download the Dataset
Please note that **you will need to download the dataset**. Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints.
Here are the links to the data source and location:
* **Data Source: ** MovieLens web site (filename: ml-20m.zip)
* **Location:** https://grouplens.org/datasets/movielens/
Once the download completes, please make sure the data files are in a directory called *movielens* in your *Week-3-pandas* folder.
Let us look at the files in this dataset using the UNIX command ls.
!ls ./movielens
!cat ./movielens/movies.csv | wc -l
!head -5 ./movielens/ratings.csv
Use Pandas to Read the Dataset
In this notebook, we will be using three CSV files:
-
ratings.csv : userId,movieId,rating, timestamp
-
tags.csv : userId,movieId, tag, timestamp
-
movies.csv : movieId, title, genres
Using the read_csv function in pandas, we will ingest these three files.
movies = pd.read_csv('./movielens/movies.csv', sep=',')
print(type(movies))
movies.head(15)
tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.head()
|
userId |
movieId |
tag |
timestamp |
0 |
18 |
4141 |
Mark Waters |
1240597180 |
1 |
65 |
208 |
dark hero |
1368150078 |
2 |
65 |
353 |
dark hero |
1368150079 |
3 |
65 |
521 |
noir thriller |
1368149983 |
4 |
65 |
592 |
dark hero |
1368150078 |
ratings = pd.read_csv('./movielens/ratings.csv', sep=',', parse_dates=['timestamp'])
ratings.head()
|
userId |
movieId |
rating |
timestamp |
0 |
1 |
2 |
3.5 |
1112486027 |
1 |
1 |
29 |
3.5 |
1112484676 |
2 |
1 |
32 |
3.5 |
1112484819 |
3 |
1 |
47 |
3.5 |
1112484727 |
4 |
1 |
50 |
3.5 |
1112484580 |
del ratings['timestamp']
del tags['timestamp']
Data Structures
Series
row_0 = tags.iloc[0]
type(row_0)
pandas.core.series.Series
print(row_0)
userId 18
movieId 4141
tag Mark Waters
Name: 0, dtype: object
row_0.index
Index([‘userId’, ‘movieId’, ‘tag’], dtype=’object’)
row_0['userId']
18
'rating' in row_0
False
row_0.name
0
row_0 = row_0.rename('first_row')
row_0.name
‘first_row’
DataFrames
tags.head()
|
userId |
movieId |
tag |
0 |
18 |
4141 |
Mark Waters |
1 |
65 |
208 |
dark hero |
2 |
65 |
353 |
dark hero |
3 |
65 |
521 |
noir thriller |
4 |
65 |
592 |
dark hero |
tags.index
RangeIndex(start=0, stop=465564, step=1)
tags.columns
Index([‘userId’, ‘movieId’, ‘tag’], dtype=’object’)
tags.iloc[ [0,11,2000] ]
|
userId |
movieId |
tag |
0 |
18 |
4141 |
Mark Waters |
11 |
65 |
1783 |
noir thriller |
2000 |
910 |
68554 |
conspiracy theory |
Descriptive Statistics
Let’s look how the ratings are distributed!
ratings['rating'].describe()
count 2.000026e+07
mean 3.525529e+00
std 1.051989e+00
min 5.000000e-01
25% 3.000000e+00
50% 3.500000e+00
75% 4.000000e+00
max 5.000000e+00
Name: rating, dtype: float64
ratings.describe()
|
userId |
movieId |
rating |
count |
2.000026e+07 |
2.000026e+07 |
2.000026e+07 |
mean |
6.904587e+04 |
9.041567e+03 |
3.525529e+00 |
std |
4.003863e+04 |
1.978948e+04 |
1.051989e+00 |
min |
1.000000e+00 |
1.000000e+00 |
5.000000e-01 |
25% |
3.439500e+04 |
9.020000e+02 |
3.000000e+00 |
50% |
6.914100e+04 |
2.167000e+03 |
3.500000e+00 |
75% |
1.036370e+05 |
4.770000e+03 |
4.000000e+00 |
max |
1.384930e+05 |
1.312620e+05 |
5.000000e+00 |
ratings['rating'].mean()
3.5255285642993797
ratings.mean()
userId 69045.872583
movieId 9041.567330
rating 3.525529
dtype: float64
ratings['rating'].min()
0.5
ratings['rating'].max()
5.0
ratings['rating'].std()
1.051988919275684
ratings['rating'].mode()
0 4.0
dtype: float64
ratings.corr()
|
userId |
movieId |
rating |
userId |
1.000000 |
-0.000850 |
0.001175 |
movieId |
-0.000850 |
1.000000 |
0.002606 |
rating |
0.001175 |
0.002606 |
1.000000 |
filter_1 = ratings['rating'] > 5
print(filter_1)
filter_1.any()
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 False
27 False
28 False
29 False
…
20000233 False
20000234 False
20000235 False
20000236 False
20000237 False
20000238 False
20000239 False
20000240 False
20000241 False
20000242 False
20000243 False
20000244 False
20000245 False
20000246 False
20000247 False
20000248 False
20000249 False
20000250 False
20000251 False
20000252 False
20000253 False
20000254 False
20000255 False
20000256 False
20000257 False
20000258 False
20000259 False
20000260 False
20000261 False
20000262 False
Name: rating, dtype: bool
False
filter_2 = ratings['rating'] > 0
filter_2.all()
True
Data Cleaning: Handling Missing Data
movies.shape
(27278, 3)
movies.isnull().any()
movieId False
title False
genres False
dtype: bool
Thats nice ! No NULL values !
ratings.shape
(20000263, 3)
ratings.isnull().any()
userId False
movieId False
rating False
dtype: bool
Thats nice ! No NULL values !
tags.shape
(465564, 3)
tags.isnull().any()
userId False
movieId False
tag True
dtype: bool
We have some tags which are NULL.
tags = tags.dropna()
tags.isnull().any()
userId False
movieId False
tag False
dtype: bool
tags.shape
(465548, 3)
Thats nice ! No NULL values ! Notice the number of lines have reduced.
Data Visualization
%matplotlib inline
ratings.hist(column='rating', figsize=(15,10))
array([[
ratings.boxplot(column='rating', figsize=(15,20))
tags['tag'].head()
0 Mark Waters
1 dark hero
2 dark hero
3 noir thriller
4 dark hero
Name: tag, dtype: object
movies[['title','genres']].head()
|
title |
genres |
0 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
1 |
Jumanji (1995) |
Adventure|Children|Fantasy |
2 |
Grumpier Old Men (1995) |
Comedy|Romance |
3 |
Waiting to Exhale (1995) |
Comedy|Drama|Romance |
4 |
Father of the Bride Part II (1995) |
Comedy |
ratings[-10:]
|
userId |
movieId |
rating |
20000253 |
138493 |
60816 |
4.5 |
20000254 |
138493 |
61160 |
4.0 |
20000255 |
138493 |
65682 |
4.5 |
20000256 |
138493 |
66762 |
4.5 |
20000257 |
138493 |
68319 |
4.5 |
20000258 |
138493 |
68954 |
4.5 |
20000259 |
138493 |
69526 |
4.5 |
20000260 |
138493 |
69644 |
3.0 |
20000261 |
138493 |
70286 |
5.0 |
20000262 |
138493 |
71619 |
2.5 |
tag_counts = tags['tag'].value_counts()
tag_counts[-10:]
Venice Film Festival Winner 2002 1
based on the life of Buford Pusser 1
but no way as good as the other two 1
see 1
Jeffrey Kimball 1
tolerable 1
Fake History - Don’t Believe a Thing 1
Boy 1
urlaub 1
conservative 1
Name: tag, dtype: int64
tag_counts[:10].plot(kind='bar', figsize=(15,10))
is_highly_rated = ratings['rating'] >= 4.0
ratings[is_highly_rated][30:50]
|
userId |
movieId |
rating |
68 |
1 |
2021 |
4.0 |
69 |
1 |
2100 |
4.0 |
70 |
1 |
2118 |
4.0 |
71 |
1 |
2138 |
4.0 |
72 |
1 |
2140 |
4.0 |
73 |
1 |
2143 |
4.0 |
74 |
1 |
2173 |
4.0 |
75 |
1 |
2174 |
4.0 |
76 |
1 |
2193 |
4.0 |
79 |
1 |
2288 |
4.0 |
80 |
1 |
2291 |
4.0 |
81 |
1 |
2542 |
4.0 |
82 |
1 |
2628 |
4.0 |
90 |
1 |
2762 |
4.0 |
92 |
1 |
2872 |
4.0 |
94 |
1 |
2944 |
4.0 |
96 |
1 |
2959 |
4.0 |
97 |
1 |
2968 |
4.0 |
101 |
1 |
3081 |
4.0 |
102 |
1 |
3153 |
4.0 |
is_animation = movies['genres'].str.contains('Animation')
movies[is_animation][5:15]
|
movieId |
title |
genres |
310 |
313 |
Swan Princess, The (1994) |
Animation|Children |
360 |
364 |
Lion King, The (1994) |
Adventure|Animation|Children|Drama|Musical|IMAX |
388 |
392 |
Secret Adventures of Tom Thumb, The (1993) |
Adventure|Animation |
547 |
551 |
Nightmare Before Christmas, The (1993) |
Animation|Children|Fantasy|Musical |
553 |
558 |
Pagemaster, The (1994) |
Action|Adventure|Animation|Children|Fantasy |
582 |
588 |
Aladdin (1992) |
Adventure|Animation|Children|Comedy|Musical |
588 |
594 |
Snow White and the Seven Dwarfs (1937) |
Animation|Children|Drama|Fantasy|Musical |
589 |
595 |
Beauty and the Beast (1991) |
Animation|Children|Fantasy|Musical|Romance|IMAX |
590 |
596 |
Pinocchio (1940) |
Animation|Children|Fantasy|Musical |
604 |
610 |
Heavy Metal (1981) |
Action|Adventure|Animation|Horror|Sci-Fi |
movies[is_animation].head(15)
|
movieId |
title |
genres |
0 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
12 |
13 |
Balto (1995) |
Adventure|Animation|Children |
47 |
48 |
Pocahontas (1995) |
Animation|Children|Drama|Musical|Romance |
236 |
239 |
Goofy Movie, A (1995) |
Animation|Children|Comedy|Romance |
241 |
244 |
Gumby: The Movie (1995) |
Animation|Children |
310 |
313 |
Swan Princess, The (1994) |
Animation|Children |
360 |
364 |
Lion King, The (1994) |
Adventure|Animation|Children|Drama|Musical|IMAX |
388 |
392 |
Secret Adventures of Tom Thumb, The (1993) |
Adventure|Animation |
547 |
551 |
Nightmare Before Christmas, The (1993) |
Animation|Children|Fantasy|Musical |
553 |
558 |
Pagemaster, The (1994) |
Action|Adventure|Animation|Children|Fantasy |
582 |
588 |
Aladdin (1992) |
Adventure|Animation|Children|Comedy|Musical |
588 |
594 |
Snow White and the Seven Dwarfs (1937) |
Animation|Children|Drama|Fantasy|Musical |
589 |
595 |
Beauty and the Beast (1991) |
Animation|Children|Fantasy|Musical|Romance|IMAX |
590 |
596 |
Pinocchio (1940) |
Animation|Children|Fantasy|Musical |
604 |
610 |
Heavy Metal (1981) |
Action|Adventure|Animation|Horror|Sci-Fi |
Group By and Aggregate
ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count
|
movieId |
rating |
|
0.5 |
239125 |
1.0 |
680732 |
1.5 |
279252 |
2.0 |
1430997 |
2.5 |
883398 |
3.0 |
4291193 |
3.5 |
2200156 |
4.0 |
5561926 |
4.5 |
1534824 |
5.0 |
2898660 |
average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.head()
|
rating |
movieId |
|
1 |
3.921240 |
2 |
3.211977 |
3 |
3.151040 |
4 |
2.861393 |
5 |
3.064592 |
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()
|
rating |
movieId |
|
1 |
49695 |
2 |
22243 |
3 |
12735 |
4 |
2756 |
5 |
12161 |
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.tail()
|
rating |
movieId |
|
131254 |
1 |
131256 |
1 |
131258 |
1 |
131260 |
1 |
131262 |
1 |
Merge Dataframes
tags.head()
|
userId |
movieId |
tag |
0 |
18 |
4141 |
Mark Waters |
1 |
65 |
208 |
dark hero |
2 |
65 |
353 |
dark hero |
3 |
65 |
521 |
noir thriller |
4 |
65 |
592 |
dark hero |
movies.head()
|
movieId |
title |
genres |
0 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
1 |
2 |
Jumanji (1995) |
Adventure|Children|Fantasy |
2 |
3 |
Grumpier Old Men (1995) |
Comedy|Romance |
3 |
4 |
Waiting to Exhale (1995) |
Comedy|Drama|Romance |
4 |
5 |
Father of the Bride Part II (1995) |
Comedy |
t = movies.merge(tags, on='movieId', how='inner')
t.head()
|
movieId |
title |
genres |
userId |
tag |
0 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
1644 |
Watched |
1 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
1741 |
computer animation |
2 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
1741 |
Disney animated feature |
3 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
1741 |
Pixar animation |
4 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
1741 |
Téa Leoni does not star in this movie |
More examples: http://pandas.pydata.org/pandas-docs/stable/merging.html
Combine aggreagation, merging, and filters to get useful analytics
avg_ratings = ratings.groupby('movieId', as_index=False).mean().rename(columns={'rating':'avg_rating'})
del avg_ratings['userId']
avg_ratings.head()
|
movieId |
avg_rating |
0 |
1 |
3.921240 |
1 |
2 |
3.211977 |
2 |
3 |
3.151040 |
3 |
4 |
2.861393 |
4 |
5 |
3.064592 |
avg_ratings = avg_ratings.rename({'ratings':'avg_rating'})
box_office = movies.merge(avg_ratings, on='movieId', how='inner')
box_office.head()
|
movieId |
title |
genres |
avg_rating |
0 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
3.921240 |
1 |
2 |
Jumanji (1995) |
Adventure|Children|Fantasy |
3.211977 |
2 |
3 |
Grumpier Old Men (1995) |
Comedy|Romance |
3.151040 |
3 |
4 |
Waiting to Exhale (1995) |
Comedy|Drama|Romance |
2.861393 |
4 |
5 |
Father of the Bride Part II (1995) |
Comedy |
3.064592 |
is_highly_rated = box_office['avg_rating'] >= 4.0
box_office[is_highly_rated][-5:]
|
movieId |
title |
genres |
avg_rating |
26737 |
131250 |
No More School (2000) |
Comedy |
4.0 |
26738 |
131252 |
Forklift Driver Klaus: The First Day on the Jo… |
Comedy|Horror |
4.0 |
26739 |
131254 |
Kein Bund für’s Leben (2007) |
Comedy |
4.0 |
26740 |
131256 |
Feuer, Eis & Dosenbier (2002) |
Comedy |
4.0 |
26743 |
131262 |
Innocence (2014) |
Adventure|Fantasy|Horror |
4.0 |
is_comedy = box_office['genres'].str.contains('Comedy')
box_office[is_comedy][:5]
|
movieId |
title |
genres |
avg_rating |
0 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
3.921240 |
2 |
3 |
Grumpier Old Men (1995) |
Comedy|Romance |
3.151040 |
3 |
4 |
Waiting to Exhale (1995) |
Comedy|Drama|Romance |
2.861393 |
4 |
5 |
Father of the Bride Part II (1995) |
Comedy |
3.064592 |
6 |
7 |
Sabrina (1995) |
Comedy|Romance |
3.366484 |
box_office[is_comedy & is_highly_rated][-5:]
|
movieId |
title |
genres |
avg_rating |
26736 |
131248 |
Brother Bear 2 (2006) |
Adventure|Animation|Children|Comedy|Fantasy |
4.0 |
26737 |
131250 |
No More School (2000) |
Comedy |
4.0 |
26738 |
131252 |
Forklift Driver Klaus: The First Day on the Jo… |
Comedy|Horror |
4.0 |
26739 |
131254 |
Kein Bund für’s Leben (2007) |
Comedy |
4.0 |
26740 |
131256 |
Feuer, Eis & Dosenbier (2002) |
Comedy |
4.0 |
Vectorized String Operations
movies.head()
|
movieId |
title |
genres |
0 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
1 |
2 |
Jumanji (1995) |
Adventure|Children|Fantasy |
2 |
3 |
Grumpier Old Men (1995) |
Comedy|Romance |
3 |
4 |
Waiting to Exhale (1995) |
Comedy|Drama|Romance |
4 |
5 |
Father of the Bride Part II (1995) |
Comedy |
Split ‘genres’ into multiple columns
movie_genres = movies['genres'].str.split('|', expand=True)
movie_genres[:10]
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
0 |
Adventure |
Animation |
Children |
Comedy |
Fantasy |
None |
None |
None |
None |
None |
1 |
Adventure |
Children |
Fantasy |
None |
None |
None |
None |
None |
None |
None |
2 |
Comedy |
Romance |
None |
None |
None |
None |
None |
None |
None |
None |
3 |
Comedy |
Drama |
Romance |
None |
None |
None |
None |
None |
None |
None |
4 |
Comedy |
None |
None |
None |
None |
None |
None |
None |
None |
None |
5 |
Action |
Crime |
Thriller |
None |
None |
None |
None |
None |
None |
None |
6 |
Comedy |
Romance |
None |
None |
None |
None |
None |
None |
None |
None |
7 |
Adventure |
Children |
None |
None |
None |
None |
None |
None |
None |
None |
8 |
Action |
None |
None |
None |
None |
None |
None |
None |
None |
None |
9 |
Action |
Adventure |
Thriller |
None |
None |
None |
None |
None |
None |
None |
Add a new column for comedy genre flag
movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')
movie_genres[:10]
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
isComedy |
0 |
Adventure |
Animation |
Children |
Comedy |
Fantasy |
None |
None |
None |
None |
None |
True |
1 |
Adventure |
Children |
Fantasy |
None |
None |
None |
None |
None |
None |
None |
False |
2 |
Comedy |
Romance |
None |
None |
None |
None |
None |
None |
None |
None |
True |
3 |
Comedy |
Drama |
Romance |
None |
None |
None |
None |
None |
None |
None |
True |
4 |
Comedy |
None |
None |
None |
None |
None |
None |
None |
None |
None |
True |
5 |
Action |
Crime |
Thriller |
None |
None |
None |
None |
None |
None |
None |
False |
6 |
Comedy |
Romance |
None |
None |
None |
None |
None |
None |
None |
None |
True |
7 |
Adventure |
Children |
None |
None |
None |
None |
None |
None |
None |
None |
False |
8 |
Action |
None |
None |
None |
None |
None |
None |
None |
None |
None |
False |
9 |
Action |
Adventure |
Thriller |
None |
None |
None |
None |
None |
None |
None |
False |
Extract year from title e.g. (1995)
movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)
movies.tail()
|
movieId |
title |
genres |
year |
27273 |
131254 |
Kein Bund für’s Leben (2007) |
Comedy |
2007 |
27274 |
131256 |
Feuer, Eis & Dosenbier (2002) |
Comedy |
2002 |
27275 |
131258 |
The Pirates (2014) |
Adventure |
2014 |
27276 |
131260 |
Rentun Ruusu (2001) |
(no genres listed) |
2001 |
27277 |
131262 |
Innocence (2014) |
Adventure|Fantasy|Horror |
2014 |
More here: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods
Parsing Timestamps
Timestamps are common in sensor data or other time series datasets.
Let us revisit the *tags.csv* dataset and read the timestamps!
tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.dtypes
userId int64
movieId int64
tag object
timestamp int64
dtype: object
Unix time / POSIX time / epoch time records
time in seconds
since midnight Coordinated Universal Time (UTC) of January 1, 1970
tags.head(5)
|
userId |
movieId |
tag |
timestamp |
0 |
18 |
4141 |
Mark Waters |
1240597180 |
1 |
65 |
208 |
dark hero |
1368150078 |
2 |
65 |
353 |
dark hero |
1368150079 |
3 |
65 |
521 |
noir thriller |
1368149983 |
4 |
65 |
592 |
dark hero |
1368150078 |
tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')
Data Type datetime64[ns] maps to either
tags['parsed_time'].dtype
dtype(‘
tags.head(2)
|
userId |
movieId |
tag |
timestamp |
parsed_time |
0 |
18 |
4141 |
Mark Waters |
1240597180 |
2009-04-24 18:19:40 |
1 |
65 |
208 |
dark hero |
1368150078 |
2013-05-10 01:41:18 |
Selecting rows based on timestamps
greater_than_t = tags['parsed_time'] > '2015-02-01'
selected_rows = tags[greater_than_t]
tags.shape, selected_rows.shape
((465564, 5), (12130, 5))
Sorting the table using the timestamps
tags.sort_values(by='parsed_time', ascending=True)[:10]
|
userId |
movieId |
tag |
timestamp |
parsed_time |
333932 |
100371 |
2788 |
monty python |
1135429210 |
2005-12-24 13:00:10 |
333927 |
100371 |
1732 |
coen brothers |
1135429236 |
2005-12-24 13:00:36 |
333924 |
100371 |
1206 |
stanley kubrick |
1135429248 |
2005-12-24 13:00:48 |
333923 |
100371 |
1193 |
jack nicholson |
1135429371 |
2005-12-24 13:02:51 |
333939 |
100371 |
5004 |
peter sellers |
1135429399 |
2005-12-24 13:03:19 |
333922 |
100371 |
47 |
morgan freeman |
1135429412 |
2005-12-24 13:03:32 |
333921 |
100371 |
47 |
brad pitt |
1135429412 |
2005-12-24 13:03:32 |
333936 |
100371 |
4011 |
brad pitt |
1135429431 |
2005-12-24 13:03:51 |
333937 |
100371 |
4011 |
guy ritchie |
1135429431 |
2005-12-24 13:03:51 |
333920 |
100371 |
32 |
bruce willis |
1135429442 |
2005-12-24 13:04:02 |
Average Movie Ratings over Time
## Are Movie ratings related to the year of launch?
average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.tail()
|
movieId |
rating |
26739 |
131254 |
4.0 |
26740 |
131256 |
4.0 |
26741 |
131258 |
2.5 |
26742 |
131260 |
3.0 |
26743 |
131262 |
4.0 |
joined = movies.merge(average_rating, on='movieId', how='inner')
joined.head()
joined.corr()
|
movieId |
rating |
movieId |
1.000000 |
-0.090369 |
rating |
-0.090369 |
1.000000 |
yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean()
yearly_average[:10]
|
year |
rating |
0 |
1891 |
3.000000 |
1 |
1893 |
3.375000 |
2 |
1894 |
3.071429 |
3 |
1895 |
3.125000 |
4 |
1896 |
3.183036 |
5 |
1898 |
3.850000 |
6 |
1899 |
3.625000 |
7 |
1900 |
3.166667 |
8 |
1901 |
5.000000 |
9 |
1902 |
3.738189 |
import matplotlib.pyplot as plt
yearly_average[-20:].plot(x='year', y='rating', figsize=(15,10), grid=True)
plt.show()
Do some years look better for the boxoffice movies than others?
Does any data point seem like an outlier in some sense?
注意: 本文为edx上 UCSD 的课程py for data science笔记