【课程】Introduction to Data Science in Python Week3
Week 3 : Advanced Python Pandas
Merging Dataframes
When we want to join the DataFrames together, we have some choices to make.
outer join: a union. In the Venn diagram, it represents everyone in any circle.
pd.merge(dataframe1, dataframe2, how='outer', left_index=True, right_index=True)
iner join: the intersection. This is represented in the Venn diagram as the overlapping parts of each circle.
pd.merge(df1, df2, how='inner', left_index=True, right_index=True)
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'},
{'Name': 'Sally', 'Role': 'Course liasion'},
{'Name': 'James', 'Role': 'Grader'}])
staff_df = staff_df.set_index('Name')
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business'},
{'Name': 'Mike', 'School': 'Law'},
{'Name': 'Sally', 'School': 'Engineering'}])
student_df = student_df.set_index('Name')
print(staff_df.head())
print()
print(student_df.head())
left join: we want to get a list of all staff regardless of whether they were students or not. But if they were students, we would want to get their student details as well.
pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True)
right join: we want a list of all of the students and their roles if they were also staff.
pd.merge(staff_df, student_df, how='right', left_index=True, right_index=True)
表格中_x是left dataframe的信息, _y是right dataframe的信息。
Eg. Here are two DataFrames, products and invoices. The product DataFrame has an identifier and a sticker price. The invoices DataFrame lists the people, product identifiers, and quantity. Assuming that we want to generate totals, how do we join these two DataFrames together so that we have one which lists all of the information we need?
print(pd.merge(products, invoices, left_index=True, right_on='Product ID'))
staff_df = pd.DataFrame([{'First Name': 'Kelly', 'Last Name': 'Desjardins', 'Role': 'Director of HR'},
{'First Name': 'Sally', 'Last Name': '*s', 'Role': 'Course liasion'},
{'First Name': 'James', 'Last Name': 'Wilde', 'Role': 'Grader'}])
student_df = pd.DataFrame([{'First Name': 'James', 'Last Name': 'Hammond', 'School': 'Business'},
{'First Name': 'Mike', 'Last Name': 'Smith', 'School': 'Law'},
{'First Name': 'Sally', 'Last Name': '*s', 'School': 'Engineering'}])
staff_df
student_df
pd.merge(staff_df, student_df, how='inner', left_on=['First Name','Last Name'], right_on=['First Name','Last Name'])
left_on: 左侧Dataframe中的列或索引级别用作键。可以是列名,索引级名称,可以是长度等于Dataframe长度的数组。
left_index: 如果为True,则使用左侧Dataframe中的索引(行标签)作为其链接键。
Pandas Idioms
Pandorable
(df.where(df['SUMLEV']==50)
.dropna()
.set_index(['STNAME','CTYNAME'])
.rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'}))
Eg. Suppose we are working on a DataFrame that holds information on our equipment for an upcoming backpacking trip. Can you use method chaining to modify the DataFrame df in one statement to drop any entries where ‘Quantity’ is 0 and rename the column ‘Weight’ to ‘Weight (oz.)’?
print(df.where(df['Quantity']!=0)
.dropna()
.rename(columns={'Weight':'Weight(oz.)'}))
# OR
print(df.drop(df[df['Quantity'] == 0].index).rename(columns={'Weight': 'Weight (oz.)'}))
Group by
Split
df = df.set_index('STNAME')
def fun(item):
if item[0]<'M':
return 0
if item[0]<'Q':
return 1
return 2
for group, frame in df.groupby(fun):
print('There are ' + str(len(frame)) + ' records in group ' + str(group) + ' for processing.')
Apply
df.groupby('STNAME').agg({'CENSUS2010POP': np.average})
print(df.groupby('Category').apply(lambda df,a,b: sum(df[a] * df[b]), 'Weight (oz.)', 'Quantity'))
Or alternatively without using a lambda:
# def totalweight(df, w, q):
# return sum(df[w] * df[q])
#
# print(df.groupby('Category').apply(totalweight, 'Weight (oz.)', 'Quantity'))
agg 方法将一个函数使用在一个数列上,然后返回一个标量的值。也就是说agg每次传入的是一列数据,对其聚合后返回标量。
apply 是一个更一般化的方法:将一个数据分拆-应用-汇总。而apply会将当前分组后的数据一起传入,可以返回多维数据。
一个讲解groupby很好的网页 Pandas Groupby 使用
推荐阅读
-
Introduction to Data Science in Python 第 2 周 Assignment
-
introduction to data science w4
-
Coursera | Introduction to Data Science in Python(University of Michigan)| Assignment2
-
【课程】Introduction to Data Science in Python Week3
-
Introduction to Data Science w3 Advanced python pandas笔记
-
Coursera Introduction to Data Science in Python Assignment2
-
Intro to Python for Data Science Learning 6 - NumPy
-
Python for Data Science
-
Python Data Science, NumPy 1
-
Data Science完整学习路径Python版