欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

【课程】Introduction to Data Science in Python Week3

程序员文章站 2024-01-04 20:58:16
...

Week 3 : Advanced Python Pandas

Merging Dataframes

When we want to join the DataFrames together, we have some choices to make.
outer join: a union. In the Venn diagram, it represents everyone in any circle.

pd.merge(dataframe1, dataframe2, how='outer', left_index=True, right_index=True)

iner join: the intersection. This is represented in the Venn diagram as the overlapping parts of each circle.

pd.merge(df1, df2, how='inner', left_index=True, right_index=True)
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'},
                         {'Name': 'Sally', 'Role': 'Course liasion'},
                         {'Name': 'James', 'Role': 'Grader'}])
staff_df = staff_df.set_index('Name')
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business'},
                           {'Name': 'Mike', 'School': 'Law'},
                           {'Name': 'Sally', 'School': 'Engineering'}])
student_df = student_df.set_index('Name')
print(staff_df.head())
print()
print(student_df.head())

left join: we want to get a list of all staff regardless of whether they were students or not. But if they were students, we would want to get their student details as well.

pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True)

right join: we want a list of all of the students and their roles if they were also staff.

pd.merge(staff_df, student_df, how='right', left_index=True, right_index=True)

表格中_x是left dataframe的信息, _y是right dataframe的信息。

Eg. Here are two DataFrames, products and invoices. The product DataFrame has an identifier and a sticker price. The invoices DataFrame lists the people, product identifiers, and quantity. Assuming that we want to generate totals, how do we join these two DataFrames together so that we have one which lists all of the information we need?

print(pd.merge(products, invoices, left_index=True, right_on='Product ID'))
staff_df = pd.DataFrame([{'First Name': 'Kelly', 'Last Name': 'Desjardins', 'Role': 'Director of HR'},
                         {'First Name': 'Sally', 'Last Name': '*s', 'Role': 'Course liasion'},
                         {'First Name': 'James', 'Last Name': 'Wilde', 'Role': 'Grader'}])
student_df = pd.DataFrame([{'First Name': 'James', 'Last Name': 'Hammond', 'School': 'Business'},
                           {'First Name': 'Mike', 'Last Name': 'Smith', 'School': 'Law'},
                           {'First Name': 'Sally', 'Last Name': '*s', 'School': 'Engineering'}])
staff_df
student_df
pd.merge(staff_df, student_df, how='inner', left_on=['First Name','Last Name'], right_on=['First Name','Last Name'])

left_on: 左侧Dataframe中的列或索引级别用作键。可以是列名,索引级名称,可以是长度等于Dataframe长度的数组。
left_index: 如果为True,则使用左侧Dataframe中的索引(行标签)作为其链接键。

Pandas Idioms

Pandorable

(df.where(df['SUMLEV']==50)
    .dropna()
    .set_index(['STNAME','CTYNAME'])
    .rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'}))

Eg. Suppose we are working on a DataFrame that holds information on our equipment for an upcoming backpacking trip. Can you use method chaining to modify the DataFrame df in one statement to drop any entries where ‘Quantity’ is 0 and rename the column ‘Weight’ to ‘Weight (oz.)’?

print(df.where(df['Quantity']!=0)
        .dropna()
        .rename(columns={'Weight':'Weight(oz.)'}))
# OR
print(df.drop(df[df['Quantity'] == 0].index).rename(columns={'Weight': 'Weight (oz.)'}))

Group by

Split

df = df.set_index('STNAME')

def fun(item):
    if item[0]<'M':
        return 0
    if item[0]<'Q':
        return 1
    return 2

for group, frame in df.groupby(fun):
    print('There are ' + str(len(frame)) + ' records in group ' + str(group) + ' for processing.')

Apply

df.groupby('STNAME').agg({'CENSUS2010POP': np.average})
print(df.groupby('Category').apply(lambda df,a,b: sum(df[a] * df[b]), 'Weight (oz.)', 'Quantity'))
 Or alternatively without using a lambda:
# def totalweight(df, w, q):
#        return sum(df[w] * df[q])
#        
# print(df.groupby('Category').apply(totalweight, 'Weight (oz.)', 'Quantity'))

agg 方法将一个函数使用在一个数列上,然后返回一个标量的值。也就是说agg每次传入的是一列数据,对其聚合后返回标量。
apply 是一个更一般化的方法:将一个数据分拆-应用-汇总。而apply会将当前分组后的数据一起传入,可以返回多维数据。

一个讲解groupby很好的网页 Pandas Groupby 使用

上一篇:

下一篇: