Introduction to Data Science w3 Advanced python pandas笔记

程序员文章站 2024-01-04 20:49:16

...

merging data frame
查看Python文档，DataFrame.merge()方法能够按照一定条件将两个表相连。
参数为：

right : DataFrame
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’
left: use only keys from left frame, preserve key order
right: use only keys from right frame, preserve key order
outer: use union of keys from both frames, sort keys lexicographically
on : label or list
Field names to join on. Must be found in both DataFrames. If on is None and not merging on indexes, then it merges on the intersection of the columns by default.
left_on, right_on: 用哪些左或右的列名作为join on
left_index, right_index : boolean, default False
保留左或右的Index.

pandas idioms
1.method chaining
(注意chain indexing是一定要避免的，因为有可能会返回一个复制的dataframe或一个dataframe的视图，不确定。
例如df.loc[‘a’][‘b’])

method chaining

df.drop(df[df['Quantity'] == 0].index)
.rename(columns={'Weight': 'Weight (oz.)'})

.apply()
apply可以使用lambda或function，要想让function适用在全部行上，需要参数axis=1

rows = ['POPESTIMATE2010',
        'POPESTIMATE2011',
        'POPESTIMATE2012',
        'POPESTIMATE2013',
        'POPESTIMATE2014',
        'POPESTIMATE2015']
df.apply(lambda x: np.max(x[rows]), axis=1)

groupby（）
发现了一个神奇的用法：groupby以后直接用group和frame来遍历。注意groupby之前

for group, frame in df.groupby('STNAME'):
    avg = np.average(frame['CENSUS2010POP'])
    print('Counties in state ' + group + ' have an average population of ' + str(avg))

按照stname作为groupby，只投影其中一列，计算它的sum和avg的方法如下：
(df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'] .agg({'avg': np.average, 'sum': np.sum}))

scales and ordinals
这是一列用字母和加减号标识的成绩，为了能用boolean mask寻找比C小的，需要使用astype把它们转换成category value，再使用ordinal。

df = pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'],
                  index=['excellent', 'excellent', 'excellent', 'good', 'good', 'good', 'ok', 'ok', 'ok', 'poor', 'poor'])
df.rename(columns={0: 'Grades'}, inplace=True)

grades = df['Grades'].astype('category',
                             categories=['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'],
                             ordered=True)

如果不这样做，c+和c-都会被显示为大于c。

cut

df = pd.read_csv('census.csv')
df = df[df['SUMLEV']==50]
df = df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg({'avg': np.average})
pd.cut(df['avg'],10)

pivot table
The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
参数：
1.values : 用来归类的值
2.index : 用哪个或哪些列的值作为归类的index
3.columns: 用哪个或哪些列的值作为归类的column
4.aggfunc : 默认为numpy.mean
5.fill_value : Value to replace missing values with
6.margins : Add all row / columns (e.g. for subtotal / grand totals)

例子：
A B C D
0 foo one small 1
1 foo one large 2
2 foo one large 2
3 foo two small 3
4 foo two small 3
5 bar one large 4
6 bar one small 5
7 bar two small 6
8 bar two large 7

>>> table = pivot_table(df, values='D', index=['A', 'B'],
...                     columns=['C'], aggfunc=np.sum)
>>> table
... 
C        large  small
A   B
bar one    4.0    5.0
    two    7.0    6.0
foo one    4.0    1.0
    two    NaN    6.0

新知识总结：
1. 连接两张表
pd.merge(df1, df2, left_index = True, right_index = True, left_on = ,right_on = ,how = 'inner, outer, left, right')
2.在一个Index上，对列做groupby操作，然后使用集合函数

 df.set_index('STNAME').groupby(level=0)['CENSUS2010POP']
    .agg({'avg': np.average, 'sum': np.sum})

3.怎样得到ordinal category

grades = df['Grades'].astype('category',
categories=['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'],
ordered=True)

4.让某一列的值生成10个bin，用来做后续的groupby

pd.cut(df['avg'],10)

5.pivot_table，index取a和b，以c的内容作为列的分类，以d的值作为agg的来源

table = pivot_table(df, values='D', index=['A', 'B'],
...                     columns=['C'], aggfunc=np.sum)