introduction to data science assignment 3-more pandas

程序员文章站 2022-03-09 15:26:34

...

整理一下密歇根大学Introduction to Data Science第三周作业。
1. read_excel()
pandas可以读取excel文件，有这样几个常用的参数：

sheet_name: 表名，可以是整数，字符串，或 [ ]
skiprows: 开头跳过几行
skip_footer: 结尾跳过几行
usecols: 用到哪些列，例如range(2,6)
names:list, 新的列名。用来给列重命名
na_values：表中的空值目前用什么表示，当前的表示会被转换成NaN

2.数据预处理
第一题有这样几个要求：
（1）Rename the following list of countries (for use in later questions):
“Republic of Korea”: “South Korea”,
“United States of America”: “United States”,
“United Kingdom of Great Britain and Northern Ireland”: “United Kingdom”,
“China, * Special Administrative Region”: “*”

修改index label的名称，直接用的是DataFrame.replace(“value”, “replace value”)。**要修改列名的话直接用

GDP = GDP.rename(columns={'Country Name': 'Country'})**

（2）There are also several countries with numbers and/or parenthesis in their name. Be sure to remove these,
e.g.
‘Bolivia (Plurinational State of)’ should be ‘Bolivia’,
‘Switzerland17’ should be ‘Switzerland’.

去除括号和数字用到了re模块，需要在一开始Import re，提供正则表达式的匹配。
去除数字：这里的数字只在最末尾出现，所以用split把一行按照数字分开并返回最前面的就可以。

def remove_nums(line):
        line = re.split('(\d+)',line)
        return line[0]

去除”（”后面的内容，尝试着用到了re.sub()替换函数，参数分别是待替换的内容，用什么替换，str，这里匹配括号后面的全部内容，替换为空格，再用.strip()去除末尾的空格。strip()的作用是删掉首尾的指定字符，如无参数则为空格。

 energy['Country'] = energy['Country'].apply(lambda x: re.sub(r'\(.*\)', '', x))
 energy['Country'] = energy['Country'].map(lambda x: x.strip())

3.将numpy float64类型转换为普通的float类型
有一道题我发现自己没有得分，返回值是对的，但是类型不对，题目要求返回float类型。这里使用series.mean()求值以后，返回的是numpy float64，再用.item()就变成了float类型。

def answer_five():
    df = answer_one()
    mean_energy = df['Energy Supply per Capita'].mean(axis=0, skipna = False).item()

    return mean_energy

answer_five()

一般地，以下几种np数字类型均可以转出

import numpy as np
# examples using a.item()
type(np.float32(0).item()) # <type 'float'>
type(np.float64(0).item()) # <type 'float'>
type(np.uint32(0).item())  # <type 'long'>
# examples using np.asscalar(a)
type(np.asscalar(np.int16(0)))   # <type 'int'>
type(np.asscalar(np.cfloat(0)))  # <type 'complex'>
type(np.asscalar(np.datetime64(0)))  # <type 'datetime.datetime'>
type(np.asscalar(np.timedelta64(0))) # <type 'datetime.timedelta'>

4.移除列
移除行可以直接用drop传入index或Index label。
移除列可以使用column name

df = df.drop('column_name', 1)
where 1 is the axis number (0 for rows and 1 for columns.)

但如果不使用列名，而是使用Index，就必须带上df.columns[ ]

df.drop(df.columns[[0, 1, 3]], axis=1)  # df.columns is zero-based pd.Index

5.给行index获得行label

df.iloc[2].name

要记住吼！

6.lambda中if-else的用法

df['HighRenew']=df['% Renewable'].map(lambda x:1 if x>=renewable_mean else 0)

7.groupby以后返回每个组的元素个数

grouped=df.groupby(['Country','bin'])    
return grouped.size()

8. number format
将数字用千位的,表示

df['population']=df['population'].apply(lambda x:format(x,','))