python数分之PM2.5案例

程序员文章站 2024-03-07 17:03:15

...

文章目录

问题和数据
PeriodIndex方法介绍
绘制北京地区中国数据和美国统计的中国的数据随时间变化图
做一个真实的项目

提出问题
观察数据

利用jupyter notebook快速分析

问题和数据

现在我们有北上广、深圳、和沈阳5个城市空气质量数据，请绘制出北京这个城市的PM2.5随时间的变化情况
观察这组数据中的时间结构，并不是字符串，这个时候我们应该怎么办？
数据来源： https://www.kaggle.com/uciml/pm25-data-for-five-chinese-cities

PeriodIndex方法介绍

之前所学习的DatetimeIndex可以理解为时间戳
那么现在我们要学习的PeriodIndex可以理解为时间段

periods = pd.PeriodIndex(year=data["year"],month=data["month"],day=data["day"],hour=data["hour"],freq="H")

# 那么如果给这个时间段降采样呢？
data = df.set_index(periods).resample("10D").mean()

绘制北京地区中国数据和美国统计的中国的数据随时间变化图

import pandas as pd
from matplotlib import pyplot as plt

file_path='C:/Users/ming/Desktop/DataAnalysis-master/day06/code/PM2.5/BeijingPM20100101_20151231.csv'
df=pd.read_csv(file_path)
# # 显示所以的列
# pd.set_option('display.max_columns', None)
# print(df.head(5))
# print(df.info())
# 把分开的时间字符串通过periodIndex的方法转化为pandas的时间类型
period=pd.PeriodIndex(year=df['year'],month=df['month'],day=df['day'],hour=df['hour'],freq="H")
# print(period)
# 增加一列名为datetime
df['datetime']=period
# pd.set_option('display.max_columns', None)
# print(df.head(5))
# 将datetime设置为索引
df.set_index('datetime',inplace=True)

# 按天进行降采样
df=df.resample('7D').mean()
# 看看有多少数据，结果为313
print(df.shape)

pd.set_option('display.max_columns', None)
print(df.head(5))

# 处理缺失数据，删除缺失数据
# print(df['PM_US Post'])
# data=df['PM_US Post'].dropna()
# data_china=df['PM_Dongsi'].dropna()
# # 查看data_china后20行
# print(data_china.tail(20))

# 不取消缺失值
data=df['PM_US Post']
data_china=df['PM_Dongsi']



# 画图
_x=data.index
_x=[i.strftime("%Y%m%d")  for i in _x]
_x_china=[i.strftime("%Y%m%d")  for i in data_china.index]
_y=data.values
_y_china=data_china.values

plt.figure(figsize=(10,6),dpi=80)

plt.plot(range(len(_x)), _y,label='US Post')
plt.plot(range(len(_x_china)), _y_china,label='CN Post')

# 取步长操作
plt.xticks(range(0,len(_x),10),list(_x)[::10],rotation=45)

plt.legend()

plt.show()

python数分之PM2.5案例

做一个真实的项目

我们不应该只会简单的画图，要学会分析问题

提出问题

在此项目中，你将以一名数据分析师的身份执行数据的探索性分析。你将了解数据分析过程的基本流程。但是在你开始查看数据前，请先思考几个你需要理解的关于PM2.5的问题，例如，如果你是一名环境工作者，你会想要获得什么类型的信息来了解不同城市的环境情况？如果你是一名生活在这个城市的普通人，你可以思考PM 2.5的变化会有什么样的周期性规律？选择什么时段出行空气质量最佳？

观察数据

书写代码如下来查看数据

import pandas as pd
from matplotlib import pyplot as plt

file_path='C:/Users/ming/Desktop/DataAnalysis-master/day06/code/PM2.5/BeijingPM20100101_20151231.csv'
df=pd.read_csv(file_path)
print(df.info())

我们看到结果如下

Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   No               52584 non-null  int64  
 1   year             52584 non-null  int64  
 2   month            52584 non-null  int64  
 3   day              52584 non-null  int64  
 4   hour             52584 non-null  int64  
 5   season           52584 non-null  int64  
 6   PM_Dongsi        25052 non-null  float64
 7   PM_Dongsihuan    20508 non-null  float64
 8   PM_Nongzhanguan  24931 non-null  float64
 9   PM_US Post       50387 non-null  float64
 10  DEWP             52579 non-null  float64
 11  HUMI             52245 non-null  float64
 12  PRES             52245 non-null  float64
 13  TEMP             52579 non-null  float64
 14  cbwd             52579 non-null  object 
 15  Iws              52579 non-null  float64
 16  precipitation    52100 non-null  float64
 17  Iprec            52100 non-null  float64

对每列的标题进行解释

No: 行号
year: 年份
month: 月份
day: 日期
hour: 小时
season: 季节
PM: PM2.5浓度 (ug/m^3)
DEWP: 露点 (摄氏温度) 指在固定气压之下，空气中所含的气态水达到饱和而凝结成液态水所需要降至的温度。
TEMP: Temperature (摄氏温度)
HUMI: 湿度 (%)
PRES: 气压 (hPa)
cbwd: 组合风向
Iws: 累计风速 (m/s)
precipitation: 降水量/时 (mm)
Iprec: 累计降水量 (mm)

**问题 1：**至少写下两个你感兴趣的问题，请确保这些问题能够由现有的数据进行回答。

（问题示例：1. 2012年-2015年上海市PM 2.5的数据在不同的月份有什么变化趋势？2. 哪个城市的PM 2.5的含量较低？）

答案：

第一个问题：北京哪一年的PM 2.5平均值最高？

**第二个问题：**2015年各季节北京的平均PM 2.5是多少？

利用jupyter notebook快速分析

import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn
%matplotlib inline

以北京数据为例子，我们先使用Pandas的read_csv函数导入第一个数据集，并使用head、info、describe方法来查看数据中的基本信息。

file_path='C:/Users/ming/Desktop/DataAnalysis-master/day06/code/PM2.5/BeijingPM20100101_20151231.csv'
Beijing_data=pd.read_csv(file_path)

Beijing_data.head()

python数分之PM2.5案例

北京数据中还包含有PM_Dongsi PM_Dongsihuan PM_Nongzhanguan PM_US Post 四个观测站点的数据。并且数据中PM2.5的这四列包含有缺失值“NaN”

Beijing_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52584 entries, 0 to 52583
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   No               52584 non-null  int64  
 1   year             52584 non-null  int64  
 2   month            52584 non-null  int64  
 3   day              52584 non-null  int64  
 4   hour             52584 non-null  int64  
 5   season           52584 non-null  int64  
 6   PM_Dongsi        25052 non-null  float64
 7   PM_Dongsihuan    20508 non-null  float64
 8   PM_Nongzhanguan  24931 non-null  float64
 9   PM_US Post       50387 non-null  float64
 10  DEWP             52579 non-null  float64
 11  HUMI             52245 non-null  float64
 12  PRES             52245 non-null  float64
 13  TEMP             52579 non-null  float64
 14  cbwd             52579 non-null  object 
 15  Iws              52579 non-null  float64
 16  precipitation    52100 non-null  float64
 17  Iprec            52100 non-null  float64
dtypes: float64(11), int64(6), object(1)
memory usage: 7.2+ MB

变量名PM_US Post中包含空格，这也可能对我们后续的分析造成一定的困扰。因为大多数命令中，都是默认以空格做为值与值之间的分隔符，而不是做为文件名的一部分。因此我们需要将变量名中的空格改为下划线:

Beijing_data.columns = [c.replace(' ', '_') for c in Beijing_data.columns]
Beijing_data.head()

python数分之PM2.5案例

python数分之PM2.5案例

文章目录

问题和数据

PeriodIndex方法介绍

绘制北京地区中国数据和美国统计的中国的数据随时间变化图

做一个真实的项目

提出问题

观察数据

利用jupyter notebook快速分析

python数分之PM2.5案例

Python3 完全平方数案例

Python之条件判断if语句、案例及习题[ random 随机数、剪刀石头布、判断平闰年]、循环控制语句（ break、continue、exit()）

Python3 完全平方数案例