2分钟学会python数据分析与机器学习知识点(二)
程序员文章站
2023-12-21 13:06:58
...
2分钟学会python数据分析与机器学习知识点(二)
第三节、Pandas工具包
1、Pandas读取文件操作两种工具读取
1.1 jupyter读取
Pandas:数据分析处理库
import pandas as pd
df = pd.read_csv('./data/titanic.csv')
.head()可以读取前几条数据,指定前几条都可以
6
df.head(6)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
df.tail()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.00 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q
.info返回当前的信息
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
df.index
RangeIndex(start=0, stop=891, step=1)
df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
df.dtypes
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
df.values
df.values
array([[1, 0, 3, ..., 7.25, nan, 'S'],
[2, 1, 1, ..., 71.2833, 'C85', 'C'],
[3, 1, 3, ..., 7.925, nan, 'S'],
...,
[889, 0, 3, ..., 23.45, nan, 'S'],
[890, 1, 1, ..., 30.0, 'C148', 'C'],
[891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)
自己创建一个dataframe结构
data = {'country':['aaa','bbb','ccc'],
'population':[10,12,14]}
df_data = pd.DataFrame(data)
df_data
data = {'country':['aaa','bbb','ccc'],
'population':[10,12,14]}
df_data = pd.DataFrame(data)
df_data
country population
0 aaa 10
1 bbb 12
2 ccc 14
df_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
country 3 non-null object
population 3 non-null int64
dtypes: int64(1), object(1)
memory usage: 128.0+ bytes
取指定的数据
age = df['Age']
age[:5]
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64
series:dataframe中的一行/列
age.index
RangeIndex(start=0, stop=891, step=1)
age.values[:5]
array([ 22., 38., 26., 35., 35.])
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
df['Age'][:5]
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64
索引我们可以自己指定
df = df.set_index('Name')
df.head()
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
Name
Braund, Mr. Owen Harris 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Heikkinen, Miss. Laina 3 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
Allen, Mr. William Henry 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
df['Age'][:5]
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64
age = df['Age']
age[:5]
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64
import pandas as pd;
age['Allen, Mr. William Henry']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-17-a6cf1fb56631> in <module>
1 import pandas as pd;
----> 2 age['Allen, Mr. William Henry']
~\Anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
866 key = com.apply_if_callable(key, self)
867 try:
--> 868 result = self.index.get_value(self, key)
869
870 if not is_scalar(result):
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_value(self, series, key)
4373 try:
4374 return self._engine.get_value(s, k,
-> 4375 tz=getattr(series.dtype, 'tz', None))
4376 except KeyError as e1:
4377 if len(self) > 0 and (self.holds_integer() or self.is_boolean()):
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()
KeyError: 'Allen, Mr. William Henry'
age = age + 10
age[:5]
0 32.0
1 48.0
2 36.0
3 45.0
4 45.0
Name: Age, dtype: float64
age = age *10
age[:5]
Name
Braund, Mr. Owen Harris 320.0
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 480.0
Heikkinen, Miss. Laina 360.0
Futrelle, Mrs. Jacques Heath (Lily May Peel) 450.0
Allen, Mr. William Henry 450.0
Name: Age, dtype: float64
age.mean()
396.99117647058824
age.max()
900.0
age.min()
104.2
.describe()可以得到数据的基本统计特性
df.describe()
df.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
1.2 pycharm读取
import numpy as np;
import pandas as pd;
# m = n = 3
# test = np.ones((m, n), dtype=np.int)
# print(test)
#绝对路径
path = r'G:\nodebookPython3\lesson\titanic_train.csv'
df=pd.read_csv(path)
#设置列名全部展示
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 200)
#取前6行,不包括列名
df=df.head(6)
print(df)
#帮助函数
#print(help(pd.read_csv))
print(df.info())
#pandas默认支持,第一列就是列名,然后第二列就是数据
#print(pd.index)
#print(pd.columns)
#print(pd.dtypes)
#重要数据打印出来了,类型和结构是数组结构
#print(df.values)
#自己创建一个dataframe,字典的方式
# data = {'country':['aaa','bbb','ccc'],
# 'population':[10,12,14]}
# df_data = pd.DataFrame(data)
#print(df_data)
#创建好了以后取数据,比如取读取的csv文件中的年龄
df=df.set_index('Name')
age=df['Age']
#取得一列数据的前5条数据
age[:5]
print(age[:5])
#取出一列就是series
#print(age.index)
#print(age.values)
print(df)
#获取Allen这个人的年龄
print(age['Allen, Mr. William Henry'])
#对列整体加100
age=age+100
print(age)
print(age[:6])
#平均
print(age.mean())
#最大值
print(age.max())
#最小值
print(age.min())
#数值类型才能得到统计结果
print(df.describe())