欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

2分钟学会python数据分析与机器学习知识点(二)

程序员文章站 2023-12-21 13:06:58
...

第三节、Pandas工具包

1、Pandas读取文件操作两种工具读取

1.1 jupyter读取
Pandas:数据分析处理库
import pandas as pd
df = pd.read_csv('./data/titanic.csv')
.head()可以读取前几条数据,指定前几条都可以

6
df.head(6)
PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
df.tail()
PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.00	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.00	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.45	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.00	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.75	NaN	Q
.info返回当前的信息

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
df.index
RangeIndex(start=0, stop=891, step=1)
df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
df.dtypes
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
df.values
df.values
array([[1, 0, 3, ..., 7.25, nan, 'S'],
       [2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [3, 1, 3, ..., 7.925, nan, 'S'],
       ..., 
       [889, 0, 3, ..., 23.45, nan, 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C'],
       [891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)
自己创建一个dataframe结构

data = {'country':['aaa','bbb','ccc'],
       'population':[10,12,14]}
df_data = pd.DataFrame(data)
df_data
data = {'country':['aaa','bbb','ccc'],
       'population':[10,12,14]}
df_data = pd.DataFrame(data)
df_data
country	population
0	aaa	10
1	bbb	12
2	ccc	14
df_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
country       3 non-null object
population    3 non-null int64
dtypes: int64(1), object(1)
memory usage: 128.0+ bytes
取指定的数据

age = df['Age']
age[:5]
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64
series:dataframe中的一行/列

age.index
RangeIndex(start=0, stop=891, step=1)
age.values[:5]
array([ 22.,  38.,  26.,  35.,  35.])
df.head()
PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
df['Age'][:5]
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64
索引我们可以自己指定

df = df.set_index('Name')
df.head()
PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
Name											
Braund, Mr. Owen Harris	1	0	3	male	22.0	1	0	A/5 21171	7.2500	NaN	S
Cumings, Mrs. John Bradley (Florence Briggs Thayer)	2	1	1	female	38.0	1	0	PC 17599	71.2833	C85	C
Heikkinen, Miss. Laina	3	1	3	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
Futrelle, Mrs. Jacques Heath (Lily May Peel)	4	1	1	female	35.0	1	0	113803	53.1000	C123	S
Allen, Mr. William Henry	5	0	3	male	35.0	0	0	373450	8.0500	NaN	S
df['Age'][:5]
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64
age = df['Age']
age[:5]
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64
import pandas as pd;
age['Allen, Mr. William Henry']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-17-a6cf1fb56631> in <module>
      1 import pandas as pd;
----> 2 age['Allen, Mr. William Henry']

~\Anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
    866         key = com.apply_if_callable(key, self)
    867         try:
--> 868             result = self.index.get_value(self, key)
    869 
    870             if not is_scalar(result):

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_value(self, series, key)
   4373         try:
   4374             return self._engine.get_value(s, k,
-> 4375                                           tz=getattr(series.dtype, 'tz', None))
   4376         except KeyError as e1:
   4377             if len(self) > 0 and (self.holds_integer() or self.is_boolean()):

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()

KeyError: 'Allen, Mr. William Henry'

age = age + 10
age[:5]
0    32.0
1    48.0
2    36.0
3    45.0
4    45.0
Name: Age, dtype: float64
age = age *10 
age[:5]
Name
Braund, Mr. Owen Harris                                320.0
Cumings, Mrs. John Bradley (Florence Briggs Thayer)    480.0
Heikkinen, Miss. Laina                                 360.0
Futrelle, Mrs. Jacques Heath (Lily May Peel)           450.0
Allen, Mr. William Henry                               450.0
Name: Age, dtype: float64
age.mean()
396.99117647058824
age.max()
900.0
age.min()
104.2
.describe()可以得到数据的基本统计特性

df.describe()
df.describe()
PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200
1.2 pycharm读取
import numpy as np;
import pandas as pd;
# m = n = 3
# test = np.ones((m, n), dtype=np.int)
# print(test)
#绝对路径
path = r'G:\nodebookPython3\lesson\titanic_train.csv'
df=pd.read_csv(path)
#设置列名全部展示
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 200)
#取前6行,不包括列名
df=df.head(6)
print(df)
#帮助函数
#print(help(pd.read_csv))
print(df.info())
#pandas默认支持,第一列就是列名,然后第二列就是数据
#print(pd.index)
#print(pd.columns)
#print(pd.dtypes)
#重要数据打印出来了,类型和结构是数组结构
#print(df.values)
#自己创建一个dataframe,字典的方式
# data = {'country':['aaa','bbb','ccc'],
#        'population':[10,12,14]}
# df_data = pd.DataFrame(data)
#print(df_data)
#创建好了以后取数据,比如取读取的csv文件中的年龄
df=df.set_index('Name')
age=df['Age']
#取得一列数据的前5条数据
age[:5]
print(age[:5])
#取出一列就是series
#print(age.index)
#print(age.values)
print(df)
#获取Allen这个人的年龄
print(age['Allen, Mr. William Henry'])
#对列整体加100
age=age+100
print(age)
print(age[:6])
#平均
print(age.mean())
#最大值
print(age.max())
#最小值
print(age.min())

#数值类型才能得到统计结果
print(df.describe())

2、Pandas索引结构

上一篇:

下一篇: