2 Python数据分析 Tushare双均线与金叉死叉日期 Pandas数据清洗级联操作合并操作

程序员文章站 2022-07-10 21:28:09

Python数据分析1 Tushare股票分析1.1 双均线策略制定1.2 金叉日期与死叉日期2 Pandas数据清洗2.1 介绍数据清洗(Data cleaning)对数据进行重新审查和校验的过程，目的在于删除重复信息、处理无效值和缺失值等。2.2 处理缺失数据2.2.1 缺失数据类型导入包import numpy as npimport pandas as pdfrom pandas import DataFramePandas中存在两种缺失数据类型：None 和 np.na...

Python数据分析

1 Tushare股票分析

1.1 准备数据

平安银行[000001]

import tushare as ts
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

df = ts.get_k_data(code='000001', start='2010-01')
df.to_csv('pingan.csv')
df = pd.read_csv('pingan.csv')
df.drop(labels='Unnamed: 0', axis=1, inplace=True)
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
df.head()
'''
			 open	close	 high	  low	   volume	code
	  date						
2010-01-04	8.149	7.880	8.169	7.870	241922.76	   1
2010-01-05	7.893	7.743	7.943	7.560	556499.82	   1
2010-01-06	7.727	7.610	7.727	7.551	412143.13	   1
2010-01-07	7.610	7.527	7.660	7.444	355336.85	   1
2010-01-08	7.477	7.511	7.560	7.428	288543.06	   1
'''

1.2 双均线分析

1.2.1 移动平均线介绍

移动平均线，Moving Average(MA)，简称均线，将一定时期内的股票价格加以平均，并把不同时间的平均值连接起来，形成一根MA，是用以观察股票价格变动趋势的一种技术指标。
移动平均线常用的有5日、10日、30日、60日、120日和240日。
5日和10日的MA是短线操作的参照指标，称做日均线指标；
30日和60日的是中期均线指标，称做季均线指标；
120日和240日的是长期均线指标，称做年均线指标。

均线计算方法
MA = (C1 + C2 + C3 + … + Cn) / N
其中C为某日的收盘价，N为移动平均周期(天数)。

1.2.2 案例

需求：获取该股票历史数据的5日均线和30日均线。

1.2.2.1 时间窗函数rolling

窗口，就是将某个点扩大到包含这个点的一段区间，对区间来进行操作。
函数rolling的参数window表示时间窗的大小，可以用数字来表示向前观测数据的数量。

df['close'].rolling(5)  # Rolling [window=5, center=False, axis=0]

对区间进行操作
前4条数据无法继续向前取5条数据，因此操作结果为NaN。

df['close'].rolling(5).mean().head(10)
'''
date
2010-01-04       NaN
2010-01-05       NaN
2010-01-06       NaN
2010-01-07       NaN
2010-01-08    7.6542
2010-01-11    7.5804
2010-01-12    7.5240
2010-01-13    7.3952
2010-01-14    7.2836
2010-01-15    7.2058
Name: close, dtype: float64
'''

1.2.2.2 绘制双均线

import matplotlib.pyplot as plt

ma5 = df['close'].rolling(5).mean()
ma30 = df['close'].rolling(30).mean()
plt.plot(ma5)
plt.plot(ma30)

2 Python数据分析 Tushare双均线与金叉死叉日期 Pandas数据清洗级联操作合并操作

1.3 金叉日期与死叉日期

1.3.1 金叉与死叉介绍

叉指的是股票分析中短期均线与长期均线的交叉的交叉点。
如果短期均线主动向上穿越了长期均线，这个交叉点称为金叉；
如果短期均线主动向下穿越了长期均线，这个交叉点称为死叉。
但如果长期均线向下或变缓，同时短期均线向上穿越了长期均线，这个交叉点就不能叫金叉，死叉也如此。
出现金叉时，短期线从下向上突破长期线是买入信号，趋向买入，
出现死叉时，短期线从上向下跌穿过长期线是卖出信号，则趋向卖出。
2 Python数据分析 Tushare双均线与金叉死叉日期 Pandas数据清洗级联操作合并操作

1.3.2 案例1

需求：获取2010年至今所有金叉日期和死叉日期。

ma5 = df['close'].rolling(5).mean()
ma30 = df['close'].rolling(30).mean()

s1 = ma5 < ma30
s2 = ma5 > ma30

分析
金叉的左侧短期均线ma5低于长期均线ma30，金叉的右侧ma5高于长期均线ma30，因此金叉的左侧s1为True，s2为False，金叉的左侧s1为False，s2为True。当s1由True变为False时会出现金叉，当s1由False变为True时会出现死叉。

s
s1	T	T	F	F	F	T	T	F
s2	F	F	T	T	T	F	F	T
s2.shift(1)		F	F	T	T	T	F	F
`s1 & s2.shift(1)`		F	F	F	F	T	F	F
`s1 \| s2.shift(1)`		T	F	T	T	T	T	F
`~s1 \| s2.shift(1)`		F	T	F	F	F	F	T
			金叉			死叉		金叉

# 死叉
df.loc[s1 & s2.shift(1)]  # 死叉对应的数据
death_date = df.loc[s1 & s2.shift(1)].index  # 死叉出现的日期

# 金叉
df.loc[~(s1 | s2.shift(1))]  # 金叉对应的数据
golden_date = df.loc[~(s1 | s2.shift(1))].index  # 金叉出现的日期

1.3.3 案例2

如果从2010年1月1日开始，初始资金为100000元，金叉尽量买入，死叉全部卖出，分析截止到今天炒股收益。

取出指定时间段的数据

df_new = df['2010':'2020']

获取金叉日期和死叉日期
注意，前29日的ma30无法计算，为NaN，因此需要除去。

ma5_new = df_new['close'].rolling(5).mean()
ma30_new = df_new['close'].rolling(30).mean()
ma5_new = ma5_new[29:] 
ma30_new = ma30_new[29:] 
s1_new = ma5_new < ma30_new 
s2_new = ma5_new > ma30_new

# 死叉日期
death_date_new = df_new[29:].loc[s1_new & s2_new.shift(1)].index
# 金叉日期
golden_date_new = df_new[29:].loc[~(s1_new | s2_new.shift(1))].index

金叉尽量买入，死叉全部卖出。
买卖股票单价使用开盘价。
特殊情况：

如果昨天为金叉，则只能买入不能卖出；
如果手里有剩余股票，需要将剩余股票价值计算到总收益中。

将金叉日期和死叉日期分别存储到两个Series中，将日期作为索引，将1和0分别设置为值，其中1对应的日期为金叉日期，0对应的日期为死叉日期。

golden_date_series = Series(data=1, index=golden_date_new)
death_date_series = Series(data=0, index=death_date_new)

拼接两张表

all_series = golden_date_series.append(death_date_series).sort_index().sort_index()

计算总收益

initial_money = 100000  # 初始本金(固定)
hold_money = initial_money  # 手中持有金额(可变，开始时等于初始本金)
hold_shares = 0  # 持有股票支数(可变，开始时为0)

for date in all_series.index:
    # date是all_series的显式索引
    if all_series[date] == 1: 
        # 当日为金叉，需要买入股票。
        price = df_new['open'][date]  # 股票单价，开盘价
        # 尽量消耗持有本金所能购买的最大股票支数
        max_purchases = hold_money // (price * 100)
        hold_shares += max_purchases * 100
        hold_money -= (hold_shares * price)
    else:  
        # 当日为死叉，需要卖出股票。
        price = df_new['open'][date]  # 股票单价，开盘价
        hold_money += (price * hold_shares)
        hold_shares = 0 

# 计算剩余股票价值(可以根据持有股票支数hold_shares判断是否有剩余股票)
hold_shares_money = hold_shares * df_new['open'][-1]
print(hold_shares_money)
# 计算总收益
print(hold_money - initial_money)  # -36499.500000000466

2 Pandas数据清洗

2.1 介绍

数据清洗(Data cleaning)对数据进行重新审查和校验的过程，目的在于删除重复信息、处理无效值和缺失值等。

2.2 处理缺失数据

2.2.1 缺失数据类型

导入包

import numpy as np
import pandas as pd
from pandas import DataFrame

Pandas中存在两种缺失数据类型：None 和 np.nan

区别：
None不能参与计算，否则直接报错(TypeError)；
np.nan属于float类型，可以参与计算，结果为nan。

type(None)  # NoneType
type(np.nan)  # float
np.nan + 1  # nan

数据分析中会使用某些运算来处理原始数据，如果原数数据中的空值为nan，则不会干扰或者中断运算。
如果遇到了None形式的空值，Pandas会自动将其转换成nan。

df = DataFrame(data=np.random.randint(0, 100, size=(3, 4)))
df.iloc[1,1] = None
df.iloc[0,0] = np.nan
'''
    0	    1       2	3
0	NaN	    31.0	92	81
1	17.0	NaN	    8	84
2	38.0	49.0	33	88
'''

2.2.2 Pandas处理缺失数据

2.2.2.1 判断是否存在缺失数据 isnull，notnull, any, all

isnull，notnull
用于判断某个元素是否为缺失数据。

df.isnull()  # 缺失值为True，非缺失值为False。
'''
    0	    1	    2	    3
0	True	False	False	False
1	False	True	False	False
2	False	False	False	False
'''

df.notnull()  # 缺失值为False，非缺失值为True。
'''
    0	    1	    2	    3
0	False	True	True	True
1	True	False	True	True
2	True	True	True	True
'''

any, all
any：序列中存在一个True就返回True，否则为False；
all：序列中所有的值均为True才返回True，否则为False。
isnull/notnull 配合 any/all，用于判断每一个行/列中是否存在缺失数据。
axis=0 检测列(默认)中是否存在空值，
axis=1 检测行中是否存在空值。

df.isnull().any()
df.isnull().any(axis=0)  # 列
'''
0     True
1     True
2     False
3     False
dtype: bool
'''

df.isnull().any(axis=1)  # 行
'''
0     False
1     False
2     True
dtype: bool
'''

结合df.loc

取出不存在缺失数据的行及行号

df.loc[df.notnull().all(axis=1)]  # 行
'''
	0		1		2	3
2	38.0	49.0	33	88
'''
df.loc[df.notnull().all(axis=1)].index
# Int64Index([2], dtype='int64')

取出存在缺失数据的行及行号

df.loc[df.isnull().any(axis=1)]
'''
	0		1		2	3
0	NaN		31.0	92	81
1	17.0	NaN		8	84
'''
df.loc[df.isnull().any(axis=1)].index
# Int64Index([0, 1], dtype='int64')

2.2.2.2 清除缺失数据 dropna

清除指定行号的行

nan_indexs = df.loc[df.isnull().any(axis=1)].index
df.drop(labels=nan_indexs, axis=0)
'''
	0		1		2	3
2	38.0	49.0	33	88
'''

直接删除存在缺失值的行/列
注意，drop系列函数中axis=0表示行，axis=1表示列。

df.dropna(axis=0)  # 行
'''
	0		1		2	3
2	38.0	49.0	33	88
'''
df.dropna(axis=1)  # 列
'''
	2	3
0	92	81
1	8	84
2	33	88
'''

2.2.2.3 填充缺失数据 fillna

使用指定值填充缺失值。

df.fillna(value=0)
'''
	0		1		2	3
0	0.0		31.0	92	81
1	17.0	0.0		8	84
2	38.0	49.0	33	88
'''

使用近邻值填充缺失值。
method=ffill表示用缺失值的前一个值去填充缺失值，
method=bfill表示用缺失值的后一个值去填充缺失值。
axis={0 or ‘index’, 1 or ‘columns’}，0表示在列方向上下填充，1表示在行方向左右填充。
参数axis需要配合参数method使用，一般会使用缺失值所在列中的前一个/后一个值来填充缺失值，因此需要axis=0来指定列。

df
'''
	0		1		2	3
0	NaN		84.0	43	14
1	28.0	NaN		67	25
2	66.0	94.0	48	24
'''

df.fillna(method='ffill', axis=0)  
df.fillna(method='ffill', axis='index')  
# 用缺失值的前一个值去填充缺失值。
'''
	0		1		2	3
0	NaN		84.0	43	14
1	28.0	84.0	67	25
2	66.0	94.0	48	24
'''
df.fillna(method='bfill', axis=0)  
df.fillna(method='bfill', axis='index')  
# 用缺失值的后一个值去填充缺失值。
'''
	0		1		2	3
0	28.0	84.0	43	14
1	28.0	94.0	67	25
2	66.0	94.0	48	24
'''

2.2.3 处理缺失数据案例

数据说明：数据是1个冷库的温度数据，1-7对应7个温度采集设备，1分钟采集一次。
数据来源：https://download.csdn.net/download/qq_36565509/12619016
观察数据

raw_data = pd.read_excel('./testData.xlsx')
raw_data.head()
'''
				   time none	    1	    2	    3	    4 none1	    5	    6	    7
0	2019-01-27 17:00:00	 NaN	-24.8	-18.2	-20.8	-18.8	NaN	  NaN	  NaN	  NaN
1	2019-01-27 17:01:00	 NaN	-23.5	-18.8	-20.5	-19.8	NaN	-15.2	-14.5	-16.0
2	2019-01-27 17:02:00	 NaN	-23.2	-19.2	  NaN	  NaN	NaN	-13.0	  NaN	-14.0
3	2019-01-27 17:03:00	 NaN	-22.8	-19.2	-20.0	-20.5	NaN	  NaN	-12.2	 -9.8
4	2019-01-27 17:04:00	 NaN	-23.2	-18.5	-20.0	-18.8	NaN	-10.2	-10.8	 -8.8
'''

删除列none和none1

raw_data = pd.read_excel('./testData.xlsx')
data = raw_data.drop(labels=['none', 'none1'], axis=1)
data.shape  # (1060, 8)

方案1：删除存在缺失值的行

data1 = data.dropna(axis=0)
data1.shape  # (927, 8)

方案2：使用近邻值填充缺失值

data2 = data.fillna(method='ffill', axis=0).fillna(method='bfill', axis=0)

2.2.4 处理重复数据

2.2.4.1 准备数据

df = DataFrame(data=np.random.randint(0, 100, size=(6, 8)))
df.iloc[1] = [1,1,1,1,1,1,1,1]
df.iloc[3] = [1,1,1,1,1,1,1,1]
df.iloc[5] = [1,1,1,1,1,1,1,1]
'''
	0	1	2	3	4	5	6	7
0	37	94	78	53	36	72	36	69
1	1	1	1	1	1	1	1	1
2	68	6	28	89	5	38	35	48
3	1	1	1	1	1	1	1	1
4	84	2	87	47	62	29	92	40
5	1	1	1	1	1	1	1	1
'''

2.2.4.2 duplicated

duplicated函数用于标记Series或DataFrame中的行数据是否重复，重复为True，否则为False。

pandas.DataFrame.duplicated(self, subset=None, keep='first')

subset：用于识别重复的列标签或列标签序列，默认为所有列标签；
keep=‘frist’：第一次出现外，其余相同的均被标记为重复；
keep=‘last’：最后一次出现外，其余相同的均被标记为重复；
keep=False：所有相同的都被标记为重复。

2.2.4.3 判断是否为重复行数据

第一次出现外，其余相同的行均被标记为重复。

df.duplicated(keep='first')
'''
0    False
1    False
2    False
3     True
4    False
5     True
dtype: bool
'''

2.2.4.4 获取非重复行数据

df.loc[~df.duplicated(keep='first')]
'''
	0	1	2	3	4	5	6	7
0	37	94	78	53	36	72	36	69
1	1	1	1	1	1	1	1	1
2	68	6	28	89	5	38	35	48
4	84	2	87	47	62	29	92	40
'''

2.2.4.5 直接删除重复行数据 drop_duplicates

df.drop_duplicates(keep='first')
'''
	0	1	2	3	4	5	6	7
0	37	94	78	53	36	72	36	69
1	1	1	1	1	1	1	1	1
2	68	6	28	89	5	38	35	48
4	84	2	87	47	62	29	92	40
'''

2.2.5 处理异常数据

目标：自定义一个1000行3列（A，B，C）取值范围为0-1的数据，然后对C列中值大于其两倍标准差的异常值进行清洗。

2.2.5.1 准备数据

df = DataFrame(data=np.random.random(size=(1000, 3)), columns=['A', 'B', 'C'])
df.head()
'''
		   A	       B	       C
0	0.118990	0.649505	0.978472
1	0.830928	0.120502	0.743489
2	0.081553	0.648753	0.952130
3	0.994736	0.408437	0.011220
4	0.959329	0.865392	0.110883
'''

2.2.5.2 清洗数据

计算C列标准差

df['C'].std()  # 0.28356614551282616

异常数据条件：C列中值大于其两倍标准差的数据。

twice_std = df['C'].std() * 2  # 0.5671322910256523
df['C'] > twice_std
'''
0       True
1       True
2       True
3      False
4      False
...
'''

df.loc[~(df['C'] > twice_std)]
'''
3	0.537346	0.339687	0.276611
4	0.414794	0.179321	0.094958
5	0.397169	0.610316	0.420824
7	0.740718	0.730160	0.302804
...
'''

3 级联操作与合并操作

3.1 级联操作

3.1.1 concat

import pandas as pd

pd.concat(object, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False)

重要参数说明
object：series，dataframe或则是panel构成的序列list；
axis：指定合并连接的轴，0表示行，1表示列；
join：指定连接方式，inner或outer。

3.1.2 准备数据

import numpy as np
import pandas as pd

df1 = pd.DataFrame(data=np.random.randint(0, 100, size=(5, 3)), columns=['A', 'B', 'C'])
df1_copy = pd.DataFrame(data=np.random.randint(0, 100, size=(5, 3)), columns=['A', 'B', 'C'])
df2 = pd.DataFrame(data=np.random.randint(0, 100, size=(5, 3)), columns=['A', 'D', 'C'])

3.1.3 级联种类

3.1.3.1 匹配级联

匹配级联：行索引和列索引完全一致。

pd.concat((df1, df1_copy), axis=0)
'''
	A	B	C
0	74	1	71
1	80	84	61
2	59	93	74
3	80	59	52
4	72	4	8
0	65	0	1
1	30	70	13
2	48	40	11
3	63	48	29
4	96	93	91
'''

3.1.3.2 非匹配级联

非匹配级联：级联维度的索引不一致，即纵向级联时列索引不一致，横向级联时行索引不一致。

非匹配级联有两种连接方式：内连接inner 和外连接outer。
inner：inner方式只会对匹配的项进行连接；
outer：无论是否匹配，outer方式会将所有的项进行连接，将不存在的值设置为nan。
使用参数join指定连接方式，默认为外连接outer。

外连接outer

pd.concat((df1, df2), axis=0)
pd.concat((df1, df2), axis=0, join='outer')
'''
	A	B		C	D
0	67	61.0	36	NaN
1	67	63.0	11	NaN
2	80	43.0	52	NaN
3	59	39.0	18	NaN
4	7	75.0	21	NaN
0	49	NaN	    97	97.0
1	65	NaN	    21	89.0
2	64	NaN		42	88.0
3	32	NaN		62	42.0
4	58	NaN		12	66.0
'''

内连接inner

pd.concat((df1, df2), axis=0, join='inner')
'''
	A	C
0	67	36
1	67	11
2	80	52
3	59	18
4	7	21
0	49	97
1	65	21
2	64	42
3	32	62
4	58	12
'''

3.1.2 append

append方法无法指定级联方向，只能在列方向上进行级联。

df1.append(df2)
'''
	A	B		C	D
0	67	61.0	36	NaN
1	67	63.0	11	NaN
2	80	43.0	52	NaN
3	59	39.0	18	NaN
4	7	75.0	21	NaN
0	49	NaN	    97	97.0
1	65	NaN	    21	89.0
2	64	NaN		42	88.0
3	32	NaN		62	42.0
4	58	NaN		12	66.0
'''

3.2 合并操作

3.2.1 merge

merge函数可以将不同数据集依照某些字段(属性)进行合并操作，得到一个新的数据集。

pd.merge(left=DataFrame1, right=DataFrame2, how=‘inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=(’_x’, ‘_y’))

参数说明
how：inner(默认)，outer，left，right；
on：指定根据某列进行合并，必须存在于两个DateFrame中；
left_on：左连接，指定根据DataFrame1中的某列进行合并；
right_on：左连接，指定根据DataFrame2中的某列进行合并；
left_index：是否根据DataFrame1的行索引进行合并；
right_index：是否根据DataFrame2的行索引进行合并；
sort：是否对合并后的数据根据连接键进行排序；
suffixes：如果两个数据集中出现重复列，新数据集中加上后缀_x,_y进行区别。

3.2.2 merge与concat的区别

merge需要指定某列进行数据合并；
concat只是在行/列方向上进行表格拼接。

3.2.3 合并操作

3.2.3.1 一对一合并

from pandas import DataFrame

df1 = DataFrame({
    'employee': ['Bob', 'Jake', 'Lisa'],
    'group': ['Accounting', 'Engineering', 'Engineering'],
})
'''
	employee		group
0		 Bob   Accounting
1		Jake  Engineering
2		Lisa  Engineering
'''

df2 = DataFrame({
    'employee': ['Lisa', 'Bob', 'Jake'],
    'hire_date': [2004, 2008, 2012],
})
'''
    employee	hire_date
0		Lisa	     2004
1		 Bob	     2008
2		Jake		 2012
'''

参数on：指定根据某列进行合并，必须存在于两个DateFrame中。

pd.merge(df1, df2, on='employee')
'''
	employee		group	hire_date
0		 Bob   Accounting		 2008
1		Jake  Engineering		 2012
2		Lisa  Engineering		 2004
'''

如果不指定参数，默认根据两表中相同列进行合并，df1和df2的相同列是employee。

pd.merge(df1, df2)
'''
	employee		group	hire_date
0		 Bob   Accounting		 2008
1		Jake  Engineering		 2012
2		Lisa  Engineering		 2004
'''

3.2.3.2 一对多合并

df3 = DataFrame({
    'employee': ['Lisa', 'Jake'],
    'group': ['Accounting', 'Engineering'],
    'hire_date': [2004, 2016],
})
'''
	employee		 group	hire_date
0		Lisa	Accounting		 2004
1		Jake   Engineering		 2016
'''

df4 = DataFrame({
	'group': ['Accounting', 'Engineering', 'Engineering'],
	'supervisor': ['Carly', 'Guido', 'Steve'],
})
'''
		  group		supervisor
0	 Accounting			 Carly
1	Engineering			 Guido
2	Engineering			 Steve
'''

pd.merge(df3, df4)  # 相同列为group
'''
	employee	  	group	hire_date	supervisor
0		Lisa   Accounting		 2004	     Carly
1		Jake  Engineering	     2016		 Guido
2		Jake  Engineering		 2016		 Steve
'''

3.2.3.3 多对多合并

参数how：inner(默认)，outer，left，right。

df1 = DataFrame({
    'employee': ['Bob', 'Jake', 'Lisa'],
    'group': ['Accounting', 'Engineering', 'Engineering'],
})
'''
	employee		group
0		 Bob   Accounting
1		Jake  Engineering
2		Lisa  Engineering
'''

df5 = DataFrame({
    'group': ['Engineering', 'Engineering', 'HR'],
    'supervisor': ['Carly', 'Guido', 'Steve'],
})
'''
		  group		supervisor
0	Engineering			 Carly
1	Engineering	         Guido
2			 HR		     Steve
'''

pd.merge(df1, df5, how='inner')
'''
	employee		group	supervisor
0		Jake  Engineering		 Carly
1		Jake  Engineering		 Guido
2		Lisa  Engineering		 Carly
3		Lisa  Engineering		 Guido
'''

pd.merge(df1, df5, how='outer')
'''
	employee		group	supervisor
0		 Bob   Accounting		   NaN
1		Jake  Engineering		 Carly
2		Jake  Engineering		 Guido
3		Lisa  Engineering		 Carly
4		Lisa  Engineering		 Guido
5		 NaN		   HR		 Steve
'''

pd.merge(df1, df5, how='left')
'''
	employee		group	supervisor
0		 Bob   Accounting		   NaN
1		Jake  Engineering		 Carly
2		Jake  Engineering		 Guido
3		Lisa  Engineering		 Carly
4		Lisa  Engineering		 Guido
'''

pd.merge(df1, df5, how='right')
'''
	employee		group	supervisor
0		Jake  Engineering		 Carly
1		Jake  Engineering		 Guido
2		Lisa  Engineering		 Carly
3		Lisa  Engineering		 Guido
4		 NaN		   HR		 Steve
'''

3.2.3.4 key规范化

当两张表没有相同列时，可使用参数left_on和right_on分别指定左右表中的列作为连接列进行合并。

df1 = DataFrame({
    'employee': ['Bob', 'Jake', 'Lisa'],
    'group': ['Accounting', 'Engineering', 'Engineering'],
})
'''
	employee		group
0		 Bob   Accounting
1		Jake  Engineering
2		Lisa  Engineering
'''

df6 = DataFrame({
    'name': ['Lisa', 'Bob', 'Bill'],
    'hire_date': [1998, 2016, 2007],
})
'''
	name	hire_date
0	Lisa		 1998
1	 Bob		 2016
2	Bill		 2007
'''

pd.merge(df1, df6, left_on='employee', right_on='name')
'''
	employee		group		name	hire_date
0		 Bob   Accounting		 Bob		 2016
1		Lisa  Engineering		Lisa		 1998
'''

本文地址：https://blog.csdn.net/qq_36565509/article/details/107338818

上一篇：企业网站被惩罚之后的解决策略分析

下一篇：测量C++程序运行时间

2 Python数据分析 Tushare双均线与金叉死叉日期 Pandas数据清洗 级联操作 合并操作