美国劳工部官方统计数据 员工离职案例分析
程序员文章站
2024-03-22 09:41:22
...
通过对数据的分析 预判员工离职的可能性
首先去分析是否存在不干净数据,
import pandas as pd
import numpy as np
df = pd.read_csv('HR_comma_sep.csv')
# print(df.isnull().any()) #判断是否有null值
# print(np.count_nonzero(df != df)) #判断nan数量
print(df.info()) #数据集很干净 无缺失值
输出: 可以发现这份数据还是比较干净的 不存在缺失值,只存在两个object类型的特征
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
satisfaction_level 14999 non-null float64
last_evaluation 14999 non-null float64
number_project 14999 non-null int64
average_montly_hours 14999 non-null int64
time_spend_company 14999 non-null int64
Work_accident 14999 non-null int64
left 14999 non-null int64
promotion_last_5years 14999 non-null int64
sales 14999 non-null object
salary 14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB
None
更正列名:
df.rename(columns = {'average_montly_hours':'average_monthly_hours',
'sales':'department'},inplace = True)
分析数值类型的分布特征
# 自动打印数值类型分布情况
print(df.describe())
输出: 计数 均值 方差 最小 最大值 上下四分位数 中间值
satisfaction_level last_evaluation number_project \
count 14999.000000 14999.000000 14999.000000
mean 0.612834 0.716102 3.803054
std 0.248631 0.171169 1.232592
min 0.090000 0.360000 2.000000
25% 0.440000 0.560000 3.000000
50% 0.640000 0.720000 4.000000
75% 0.820000 0.870000 5.000000
max 1.000000 1.000000 7.000000
average_monthly_hours time_spend_company Work_accident left \
count 14999.000000 14999.000000 14999.000000 14999.000000
mean 201.050337 3.498233 0.144610 0.238083
std 49.943099 1.460136 0.351719 0.425924
min 96.000000 2.000000 0.000000 0.000000
25% 156.000000 3.000000 0.000000 0.000000
50% 200.000000 3.000000 0.000000 0.000000
75% 245.000000 4.000000 0.000000 0.000000
max 310.000000 10.000000 1.000000 1.000000
promotion_last_5years
count 14999.000000
mean 0.021268
std 0.144281
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
查看某一特征离散属性值的分布情况
print('Departments:')
print(df['department'].value_counts())
print('\nSalary:')
print(df['salary'].value_counts())
输出:
Departments:
sales 4140
technical 2720
support 2229
IT 1227
product_mng 902
marketing 858
RandD 787
accounting 767
hr 739
management 630
Name: department, dtype: int64
Salary:
low 7316
medium 6446
high 1237
Name: salary, dtype: int64
查看属性之间的相关性: 判断可以是否存在相关性教强的特征,为后期特征降维提供参考
print(df.corr())
输出:
satisfaction_level last_evaluation number_project \
satisfaction_level 1.000000 0.105021 -0.142970
last_evaluation 0.105021 1.000000 0.349333
number_project -0.142970 0.349333 1.000000
average_monthly_hours -0.020048 0.339742 0.417211
time_spend_company -0.100866 0.131591 0.196786
Work_accident 0.058697 -0.007104 -0.004741
left -0.388375 0.006567 0.023787
promotion_last_5years 0.025605 -0.008684 -0.006064
average_monthly_hours time_spend_company \
satisfaction_level -0.020048 -0.100866
last_evaluation 0.339742 0.131591
number_project 0.417211 0.196786
average_monthly_hours 1.000000 0.127755
time_spend_company 0.127755 1.000000
Work_accident -0.010143 0.002120
left 0.071287 0.144822
promotion_last_5years -0.003544 0.067433
Work_accident left promotion_last_5years
satisfaction_level 0.058697 -0.388375 0.025605
last_evaluation -0.007104 0.006567 -0.008684
number_project -0.004741 0.023787 -0.006064
average_monthly_hours -0.010143 0.071287 -0.003544
time_spend_company 0.002120 0.144822 0.067433
Work_accident 1.000000 -0.154622 0.039245
left -0.154622 1.000000 -0.061788
promotion_last_5years 0.039245 -0.061788 1.000000
查看某一特征与输出分类之间的关系:
import matplotlib.pyplot as plt
import seaborn as sns
plot = sns.factorplot(x='department',y='left',kind='bar',data = df)
plot.set_xticklabels(rotation = 45, horizontalalignment = 'right')
plt.show()
输出:
分析工资特征 与离职的关系
plot = sns.factorplot(x='salary',y='left',kind='bar',data = df)
plt.show()
输出:
分析职位为经理的 薪资分布情况:
df[df['department'] == 'management']['salary'].value_counts().plot(kind = 'pie',title = 'Management salary level distribution')
plt.show()
分析职位为 研发人员的薪资情况
df[df['department'] == 'RandD']['salary'].value_counts().plot(kind = 'pie',title = 'R&D dept salary level distribution')
plt.show()
通过柱形图分析员工对公司满意程度 与 是否离职的情况
bins = np.linspace(0.0001,1.0001,21)
plt.hist(df[df['left'] == 1]['satisfaction_level'],bins = bins,alpha = 0.7,label = 'Employees Left')
plt.hist(df[df['left'] == 0]['satisfaction_level'],bins = bins,alpha = 0.5,label = 'Employees Stayed')
plt.xlabel('satisfaction_level')
plt.xlim((0,1.05))
plt.legend(loc = 'best')
plt.show()
输出:
分析公司对个人的评分 与 员工离职之间的关系
bins = np.linspace(0.3501,1.0001,14)
plt.hist(df[df['left'] == 1]['last_evaluation'],bins = bins,alpha = 1,label = 'Employees Left')
plt.hist(df[df['left'] == 0]['last_evaluation'],bins = bins,alpha = 0.4,label = 'Employees Stayed')
plt.xlabel('satisfaction_level')
plt.legend(loc = 'best')
plt.show()
输出:
分析员工所做项目数量 与 员工是否离职之间的关系
bins = np.linspace(1.5,7.5,7)
plt.hist(df[df['left'] == 1]['number_project'],bins = bins,alpha = 1,label = 'Employees Left')
plt.hist(df[df['left'] == 0]['number_project'],bins = bins,alpha = 0.4,label = 'Employees Stayed')
plt.xlabel('number_project')
plt.grid(axis = 'x')
plt.legend(loc = 'best')
plt.show()
输出:
分析员工每月工作时间与员工是否离职之间的一个关系
bins = np.linspace(75,325,11)
plt.hist(df[df['left'] == 1]['average_monthly_hours'],bins = bins,alpha = 1,label = 'Employees Left')
plt.hist(df[df['left'] == 0]['average_monthly_hours'],bins = bins,alpha = 0.4,label = 'Employees Stayed')
plt.xlabel('average_monthly_hours')
plt.legend(loc = 'best')
plt.show()
输出:
在公司中的工作年限 与是否离职之间的关系
bins = np.linspace(1.5,10.5,10)
plt.hist(df[df['left'] == 1]['time_spend_company'],bins = bins,alpha = 1,label = 'Employees Left')
plt.hist(df[df['left'] == 0]['time_spend_company'],bins = bins,alpha = 0.4,label = 'Employees Stayed')
plt.xlabel('time_spend_company')
plt.xlim((1,11))
plt.grid(axis = 'x')
plt.xticks(np.arange(2,11))
plt.legend(loc = 'best')
plt.show()
输出:
还有一些其他特征 比如工作事故 Work_accident
输入:
plot = sns.factorplot(x='Work_accident',y='left',kind='bar',data = df)
plt.show()
输出:
特征 5年内是否得到升职
输入:
plot = sns.factorplot(x='promotion_last_5years',y='left',kind='bar',data = df)
plt.show()
输出:
对数据集中特征属性为object类型的数据 进行one-hot编码:
X = df.drop('left',axis = 1)
y = df['left']
X.drop(['department','salary'],axis = 1,inplace=True)
salary_dummy = pd.get_dummies(df['salary'])
department_dummy = pd.get_dummies(df['department'])
X = pd.concat([X,salary_dummy],axis = 1)
X = pd.concat([X,department_dummy],axis = 1)
print(X.head())
输出:
satisfaction_level last_evaluation number_project average_monthly_hours \
0 0.38 0.53 2 157
1 0.80 0.86 5 262
2 0.11 0.88 7 272
3 0.72 0.87 5 223
4 0.37 0.52 2 159
time_spend_company Work_accident promotion_last_5years high low \
0 3 0 0 0 1
1 6 0 0 0 0
2 4 0 0 0 0
3 5 0 0 0 1
4 3 0 0 0 1
medium IT RandD accounting hr management marketing product_mng \
0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
sales support technical
0 1 0 0
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
对数据特征进行标准化处理:
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_example = np.array([[1.,-2.,2.],
[5.,3.,2.],
[0.,1.,-10.]])
X_example = stdsc.fit_transform(X_example)
X_example = pd.DataFrame(X_example)
print(X_example)
print(X_example.describe())
输出:可以看出 他的一个均值都为0 方差都为 1.22
0 1 2
0 -0.46291 -1.297771 0.707107
1 1.38873 1.135550 0.707107
2 -0.92582 0.162221 -1.414214
0 1 2
count 3.000000 3.000000e+00 3.000000
mean 0.000000 3.700743e-17 0.000000
std 1.224745 1.224745e+00 1.224745
min -0.925820 -1.297771e+00 -1.414214
25% -0.694365 -5.677750e-01 -0.353553
50% -0.462910 1.622214e-01 0.707107
75% 0.462910 6.488857e-01 0.707107
max 1.388730 1.135550e+00 0.707107
建立随机森林分类模型
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)
from sklearn.model_selection import ShuffleSplit
#n_splits 表示进行多少份的交叉验证
cv = ShuffleSplit(n_splits = 20,test_size = 0.3)
#建立随机森林模型
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
rf_model = RandomForestClassifier()
rf_param = {'n_estimators':range(1,11)}
rf_grid = GridSearchCV(rf_model,rf_param,cv = cv)
rf_grid.fit(X_train,y_train)
print('Parameter eith best score:')
print(rf_grid.best_params_)
print('Cross validation score:',rf_grid.best_score_)
输出: 输出训练集分类精度
Parameter eith best score:
{'n_estimators': 10}
Cross validation score: 0.9824920634920635
测试集分类精度
best_rf = rf_grid.best_estimator_
print('Test score:',best_rf.score(X_test,y_test))
输出:
Test score: 0.9837777777777778
通过随机森林去看那个特征比较重要
features = X.columns
feature_importances = best_rf.feature_importances_
features_df = pd.DataFrame({'Features':features,'Importance Score':feature_importances})
features_df.sort_values('Importance Score',inplace = True,ascending = False)
print(features_df)
输出:
Features Importance Score
0 satisfaction_level 0.287197
2 number_project 0.229172
4 time_spend_company 0.159471
3 average_monthly_hours 0.141351
1 last_evaluation 0.130275
5 Work_accident 0.011622
8 low 0.009073
7 high 0.007717
17 sales 0.003182
19 technical 0.003036
9 medium 0.002938
18 support 0.002562
6 promotion_last_5years 0.002516
12 accounting 0.001969
10 IT 0.001931
14 management 0.001648
13 hr 0.001362
15 marketing 0.001211
11 RandD 0.001118
16 product_mng 0.000651
前5项特征的求和:
#前5项特征的求和
print(features_df['Importance Score'][:5].sum())
输出: 表明前5项的特征和占据了 95%
0.954224518253131
上一篇: RESTful API规范
下一篇: Java笔记(8)——多态和对象实例化
推荐阅读