美国劳工部官方统计数据员工离职案例分析

程序员文章站 2024-03-22 09:41:22

...

通过对数据的分析预判员工离职的可能性

首先去分析是否存在不干净数据，

import pandas as pd
import numpy as np

df = pd.read_csv('HR_comma_sep.csv')
# print(df.isnull().any()) #判断是否有null值
# print(np.count_nonzero(df != df)) #判断nan数量
print(df.info()) #数据集很干净 无缺失值

输出：可以发现这份数据还是比较干净的不存在缺失值，只存在两个object类型的特征

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
satisfaction_level       14999 non-null float64
last_evaluation          14999 non-null float64
number_project           14999 non-null int64
average_montly_hours     14999 non-null int64
time_spend_company       14999 non-null int64
Work_accident            14999 non-null int64
left                     14999 non-null int64
promotion_last_5years    14999 non-null int64
sales                    14999 non-null object
salary                   14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB
None

更正列名：

df.rename(columns = {'average_montly_hours':'average_monthly_hours',
                     'sales':'department'},inplace = True)

分析数值类型的分布特征

# 自动打印数值类型分布情况
print(df.describe())

输出：计数均值方差最小最大值上下四分位数中间值

       satisfaction_level  last_evaluation  number_project  \
count        14999.000000     14999.000000    14999.000000   
mean             0.612834         0.716102        3.803054   
std              0.248631         0.171169        1.232592   
min              0.090000         0.360000        2.000000   
25%              0.440000         0.560000        3.000000   
50%              0.640000         0.720000        4.000000   
75%              0.820000         0.870000        5.000000   
max              1.000000         1.000000        7.000000   

       average_monthly_hours  time_spend_company  Work_accident          left  \
count           14999.000000        14999.000000   14999.000000  14999.000000   
mean              201.050337            3.498233       0.144610      0.238083   
std                49.943099            1.460136       0.351719      0.425924   
min                96.000000            2.000000       0.000000      0.000000   
25%               156.000000            3.000000       0.000000      0.000000   
50%               200.000000            3.000000       0.000000      0.000000   
75%               245.000000            4.000000       0.000000      0.000000   
max               310.000000           10.000000       1.000000      1.000000   

       promotion_last_5years  
count           14999.000000  
mean                0.021268  
std                 0.144281  
min                 0.000000  
25%                 0.000000  
50%                 0.000000  
75%                 0.000000  
max                 1.000000

查看某一特征离散属性值的分布情况

print('Departments:')
print(df['department'].value_counts())
   
print('\nSalary:')
print(df['salary'].value_counts())

输出：

Departments:
sales          4140
technical      2720
support        2229
IT             1227
product_mng     902
marketing       858
RandD           787
accounting      767
hr              739
management      630
Name: department, dtype: int64

Salary:
low       7316
medium    6446
high      1237
Name: salary, dtype: int64

查看属性之间的相关性：判断可以是否存在相关性教强的特征，为后期特征降维提供参考

print(df.corr())

输出：

                       satisfaction_level  last_evaluation  number_project  \
satisfaction_level               1.000000         0.105021       -0.142970   
last_evaluation                  0.105021         1.000000        0.349333   
number_project                  -0.142970         0.349333        1.000000   
average_monthly_hours           -0.020048         0.339742        0.417211   
time_spend_company              -0.100866         0.131591        0.196786   
Work_accident                    0.058697        -0.007104       -0.004741   
left                            -0.388375         0.006567        0.023787   
promotion_last_5years            0.025605        -0.008684       -0.006064   

                       average_monthly_hours  time_spend_company  \
satisfaction_level                 -0.020048           -0.100866   
last_evaluation                     0.339742            0.131591   
number_project                      0.417211            0.196786   
average_monthly_hours               1.000000            0.127755   
time_spend_company                  0.127755            1.000000   
Work_accident                      -0.010143            0.002120   
left                                0.071287            0.144822   
promotion_last_5years              -0.003544            0.067433   

                       Work_accident      left  promotion_last_5years  
satisfaction_level          0.058697 -0.388375               0.025605  
last_evaluation            -0.007104  0.006567              -0.008684  
number_project             -0.004741  0.023787              -0.006064  
average_monthly_hours      -0.010143  0.071287              -0.003544  
time_spend_company          0.002120  0.144822               0.067433  
Work_accident               1.000000 -0.154622               0.039245  
left                       -0.154622  1.000000              -0.061788  
promotion_last_5years       0.039245 -0.061788               1.000000

查看某一特征与输出分类之间的关系：

import matplotlib.pyplot as plt
import seaborn as sns
plot = sns.factorplot(x='department',y='left',kind='bar',data = df)
plot.set_xticklabels(rotation = 45, horizontalalignment = 'right')
plt.show()

输出：

美国劳工部官方统计数据员工离职案例分析

分析工资特征与离职的关系

plot = sns.factorplot(x='salary',y='left',kind='bar',data = df)
plt.show()

输出：

美国劳工部官方统计数据员工离职案例分析

分析职位为经理的薪资分布情况：

df[df['department'] == 'management']['salary'].value_counts().plot(kind = 'pie',title = 'Management salary level distribution')
plt.show()

美国劳工部官方统计数据员工离职案例分析

分析职位为研发人员的薪资情况

df[df['department'] == 'RandD']['salary'].value_counts().plot(kind = 'pie',title = 'R&D dept salary level distribution')
plt.show()

美国劳工部官方统计数据员工离职案例分析

通过柱形图分析员工对公司满意程度与是否离职的情况

bins = np.linspace(0.0001,1.0001,21)
plt.hist(df[df['left'] == 1]['satisfaction_level'],bins = bins,alpha = 0.7,label = 'Employees Left')
plt.hist(df[df['left'] == 0]['satisfaction_level'],bins = bins,alpha = 0.5,label = 'Employees Stayed')
plt.xlabel('satisfaction_level')
plt.xlim((0,1.05))
plt.legend(loc = 'best')
plt.show()

输出：

美国劳工部官方统计数据员工离职案例分析

分析公司对个人的评分与员工离职之间的关系

bins = np.linspace(0.3501,1.0001,14)
plt.hist(df[df['left'] == 1]['last_evaluation'],bins = bins,alpha = 1,label = 'Employees Left')
plt.hist(df[df['left'] == 0]['last_evaluation'],bins = bins,alpha = 0.4,label = 'Employees Stayed')
plt.xlabel('satisfaction_level')
plt.legend(loc = 'best')
plt.show()

输出：

美国劳工部官方统计数据员工离职案例分析

分析员工所做项目数量与员工是否离职之间的关系

bins = np.linspace(1.5,7.5,7)
plt.hist(df[df['left'] == 1]['number_project'],bins = bins,alpha = 1,label = 'Employees Left')
plt.hist(df[df['left'] == 0]['number_project'],bins = bins,alpha = 0.4,label = 'Employees Stayed')
plt.xlabel('number_project')
plt.grid(axis = 'x')
plt.legend(loc = 'best')
plt.show()

输出：

美国劳工部官方统计数据员工离职案例分析

分析员工每月工作时间与员工是否离职之间的一个关系

bins = np.linspace(75,325,11)
plt.hist(df[df['left'] == 1]['average_monthly_hours'],bins = bins,alpha = 1,label = 'Employees Left')
plt.hist(df[df['left'] == 0]['average_monthly_hours'],bins = bins,alpha = 0.4,label = 'Employees Stayed')
plt.xlabel('average_monthly_hours')
plt.legend(loc = 'best')
plt.show()

输出：

美国劳工部官方统计数据员工离职案例分析

在公司中的工作年限与是否离职之间的关系

bins = np.linspace(1.5,10.5,10)
plt.hist(df[df['left'] == 1]['time_spend_company'],bins = bins,alpha = 1,label = 'Employees Left')
plt.hist(df[df['left'] == 0]['time_spend_company'],bins = bins,alpha = 0.4,label = 'Employees Stayed')
plt.xlabel('time_spend_company')
plt.xlim((1,11))
plt.grid(axis = 'x')
plt.xticks(np.arange(2,11))
plt.legend(loc = 'best')
plt.show()

输出：

美国劳工部官方统计数据员工离职案例分析

还有一些其他特征比如工作事故 Work_accident

输入：

plot = sns.factorplot(x='Work_accident',y='left',kind='bar',data = df)
plt.show()

输出：

美国劳工部官方统计数据员工离职案例分析

特征 5年内是否得到升职

输入：

plot = sns.factorplot(x='promotion_last_5years',y='left',kind='bar',data = df)
plt.show()

输出：

美国劳工部官方统计数据员工离职案例分析

对数据集中特征属性为object类型的数据进行one-hot编码：

X = df.drop('left',axis = 1)
y = df['left']
X.drop(['department','salary'],axis = 1,inplace=True)
salary_dummy = pd.get_dummies(df['salary'])
department_dummy = pd.get_dummies(df['department'])
X = pd.concat([X,salary_dummy],axis = 1)
X = pd.concat([X,department_dummy],axis = 1)
print(X.head())

输出：

   satisfaction_level  last_evaluation  number_project  average_monthly_hours  \
0                0.38             0.53               2                    157   
1                0.80             0.86               5                    262   
2                0.11             0.88               7                    272   
3                0.72             0.87               5                    223   
4                0.37             0.52               2                    159   

   time_spend_company  Work_accident  promotion_last_5years  high  low  \
0                   3              0                      0     0    1   
1                   6              0                      0     0    0   
2                   4              0                      0     0    0   
3                   5              0                      0     0    1   
4                   3              0                      0     0    1   

   medium  IT  RandD  accounting  hr  management  marketing  product_mng  \
0       0   0      0           0   0           0          0            0   
1       1   0      0           0   0           0          0            0   
2       1   0      0           0   0           0          0            0   
3       0   0      0           0   0           0          0            0   
4       0   0      0           0   0           0          0            0   

   sales  support  technical  
0      1        0          0  
1      1        0          0  
2      1        0          0  
3      1        0          0  
4      1        0          0

对数据特征进行标准化处理：

from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_example = np.array([[1.,-2.,2.],
                      [5.,3.,2.],
                      [0.,1.,-10.]])
X_example = stdsc.fit_transform(X_example)
X_example = pd.DataFrame(X_example)
print(X_example)
print(X_example.describe())

输出：可以看出他的一个均值都为0 方差都为 1.22

         0         1         2
0 -0.46291 -1.297771  0.707107
1  1.38873  1.135550  0.707107
2 -0.92582  0.162221 -1.414214
              0             1         2
count  3.000000  3.000000e+00  3.000000
mean   0.000000  3.700743e-17  0.000000
std    1.224745  1.224745e+00  1.224745
min   -0.925820 -1.297771e+00 -1.414214
25%   -0.694365 -5.677750e-01 -0.353553
50%   -0.462910  1.622214e-01  0.707107
75%    0.462910  6.488857e-01  0.707107
max    1.388730  1.135550e+00  0.707107

建立随机森林分类模型

from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)

from sklearn.model_selection import ShuffleSplit
#n_splits 表示进行多少份的交叉验证
cv = ShuffleSplit(n_splits = 20,test_size = 0.3)

#建立随机森林模型
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
rf_model = RandomForestClassifier()
rf_param = {'n_estimators':range(1,11)}
rf_grid = GridSearchCV(rf_model,rf_param,cv = cv)
rf_grid.fit(X_train,y_train)
print('Parameter eith best score:')
print(rf_grid.best_params_)
print('Cross validation score:',rf_grid.best_score_)

输出：输出训练集分类精度

Parameter eith best score:
{'n_estimators': 10}
Cross validation score: 0.9824920634920635

测试集分类精度

best_rf = rf_grid.best_estimator_
print('Test score:',best_rf.score(X_test,y_test))

输出：

Test score: 0.9837777777777778

通过随机森林去看那个特征比较重要

features = X.columns
feature_importances = best_rf.feature_importances_
features_df = pd.DataFrame({'Features':features,'Importance Score':feature_importances})
features_df.sort_values('Importance Score',inplace = True,ascending = False)
print(features_df)

输出：

                 Features  Importance Score
0      satisfaction_level          0.287197
2          number_project          0.229172
4      time_spend_company          0.159471
3   average_monthly_hours          0.141351
1         last_evaluation          0.130275
5           Work_accident          0.011622
8                     low          0.009073
7                    high          0.007717
17                  sales          0.003182
19              technical          0.003036
9                  medium          0.002938
18                support          0.002562
6   promotion_last_5years          0.002516
12             accounting          0.001969
10                     IT          0.001931
14             management          0.001648
13                     hr          0.001362
15              marketing          0.001211
11                  RandD          0.001118
16            product_mng          0.000651

前5项特征的求和：

#前5项特征的求和
print(features_df['Importance Score'][:5].sum())

输出：表明前5项的特征和占据了 95%

0.954224518253131

上一篇： RESTful API规范

下一篇： Java笔记（8）——多态和对象实例化

美国劳工部官方统计数据 员工离职案例分析