欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

美国劳工部官方统计数据 员工离职案例分析

程序员文章站 2024-03-22 09:41:22
...

通过对数据的分析 预判员工离职的可能性

首先去分析是否存在不干净数据,

import pandas as pd
import numpy as np

df = pd.read_csv('HR_comma_sep.csv')
# print(df.isnull().any()) #判断是否有null值
# print(np.count_nonzero(df != df)) #判断nan数量
print(df.info()) #数据集很干净 无缺失值

 输出: 可以发现这份数据还是比较干净的 不存在缺失值,只存在两个object类型的特征

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
satisfaction_level       14999 non-null float64
last_evaluation          14999 non-null float64
number_project           14999 non-null int64
average_montly_hours     14999 non-null int64
time_spend_company       14999 non-null int64
Work_accident            14999 non-null int64
left                     14999 non-null int64
promotion_last_5years    14999 non-null int64
sales                    14999 non-null object
salary                   14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB
None

 更正列名:

df.rename(columns = {'average_montly_hours':'average_monthly_hours',
                     'sales':'department'},inplace = True)

 分析数值类型的分布特征

# 自动打印数值类型分布情况
print(df.describe()) 

输出: 计数 均值 方差 最小 最大值 上下四分位数 中间值

       satisfaction_level  last_evaluation  number_project  \
count        14999.000000     14999.000000    14999.000000   
mean             0.612834         0.716102        3.803054   
std              0.248631         0.171169        1.232592   
min              0.090000         0.360000        2.000000   
25%              0.440000         0.560000        3.000000   
50%              0.640000         0.720000        4.000000   
75%              0.820000         0.870000        5.000000   
max              1.000000         1.000000        7.000000   

       average_monthly_hours  time_spend_company  Work_accident          left  \
count           14999.000000        14999.000000   14999.000000  14999.000000   
mean              201.050337            3.498233       0.144610      0.238083   
std                49.943099            1.460136       0.351719      0.425924   
min                96.000000            2.000000       0.000000      0.000000   
25%               156.000000            3.000000       0.000000      0.000000   
50%               200.000000            3.000000       0.000000      0.000000   
75%               245.000000            4.000000       0.000000      0.000000   
max               310.000000           10.000000       1.000000      1.000000   

       promotion_last_5years  
count           14999.000000  
mean                0.021268  
std                 0.144281  
min                 0.000000  
25%                 0.000000  
50%                 0.000000  
75%                 0.000000  
max                 1.000000  

查看某一特征离散属性值的分布情况

print('Departments:')
print(df['department'].value_counts())
   
print('\nSalary:')
print(df['salary'].value_counts())

输出: 

Departments:
sales          4140
technical      2720
support        2229
IT             1227
product_mng     902
marketing       858
RandD           787
accounting      767
hr              739
management      630
Name: department, dtype: int64

Salary:
low       7316
medium    6446
high      1237
Name: salary, dtype: int64

 查看属性之间的相关性: 判断可以是否存在相关性教强的特征,为后期特征降维提供参考

print(df.corr())

 输出:

                       satisfaction_level  last_evaluation  number_project  \
satisfaction_level               1.000000         0.105021       -0.142970   
last_evaluation                  0.105021         1.000000        0.349333   
number_project                  -0.142970         0.349333        1.000000   
average_monthly_hours           -0.020048         0.339742        0.417211   
time_spend_company              -0.100866         0.131591        0.196786   
Work_accident                    0.058697        -0.007104       -0.004741   
left                            -0.388375         0.006567        0.023787   
promotion_last_5years            0.025605        -0.008684       -0.006064   

                       average_monthly_hours  time_spend_company  \
satisfaction_level                 -0.020048           -0.100866   
last_evaluation                     0.339742            0.131591   
number_project                      0.417211            0.196786   
average_monthly_hours               1.000000            0.127755   
time_spend_company                  0.127755            1.000000   
Work_accident                      -0.010143            0.002120   
left                                0.071287            0.144822   
promotion_last_5years              -0.003544            0.067433   

                       Work_accident      left  promotion_last_5years  
satisfaction_level          0.058697 -0.388375               0.025605  
last_evaluation            -0.007104  0.006567              -0.008684  
number_project             -0.004741  0.023787              -0.006064  
average_monthly_hours      -0.010143  0.071287              -0.003544  
time_spend_company          0.002120  0.144822               0.067433  
Work_accident               1.000000 -0.154622               0.039245  
left                       -0.154622  1.000000              -0.061788  
promotion_last_5years       0.039245 -0.061788               1.000000  

 查看某一特征与输出分类之间的关系:

import matplotlib.pyplot as plt
import seaborn as sns
plot = sns.factorplot(x='department',y='left',kind='bar',data = df)
plot.set_xticklabels(rotation = 45, horizontalalignment = 'right')
plt.show()

 输出:

美国劳工部官方统计数据 员工离职案例分析

 分析工资特征 与离职的关系

plot = sns.factorplot(x='salary',y='left',kind='bar',data = df)
plt.show()

输出:

美国劳工部官方统计数据 员工离职案例分析

分析职位为经理的 薪资分布情况:

df[df['department'] == 'management']['salary'].value_counts().plot(kind = 'pie',title = 'Management salary level distribution')
plt.show()

美国劳工部官方统计数据 员工离职案例分析

分析职位为 研发人员的薪资情况

df[df['department'] == 'RandD']['salary'].value_counts().plot(kind = 'pie',title = 'R&D dept salary level distribution')
plt.show()

美国劳工部官方统计数据 员工离职案例分析

通过柱形图分析员工对公司满意程度 与 是否离职的情况

bins = np.linspace(0.0001,1.0001,21)
plt.hist(df[df['left'] == 1]['satisfaction_level'],bins = bins,alpha = 0.7,label = 'Employees Left')
plt.hist(df[df['left'] == 0]['satisfaction_level'],bins = bins,alpha = 0.5,label = 'Employees Stayed')
plt.xlabel('satisfaction_level')
plt.xlim((0,1.05))
plt.legend(loc = 'best')
plt.show()

 输出:

美国劳工部官方统计数据 员工离职案例分析

分析公司对个人的评分 与 员工离职之间的关系

bins = np.linspace(0.3501,1.0001,14)
plt.hist(df[df['left'] == 1]['last_evaluation'],bins = bins,alpha = 1,label = 'Employees Left')
plt.hist(df[df['left'] == 0]['last_evaluation'],bins = bins,alpha = 0.4,label = 'Employees Stayed')
plt.xlabel('satisfaction_level')
plt.legend(loc = 'best')
plt.show()

 输出:

美国劳工部官方统计数据 员工离职案例分析

 分析员工所做项目数量 与 员工是否离职之间的关系

bins = np.linspace(1.5,7.5,7)
plt.hist(df[df['left'] == 1]['number_project'],bins = bins,alpha = 1,label = 'Employees Left')
plt.hist(df[df['left'] == 0]['number_project'],bins = bins,alpha = 0.4,label = 'Employees Stayed')
plt.xlabel('number_project')
plt.grid(axis = 'x')
plt.legend(loc = 'best')
plt.show()

 输出:

美国劳工部官方统计数据 员工离职案例分析

分析员工每月工作时间与员工是否离职之间的一个关系

bins = np.linspace(75,325,11)
plt.hist(df[df['left'] == 1]['average_monthly_hours'],bins = bins,alpha = 1,label = 'Employees Left')
plt.hist(df[df['left'] == 0]['average_monthly_hours'],bins = bins,alpha = 0.4,label = 'Employees Stayed')
plt.xlabel('average_monthly_hours')
plt.legend(loc = 'best')
plt.show()

输出: 

美国劳工部官方统计数据 员工离职案例分析

在公司中的工作年限 与是否离职之间的关系

bins = np.linspace(1.5,10.5,10)
plt.hist(df[df['left'] == 1]['time_spend_company'],bins = bins,alpha = 1,label = 'Employees Left')
plt.hist(df[df['left'] == 0]['time_spend_company'],bins = bins,alpha = 0.4,label = 'Employees Stayed')
plt.xlabel('time_spend_company')
plt.xlim((1,11))
plt.grid(axis = 'x')
plt.xticks(np.arange(2,11))
plt.legend(loc = 'best')
plt.show()

输出: 

美国劳工部官方统计数据 员工离职案例分析

还有一些其他特征 比如工作事故 Work_accident

输入:

plot = sns.factorplot(x='Work_accident',y='left',kind='bar',data = df)
plt.show()

输出: 

美国劳工部官方统计数据 员工离职案例分析

特征 5年内是否得到升职

输入:

plot = sns.factorplot(x='promotion_last_5years',y='left',kind='bar',data = df)
plt.show()

 输出:

美国劳工部官方统计数据 员工离职案例分析

对数据集中特征属性为object类型的数据 进行one-hot编码:

X = df.drop('left',axis = 1)
y = df['left']
X.drop(['department','salary'],axis = 1,inplace=True)
salary_dummy = pd.get_dummies(df['salary'])
department_dummy = pd.get_dummies(df['department'])
X = pd.concat([X,salary_dummy],axis = 1)
X = pd.concat([X,department_dummy],axis = 1)
print(X.head())

输出:

   satisfaction_level  last_evaluation  number_project  average_monthly_hours  \
0                0.38             0.53               2                    157   
1                0.80             0.86               5                    262   
2                0.11             0.88               7                    272   
3                0.72             0.87               5                    223   
4                0.37             0.52               2                    159   

   time_spend_company  Work_accident  promotion_last_5years  high  low  \
0                   3              0                      0     0    1   
1                   6              0                      0     0    0   
2                   4              0                      0     0    0   
3                   5              0                      0     0    1   
4                   3              0                      0     0    1   

   medium  IT  RandD  accounting  hr  management  marketing  product_mng  \
0       0   0      0           0   0           0          0            0   
1       1   0      0           0   0           0          0            0   
2       1   0      0           0   0           0          0            0   
3       0   0      0           0   0           0          0            0   
4       0   0      0           0   0           0          0            0   

   sales  support  technical  
0      1        0          0  
1      1        0          0  
2      1        0          0  
3      1        0          0  
4      1        0          0  

 对数据特征进行标准化处理:

from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_example = np.array([[1.,-2.,2.],
                      [5.,3.,2.],
                      [0.,1.,-10.]])
X_example = stdsc.fit_transform(X_example)
X_example = pd.DataFrame(X_example)
print(X_example)
print(X_example.describe())

输出:可以看出 他的一个均值都为0 方差都为 1.22

         0         1         2
0 -0.46291 -1.297771  0.707107
1  1.38873  1.135550  0.707107
2 -0.92582  0.162221 -1.414214
              0             1         2
count  3.000000  3.000000e+00  3.000000
mean   0.000000  3.700743e-17  0.000000
std    1.224745  1.224745e+00  1.224745
min   -0.925820 -1.297771e+00 -1.414214
25%   -0.694365 -5.677750e-01 -0.353553
50%   -0.462910  1.622214e-01  0.707107
75%    0.462910  6.488857e-01  0.707107
max    1.388730  1.135550e+00  0.707107

建立随机森林分类模型

from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)

from sklearn.model_selection import ShuffleSplit
#n_splits 表示进行多少份的交叉验证
cv = ShuffleSplit(n_splits = 20,test_size = 0.3)

#建立随机森林模型
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
rf_model = RandomForestClassifier()
rf_param = {'n_estimators':range(1,11)}
rf_grid = GridSearchCV(rf_model,rf_param,cv = cv)
rf_grid.fit(X_train,y_train)
print('Parameter eith best score:')
print(rf_grid.best_params_)
print('Cross validation score:',rf_grid.best_score_)

输出: 输出训练集分类精度

Parameter eith best score:
{'n_estimators': 10}
Cross validation score: 0.9824920634920635

测试集分类精度

best_rf = rf_grid.best_estimator_
print('Test score:',best_rf.score(X_test,y_test))

输出:

Test score: 0.9837777777777778

通过随机森林去看那个特征比较重要

features = X.columns
feature_importances = best_rf.feature_importances_
features_df = pd.DataFrame({'Features':features,'Importance Score':feature_importances})
features_df.sort_values('Importance Score',inplace = True,ascending = False)
print(features_df)

输出:

                 Features  Importance Score
0      satisfaction_level          0.287197
2          number_project          0.229172
4      time_spend_company          0.159471
3   average_monthly_hours          0.141351
1         last_evaluation          0.130275
5           Work_accident          0.011622
8                     low          0.009073
7                    high          0.007717
17                  sales          0.003182
19              technical          0.003036
9                  medium          0.002938
18                support          0.002562
6   promotion_last_5years          0.002516
12             accounting          0.001969
10                     IT          0.001931
14             management          0.001648
13                     hr          0.001362
15              marketing          0.001211
11                  RandD          0.001118
16            product_mng          0.000651

前5项特征的求和: 

#前5项特征的求和
print(features_df['Importance Score'][:5].sum())

输出: 表明前5项的特征和占据了 95%

0.954224518253131