欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

如何准备机器学习数据集_机器学习演练第一部分:准备数据

程序员文章站 2022-05-01 14:53:06
...

如何准备机器学习数据集

Cleaning and preparing data is a critical first step in any machine learning project. In this blog post, Dataquest student Daniel Osei’s takes us through examining a dataset, selecting columns for features, exploring the data visually and then encoding the features for machine learning.

清理和准备数据是任何机器学习项目中至关重要的第一步。 在此博客文章中,Dataquest的学生Daniel Osei带领我们通过检查数据集,选择要素列,以可视方式探索数据,然后对要素进行编码以进行机器学习。

This post is based on a Dataquest ‘Monthly Challenge’, where our students are given a free-form task to complete.

这篇文章基于Dataquest的 “每月挑战”,在这里我们的学生将获得一份*形式的任务来完成。

After first reading about Machine Learning on Quora in 2015, Daniel became excited at the prospect of an area that could combine his love of Mathematics and Programming. After reading this article on how to learn data science, Daniel started following the steps, eventually joining Dataquest to learn Data Science with us in in April 2016.

在2015年首次阅读有关Quora上的机器学习的文章后,Daniel对这个可以将他对数学和编程的热爱相结合的领域的前景感到兴奋。 在阅读了有关如何学习数据科学的文章之后,Daniel开始遵循这些步骤,最终于2016年4月加入Dataquest与我们一起学习数据科学。

We’d like to thank Daniel for his hard work, and generously letting us publish this post. This walkthrough uses Python 3.5 and Jupyter notebook.

我们要感谢Daniel的辛勤工作,并慷慨地允许我们发表这篇文章。 本演练使用Python 3.5Jupyter笔记本

机器学习挑战概述 (Machine Learning Challenge Overview)

Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return. Each borrower fills out a comprehensive application, providing their past financial history, the reason for the loan, and more. Lending Club evaluates each borrower’s credit score using past historical data (and their own data science process!) and assigns an interest rate to the borrower.

Lending Club是个人贷款的市场,它与寻求贷款的借款人与希望借钱并获得回报的投资者相匹配。 每个借款人都填写一份全面的申请表,提供他们过去的财务记录,贷款原因等。 Lending Club使用过去的历史数据(以及他们自己的数据科学过程!)评估每个借款人的信用评分,并为借款人分配一个利率。

如何准备机器学习数据集_机器学习演练第一部分:准备数据

The Lending Club website.

Lending Club网站。

The loan is then listed on the Lending Club marketplace. You can read more about their marketplace here.

贷款然后在Lending Club市场上列出。 您可以在此处阅读有关其市场的更多信息。

Investors are primarily interested in receiving a return on their investments. Approved loans are listed on the Lending Club website, where qualified investors can browse recently approved loans, the borrower’s credit score, the purpose for the loan, and other information from the application.

投资者主要有兴趣获得投资回报。 批准的贷款在Lending Club网站上列出,合格的投资者可以在其中浏览最近批准的贷款,借款人的信用评分,贷款目的以及应用程序中的其他信息。

Once an investor decides to fund a loan, the borrower then makes monthly payments back to Lending Club. Lending Club redistributes these payments to the investors. This means that investors don’t have to wait until the full amount is paid off to start to see money back. If a loan is fully paid off on time, the investors make a return which corresponds to the interest rate the borrower had to pay in addition to the requested amount. Many loans aren’t completely paid off on time, however, and some borrowers default on the loan.

一旦投资者决定为贷款提供资金,借款人便每月向Lending Club偿还款项。 Lending Club将这些付款重新分配给投资者。 这意味着投资者不必等到还清全部款项就可以开始收回资金。 如果按时还清了贷款,则投资者可获得与借款人除要求的金额外还需支付的利率相对应的回报。 但是,许多贷款还没有按时还清,有些借款人拖欠贷款。

挑战 (The Challenge)

Suppose an investor has approached us and has asked us to build a machine learning model that can reliably predict if a loan will be paid off or not. This investor described himself/herself as a conservative investor who only wants to invest in loans that have a good chance of being paid off on time. Thus, this client is more interested in a machine learning model which does a good job of filtering out high percentage of loan defaulters.

假设有一个投资者联系我们并要求我们建立一个机器学习模型,该模型可以可靠地预测贷款是否还清。 该投资者将自己描述为一个保守的投资者,他只想投资那些很有可能按时还清的贷款。 因此,该客户对机器学习模型更感兴趣,该模型可以很好地过滤掉高比例的贷款违约者。

Our task is to construct a machine learning model that achieves a True Positive Rate of greater than 70% while maintaining a False Positive Rate of less than 7%.

我们的任务是构建一种机器学习模型,该模型可以实现大于70%的真实肯定率 ,同时保持小于7%的错误肯定率

1.检查数据集 (1. Examining the Data Set)

Lending Club periodically releases data for all the approved and declined loan applications on their website. You can select different year ranges to download the dataset (in CSV format) for both approved and declined loans.

Lending Club会在其网站上定期发布所有已批准和已拒绝贷款申请的数据。 您可以选择不同的年份范围来下载批准和拒绝贷款的数据集(CSV格式)。

You’ll also find a data dictionary (in XLS format), towards the bottom of the page, which contains information on the different column names. The data dictionary is useful to help understand what a column represents in the dataset.

您还会在页面底部找到一个数据字典 (XLS格式),其中包含有关不同列名的信息。 数据字典对于帮助理解列在数据集中表示什么很有用。

The data dictionary contains two sheets:

数据字典包含两页:

  • LoanStats sheet: describes the approved loans dataset
  • RejectStats sheet: describes the rejected loans dataset
  • LoanStats工作表:描述批准的贷款数据集
  • RejectStats工作表:描述拒绝的贷款数据集

We’ll be using the LoanStats sheet since we’re interested in the approved loans dataset.

因为我们对批准的贷款数据集感兴趣,所以我们将使用LoanStats工作表

The approved loans dataset contains information on current loans, completed loans, and defaulted loans. For this challenge, we’ll be working with approved loans data for the years 2007 to 2011.

批准的贷款数据集包含有关当前贷款,已完成贷款和拖欠贷款的信息。 为应对这一挑战,我们将处理2007年至2011年的批准贷款数据。

First, lets import some of the libraries that we’ll be using, and set some parameters to make the output easier to read.

首先,让我们导入我们将要使用的一些库,并设置一些参数以使输出更易于阅读。

import import pandas pandas as as pd
pd
import import numpy numpy as as np
np
pdpd .. set_optionset_option (( 'max_columns''max_columns' , , 120120 )
)
pdpd .. set_optionset_option (( 'max_colwidth''max_colwidth' , , 50005000 )


)


import import matplotlib.pyplot matplotlib.pyplot as as plt
plt
import import seaborn seaborn as as sns
sns
%% matplotlib inline
matplotlib inline
pltplt .. rcParamsrcParams [[ 'figure.figsize''figure.figsize' ] ] = = (( 1212 ,, 88 )
)

将数据加载到Pandas DataFrame中 (Loading The Data Into Pandas DataFrame)

We’ve downloaded our dataset and named it lending_club_loans.csv, but now we need to load it into a pandas DataFrame to explore it.

我们已经下载了数据集并将其命名为lending_club_loans.csv ,但是现在我们需要将其加载到pandas DataFrame中以对其进行探索。

To ensure that code run fast for us, we need to reduce the size of lending_club_loans.csv by doing the following:

为了确保代码对我们而言快速运行,我们需要通过执行以下操作来减小lending_club_loans.csv的大小:

  • Remove the first line: It contains extraneous text instead of the column titles. This text prevents the dataset from being parsed properly by the pandas library.
  • Remove the ‘desc’ column: it contains a long text explanation for the loan.
  • Remove the ‘url’ column: it contains a link to each on Lending Club which can only be accessed with an investor account.
  • Removing all columns with more than 50% missing values: This allows us to move faster since don’t need to spend time trying to fill these values.
  • 删除第一行:它包含多余的文本,而不是列标题。 此文本可防止熊猫库正确解析数据集。
  • 删除“ desc”列:其中包含贷款的长文字说明。
  • 删除“ URL”列:它包含指向Lending Club上每个链接的链接,只能使用投资者帐户进行访问。
  • 删除所有缺少超过50%的值的列:由于不需要花费时间尝试填充这些值,这使我们可以更快地移动。

We’ll also name the filtered dataset loans_2007 and later at the end of this section save it as loans_2007.csv to keep it separate from the raw data. This is good practice and makes sure we have our original data in case we need to go back and retrieve any of the original data we’re removing.

我们还将命名过滤后的数据集loans_2007并在本节末尾将其另存为loans_2007.csv以使其与原始数据分开。 这是一种很好的做法,可以确保我们拥有原始数据,以防万一需要返回并检索要删除的任何原始数据。

Now, let’s go ahead and perform these steps:

现在,让我们继续执行以下步骤:

Let’s use the pandas head() method to display first three rows of the loans_2007 DataFrame, just to make sure we were able to load the dataset properly:

让我们使用pandas head()方法显示loan_2007 DataFrame的前三行,只是为了确保我们能够正确加载数据集:

loans_2007loans_2007 .. headhead (( 33 )
)
id ID member_id 会员ID loan_amnt loan_amnt funded_amnt funded_amnt funded_amnt_inv funded_amnt_inv term 术语 int_rate int_rate installment 分期付款 grade 年级 sub_grade 次等级 emp_title emp_title emp_length emp_length home_ownership 房产权 annual_inc Annual_inc verification_status 验证状态 issue_d 发行 loan_status 贷款状态 pymnt_plan pymnt_plan purpose 目的 title 标题 zip_code 邮政编码 addr_state addr_state dti dti delinq_2yrs delinq_2yrs earliest_cr_line earlyest_cr_line fico_range_low fico_range_low fico_range_high fico_range_high inq_last_6mths inq_last_6mths open_acc open_acc pub_rec pub_rec revol_bal revol_bal revol_util revol_util total_acc total_acc initial_list_status initial_list_status out_prncp out_prncp out_prncp_inv out_prncp_inv total_pymnt total_pymnt total_pymnt_inv total_pymnt_inv total_rec_prncp total_rec_prncp total_rec_int total_rec_int total_rec_late_fee total_rec_late_fee recoveries 回收率 collection_recovery_fee collection_recovery_fee last_pymnt_d last_pymnt_d last_pymnt_amnt last_pymnt_amnt last_credit_pull_d last_credit_pull_d last_fico_range_high last_fico_range_high last_fico_range_low last_fico_range_low collections_12_mths_ex_med collections_12_mths_ex_med policy_code policy_code application_type 应用类型 acc_now_delinq acc_now_delinq chargeoff_within_12_mths chargeoff_within_12_mths delinq_amnt delinq_amnt pub_rec_bankruptcies pub_rec_bankruptcies tax_liens tax_liens
0 0 1077501 1077501 1296599.0 1296599.0 5000.0 5000.0 5000.0 5000.0 4975.0 4975.0 36 months 36个月 10.65% 10.65% 162.87 162.87 B B2 B2 NaN N 10+ years 10年以上 RENT 出租 24000.0 24000.0 Verified 已验证 Dec-2011 2011年12月 Fully Paid 全额付款 n ñ credit_card 信用卡 Computer 电脑 860xx 860xx AZ AZ 27.65 27.65 0.0 0.0 Jan-1985 1985年1月 735.0 735.0 739.0 739.0 1.0 1.0 3.0 3.0 0.0 0.0 13648.0 13648.0 83.7% 83.7% 9.0 9.0 f F 0.0 0.0 0.0 0.0 5863.155187 5863.155187 5833.84 5833.84 5000.00 5000.00 863.16 863.16 0.0 0.0 0.00 0.00 0.00 0.00 Jan-2015 2015年1月 171.62 171.62 Sep-2016 2016年9月 744.0 744.0 740.0 740.0 0.0 0.0 1.0 1.0 INDIVIDUAL 个人 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 1个 1077430 1077430 1314167.0 1314167.0 2500.0 2500.0 2500.0 2500.0 2500.0 2500.0 60 months 60个月 15.27% 15.27% 59.83 59.83 C C C4 C4 Ryder 莱德 < 1 year <1年 RENT 出租 30000.0 30000.0 Source Verified 来源已验证 Dec-2011 2011年12月 Charged Off 充电完毕 n ñ car 汽车 bike 自行车 309xx 309xx GA GA 1.00 1.00 0.0 0.0 Apr-1999 1999年4月 740.0 740.0 744.0 744.0 5.0 5.0 3.0 3.0 0.0 0.0 1687.0 1687.0 9.4% 9.4% 4.0 4.0 f F 0.0 0.0 0.0 0.0 1008.710000 1008.710000 1008.71 1008.71 456.46 456.46 435.17 435.17 0.0 0.0 117.08 117.08 1.11 1.11 Apr-2013 2013年4月 119.66 119.66 Sep-2016 2016年9月 499.0 499.0 0.0 0.0 0.0 0.0 1.0 1.0 INDIVIDUAL 个人 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 2 1077175 1077175 1313524.0 1313524.0 2400.0 2400.0 2400.0 2400.0 2400.0 2400.0 36 months 36个月 15.96% 15.96% 84.33 84.33 C C C5 C5 NaN N 10+ years 10年以上 RENT 出租 12252.0 12252.0 Not Verified 未经审核的 Dec-2011 2011年12月 Fully Paid 全额付款 n ñ small_business 小本生意 real estate business 房地产业务 606xx 606xx IL 白介素 8.72 8.72 0.0 0.0 Nov-2001 2001年11月 735.0 735.0 739.0 739.0 2.0 2.0 2.0 2.0 0.0 0.0 2956.0 2956.0 98.5% 98.5% 10.0 10.0 f F 0.0 0.0 0.0 0.0 3005.666844 3005.666844 3005.67 3005.67 2400.00 2400.00 605.67 605.67 0.0 0.0 0.00 0.00 0.00 0.00 Jun-2014 2014年6月 649.91 649.91 Sep-2016 2016年9月 719.0 719.0 715.0 715.0 0.0 0.0 1.0 1.0 INDIVIDUAL 个人 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Let’s also use pandas .shape attribute to view the number of samples and features we’re dealing with at this stage:

让我们还使用pandas .shape属性来查看我们现阶段要处理的样本和特征的数量:


(42538, 56)

2.缩小我们的专栏 (2. Narrowing down our columns)

It’s a great idea to spend some time to familiarize ourselves with the columns in the dataset, to understand what each feature represents. This is important, because a poor understanding of the features could cause us to make mistakes in the data analysis and the modeling process.

花一些时间来熟悉数据集中的列,以了解每个要素代表什么是一个好主意。 这很重要,因为对功能的了解不足可能会导致我们在数据分析和建模过程中出错。

We’ll be using the data dictionary Lending Club provided to help us become familiar with the columns and what each represents in the dataset. To make the process easier, we’ll create a DataFrame to contain the names of the columns, data type, first row’s values, and description from the data dictionary.

我们将使用Lending Club提供的数据字典来帮助我们熟悉列以及数据集中的各个列。 为了简化该过程,我们将创建一个DataFrame来包含列的名称,数据类型,第一行的值以及数据字典中的描述。

To make this easier, we’ve pre-converted the data dictionary from Excel format to a CSV.

为了简化操作,我们已经将数据字典从Excel格式预先转换为CSV。

data_dictionary data_dictionary = = pdpd .. read_csvread_csv (( 'LCDataDictionary.csv''LCDataDictionary.csv' ) ) # Loading in the data dictionary
# Loading in the data dictionary
printprint (( data_dictionarydata_dictionary .. shapeshape [[ 00 ])
])
printprint (( data_dictionarydata_dictionary .. columnscolumns .. tolisttolist ())
())

117
['LoanStatNew', 'Description']
LoanStatNew 贷款统计新 Description 描述
0 0 acc_now_delinq acc_now_delinq The number of accounts on which the borrower is now delinquent. 借款人现在拖欠的帐户数。
1 1个 acc_open_past_24mths acc_open_past_24mths Number of trades opened in past 24 months. 过去24个月内开设的交易数量。
2 2 addr_state addr_state The state provided by the borrower in the loan application 借款人在贷款申请中提供的状态
3 3 all_util all_util Balance to credit limit on all trades 所有交易的余额到信用额度
4 4 annual_inc Annual_inc The self-reported annual income provided by the borrower during registration. 借款人在注册期间提供的自我报告的年收入。

Now that we’ve got the data dictionary loaded, let’s join the first row of loans_2007 to the data_dictionary DataFrame to give us a preview DataFrame with the following columns:

现在,我们已经加载了数据字典,让我们将loans_2007的第一行与loans_2007 data_dictionary起来,为我们提供一个预览数据帧,其中包含以下列:

  • name – contains the column names of loans_2007.
  • dtypes – contains the data types of the loans_2007 columns.
  • first value – contains the values of loans_2007 first row.
  • description – explains what each column in loans_2007 represents.
  • name -包含的列名loans_2007
  • dtypes –包含loans_2007列的数据类型。
  • first value –包含loans_2007第一行的值。
  • description -解释loans_2007每一列的loans_2007
loans_2007_dtypes loans_2007_dtypes = = pdpd .. DataFrameDataFrame (( loans_2007loans_2007 .. dtypesdtypes ,, columnscolumns == [[ 'dtypes''dtypes' ])
])
loans_2007_dtypes loans_2007_dtypes = = loans_2007_dtypesloans_2007_dtypes .. reset_indexreset_index ()
()
loans_2007_dtypesloans_2007_dtypes [[ 'name''name' ] ] = = loans_2007_dtypesloans_2007_dtypes [[ 'index''index' ]
]
loans_2007_dtypes loans_2007_dtypes = = loans_2007_dtypesloans_2007_dtypes [[[[ 'name''name' ,, 'dtypes''dtypes' ]]

]]

loans_2007_dtypesloans_2007_dtypes [[ 'first value''first value' ] ] = = loans_2007loans_2007 .. locloc [[ 00 ]] .. values
values
preview preview = = loans_2007_dtypesloans_2007_dtypes .. mergemerge (( data_dictionarydata_dictionary , , onon == 'name''name' ,, howhow == 'left''left' )
)
name 名称 dtypes dtypes first value 第一价值 description 描述
0 0 id ID object 目的 1077501 1077501 A unique LC assigned ID for the loan listing. 贷款清单的唯一LC分配ID。
1 1个 member_id 会员ID float64 float64 1.2966e+06 1.2966e + 06 A unique LC assigned Id for the borrower member. 借款人成员的唯一LC分配ID。
2 2 loan_amnt loan_amnt float64 float64 5000 5000 The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value. 借款人申请的贷款清单。 如果信贷部门在某个时间点减少了贷款额,那么它将反映在该值中。
3 3 funded_amnt funded_amnt float64 float64 5000 5000 The total amount committed to that loan at that point in time. 在该时间点对该贷款的总承诺额。
4 4 funded_amnt_inv funded_amnt_inv float64 float64 4975 4975 The total amount committed by investors for that loan at that point in time. 投资者在该时间点对该贷款的承诺总额。

When we printed the shape of loans_2007 earlier, we noticed that it had 56 columns which also means this preview DataFrame has 56 rows. It can be cumbersome to try to explore all the rows of preview at once, so instead we’ll break it up into three parts and look at smaller selection of features each time.

当我们早些时候打印loans_2007的形状时,我们注意到它有56列,这也意味着此预览DataFrame有56行。 尝试一次浏览所有预览行可能很麻烦,因此,我们将其分为三个部分,每次查看较小的功能选择。

As you explore the features to better understand each of them, you’ll want to pay attention to any column that:

在探索功能以更好地理解每个功能时,您将需要注意以下各列:

  • leaks information from the future (after the loan has already been funded),
  • don’t affect the borrower’s ability to pay back the loan (e.g. a randomly generated ID value by Lending Club),
  • is formatted poorly,
  • requires more data or a lot of preprocessing to turn into useful a feature, or
  • contains redundant information.
  • 泄漏未来的信息(在贷款已经被资助之后),
  • 不影响借款人的还款能力(例如Lending Club随机生成的ID值),
  • 格式不佳,
  • 需要更多数据或大量预处理才能变成有用的功能,或者
  • 包含冗余信息。

I’ll say it again to emphasize it because it’s important: We need to especially pay close attention to data leakage, which can cause the model to overfit. This is because the model would be also learning from features that wouldn’t be available when we’re using it make predictions on future loans.

我将再次强调它,因为它很重要: 我们需要特别注意数据泄漏 ,这可能会导致模型过拟合。 这是因为该模型还将从我们使用时无法使用的功能中进行学习,从而对未来的贷款进行预测。

第一组列 (First Group Of Columns)

Let’s display the first 19 rows of preview and analyze them:

让我们显示preview的前19行并进行分析:

previewpreview [:[: 1919 ]
]
name 名称 dtypes dtypes first value 第一价值 description 描述
0 0 id ID object 目的 1077501 1077501 A unique LC assigned ID for the loan listing. 贷款清单的唯一LC分配ID。
1 1个 member_id 会员ID float64 float64 1.2966e+06 1.2966e + 06 A unique LC assigned Id for the borrower member. 借款人成员的唯一LC分配ID。
2 2 loan_amnt loan_amnt float64 float64 5000 5000 The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value. 借款人申请的贷款清单。 如果信贷部门在某个时间点减少了贷款额,那么它将反映在该值中。
3 3 funded_amnt funded_amnt float64 float64 5000 5000 The total amount committed to that loan at that point in time. 在该时间点对该贷款的总承诺额。
4 4 funded_amnt_inv funded_amnt_inv float64 float64 4975 4975 The total amount committed by investors for that loan at that point in time. 投资者在该时间点对该贷款的承诺总额。
5 5 term 术语 object 目的 36 months 36个月 The number of payments on the loan. Values are in months and can be either 36 or 60. 贷款的还款次数。 值以月为单位,可以是36或60。
6 6 int_rate int_rate object 目的 10.65% 10.65% Interest Rate on the loan 贷款利率
7 7 installment 分期付款 float64 float64 162.87 162.87 The monthly payment owed by the borrower if the loan originates. 如果贷款产生,则借款人每月欠的款项。
8 8 grade 年级 object 目的 B LC assigned loan grade 信用证指定的贷款等级
9 9 sub_grade 次等级 object 目的 B2 B2 LC assigned loan subgrade 立法会指定的贷款路基
10 10 emp_title emp_title object 目的 NaN N The job title supplied by the Borrower when applying for the loan.* 借款人在申请贷款时提供的职位。*
11 11 emp_length emp_length object 目的 10+ years 10年以上 Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years. 就业年限。 可能的值在0到10之间,其中0表示少于一年,而10表示十年或更长时间。
12 12 home_ownership 房产权 object 目的 RENT 出租 The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER. 借款人在注册过程中提供的房屋所有权状态。 我们的值是:租金,自己拥有,抵押,其他。
13 13 annual_inc Annual_inc float64 float64 24000 24000 The self-reported annual income provided by the borrower during registration. 借款人在注册期间提供的自我报告的年收入。
14 14 verification_status 验证状态 object 目的 Verified 已验证 Indicates if income was verified by LC, not verified, or if the income source was verified 指明收入是否已由LC验证,未验证或收入来源是否已验证
15 15 issue_d 发行 object 目的 Dec-2011 2011年12月 The month which the loan was funded 贷款融资的月份
16 16 loan_status 贷款状态 object 目的 Fully Paid 全额付款 Current status of the loan 贷款的现状
17 17 pymnt_plan pymnt_plan object 目的 n ñ Indicates if a payment plan has been put in place for the loan 指示是否已为贷款制定了付款计划
18 18 purpose 目的 object 目的 credit_card 信用卡 A category provided by the borrower for the loan request. 借款人为贷款请求提供的类别。

After analyzing the columns, we can conclude that the following features can be removed:

在分析了列之后,我们可以得出结论,可以删除以下功能:

  • id – randomly generated field by Lending Club for unique identification purposes only.
  • member_id – also randomly generated field by Lending Club for identification purposes only.
  • funded_amnt – leaks information from the future(after the loan is already started to be funded).
  • funded_amnt_inv – also leaks data from the future.
  • sub_grade – contains redundant information that is already in the grade column (more below).
  • int_rate – also included within the grade column.
  • emp_title – requires other data and a lot of processing to become potentially useful
  • issued_d – leaks data from the future.
  • id – Lending Club随机生成的字段,仅供唯一标识。
  • member_id –也是Lending Club随机生成的字段,仅供识别。
  • funded_amnt –泄露未来的信息(在贷款已经开始筹集资金之后)。
  • funded_amnt_inv –还会泄漏将来的数据。
  • sub_grade –包含“ grade列中已经存在的冗余信息(更多信息请参见下文)。
  • int_rate –也包含在“ grade列中。
  • emp_title –需要其他数据和大量处理才能变得很有用
  • issued_d –泄漏将来的数据。

Lending Club uses a borrower’s grade and payment term (30 or months) to assign an interest rate (you can read more about Rates & Fees). This causes variations in interest rate within a given grade. But, what may be useful for our model is to focus on clusters of borrowers instead of individuals. And, that’s exactly what grading does – it segments borrowers based on their credit score and other behaviors, which is we should keep the grade column and drop interest int_rate and sub_grade.

Lending Club使用借款人的等级和还款期限(30个月或几个月)来分配利率(您可以阅读更多有关Rates&Fees的信息 )。 这会导致给定等级内利率的变化。 但是,对于我们的模型可能有用的是将重点放在借款人群体而不是个人身上。 而且,这正是分级的作用-它根据借款人的信用评分和其他行为对借款人进行细分,这就是我们应该保留grade列,并降低利息int_ratesub_grade

Let’s drop these columns from the DataFrame before moving onto to the next group of columns.

在移至下一组列之前,让我们从DataFrame中删除这些列。

第二组列 (Second Group Of Columns)

Let’s move on to the next 19 columns:

让我们继续进行下19列:

previewpreview [[ 1919 :: 3838 ]
]
name 名称 dtypes dtypes first value 第一价值 description 描述
19 19 title 标题 object 目的 Computer 电脑 The loan title provided by the borrower 借款人提供的贷款名称
20 20 zip_code 邮政编码 object 目的 860xx 860xx The first 3 numbers of the zip code provided by the borrower in the loan application. 借款人在贷款申请中提供的邮政编码的前3个数字。
21 21 addr_state addr_state object 目的 AZ AZ The state provided by the borrower in the loan application 借款人在贷款申请中提供的状态
22 22 dti dti float64 float64 27.65 27.65 A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income. 用借款人的每月总债务付款额(不包括抵押贷款和所要求的信用证贷款)除以借款人的自我报告的每月收入,得出的比率。
23 23 delinq_2yrs delinq_2yrs float64 float64 0 0 The number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 years 过去2年中借款人的信用档案中逾期30天以上的逾期欠款的次数
24 24 earliest_cr_line earlyest_cr_line object 目的 Jan-1985 1985年1月 The month the borrower’s earliest reported credit line was opened 借款人最早报告的信贷额度开放的月份
25 25 fico_range_low fico_range_low float64 float64 735 735 The lower boundary range the borrower’s FICO at loan origination belongs to. 借款人原始贷款时的FICO的下限范围。
26 26 fico_range_high fico_range_high float64 float64 739 739 The upper boundary range the borrower’s FICO at loan origination belongs to. 借款人在贷款发起时的FICO的上限范围。
27 27 inq_last_6mths inq_last_6mths float64 float64 1 1个 The number of inquiries in past 6 months (excluding auto and mortgage inquiries) 最近6个月的查询数量(不包括汽车和抵押贷款查询)
28 28 open_acc open_acc float64 float64 3 3 The number of open credit lines in the borrower’s credit file. 借款人的信用档案中未清信用额度的数量。
29 29 pub_rec pub_rec float64 float64 0 0 Number of derogatory public records 贬损的公共记录数
30 30 revol_bal revol_bal float64 float64 13648 13648 Total credit revolving balance 信贷周转总额
31 31 revol_util revol_util object 目的 83.7% 83.7% Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit. 循环线利用率,或借款人相对于所有可用循环信贷所使用的信贷量。
32 32 total_acc total_acc float64 float64 9 9 The total number of credit lines currently in the borrower’s credit file 借款人信用档案中当前的信用额度总数
33 33 initial_list_status initial_list_status object 目的 f F The initial listing status of the loan. Possible values are – W, F 贷款的初始列表状态。 可能的值为– W,F
34 34 out_prncp out_prncp float64 float64 0 0 Remaining outstanding principal for total amount funded 剩余未偿还本金总额
35 35 out_prncp_inv out_prncp_inv float64 float64 0 0 Remaining outstanding principal for portion of total amount funded by investors 投资者出资总额中剩余的未偿还本金
36 36 total_pymnt total_pymnt float64 float64 5863.16 5863.16 Payments received to date for total amount funded 迄今已收到的已付款总额
37 37 total_pymnt_inv total_pymnt_inv float64 float64 5833.84 5833.84 Payments received to date for portion of total amount funded by investors 迄今收到的付款,占投资者资助总额的一部分

In this group,take note of the fico_range_low and fico_range_high columns. Both are in this second group of columns but because they related to some other columns, we’ll talk more about them after looking at the last group of columns.

在该组中,记下fico_range_lowfico_range_high列。 两者都在第二组列中,但是由于它们与其他一些列相关,因此在查看最后一组列之后,我们将进一步讨论它们。

We can drop the following columns:

我们可以删除以下列:

  • zip_code – mostly redundant with the addr_state column since only the first 3 digits of the 5 digit zip code are visible.
  • out_prncp – leaks data from the future.
  • out_prncp_inv – also leaks data from the future.
  • total_pymnt – also leaks data from the future.
  • total_pymnt_inv – also leaks data from the future.
  • zip_code –由于5位数邮政编码中只有前3位数是可见的,因此与addr_state列大部分相同。
  • out_prncp –泄漏将来的数据。
  • out_prncp_inv –还会泄漏将来的数据。
  • total_pymnt –还会泄漏将来的数据。
  • total_pymnt_inv –还会泄漏将来的数据。

Let’s go ahead and remove these 5 columns from the DataFrame:

让我们继续,从DataFrame中删除以下5列:

第三组列 (Third Group Of Columns)

Let’s analyze the last group of features:

让我们分析最后一组功能:

previewpreview [[ 3838 :]
:]
name 名称 dtypes dtypes first value 第一价值 description 描述
38 38 total_rec_prncp total_rec_prncp float64 float64 5000 5000 Principal received to date 校长至今
39 39 total_rec_int total_rec_int float64 float64 863.16 863.16 Interest received to date 迄今为止收到的利息
40 40 total_rec_late_fee total_rec_late_fee float64 float64 0 0 Late fees received to date 迄今为止收取的滞纳金
41 41 recoveries 回收率 float64 float64 0 0 post charge off gross recovery 过帐总回收费用
42 42 collection_recovery_fee collection_recovery_fee float64 float64 0 0 post charge off collection fee 免收邮寄费
43 43 last_pymnt_d last_pymnt_d object 目的 Jan-2015 2015年1月 Last month payment was received 上个月已收到付款
44 44 last_pymnt_amnt last_pymnt_amnt float64 float64 171.62 171.62 Last total payment amount received 最近收到的总付款金额
45 45 last_credit_pull_d last_credit_pull_d object 目的 Sep-2016 2016年9月 The most recent month LC pulled credit for this loan 最近一个月LC取消了这笔贷款的信贷
46 46 last_fico_range_high last_fico_range_high float64 float64 744 744 The upper boundary range the borrower’s last FICO pulled belongs to. 借款人最近一次FICO所属于的上限范围。
47 47 last_fico_range_low last_fico_range_low float64 float64 740 740 The lower boundary range the borrower’s last FICO pulled belongs to. 借款人上次FICO所属的下限范围。
48 48 collections_12_mths_ex_med collections_12_mths_ex_med float64 float64 0 0 Number of collections in 12 months excluding medical collections 12个月内的馆藏数量(医疗馆藏除外)
49 49 policy_code policy_code float64 float64 1 1个 publicly available policy_code=1nnew products not publicly available policy_code=2 公开提供的policy_code = 1新产品未公开提供policy_code = 2
50 50 application_type 应用类型 object 目的 INDIVIDUAL 个人 Indicates whether the loan is an individual application or a joint application with two co-borrowers 指示贷款是个人申请还是与两个共同借款人共同申请
51 51 acc_now_delinq acc_now_delinq float64 float64 0 0 The number of accounts on which the borrower is now delinquent. 借款人现在拖欠的帐户数。
52 52 chargeoff_within_12_mths chargeoff_within_12_mths float64 float64 0 0 Number of charge-offs within 12 months 12个月内的注销数量
53 53 delinq_amnt delinq_amnt float64 float64 0 0 The past-due amount owed for the accounts on which the borrower is now delinquent. 借款人现在拖欠其帐户的逾期款项。
54 54 pub_rec_bankruptcies pub_rec_bankruptcies float64 float64 0 0 Number of public record bankruptcies 公共记录破产数量
55 55 tax_liens tax_liens float64 float64 0 0 Number of tax liens 税收留置权数量

In this last group of columns, we need to drop the following, all of which leak data from the future:

在最后一组列中,我们需要删除以下所有列,所有这些将来都会泄漏数据:

  • total_rec_prncp
  • total_rec_int
  • total_rec_late_fee
  • recoveries
  • collection_recovery_fee
  • last_pymnt_d
  • last_pymnt_amnt
  • total_rec_prncp
  • total_rec_int
  • total_rec_late_fee
  • recoveries
  • collection_recovery_fee
  • last_pymnt_d
  • last_pymnt_amnt

Let’s drop our last group of columns:

让我们删除最后一组列:

调查FICO分数列 (Investigating FICO Score Columns)

Now, besides the explanations provided here in the Description column,let’s learn more about fico_range_low, fico_range_high, last_fico_range_low, and last_fico_range_high.

现在,除了“描述”列中此处提供的解释之外,让我们进一步了解fico_range_lowfico_range_highlast_fico_range_lowlast_fico_range_high

FICO scores are a credit score, or a number used by banks and credit cards to represent how credit-worthy a person is. While there are a few types of credit scores used in the United States, the FICO score is the best known and most widely used.

FICO分数是一个信用分数,或者是银行和信用卡用来表示一个人的信用程度的数字。 虽然在美国使用的信用评分类型有几种,但FICO评分是最著名和使用最广泛的。

When a borrower applies for a loan, Lending Club gets the borrowers credit score from FICO – they are given a lower and upper limit of the range that the borrowers score belongs to, and they store those values as fico_range_low, fico_range_high. After that, any updates to the borrowers score are recorded as last_fico_range_low, and last_fico_range_high.

当借款人申请贷款时,Lending Club会从FICO获得借款人的信用评分-给出借款人评分所属范围的上下限,并将这些值存储为fico_range_lowfico_range_high 之后,对借方分数的任何更新都记录为last_fico_range_lowlast_fico_range_high

A key part of any data science project is to do everything you can to understand the data. While researching this data set, I found a project done in 2014 by a group of students from Stanford University on this same dataset.

任何数据科学项目的关键部分是尽一切可能理解数据。 在研究此数据集时,我发现了由斯坦福大学的一群学生在2014年完成的一个项目,使用了相同的数据集。

In the report for the project, the group listed the current credit score (last_fico_range) among late fees and recovery fees as fields they mistakenly added to the features but state that they later learned these columns all leak information into the future.

该项目的报告中 ,该小组在滞纳金和回收金中列出了当前的信用评分( last_fico_range ),因为他们错误地将这些字段添加到功能中,但指出后来他们知道这些列将所有信息泄漏到将来。

However, following this group’s project, another group from Stanford worked on this same Lending Club dataset. They used the FICO score columns, dropping only last_fico_range_low, in their modeling. This second group’s report described last_fico_range_high as the one of the more important features in predicting accurate results.

但是,按照该小组的项目,斯坦福大学的另一个小组研究了相同的Lending Club数据集。 他们使用FICO分数列,在建模中仅删除last_fico_range_low 第二组的报告last_fico_range_high描述为预测准确结果的更重要特征之一。

The question we must answer is, do the FICO credit scores information into the future? Recall a column is considered leaking information when especially it won’t be available at the time we use our model – in this case when we use our model on future loans.

我们必须回答的问题是,FICO信用评分信息是否会面向未来? 回想一下,当我们在使用模型时,特别是在某列中将无法获得信息时,会认为该列是泄漏信息–在这种情况下,当我们在未来的贷款中使用我们的模型时。

This blog examines in-depth the FICO scores for lending club loans, and notes that while looking at the trend of the FICO scores is a great predictor of whether a loan will default, that because FICO scores continue to be updated by the Lending Club after a loan is funded, a defaulting loan can lower the borrowers score, or in other words, will leak data.

该博客深入研究了贷款俱乐部贷款的FICO分数,并指出,在查看FICO分数趋势的同时,可以很好地预测贷款是否会违约,因为FICO分数会在贷款俱乐部继续更新之后贷款是有资金的,拖欠贷款会降低借款人的分数,换句话说,会泄漏数据。

Therefore we can safely use fico_range_low and fico_range_high, but not last_fico_range_low, and last_fico_range_high. Lets take a look at the values in these columns:

因此,我们可以安全地使用fico_range_lowfico_range_high ,但不能使用last_fico_range_lowlast_fico_range_high 让我们看一下这些列中的值:

printprint (( loans_2007loans_2007 [[ 'fico_range_low''fico_range_low' ]] .. uniqueunique ())
())
printprint (( loans_2007loans_2007 [[ 'fico_range_high''fico_range_high' ]] .. uniqueunique ())
())

[ 735.  740.  690.  695.  730.  660.  675.  725.  710.  705.  720.  665.
  670.  760.  685.  755.  680.  700.  790.  750.  715.  765.  745.  770.
  780.  775.  795.  810.  800.  815.  785.  805.  825.  820.  630.  625.
   nan  650.  655.  645.  640.  635.  610.  620.  615.]
[ 739.  744.  694.  699.  734.  664.  679.  729.  714.  709.  724.  669.
  674.  764.  689.  759.  684.  704.  794.  754.  719.  769.  749.  774.
  784.  779.  799.  814.  804.  819.  789.  809.  829.  824.  634.  629.
   nan  654.  659.  649.  644.  639.  614.  624.  619.]

Let’s get rid of the missing values, then plot histograms to look at the ranges of the two columns:

让我们摆脱缺失的值,然后绘制直方图以查看两列的范围:


42538
42535

如何准备机器学习数据集_机器学习演练第一部分:准备数据

Let’s now go ahead and create a column for the average of fico_range_low and fico_range_high columns and name it fico_average. Note that this is not the average FICO score for each borrower, but rather an average of the high and low range that we know the borrower is in.

现在,让我们为fico_range_lowfico_range_high列的平均值创建一列,并将其命名为fico_average 请注意,这不是每个借款人的平均FICO得分,而是我们知道借款人所处的最高和最低范围的平均值。

loans_2007loans_2007 [[ 'fico_average''fico_average' ] ] = = (( loans_2007loans_2007 [[ 'fico_range_high''fico_range_high' ] ] + + loans_2007loans_2007 [[ 'fico_range_low''fico_range_low' ]) ]) / / 2
2

Let’s check what we just did.

让我们检查一下我们刚才做了什么。

fico_range_low fico_range_low fico_range_high fico_range_high fico_average fico_average
0 0 735.0 735.0 739.0 739.0 737.0 737.0
1 1个 740.0 740.0 744.0 744.0 742.0 742.0
2 2 735.0 735.0 739.0 739.0 737.0 737.0
3 3 690.0 690.0 694.0 694.0 692.0 692.0
4 4 695.0 695.0 699.0 699.0 697.0 697.0

Good! We got the mean calculations and everything right. Now, we can go ahead and drop fico_range_low, fico_range_high, last_fico_range_low, and last_fico_range_high columns.

好! 我们得到了均值计算,一切都正确。 现在,我们可以继续删除fico_range_lowfico_range_highlast_fico_range_lowlast_fico_range_high列。

drop_cols drop_cols = = [[ 'fico_range_low''fico_range_low' ,, 'fico_range_high''fico_range_high' ,, 'last_fico_range_low''last_fico_range_low' ,
             ,
             'last_fico_range_high''last_fico_range_high' ]
]
loans_2007 loans_2007 = = loans_2007loans_2007 .. dropdrop (( drop_colsdrop_cols , , axisaxis == 11 )
)
loans_2007loans_2007 .. shape
shape

(42535, 33)

Notice just by becoming familiar with the columns in the dataset, we’re able to reduce the number of columns from 56 to 33.

注意,只要熟悉数据集中的列,我们就能将列数从56减少到33。

确定目标列 (Decide On A Target Column)

Now, let’s decide on the appropriate column to use as a target column for modeling – keep in mind the main goal is predict who will pay off a loan and who will default.

现在,让我们确定合适的列以用作建模的目标列–请记住,主要目标是预测谁将还清贷款以及谁会违约。

We learned from the description of columns in the preview DataFrame that loan_status is the only field in the main dataset that describe a loan status, so let’s use this column as the target column.

我们从预览DataFrame中的列描述中得知, loan_status是主数据集中描述贷款状态的唯一字段,因此让我们将此列用作目标列。

name 名称 dtypes dtypes first value 第一价值 description 描述
16 16 loan_status 贷款状态 object 目的 Fully Paid 全额付款 Current status of the loan 贷款的现状

Currently, this column contains text values that need to be converted to numerical values to be able use for training a model.

当前,此列包含需要转换为数值才能用于训练模型的文本值。

Let’s explore the different values in this column and come up with a strategy for converting the values in this column. We’ll use the DataFrame method value_counts() to return the frequency of the unique values in the loan_status column.

让我们探索此列中的不同值,并提出一种转换此列中的值的策略。 我们将使用DataFrame方法value_counts()来返回loan_status列中唯一值的频率。

loans_2007loans_2007 [[ "loan_status""loan_status" ]] .. value_countsvalue_counts ()
()

Fully Paid                                             33586
Charged Off                                             5653
Does not meet the credit policy. Status:Fully Paid      1988
Does not meet the credit policy. Status:Charged Off      761
Current                                                  513
In Grace Period                                           16
Late (31-120 days)                                        12
Late (16-30 days)                                          5
Default                                                    1
Name: loan_status, dtype: int64

The loan status has nine different possible values!

贷款状态有九种可能的值!

Let’s learn about these unique values to determine the ones that best describe the final outcome of a loan, and also the kind of classification problem we’ll be dealing with.

让我们了解这些独特的值,以确定最能描述贷款最终结果的值,以及我们将要处理的分类问题。

You can read about most of the different loan statuses on the Lending Club website as well as these posts on the Lend Academy and Orchard forums. I have pulled that data together in a table below so we can see the unique values, their frequency in the dataset and what each means:

您可以在Lending Club网站上以及Lend AcademyOrchard论坛上了解有关大多数不同贷款状态的信息 我将这些数据汇总到下表中,以便我们可以看到唯一值,它们在数据集中的出现频率以及各自的含义:

Loan Status 贷款状况 Count 计数 Meaning 含义
0 0 Fully Paid 全额付款 33586 33586 Loan has been fully paid off. 贷款已全额还清。
1 1个 Charged Off 充电完毕 5653 5653 Loan for which there is no longer a reasonable expectation of further payments. 不再有合理预期进一步付款的贷款。
2 2 Does not meet the credit policy. Status:Fully Paid 不符合信用政策。 状态:全额付款 1988 1988年 While the loan was paid off, the loan application today would no longer meet the credit policy and wouldn’t be approved on to the marketplace. 虽然还清了贷款,但今天的贷款申请将不再符合信贷政策,也不会被批准进入市场。
3 3 Does not meet the credit policy. Status:Charged Off 不符合信用政策。 状态:已充电 761 761 While the loan was charged off, the loan application today would no longer meet the credit policy and wouldn’t be approved on to the marketplace. 在清算贷款后,今天的贷款申请将不再符合信贷政策,也不会被批准进入市场。
4 4 Current 当前 513 513 Loan is up to date on current payments. 贷款是当前付款的最新信息。
5 5 In Grace Period 宽限期 16 16 The loan is past due but still in the grace period of 15 days. 贷款已过期,但仍处于15天的宽限期内。
6 6 Late (31-120 days) 晚(31-120天) 12 12 Loan hasn’t been paid in 31 to 120 days (late on the current payment). 在31到120天内没有还清贷款(当前付款的时间很晚)。
7 7 Late (16-30 days) 晚(16-30天) 5 5 Loan hasn’t been paid in 16 to 30 days (late on the current payment). 在16到30天内没有还清贷款(当前付款已晚)。
8 8 Default 默认 1 1个 Loan is defaulted on and no payment has been made for more than 121 days. 拖欠贷款是默认的,并且超过121天未付款。

Remember, our goal is to build a machine learning model that can learn from past loans in trying to predict which loans will be paid off and which won’t. From the above table, only the Fully Paid and Charged Off values describe the final outcome of a loan. The other values describe loans that are still on going, and even though some loans are late on payments, we can’t jump the gun and classify them as Charged Off.

请记住,我们的目标是建立一个机器学习模型,该模型可以从过去的贷款中学习,以试图预测哪些贷款将得到还清,而哪些则不会。 在上表中,仅“已付清”和“已清还”值描述了贷款的最终结果。 其他值描述的是仍在继续的贷款,即使有些贷款延迟付款,我们也无法将其归类为“冲销”。

Also, while the Default status resembles the Charged Off status, in Lending Club’s eyes, loans that are charged off have essentially no chance of being repaid while default ones have a small chance. Therefore, we should use only samples where the loan_status column is 'Fully Paid' or 'Charged Off'.

同样,虽然“默认”状态类似于“已注销”状态,但在Lending Club看来,已注销的贷款基本上没有机会偿还,而“默认”的机会很小。 因此,我们应该仅使用loan_status 'Fully Paid''Charged Off'示例。

We’re not interested in any statuses that indicate that the loan is ongoing or in progress, because predicting that something is in progress doesn’t tell us anything.

我们对表示贷款正在进行或进行中的任何状态都不感兴趣,因为预测正在发生的事情不会告诉我们任何事情。

Since we’re interested in being able to predict which of these 2 values a loan will fall under, we can treat the problem as binary classification.

由于我们有兴趣预测贷款将属于这两个值中的哪个值,因此可以将问题视为二进制分类

Let’s remove all the loans that don’t contain either 'Fully Paid' or 'Charged Off' as the loan’s status and then transform the 'Fully Paid' values to 1 for the positive case and the 'Charged Off' values to 0 for the negative case.

让我们删除所有不包含'Fully Paid''Charged Off'作为贷款状态的贷款,然后将正数情况下的'Fully Paid'值转换为1 ,将正情况下的'Charged Off'值转换为0 。否定情况。

This will mean that out of the ~42,000 rows we have, we’ll be removing just over 3,000.

这意味着在我们拥有的约42,000行中,我们将删除3,000多行。

There are few different ways to transform all of the values in a column, we’ll use the DataFrame method replace().

转换列中所有值的方法很少,我们将使用DataFrame方法replace()

loans_2007 loans_2007 = = loans_2007loans_2007 [([( loans_2007loans_2007 [[ "loan_status""loan_status" ] ] == == "Fully Paid""Fully Paid" ) ) |
                            |
                            (( loans_2007loans_2007 [[ "loan_status""loan_status" ] ] == == "Charged Off""Charged Off" )]

)]

mapping_dictionary mapping_dictionary = = {{ "loan_status""loan_status" :{ :{ "Fully Paid""Fully Paid" : : 11 , , "Charged Off""Charged Off" : : 00 }}
}}
loans_2007 loans_2007 = = loans_2007loans_2007 .. replacereplace (( mapping_dictionarymapping_dictionary )
)

可视化目标列结果 (Visualizing the Target Column Outcomes)

如何准备机器学习数据集_机器学习演练第一部分:准备数据

These plots indicate that a significant number of borrowers in our dataset paid off their loan – 85.62% of loan borrowers paid off amount borrowed, while 14.38% unfortunately defaulted. From our loan data it is these ‘defaulters’ that we’re more interested in filtering out as much as possible to reduce loses on investment returns.

这些图表明,在我们的数据集中,有大量借款人还清了他们的贷款,其中85.62%的借款人还清了借入的金额,而不幸的是有14.38%的违约。 从我们的贷款数据来看,正是这些“违约者”使我们对尽可能多地过滤掉以减少投资回报损失更感兴趣。

In part two of our walkthrough, we’ll learn that the significant percentage difference, or class imbalance, in target variable needs to be considered when we build our model.

在本演练的第二部分中,我们将学习在构建模型时需要考虑目标变量中的显着百分比差异或类不平衡

仅删除一个值的列 (Remove Columns with only One Value)

To wrap up this section, let’s look for any columns that contain only one unique value and remove them. These columns won’t be useful for the model since they don’t add any information to each loan application. In addition, removing these columns will reduce the number of columns we’ll need to explore further in the next stage.

为了结束本节,让我们查找仅包含一个唯一值的所有列并将其删除。 这些列不会对模型有用,因为它们不会向每个贷款申请添加任何信息。 此外,删除这些列将减少我们在下一阶段需要进一步探索的列数。

The pandas Series method nunique() returns the number of unique values, excluding any null values. We can use apply this method across the dataset to remove these columns in one easy step.

熊猫Series方法nunique()返回唯一值的数量,不包括任何空值。 我们可以在整个数据集中使用此方法,只需一个简单的步骤即可删除这些列。

loans_2007 loans_2007 = = loans_2007loans_2007 .. locloc [:,[:, loans_2007loans_2007 .. applyapply (( pdpd .. SeriesSeries .. nuniquenunique ) ) != != 11 ]
]

Again, there may be some columns with more than one unique values but one of the values has insignificant frequency in the dataset. Let’s find out and drop such column(s):

同样,可能有一些列具有不止一个唯一值,但其中一个值在数据集中的频率不重要。 让我们找出并删除这样的列:


 36 months    29096
 60 months    10143
Name: term, dtype: int64

Not Verified       16845
Verified           12526
Source Verified     9868
Name: verification_status, dtype: int64

1    33586
0     5653
Name: loan_status, dtype: int64

n    39238
y        1
Name: pymnt_plan, dtype: int64


The payment plan column (pymnt_plan) has two unique values, 'y' and 'n', with 'y' occurring only once. Let’s drop this column:

付款计划列( pymnt_plan )具有两个唯一值'y''n' ,其中'y'仅发生一次。 让我们删除此列:

printprint (( loans_2007loans_2007 .. shapeshape [[ 11 ])
])
loans_2007 loans_2007 = = loans_2007loans_2007 .. dropdrop (( 'pymnt_plan''pymnt_plan' , , axisaxis == 11 )
)
printprint (( "We've been able to reduced the features to => "We've been able to reduced the features to =>  {}{} "" .. formatformat (( loans_2007loans_2007 .. shapeshape [[ 11 ]))
]))

25
We've been able to reduced the features to => 24

Lastly, lets save our work in this section to a CSV file.

最后,让我们将本节中的工作保存到CSV文件中。

喜欢这篇文章吗? 使用Dataquest学习数据科学! (Enjoying this post? Learn data science with Dataquest!)

  • Learn from the comfort of your browser.
  • Work with real-life data sets.
  • Build a portfolio of projects.
  • 从舒适的浏览器中学习。
  • 处理实际数据集。
  • 建立项目组合。

Start for Free

免费开始

3.准备机器学习的功能 (3. Preparing the Features for Machine Learning)

In this section, we’ll prepare the filtered_loans_2007.csv data for machine learning. We’ll focus on handling missing values, converting categorical columns to numeric columns and removing any other extraneous columns.

在本节中,我们将准备filtered_loans_2007.csv数据用于机器学习。 我们将专注于处理缺失值,将分类列转换为数字列并删除任何其他无关的列。

We need to handle missing values and categorical features before feeding the data into a machine learning algorithm, because the mathematics underlying most machine learning models assumes that the data is numerical and contains no missing values. To reinforce this requirement, scikit-learn will return an error if you try to train a model using data that contain missing values or non-numeric values when working with models like linear regression and logistic regression.

在将数据输入机器学习算法之前,我们需要处理缺失值和分类特征 ,因为大多数机器学习模型所基于的数学假定数据是数值的并且不包含缺失值。 为了加强此要求,如果在使用线性回归和逻辑回归等模型时尝试使用包含缺失值或非数值的数据训练模型,则scikit-learn将返回错误。

Here’s an outline of what we’ll be doing in this stage:

这是我们在此阶段将要做的事情的概要:

  • Handle Missing Values
  • Investigate Categorical Columns
    • Convert Categorical Columns To Numeric Features
      • Map Ordinal Values To Integers
      • Encode Nominal Values As Dummy Variables
  • 处理缺失值
  • 调查分类列
    • 将分类列转换为数值特征
      • 将序数值映射到整数
      • 将标称值编码为虚拟变量

First though, let’s load in the data from last section’s final output:

不过首先,让我们从上一节的最终输出中加载数据:

filtered_loans filtered_loans = = pdpd .. read_csvread_csv (( 'processed_data/filtered_loans_2007.csv''processed_data/filtered_loans_2007.csv' )
)
printprint (( filtered_loansfiltered_loans .. shapeshape )
)
filtered_loansfiltered_loans .. headhead ()
()

(39239, 24)
loan_amnt loan_amnt term 术语 installment 分期付款 grade 年级 emp_length emp_length home_ownership 房产权 annual_inc Annual_inc verification_status 验证状态 loan_status 贷款状态 purpose 目的 title 标题 addr_state addr_state dti dti delinq_2yrs delinq_2yrs earliest_cr_line earlyest_cr_line inq_last_6mths inq_last_6mths open_acc open_acc pub_rec pub_rec revol_bal revol_bal revol_util revol_util total_acc total_acc last_credit_pull_d last_credit_pull_d pub_rec_bankruptcies pub_rec_bankruptcies fico_average fico_average
0 0 5000.0 5000.0 36 months 36个月 162.87 162.87 B 10+ years 10年以上 RENT 出租 24000.0 24000.0 Verified 已验证 1 1个 credit_card 信用卡 Computer 电脑 AZ AZ 27.65 27.65 0.0 0.0 Jan-1985 1985年1月 1.0 1.0 3.0 3.0 0.0 0.0 13648.0 13648.0 83.7% 83.7% 9.0 9.0 Sep-2016 2016年9月 0.0 0.0 737.0 737.0
1 1个 2500.0 2500.0 60 months 60个月 59.83 59.83 C C < 1 year <1年 RENT 出租 30000.0 30000.0 Source Verified 来源已验证 0 0 car 汽车 bike 自行车 GA GA 1.00 1.00 0.0 0.0 Apr-1999 1999年4月 5.0 5.0 3.0 3.0 0.0 0.0 1687.0 1687.0 9.4% 9.4% 4.0 4.0 Sep-2016 2016年9月 0.0 0.0 742.0 742.0
2 2 2400.0 2400.0 36 months 36个月 84.33 84.33 C C 10+ years 10年以上 RENT 出租 12252.0 12252.0 Not Verified 未经审核的 1 1个 small_business 小本生意 real estate business 房地产业务 IL 白介素 8.72 8.72 0.0 0.0 Nov-2001 2001年11月 2.0 2.0 2.0 2.0 0.0 0.0 2956.0 2956.0 98.5% 98.5% 10.0 10.0 Sep-2016 2016年9月 0.0 0.0 737.0 737.0
3 3 10000.0 10000.0 36 months 36个月 339.31 339.31 C C 10+ years 10年以上 RENT 出租 49200.0 49200.0 Source Verified 来源已验证 1 1个 other 其他 personel 人事 CA 认证机构 20.00 20.00 0.0 0.0 Feb-1996 1996年2月 1.0 1.0 10.0 10.0 0.0 0.0 5598.0 5598.0 21% 21% 37.0 37.0 Apr-2016 2016年4月 0.0 0.0 692.0 692.0
4 4 5000.0 5000.0 36 months 36个月 156.46 156.46 A 一个 3 years 3年 RENT 出租 36000.0 36000.0 Source Verified 来源已验证 1 1个 wedding 婚礼 My wedding loan I promise to pay back 我答应偿还我的婚礼贷款 AZ AZ 11.20 11.20 0.0 0.0 Nov-2004 2004年11月 3.0 3.0 9.0 9.0 0.0 0.0 7963.0 7963.0 28.3% 28.3% 12.0 12.0 Jan-2016 2016年1月 0.0 0.0 732.0 732.0

处理缺失值 (Handle Missing Values)

Let’s compute the number of missing values and determine how to handle them. We can return the number of missing values across the DataFrame by:

让我们计算缺失值的数量并确定如何处理它们。 我们可以通过以下方式返回整个DataFrame中缺失值的数量:

  • First, use the Pandas DataFrame method isnull() to return a DataFrame containing Boolean values:
    • True if the original value is null
    • False if the original value isn’t null
  • Then, use the Pandas DataFrame method sum() to calculate the number of null values in each column.
  • 首先,使用Pandas DataFrame方法isnull()返回包含布尔值的DataFrame:
    • 如果原始值为null,则为True
    • 如果原始值不为null,则为False
  • 然后,使用Pandas DataFrame方法sum()计算每列中空值的数量。

Number of null values in each column:
loan_amnt                 0
term                      0
installment               0
grade                     0
emp_length                0
home_ownership            0
annual_inc                0
verification_status       0
loan_status               0
purpose                   0
title                    10
addr_state                0
dti                       0
delinq_2yrs               0
earliest_cr_line          0
inq_last_6mths            0
open_acc                  0
pub_rec                   0
revol_bal                 0
revol_util               50
total_acc                 0
last_credit_pull_d        2
pub_rec_bankruptcies    697
fico_average              0
dtype: int64

Notice while most of the columns have 0 missing values, title has 9 missing values, revol_util has 48, and pub_rec_bankruptcies contains 675 rows with missing values. Let’s remove columns entirely where more than 1% (392) of the rows for that column contain a null value. In addition, we’ll remove the remaining rows containing null values, which means we’ll lose a bit of data, but in return keep some extra features to use for prediction.

请注意,虽然大多数列都有0个缺失值, title有9个缺失值, revol_util有48个,而pub_rec_bankruptcies包含675个缺失值的行。 让我们完全删除那些该列中超过1%(392)的行包含空值的列。 此外,我们将删除其余包含空值的行,这意味着我们将丢失一些数据,但作为回报,保留一些额外的功能以用于预测。

This means that we’ll keep the title and revol_util columns, just removing rows containing missing values, but drop the pub_rec_bankruptcies column entirely since more than 1% of the rows have a missing value for this column.

这意味着我们将保留titlerevol_util列,只删除包含缺失值的行,但由于有1%以上的行具有该列的缺失值,因此将pub_rec_bankruptcies列完全删除。

Here’s a list of steps we can use to achieve that:

这是我们可以用来实现这一目标的步骤列表:

  • Use the drop method to remove the pub_rec_bankruptcies column from filtered_loans.
  • Use the dropna method to remove all rows from filtered_loans containing any missing values.
  • 使用滴法filtered_loans删除pub_rec_bankruptcies列。
  • 使用dropna方法从包含任何遗漏值 filtered_loans删除所有行。
filtered_loans filtered_loans = = filtered_loansfiltered_loans .. dropdrop (( "pub_rec_bankruptcies""pub_rec_bankruptcies" ,, axisaxis == 11 )
)
filtered_loans filtered_loans = = filtered_loansfiltered_loans .. dropnadropna ()
()

Next, we’ll focus on the categorical columns.

接下来,我们将集中讨论类别列。

调查分类列 (Investigate Categorical Columns)

Keep in mind, the goal in this section is to have all the columns as numeric columns (int or float data type), and containing no missing values. We just dealt with the missing values, so let’s now find out the number of columns that are of the object data type and then move on to process them into numeric form.

请记住,本节的目标是使所有列都为数字列(int或float数据类型),并且不包含任何缺失值。 我们只是处理缺少的值,所以现在让我们找出对象数据类型的列数,然后继续将其处理为数字形式。


Data types and their frequency
float64    11
object     11
int64       1
dtype: int64

We have 11 object columns that contain text which need to be converted into numeric features. Let’s select just the object columns using the DataFrame method select_dtype, then display a sample row to get a better sense of how the values in each column are formatted.

我们有11个对象列,其中包含需要转换为数字特征的文本。 让我们使用DataFrame方法select_dtype只选择对象列,然后显示一个示例行,以更好地了解每一列中的值如何格式化。

object_columns_df object_columns_df = = filtered_loansfiltered_loans .. select_dtypesselect_dtypes (( includeinclude == [[ 'object''object' ])
])
printprint (( object_columns_dfobject_columns_df .. ilociloc [[ 00 ])
])

term                     36 months
grade                            B
emp_length               10+ years
home_ownership                RENT
verification_status       Verified
purpose                credit_card
title                     Computer
addr_state                      AZ
earliest_cr_line          Jan-1985
revol_util                   83.7%
last_credit_pull_d        Sep-2016
Name: 0, dtype: object

Notice that revol_util column contains numeric values, but is formatted as object. We learned from the description of columns in the preview DataFrame earlier that revol_util is a revolving line utilization rate or the amount of credit the borrower is using relative to all available credit (read more here).

请注意, revol_util列包含数字值,但其格式设置为对象。 我们从前面的preview DataFrame中的列描述中了解到, revol_util是循环使用率或借款人相对于所有可用信用使用的信用量( 在此处了解更多信息 )。

We need to format revol_util as numeric values. Here’s what we should do:

我们需要将revol_util格式化为数值。 这是我们应该做的:

  • Use the str.rstrip() string method to strip the right trailing percent sign (%).
  • On the resulting Series object, use the astype() method to convert to the type float.
  • Assign the new Series of float values back to the revol_util column in the filtered_loans.
  • 使用str.rstrip()字符串方法可以str.rstrip()右尾的百分号( % )。
  • 在产生的Series对象上,使用astype()方法转换为float类型。
  • 指定浮动的新系列值恢复到revol_util列在filtered_loans

Moving on, these columns seem to represent categorical values:

继续,这些列似乎代表分类值:

  • home_ownership – home ownership status, can only be 1 of 4 categorical values according to the data dictionary.
  • verification_status – indicates if income was verified by Lending Club.
  • emp_length – number of years the borrower was employed upon time of application.
  • term – number of payments on the loan, either 36 or 60.
  • addr_state – borrower’s state of residence.
  • grade – LC assigned loan grade based on credit score.
  • purpose – a category provided by the borrower for the loan request.
  • title – loan title provided the borrower.
  • home_ownership –房屋所有权状态,根据数据字典,只能是4个分类值中的1个。
  • verification_status –指示收入是否已由Lending Club核实。
  • emp_length –借款人在申请时受雇的年限。
  • term –贷款的还款次数,为36或60。
  • addr_state –借款人的居住地。
  • grade – LC根据信用评分指定的贷款等级。
  • purpose –借款人为贷款申请提供的类别。
  • title -借款人提供的贷款所有权。

To be sure, lets confirm by checking the number of unique values in each of them.

可以肯定的是,通过检查每个值中的唯一值来进行确认。

Also, based on the first row’s values for purpose and title, it appears these two columns reflect the same information. We’ll explore their unique value counts separately to confirm if this is true.

同样,基于第一行的purposetitle值,看起来这两列反映了相同的信息。 我们将分别探索其唯一值计数,以确认是否为真。

Lastly, notice the first row’s values for both earliest_cr_line and last_credit_pull_d columns contain date values that would require a good amount of feature engineering for them to be potentially useful:

最后,注意第一行的值都earliest_cr_linelast_credit_pull_d列包含这将需要功能的工程量好日期值,他们是潜在的有用:

  • earliest_cr_line – The month the borrower’s earliest reported credit line was opened
  • last_credit_pull_d – The most recent month Lending Club pulled credit for this loan
  • earliest_cr_line –借款人最早报告的信贷额度开立的月份
  • last_credit_pull_d – Lending Club最近一个月为这笔贷款提取了信贷

We’ll remove these date columns from the DataFrame.

我们将从DataFrame中删除这些日期列。

First, let’s explore the unique value counts of the six columns that seem like they contain categorical values

首先,让我们探索看起来好像包含分类值的六列的唯一值计数

cols cols = = [[ 'home_ownership''home_ownership' , , 'grade''grade' ,, 'verification_status''verification_status' , , 'emp_length''emp_length' , , 'term''term' , , 'addr_state''addr_state' ]
]
for for name name in in colscols :
    :
    printprint (( namename ,, ':'':' )
    )
    printprint (( object_columns_dfobject_columns_df [[ namename ]] .. value_countsvalue_counts (),(), '' nn '' )
)

home_ownership :
RENT        18677
MORTGAGE    17381
OWN          3020
OTHER          96
NONE            3
Name: home_ownership, dtype: int64 

grade :
B    11873
A    10062
C     7970
D     5194
E     2760
F     1009
G      309
Name: grade, dtype: int64 

verification_status :
Not Verified       16809
Verified           12515
Source Verified     9853
Name: verification_status, dtype: int64 

emp_length :
10+ years    8715
< 1 year     4542
2 years      4344
3 years      4050
4 years      3385
5 years      3243
1 year       3207
6 years      2198
7 years      1738
8 years      1457
9 years      1245
n/a          1053
Name: emp_length, dtype: int64 

term :
 36 months    29041
 60 months    10136
Name: term, dtype: int64 

addr_state :
CA    7019
NY    3757
FL    2831
TX    2693
NJ    1825
IL    1513
PA    1493
VA    1388
GA    1381
MA    1322
OH    1197
MD    1039
AZ     863
WA     830
CO     777
NC     772
CT     738
MI     718
MO     677
MN     608
NV     488
SC     469
WI     447
OR     441
AL     441
LA     432
KY     319
OK     294
KS     264
UT     255
AR     241
DC     211
RI     197
NM     187
WV     174
HI     170
NH     169
DE     113
MT      84
WY      83
AK      79
SD      61
VT      53
MS      19
TN      17
IN       9
ID       6
IA       5
NE       5
ME       3
Name: addr_state, dtype: int64 

Most of these coumns contain discrete categorical values which we can encode as dummy variables and keep. The addr_state column, however,contains too many unique values, so it’s better to drop this.

这些列大多数包含离散的分类值,我们可以将其编码为虚拟变量并保留。 但是, addr_state列包含太多唯一值,因此最好删除它。

Next, let’s look at the unique value counts for the purpose and title columns to understand which columns we want to keep.

接下来,让我们看一下purpose列和title列的唯一值计数,以了解我们要保留哪些列。


Unique Values in column: purpose

debt_consolidation    18355
credit_card            5073
other                  3921
home_improvement       2944
major_purchase         2178
small_business         1792
car                    1534
wedding                 940
medical                 688
moving                  580
vacation                377
house                   372
educational             320
renewable_energy        103
Name: purpose, dtype: int64 

Unique Values in column: title

Debt Consolidation                         2142
Debt Consolidation Loan                    1670
Personal Loan                               650
Consolidation                               501
debt consolidation                          495
Credit Card Consolidation                   354
Home Improvement                            350
Debt consolidation                          331
Small Business Loan                         317
Credit Card Loan                            310
Personal                                    306
Consolidation Loan                          255
Home Improvement Loan                       243
personal loan                               231
personal                                    217
Loan                                        210
Wedding Loan                                206
Car Loan                                    198
consolidation                               197
Other Loan                                  187
Credit Card Payoff                          153
Wedding                                     152
Major Purchase Loan                         144
Credit Card Refinance                       143
Consolidate                                 126
Medical                                     120
Credit Card                                 115
home improvement                            109
My Loan                                      94
Credit Cards                                 92
                                           ... 
toddandkim4ever                               1
Remainder of down payment                     1
Building a Financial Future                   1
Higher interest payoff                        1
Chase Home Improvement Loan                   1
Sprinter Purchase                             1
Refi credit card-great payment record         1
Karen's Freedom Loan                          1
Business relocation and partner buyout        1
Update My New House                           1
tito                                          1
florida vacation                              1
Back to 0                                     1
Bye Bye credit card                           1
britschool                                    1
Consolidation 16X60                           1
Last Call                                     1
Want to be debt free in "3"                   1
for excellent credit                          1
loaney                                        1
jamal's loan                                  1
Refying Lending Club-I LOVE THIS PLACE!       1
Consoliation Loan                             1
Personal/ Consolidation                       1
Pauls Car                                     1
Road to freedom loan                          1
Pay it off FINALLY!                           1
MASH consolidation                            1
Destination Wedding                           1
Store Charge Card                             1
Name: title, dtype: int64 


It appears the purpose and title columns do contain overlapping information, but the purpose column contains fewer discrete values and is cleaner, so we’ll keep it and drop title.

看起来purposetitle列确实包含重叠的信息,但是purpose列包含的离散值更少并且更整洁,因此我们将其保留并删除title

Lets drop the columns we’ve decided not to keep so far:

让我们删除到目前为止我们决定不保留的列:

drop_cols drop_cols = = [[ 'last_credit_pull_d''last_credit_pull_d' ,, 'addr_state''addr_state' ,, 'title''title' ,, 'earliest_cr_line''earliest_cr_line' ]
]
filtered_loans filtered_loans = = filtered_loansfiltered_loans .. dropdrop (( drop_colsdrop_cols ,, axisaxis == 11 )
)

将分类列转换为数值特征 (Convert Categorical Columns to Numeric Features)

First, let’s understand the two types of categorical features we have in our dataset and how we can convert each to numerical features:

首先,让我们了解数据集中的两种分类特征,以及如何将它们转换为数字特征:

  • Ordinal values: these categorical values are in natural order. That’s you can sort or order them either in increasing or decreasing order. For instance, we learnt earlier that Lending Club grade loan applicants from A to G, and assign each applicant a corresponding interest rate – grade A is less riskier while grade B is riskier than A in that order:
  • 有序值 :这些分类值是自然顺序的。 那就是您可以按升序或降序对它们进行排序或排序。 例如,我们较早地了解到Lending Club 贷款申请人从A到G,并为每个申请人分配了相应的利率-A级的风险较低,而B级的风险依次高于A级:

A < B < C < D < E < F < G ; where < means less riskier than

A <B <C <D <E <F <G; 其中<表示比

  • Nominal Values: these are regular categorical values. You can’t order nominal values. For instance, while we can order loan applicants in the employment length column (emp_length) based on years spent in the workforce:
  • 标称值 :这些是常规分类值。 您不能订购标称值。 例如,尽管我们可以根据在劳动力中花费的年数在就业期限列( emp_length )中订购贷款申请人:

year 1 < year 2 < year 3 … < year N,

1年<2年<3年…<N年,

we can’t do that with the column purpose. It wouldn’t make sense to say:

我们无法通过专栏purpose做到这一点。 说:

car < wedding < education < moving < house

汽车<婚礼<教育<移动<房子

These are the columns we now have in our dataset:

这些是我们现在在数据集中的列:

  • Ordinal Values
    • grade
    • emp_length
  • Nominal Values _ home_ownership
    • verification_status
    • purpose
    • term
  • 序数值
    • grade
    • emp_length
  • 标称值 _ home_ownership
    • verification_status
    • purpose
    • term

There are different approaches to handle each of these two types. In the steps following, we’ll convert each of them accordingly.

有两种不同的方法来处理这两种类型。 在下面的步骤中,我们将相应地对其进行转换。

To map the ordinal values to integers, we can use the pandas DataFrame method replace() to map both grade and emp_length to appropriate numeric values

要将序数值映射为整数,我们可以使用pandas DataFrame方法replace()gradeemp_lengthemp_length为适当的数值

emp_length emp_length grade 年级
0 0 10 10 2 2
1 1个 0 0 3 3
2 2 10 10 3 3
3 3 10 10 3 3
4 4 3 3 1 1个

Perfect! Let’s move on to the Nominal Values. The approach to converting nominal features into numerical features is to encode them as dummy variables. The process will be:

完善! 让我们继续看名义值。 将名义特征转换为数字特征的方法是将其编码为虚拟变量。 该过程将是:

  • Use pandas’ get_dummies() method to return a new DataFrame containing a new column for each dummy variable
  • Use the concat() method to add these dummy columns back to the original DataFrame
  • Then drop the original columns entirely using the drop method
  • 使用pandas的get_dummies()方法返回一个新的DataFrame,其中包含每个虚拟变量的新列
  • 使用concat()方法将这些虚拟列添加回原始DataFrame
  • 然后使用drop方法完全删除原始列

Lets’ go ahead and encode the nominal columns that we now have in our dataset.

让我们继续进行编码,以编码现在数据集中的标称列。

nominal_columns nominal_columns = = [[ "home_ownership""home_ownership" , , "verification_status""verification_status" , , "purpose""purpose" , , "term""term" ]
]
dummy_df dummy_df = = pdpd .. get_dummiesget_dummies (( filtered_loansfiltered_loans [[ nominal_columnsnominal_columns ])
])
filtered_loans filtered_loans = = pdpd .. concatconcat ([([ filtered_loansfiltered_loans , , dummy_dfdummy_df ], ], axisaxis == 11 )
)
filtered_loans filtered_loans = = filtered_loansfiltered_loans .. dropdrop (( nominal_columnsnominal_columns , , axisaxis == 11 )
)
loan_amnt loan_amnt installment 分期付款 grade 年级 emp_length emp_length annual_inc Annual_inc loan_status 贷款状态 dti dti delinq_2yrs delinq_2yrs inq_last_6mths inq_last_6mths open_acc open_acc pub_rec pub_rec revol_bal revol_bal revol_util revol_util total_acc total_acc fico_average fico_average home_ownership_MORTGAGE home_ownership_MORTGAGE home_ownership_NONE home_ownership_NONE home_ownership_OTHER home_ownership_OTHER home_ownership_OWN home_ownership_OWN home_ownership_RENT home_ownership_RENT verification_status_Not Verified Verification_status_未验证 verification_status_Source Verified Verification_status_Source已验证 verification_status_Verified Verification_status_Verified purpose_car Purpose_car purpose_credit_card Purpose_credit_card purpose_debt_consolidation Purpose_debt_consolidation purpose_educational 目的_教育 purpose_home_improvement Purpose_home_improvement purpose_house 目的 purpose_major_purchase Purpose_major_purchase purpose_medical Purpose_medical purpose_moving 目的运动 purpose_other Purpose_other purpose_renewable_energy Purpose_renewable_energy purpose_small_business Purpose_small_business purpose_vacation Purpose_vacation purpose_wedding Purpose_wedding term_ 36 months 任期_ 36个月 term_ 60 months 任期_ 60个月
0 0 5000.0 5000.0 162.87 162.87 2 2 10 10 24000.0 24000.0 1 1个 27.65 27.65 0.0 0.0 1.0 1.0 3.0 3.0 0.0 0.0 13648.0 13648.0 83.7 83.7 9.0 9.0 737.0 737.0 0 0 0 0 0 0 0 0 1 1个 0 0 0 0 1 1个 0 0 1 1个 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1个 0 0
1 1个 2500.0 2500.0 59.83 59.83 3 3 0 0 30000.0 30000.0 0 0 1.00 1.00 0.0 0.0 5.0 5.0 3.0 3.0 0.0 0.0 1687.0 1687.0 9.4 9.4 4.0 4.0 742.0 742.0 0 0 0 0 0 0 0 0 1 1个 0 0 1 1个 0 0 1 1个 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1个
2 2 2400.0 2400.0 84.33 84.33 3 3 10 10 12252.0 12252.0 1 1个 8.72 8.72 0.0 0.0 2.0 2.0 2.0 2.0 0.0 0.0 2956.0 2956.0 98.5 98.5 10.0 10.0 737.0 737.0 0 0 0 0 0 0 0 0 1 1个 1 1个 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1个 0 0 0 0 1 1个 0 0
3 3 10000.0 10000.0 339.31 339.31 3 3 10 10 49200.0 49200.0 1 1个 20.00 20.00 0.0 0.0 1.0 1.0 10.0 10.0 0.0 0.0 5598.0 5598.0 21.0 21.0 37.0 37.0 692.0 692.0 0 0 0 0 0 0 0 0 1 1个 0 0 1 1个 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1个 0 0 0 0 0 0 0 0 1 1个 0 0
4 4 5000.0 5000.0 156.46 156.46 1 1个 3 3 36000.0 36000.0 1 1个 11.20 11.20 0.0 0.0 3.0 3.0 9.0 9.0 0.0 0.0 7963.0 7963.0 28.3 28.3 12.0 12.0 732.0 732.0 0 0 0 0 0 0 0 0 1 1个 0 0 1 1个 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1个 1 1个 0 0

To wrap things up, let’s inspect our final output from this section to make sure all the features are of the same length, contain no null value, and are numericals.

总结一下,让我们检查一下本节的最终输出,以确保所有要素的长度相同,不包含空值且为数字。

Let’s use pandas info method to inspect the filtered_loans DataFrame:

让我们使用pandas info方法来检查filtered_loans DataFrame:

filtered_loansfiltered_loans .. infoinfo ()
()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39177 entries, 0 to 39238
Data columns (total 39 columns):
loan_amnt                              39177 non-null float64
installment                            39177 non-null float64
grade                                  39177 non-null int64
emp_length                             39177 non-null int64
annual_inc                             39177 non-null float64
loan_status                            39177 non-null int64
dti                                    39177 non-null float64
delinq_2yrs                            39177 non-null float64
inq_last_6mths                         39177 non-null float64
open_acc                               39177 non-null float64
pub_rec                                39177 non-null float64
revol_bal                              39177 non-null float64
revol_util                             39177 non-null float64
total_acc                              39177 non-null float64
fico_average                           39177 non-null float64
home_ownership_MORTGAGE                39177 non-null uint8
home_ownership_NONE                    39177 non-null uint8
home_ownership_OTHER                   39177 non-null uint8
home_ownership_OWN                     39177 non-null uint8
home_ownership_RENT                    39177 non-null uint8
verification_status_Not Verified       39177 non-null uint8
verification_status_Source Verified    39177 non-null uint8
verification_status_Verified           39177 non-null uint8
purpose_car                            39177 non-null uint8
purpose_credit_card                    39177 non-null uint8
purpose_debt_consolidation             39177 non-null uint8
purpose_educational                    39177 non-null uint8
purpose_home_improvement               39177 non-null uint8
purpose_house                          39177 non-null uint8
purpose_major_purchase                 39177 non-null uint8
purpose_medical                        39177 non-null uint8
purpose_moving                         39177 non-null uint8
purpose_other                          39177 non-null uint8
purpose_renewable_energy               39177 non-null uint8
purpose_small_business                 39177 non-null uint8
purpose_vacation                       39177 non-null uint8
purpose_wedding                        39177 non-null uint8
term_ 36 months                        39177 non-null uint8
term_ 60 months                        39177 non-null uint8
dtypes: float64(12), int64(3), uint8(24)
memory usage: 5.7 MB

保存为CSV (Save to CSV)

It is a good practice to store the final output of each section or stage of your workflow in a separate csv file. One of the benefits of this practice is that it helps us to make changes in our data processing flow without having to recalculate everything.

最好将工作流的每个部分或阶段的最终输出存储在单独的csv文件中。 这种做法的好处之一是,它可以帮助我们更改数据处理流程,而不必重新计算所有内容。

下一步 (Next Steps)

In this post, we used the Data Dictionary Lending Club provided with the Loans_2007 DataFrame’s first row’s values to become familiar with the columns in the dataset and were able to removed many columns that aren’t useful for modeling. We also selected loan_status as our target column and decided to focus our modeling efforts on binary classification.

在本文中,我们使用了Loans_2007 DataFrame第一行的值提供的Data Dictionary Lending Club来熟悉数据集中的列,并能够删除许多对建模没有用的列。 我们还选择了loan_status作为目标列,并决定将建模工作重点放在二进制分类上

Then, we performed the last amount of data preparation necessary to get the features into data types that can be fed into machine learning algorithms. We converted all columns of object data type(Categorical features) to numerical values because those are the only type of values scikit-learn can work with.

然后,我们执行了最后必要的数据准备工作,以将功能部件转换为可以输入到机器学习算法中的数据类型。 我们将对象数据类型(分类特征)的所有列都转换为数值,因为它们是scikit-learn可以使用的唯一值类型。

翻译自: https://www.pybloggers.com/2016/12/machine-learning-walkthrough-part-one-preparing-the-data/

如何准备机器学习数据集