python中实现随机森林_Python中的随机森林
python中实现随机森林
Random forest is a highly versatile machine learning method with numerous applications ranging from marketing to healthcare and insurance. It can be used to model the impact of marketing on customer acquisition, retention, and churn or to predict disease risk and susceptibility in patients.
随机森林是一种用途广泛的机器学习方法,具有从营销到医疗保健和保险的众多应用。 它可用于建模营销对客户获取,保留和流失的影响,或预测患者的疾病风险和易感性 。
Random forest is capable of regression and classification. It can handle a large number of features, and it’s helpful for estimating which of your variables are important in the underlying data being modeled.
随机森林具有回归和分类的能力。 它可以处理大量功能,并且有助于估计哪些变量在要建模的基础数据中很重要。
This is a post about random forests using Python.
这是一篇有关使用Python的随机森林的文章。
什么是随机森林? (What is a Random Forest?)
Random forest is solid choice for nearly any prediction problem (even non-linear ones). It’s a relatively new machine learning strategy (it came out of Bell Labs in the 90s) and it can be used for just about anything. It belongs to a larger class of machine learning algorithms called ensemble methods.
随机森林是几乎所有预测问题(甚至是非线性问题)的可靠选择。 这是一种相对较新的机器学习策略(它是90年代来自贝尔实验室的),几乎可以用于任何事物。 它属于一类称为集成方法的较大机器学习算法。
合奏学习 (Ensemble Learning)
Ensemble learning involves the combination of several models to solve a single prediction problem. It works by generating multiple classifiers/models which learn and make predictions independently. Those predictions are then combined into a single (mega) prediction that should be as good or better than the prediction made by any one classifer.
集成学习涉及多个模型的组合以解决单个预测问题。 它通过生成多个独立学习和做出预测的分类器/模型来工作。 然后将这些预测合并为单个(大型)预测,该预测应该比任何一个分类器的预测都好或更好。
Random forest is a brand of ensemble learning, as it relies on an ensemble of decision trees. More on ensemble learning in Python here: Scikit-Learn docs.
随机森林是集成学习的品牌,因为它依赖于决策树的集成。 有关Python合奏学习的更多信息,请参见: Scikit-Learn docs 。
随机决策树 (Randomized Decision Trees)
So we know that random forest is an aggregation of other models, but what types of models is it aggregating? As you might have guessed from its name, random forest aggregates Classification (or Regression) Trees. A decision tree is composed of a series of decisions that can be used to classify an observation in a dataset.
因此,我们知道随机森林是其他模型的集合,但是它将聚集什么类型的模型? 正如您可能已经从其名称中猜到的那样,随机森林聚集了分类(或回归)树 。 决策树由一系列决策组成,可用于对数据集中的观察进行分类。
随机森林 (Random Forest)
The algorithm to induce a random forest will create a bunch of random decision trees automatically. Since the trees are generated at random, most won’t be all that meaningful to learning your classification/regression problem (maybe 99.9% of trees).
诱导随机森林的算法将自动创建一堆随机决策树。 由于树是随机生成的,因此对于学习分类/回归问题(可能是99.9%的树)而言,大多数树都没有那么有意义。
If an observation has a length of 45, blue eyes, and 2 legs, it’s going to be classified as red.
如果观察结果的长度为45,蓝眼睛,两条腿,则将其分类为红色 。
树木投票 (Arboreal Voting)
So what good are 10000 (probably) bad models? Well it turns out that they really aren’t that helpful. But what is helpful are the few really good decision trees that you also generated along with the bad ones.
那么10000个(可能)坏模型有什么用呢? 事实证明,它们确实没有帮助。 但是有用的是,您还会生成一些与坏决策一起真正好的决策树。
When you make a prediction, the new observation gets pushed down each decision tree and assigned a predicted value/label. Once each of the trees in the forest have reported its predicted value/label, the predictions are tallied up and the mode vote of all trees is returned as the final prediction.
进行预测时,新观察值将推入每个决策树并分配一个预测值/标签。 一旦森林中的每棵树都报告了其预测值/标签,就对这些预测进行汇总,并返回所有树木的模式投票作为最终预测。
Simply, the 99.9% of trees that are irrelevant make predictions that are all over the map and cancel each another out. The predictions of the minority of trees that are good top that noise and yield a good prediction.
简而言之,不相关的99.9%的树木做出的预测遍布整个地图,并且彼此抵消。 少数树木的预测结果好于噪声并产生良好的预测。
我为什么要使用它? (Why you should I use it?)
这很容易 (It’s Easy)
Random forest is the Leatherman of learning methods. You can throw pretty much anything at it and it’ll do a serviceable job. It does a particularly good job of estimating inferred transformations, and, as a result, doesn’t require much tuning like SVM (i.e. it’s good for folks with tight deadlines).
随机森林是莱瑟曼的学习方法。 您可以将几乎任何东西扔给它,它将做有用的工作。 它在估计推断的转换方面做得特别好,因此,不需要像SVM这样的大量调整(即,对于期限紧迫的人们来说非常有用)。
转换实例 (An Example Transformation)
Random forest is capable of learning without carefully crafted data transformations. Take the the f(x) = log(x)
function for example.
随机森林无需经过精心设计的数据转换即可学习。 以f(x) = log(x)
函数为例。
Alright let’s write some code. We’ll be writing our Python code in Yhat’s very own interactive environment built for analyzing data, Rodeo. You can download Rodeo for Mac, Windows or Linux [here](https://www.yhat.com/products/rodeo).
好吧,让我们写一些代码。 我们将在Yhat专门用于分析数据的互动环境Rodeo中编写Python代码。 您可以在[此处](https://www.yhat.com/products/rodeo)下载Mac,Windows或Linux的Rodeo。
First, create some fake data and add a little noise.
首先,创建一些虚假数据并添加一些干扰。
import numpy as np
import pylab as pl
x = np.random.uniform(1, 100, 1000)
y = np.log(x) + np.random.normal(0, .3, 1000)
pl.scatter(x, y, s=1, label="log(x) with noise")
pl.plot(np.arange(1, 100), np.log(np.arange(1, 100)), c="b", label="log(x) true function")
pl.xlabel("x")
pl.ylabel("f(x) = log(x)")
pl.legend(loc="best")
pl.title("A Basic Log Function")
pl.show()
Following along in Rodeo? Here’s what you should see.
跟随牛仔竞技表演吗? 这是您应该看到的。
Let’s take a closer look at that plot.
让我们仔细看看那个情节。
If we try and build a basic linear model to predict y
using x
we wind up with a straight line that sort of bisects the log(x)
function. Whereas if we use a random forest, it does a much better job of approximating the log(x)
curve and we get something that looks much more like the true function.
如果我们尝试建立一个基本的线性模型来使用x
预测y
我们将得出一条将log(x)
函数一分为二的直线。 而如果我们使用随机森林,则在逼近log(x)
曲线方面做得更好,我们得到的东西看起来更像真实函数。
You could argue that the random forest overfits the log(x)
function a little bit. Either way, I think this does a nice job of illustrating how the random forest isn’t bound by linear constraints.
您可能会说随机森林稍微有点超出了log(x)
函数。 无论哪种方式,我认为这都能很好地说明随机森林如何不受线性约束的约束。
用途 (Uses)
变量选择 (Variable Selection)
One of the best use cases for random forest is feature selection. One of the byproducts of trying lots of decision tree variations is that you can examine which variables are working best/worst in each tree.
特征选择是随机森林的最佳用例之一。 尝试大量决策树变体的副产品之一是,您可以检查哪些变量在每棵树中效果最佳/最差。
When a certain tree uses one variable and another doesn’t, you can compare the value lost or gained from the inclusion/exclusion of that variable. The good random forest implementations are going to do that for you, so all you need to do is know which method or variable to look at.
当某棵树使用一个变量而另一棵树不使用时,您可以比较包含或排除该变量所损失或获得的价值。 好的随机森林实现将为您做到这一点,因此您所需要做的就是知道要查看哪种方法或变量。
In the following examples, we’re trying to figure out which variables are most important for classifying a wine as being red or white.
在以下示例中,我们试图找出哪些变量对于将葡萄酒分类为红色或白色最重要。
分类 (Classification)
Random forest is also great for classification. It can be used to make predictions for categories with multiple possible values and it can be calibrated to output probabilities as well. One thing you do need to watch out for is overfitting. Random forest can be prone to overfitting, especially when working with relatively small datasets. You should be suspicious if your model is making “too good” of predictions on our test set.
随机森林也很适合分类。 它可以用于对具有多个可能值的类别进行预测,也可以将其校准为输出概率。 您需要注意的一件事是过度拟合 。 随机森林可能易于过度拟合,尤其是在使用相对较小的数据集时。 如果您的模型对我们的测试集做出的预测“太好”,您应该会感到怀疑。
One way to overfitting is to only use really relevant features in your model. While this isn’t always cut and dry, using a feature selection technique (like the one mentioned previously) can make it a lot easier.
过度拟合的一种方法是仅在模型中使用真正相关的功能。 尽管这并不总是可以轻松完成的,但是使用一种功能选择技术(如前面提到的那种)可以使它变得容易得多。
回归 (Regression)
Yep. It does regression too.
是的 它也进行回归。
I’ve found that random forest–unlike other algorithms–does really well learning on categorical variables or a mixture of categorical and real variables. Categorical variables with high cardinality (# of possible values) can be tricky, so having something like this in your back pocket can come in quite useful.
我发现,与其他算法不同,随机森林在分类变量或分类变量与实变量的混合中学习得很好。 具有高基数的分类变量(可能值的数量)可能很棘手,因此在后兜里放这样的东西会很有用。
一个简短的Python示例 (A Short Python Example)
Scikit-Learn is a great way to get started with random forest. The scikit-learn API is extremely consistent across algorithms, so you horse race and switch between models very easily. A lot of times I start with something simple and then move to random forest.
Scikit-Learn是入门随机森林的好方法。 scikit-learn API在各个算法之间是极其一致的,因此您可以轻松进行比赛并在模型之间进行切换。 很多时候,我从简单的事情开始,然后转到随机森林。
One of the best features of the random forest implementation in scikit-learn is the n_jobs
parameter. This will automatically parallelize fitting your random forest based on the number of cores you want to use. Here’s a great presentation by scikit-learn contributor Olivier Grisel where he talks about training a random forest on a 20 node EC2 cluster.
scikit-learn中随机森林实现的最佳功能之一是n_jobs
参数。 这将根据您要使用的核心数自动并行化拟合您的随机森林。 这是 scikit-learn的贡献者Olivier Grisel 的精彩演讲 ,他谈到了如何在20节点EC2集群上训练随机森林。
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df.head()
train, test = df[df['is_train']==True], df[df['is_train']==False]
features = df.columns[:4]
clf = RandomForestClassifier(n_jobs=2)
y, _ = pd.factorize(train['species'])
clf.fit(train[features], y)
preds = iris.target_names[clf.predict(test[features])]
pd.crosstab(test['species'], preds, rownames=['actual'], colnames=['preds'])
Following along? Here’s what you should see(ish). We’re using *randomly* selected data, so your exact values will differ each time.
跟着呢? 这是您应该看到的。 我们使用的是*随机*的选定数据,因此每次的确切值都会有所不同。
preds | Preds | sertosa | 塞尔托萨 | versicolor | 杂色 | virginica | 维吉尼卡 |
---|---|---|---|---|---|---|---|
actual | 实际 | ||||||
sertosa | 塞尔托萨 | 6 | 6 | 0 | 0 | 0 | 0 |
versicolor | 杂色 | 0 | 0 | 16 | 16 | 1 | 1个 |
virginica | 维吉尼卡 | 0 | 0 | 0 | 0 | 12 | 12 |
最后的想法 (Final Thoughts)
翻译自: https://www.pybloggers.com/2016/11/random-forests-in-python/
python中实现随机森林
上一篇: Python随机森林
下一篇: 过河问题,C++(非搜索算法实现)