欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

朴素贝叶斯分类的简单解释

程序员文章站 2022-05-30 19:33:00
...

本文翻译自:A simple explanation of Naive Bayes Classification

I am finding it hard to understand the process of Naive Bayes, and I was wondering if someone could explain it with a simple step by step process in English. 我发现很难理解朴素贝叶斯的过程,我想知道是否有人可以用简单的英语逐步过程来解释它。 I understand it takes comparisons by times occurred as a probability, but I have no idea how the training data is related to the actual dataset. 我知道它会将发生的时间进行比较作为概率,但是我不知道训练数据与实际数据集之间的关系。

Please give me an explanation of what role the training set plays. 请给我解释一下培训集扮演的角色。 I am giving a very simple example for fruits here, like banana for example 我在这里举一个非常简单的水果示例,例如香蕉

training set---
round-red
round-orange
oblong-yellow
round-red

dataset----
round-red
round-orange
round-red
round-orange
oblong-yellow
round-red
round-orange
oblong-yellow
oblong-yellow
round-red

#1楼

参考:https://stackoom.com/question/gCxW/朴素贝叶斯分类的简单解释


#2楼

Your question as I understand it is divided in two parts, part one being you need a better understanding of the Naive Bayes classifier & part two being the confusion surrounding Training set. 据我了解,您的问题分为两部分,第一部分是您需要更好地了解朴素贝叶斯分类器,第二部分是围绕训练集的困惑。

In general all of Machine Learning Algorithms need to be trained for supervised learning tasks like classification, prediction etc. or for unsupervised learning tasks like clustering. 通常,所有机器学习算法都需要针对监督学习任务(如分类,预测等)或针对无监督学习任务(如聚类)进行培训。

During the training step, the algorithms are taught with a particular input dataset (training set) so that later on we may test them for unknown inputs (which they have never seen before) for which they may classify or predict etc (in case of supervised learning) based on their learning. 在训练步骤中,将使用特定的输入数据集(训练集)来教授算法,以便稍后我们可以测试它们的未知输入(它们之前从未见过),可以对它们进行分类或预测(在监督的情况下)学习)基于他们的学习。 This is what most of the Machine Learning techniques like Neural Networks, SVM, Bayesian etc. are based upon. 这是大多数机器学习技术(如神经网络,SVM,贝叶斯等)所基于的。

So in a general Machine Learning project basically you have to divide your input set to a Development Set (Training Set + Dev-Test Set) & a Test Set (or Evaluation set). 因此,在一般的机器学习项目中,基本上,您必须将输入集划分为开发集(训练集+开发测试集)和测试集(或评估集)。 Remember your basic objective would be that your system learns and classifies new inputs which they have never seen before in either Dev set or test set. 请记住,您的基本目标是让系统学习和分类在开发集或测试集中从未见过的新输入。

The test set typically has the same format as the training set. 测试集通常具有与训练集相同的格式。 However, it is very important that the test set be distinct from the training corpus: if we simply reused the training set as the test set, then a model that simply memorized its input, without learning how to generalize to new examples, would receive misleadingly high scores. 但是,将测试集与训练集区分开是非常重要的:如果我们简单地将训练集重用为测试集,那么仅记住其输入但不学习如何归纳为新示例的模型将产生误导性高分数。

In general, for an example, 70% of our data can be used as training set cases. 通常,例如,我们的数据的70%可以用作训练集案例。 Also remember to partition the original set into the training and test sets randomly . 还记得将原始集随机分为训练集和测试集。

Now I come to your other question about Naive Bayes. 现在我来谈谈您关于朴素贝叶斯的其他问题。

To demonstrate the concept of Naïve Bayes Classification, consider the example given below: 为了说明朴素贝叶斯分类的概念,请考虑以下示例:

朴素贝叶斯分类的简单解释

As indicated, the objects can be classified as either GREEN or RED . 如图所示,对象可以分类为GREENRED Our task is to classify new cases as they arrive, ie, decide to which class label they belong, based on the currently existing objects. 我们的任务是在新案例到达时对其进行分类,即根据当前存在的对象确定它们属于哪个类别标签。

Since there are twice as many GREEN objects as RED , it is reasonable to believe that a new case (which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED . 由于GREEN对象的数量是RED两倍,因此有理由相信,新案例(尚未发现)具有成员资格GREEN可能性是RED两倍。 In the Bayesian analysis, this belief is known as the prior probability. 在贝叶斯分析中,此信念称为先验概率。 Prior probabilities are based on previous experience, in this case the percentage of GREEN and RED objects, and often used to predict outcomes before they actually happen. 先前概率基于先前的经验,在本例中为GREENRED对象的百分比,通常用于预测结果实际发生的时间。

Thus, we can write: 因此,我们可以这样写:

Prior Probability of GREEN : number of GREEN objects / total number of objects GREEN先验概率number of GREEN objects / total number of objects

Prior Probability of RED : number of RED objects / total number of objects RED先验概率number of RED objects / total number of objects

Since there is a total of 60 objects, 40 of which are GREEN and 20 RED , our prior probabilities for class membership are: 由于总共有60对象,其中40个是GREEN和20 RED ,因此我们获得类成员资格的先验概率为:

Prior Probability for GREEN : 40 / 60 事先概率GREEN40 / 60

Prior Probability for RED : 20 / 60 RED先验概率20 / 60

Having formulated our prior probability, we are now ready to classify a new object ( WHITE circle in the diagram below). 在确定了先验概率之后,我们现在就可以对新对象进行分类了(下图中的WHITE圆圈)。 Since the objects are well clustered, it is reasonable to assume that the more GREEN (or RED ) objects in the vicinity of X, the more likely that the new cases belong to that particular color. 由于对象很好地聚集在一起,因此可以合理地假设X附近的GREEN (或RED )对象越多,则新案例属于该特定颜色的可能性就越大。 To measure this likelihood, we draw a circle around X which encompasses a number (to be chosen a priori) of points irrespective of their class labels. 为了测量这种可能性,我们在X周围画了一个圆,该圆包含与点的类别标签无关的多个点(将被优先选择)。 Then we calculate the number of points in the circle belonging to each class label. 然后,我们计算属于每个类标签的圆圈中的点数。 From this we calculate the likelihood: 由此我们计算出可能性:

朴素贝叶斯分类的简单解释

朴素贝叶斯分类的简单解释

From the illustration above, it is clear that Likelihood of X given GREEN is smaller than Likelihood of X given RED , since the circle encompasses 1 GREEN object and 3 RED ones. 从上面的插图中可以明显看出,给定GREENX似然性小于给定REDX似然性,因为圆包含1 GREEN对象和3 RED对象。 Thus: 从而:

朴素贝叶斯分类的简单解释

朴素贝叶斯分类的简单解释

Although the prior probabilities indicate that X may belong to GREEN (given that there are twice as many GREEN compared to RED ) the likelihood indicates otherwise; 尽管先验概率表明X可能属于GREEN (假设GREENRED两倍),但可能性表明并非如此; that the class membership of X is RED (given that there are more RED objects in the vicinity of X than GREEN ). X的类成员资格是RED (假设X附近的RED对象多于GREEN )。 In the Bayesian analysis, the final classification is produced by combining both sources of information, ie, the prior and the likelihood, to form a posterior probability using the so-called Bayes' rule (named after Rev. Thomas Bayes 1702-1761). 在贝叶斯分析中,通过使用所谓的贝叶斯规则(以托马斯·贝叶斯(Bayes)1702-1761牧师命名),将信息的两个来源(即先验概率和可能性)组合在一起形成后验概率,从而产生最终分类。

朴素贝叶斯分类的简单解释

Finally, we classify X as RED since its class membership achieves the largest posterior probability. 最后,由于X的类成员资格具有最大的后验概率,因此我们将其分类为RED


#3楼

I realize that this is an old question, with an established answer. 我意识到这是一个古老的问题,已经有了确定的答案。 The reason I'm posting is that is the accepted answer has many elements of k-NN ( k -nearest neighbors), a different algorithm. 我发布的原因是,可接受的答案包含k-NN( k个最近邻居)的许多元素,这是一种不同的算法。

Both k-NN and NaiveBayes are classification algorithms. k-NN和NaiveBayes都是分类算法。 Conceptually, k-NN uses the idea of "nearness" to classify new entities. 从概念上讲,k-NN使用“附近”的概念对新实体进行分类。 In k-NN 'nearness' is modeled with ideas such as Euclidean Distance or Cosine Distance. 在k-NN中,“近距离”以欧几里得距离或余弦距离等思想建模。 By contrast, in NaiveBayes, the concept of 'probability' is used to classify new entities. 相反,在NaiveBayes中,“概率”的概念用于对新实体进行分类。

Since the question is about Naive Bayes, here's how I'd describe the ideas and steps to someone. 既然问题是关于朴素贝叶斯的,那么这就是我向某人描述想法和步骤的方式。 I'll try to do it with as few equations and in plain English as much as possible. 我将尝试使用尽可能少的方程式和尽可能简单的英语来实现。

First, Conditional Probability & Bayes' Rule 首先,条件概率与贝叶斯规则

Before someone can understand and appreciate the nuances of Naive Bayes', they need to know a couple of related concepts first, namely, the idea of Conditional Probability, and Bayes' Rule. 在人们理解和欣赏朴素贝叶斯的细微差别之前,他们需要首先了解几个相关概念,即条件概率的概念和贝叶斯规则。 (If you are familiar with these concepts, skip to the section titled Getting to Naive Bayes' ) (如果您熟悉这些概念,请跳到标题为“ 入门到朴素贝叶斯”的部分

Conditional Probability in plain English: What is the probability that something will happen, given that something else has already happened. 普通英语的条件概率如果已经发生其他事情,则发生某件事的概率是多少?

Let's say that there is some Outcome O. And some Evidence E. From the way these probabilities are defined: The Probability of having both the Outcome O and Evidence E is: (Probability of O occurring) multiplied by the (Prob of E given that O happened) 假设有一个结果O。还有一些证据E。从定义这些概率的方式来看: 同时获得结果O和证据E的概率为:(O发生的概率)乘以(E的概率,给定发生了

One Example to understand Conditional Probability: 理解条件概率的一个例子:

Let say we have a collection of US Senators. 可以说,我们有一些美国参*。 Senators could be Democrats or Republicans. 参*可能是*党人或共和党人。 They are also either male or female. 他们也可以是男性或女性。

If we select one senator completely randomly, what is the probability that this person is a female Democrat? 如果我们完全随机选择一名参*,那么这个人是女性*党的可能性是多少? Conditional Probability can help us answer that. 条件概率可以帮助我们回答。

Probability of (Democrat and Female Senator)= Prob(Senator is Democrat) multiplied by Conditional Probability of Being Female given that they are a Democrat. (*党和女参*)的概率=概率(参*是*党)乘以作为女性的有条件概率,前提是她们是*党人。

  P(Democrat & Female) = P(Democrat) * P(Female | Democrat) 

We could compute the exact same thing, the reverse way: 我们可以用相反的方式计算出完全相同的东西:

  P(Democrat & Female) = P(Female) * P(Democrat | Female) 

Understanding Bayes Rule 了解贝叶斯规则

Conceptually, this is a way to go from P(Evidence| Known Outcome) to P(Outcome|Known Evidence). 从概念上讲,这是从P(证据|已知结果)转到P(结果|已知证据)的方法。 Often, we know how frequently some particular evidence is observed, given a known outcome . 通常,我们知道在已知结果的情况下观察到某些特殊证据的频率。 We have to use this known fact to compute the reverse, to compute the chance of that outcome happening , given the evidence. 给定证据,我们必须使用这个已知事实来计算逆向,计算出发生这种结果的机会。

P(Outcome given that we know some Evidence) = P(Evidence given that we know the Outcome) times Prob(Outcome), scaled by the P(Evidence) P(假设我们知道一些证据就得出结果)= P(假设我们知道结果就得出证据)乘以Prob(结果),然后按P(得出结果)进行缩放

The classic example to understand Bayes' Rule: 了解贝叶斯规则的经典示例:

Probability of Disease D given Test-positive = 

               Prob(Test is positive|Disease) * P(Disease)
     _______________________________________________________________
     (scaled by) Prob(Testing Positive, with or without the disease)

Now, all this was just preamble, to get to Naive Bayes. 现在,所有这些只是到达朴素贝叶斯的序言。

Getting to Naive Bayes' 前往朴素贝叶斯

So far, we have talked only about one piece of evidence. 到目前为止,我们仅讨论了一个证据。 In reality, we have to predict an outcome given multiple evidence. 实际上,我们必须根据多种证据来预测结果 In that case, the math gets very complicated. 在这种情况下,数学变得非常复杂。 To get around that complication, one approach is to 'uncouple' multiple pieces of evidence, and to treat each of piece of evidence as independent. 为了避免这种复杂性,一种方法是“解耦”多个证据,并将每个证据视为独立的。 This approach is why this is called naive Bayes. 这种方法就是为什么这被称为朴素贝叶斯。

P(Outcome|Multiple Evidence) = 
P(Evidence1|Outcome) * P(Evidence2|outcome) * ... * P(EvidenceN|outcome) * P(Outcome)
scaled by P(Multiple Evidence)

Many people choose to remember this as: 许多人选择记住这一点:

                      P(Likelihood of Evidence) * Prior prob of outcome
P(outcome|evidence) = _________________________________________________
                                         P(Evidence)

Notice a few things about this equation: 请注意有关此方程式的一些注意事项:

  • If the Prob(evidence|outcome) is 1, then we are just multiplying by 1. 如果Prob(evidence | outcome)为1,那么我们只乘以1。
  • If the Prob(some particular evidence|outcome) is 0, then the whole prob. 如果概率(某些特定结果)为0,则整个概率。 becomes 0. If you see contradicting evidence, we can rule out that outcome. 变为0。如果您看到矛盾的证据,我们可以排除该结果。
  • Since we divide everything by P(Evidence), we can even get away without calculating it. 由于我们将所有事物都除以P(证据),因此即使不进行计算也可以逃脱。
  • The intuition behind multiplying by the prior is so that we give high probability to more common outcomes, and low probabilities to unlikely outcomes. 乘以先验后的直觉是,我们给较常见的结果以高概率,而不太可能的结果以低概率。 These are also called base rates and they are a way to scale our predicted probabilities. 这些也称为base rates ,它们是扩展我们的预测概率的一种方法。

How to Apply NaiveBayes to Predict an Outcome? 如何应用NaiveBayes预测结果?

Just run the formula above for each possible outcome. 只需对每个可能的结果运行上面的公式。 Since we are trying to classify , each outcome is called a class and it has a class label. 由于我们尝试分类 ,因此每个结果都称为一个class ,并且具有一个class label. Our job is to look at the evidence, to consider how likely it is to be this class or that class, and assign a label to each entity. 我们的工作是查看证据,考虑成为此类的可能性,并为每个实体分配标签。 Again, we take a very simple approach: The class that has the highest probability is declared the "winner" and that class label gets assigned to that combination of evidences. 同样,我们采用一种非常简单的方法:将具有最高概率的类别声明为“优胜者”,并将类别标签分配给该证据组合。

Fruit Example 水果实例

Let's try it out on an example to increase our understanding: The OP asked for a 'fruit' identification example. 让我们尝试一个例子来加深我们的理解:OP请求一个“水果”识别例子。

Let's say that we have data on 1000 pieces of fruit. 假设我们有1000个水果的数据。 They happen to be Banana , Orange or some Other Fruit . 它们恰好是香蕉橙子其他水果 We know 3 characteristics about each fruit: 我们知道每种水果的3个特征:

  1. Whether it is Long 是否长
  2. Whether it is Sweet and 无论是甜和
  3. If its color is Yellow. 如果它的颜色是黄色。

This is our 'training set.' 这是我们的“训练集”。 We will use this to predict the type of any new fruit we encounter. 我们将使用它来预测遇到的任何水果的类型。

Type           Long | Not Long || Sweet | Not Sweet || Yellow |Not Yellow|Total
             ___________________________________________________________________
Banana      |  400  |    100   || 350   |    150    ||  450   |  50      |  500
Orange      |    0  |    300   || 150   |    150    ||  300   |   0      |  300
Other Fruit |  100  |    100   || 150   |     50    ||   50   | 150      |  200
            ____________________________________________________________________
Total       |  500  |    500   || 650   |    350    ||  800   | 200      | 1000
             ___________________________________________________________________

We can pre-compute a lot of things about our fruit collection. 我们可以对水果收集进行很多预计算。

The so-called "Prior" probabilities. 所谓“先验”概率。 (If we didn't know any of the fruit attributes, this would be our guess.) These are our base rates. (如果我们不知道任何水果属性,这就是我们的猜测。)这些是我们的base rates.

 P(Banana)      = 0.5 (500/1000)
 P(Orange)      = 0.3
 P(Other Fruit) = 0.2

Probability of "Evidence" “证据”的概率

p(Long)   = 0.5
P(Sweet)  = 0.65
P(Yellow) = 0.8

Probability of "Likelihood" “可能性”的可能性

P(Long|Banana) = 0.8
P(Long|Orange) = 0  [Oranges are never long in all the fruit we have seen.]
 ....

P(Yellow|Other Fruit)     =  50/200 = 0.25
P(Not Yellow|Other Fruit) = 0.75

Given a Fruit, how to classify it? 给定水果,如何分类?

Let's say that we are given the properties of an unknown fruit, and asked to classify it. 假设给了我们未知水果的属性,并要求对其进行分类。 We are told that the fruit is Long, Sweet and Yellow. 我们被告知水果是长,甜和黄色的。 Is it a Banana? 是香蕉吗? Is it an Orange? 是橘子吗? Or Is it some Other Fruit? 还是其他水果?

We can simply run the numbers for each of the 3 outcomes, one by one. 我们可以简单地对3个结果中的每一个进行数字运算。 Then we choose the highest probability and 'classify' our unknown fruit as belonging to the class that had the highest probability based on our prior evidence (our 1000 fruit training set): 然后,根据我们的先验证据(我们的1000个水果训练集),我们选择最高概率,并将未知水果“分类”为具有最高概率的类别:

P(Banana|Long, Sweet and Yellow) 
      P(Long|Banana) * P(Sweet|Banana) * P(Yellow|Banana) * P(banana)
    = _______________________________________________________________
                      P(Long) * P(Sweet) * P(Yellow)

    = 0.8 * 0.7 * 0.9 * 0.5 / P(evidence)

    = 0.252 / P(evidence)


P(Orange|Long, Sweet and Yellow) = 0


P(Other Fruit|Long, Sweet and Yellow)
      P(Long|Other fruit) * P(Sweet|Other fruit) * P(Yellow|Other fruit) * P(Other Fruit)
    = ____________________________________________________________________________________
                                          P(evidence)

    = (100/200 * 150/200 * 50/200 * 200/1000) / P(evidence)

    = 0.01875 / P(evidence)

By an overwhelming margin ( 0.252 >> 0.01875 ), we classify this Sweet/Long/Yellow fruit as likely to be a Banana. 通过压倒性优势( 0.252 >> 0.01875 ),我们将该甜/长/黄水果分类为可能的香蕉。

Why is Bayes Classifier so popular? 为什么贝叶斯分类器如此受欢迎?

Look at what it eventually comes down to. 看看最终会导致什么。 Just some counting and multiplication. 只是一些计数和乘法。 We can pre-compute all these terms, and so classifying becomes easy, quick and efficient. 我们可以预先计算所有这些术语,因此分类变得容易,快速和高效。

Let z = 1 / P(evidence). Now we quickly compute the following three quantities. 现在我们快速计算以下三个量。

P(Banana|evidence) = z * Prob(Banana) * Prob(Evidence1|Banana) * Prob(Evidence2|Banana) ...
P(Orange|Evidence) = z * Prob(Orange) * Prob(Evidence1|Orange) * Prob(Evidence2|Orange) ...
P(Other|Evidence)  = z * Prob(Other)  * Prob(Evidence1|Other)  * Prob(Evidence2|Other)  ...

Assign the class label of whichever is the highest number, and you are done. 分配类别标签中最大的一个,即已完成。

Despite the name, Naive Bayes turns out to be excellent in certain applications. 尽管名称如此,但朴素贝叶斯在某些应用中却表现出色。 Text classification is one area where it really shines. 文本分类是它真正发挥作用的地方。

Hope that helps in understanding the concepts behind the Naive Bayes algorithm. 希望这有助于理解朴素贝叶斯算法的概念。


#4楼

I try to explain the Bayes rule with an example. 我尝试用一​​个例子来解释贝叶斯规则。

Suppose that you know that 10% of people are smokers. 假设您知道10%的人是吸烟者。

Now you see someone who is a man and 15 years old. 现在,您看到一个15岁的男人 You want to know the chance that he is a smoker: 您想知道他是吸烟者的机会:

 P(smoker | he is a man and under 20) 

Since you know that 10% of people are smokers your initial guess is 10% ( prior probability , without knowing anything about the person) but the other pieces of evidence (that he is a man and he is 15) can affect this guess. 由于您知道10%的人是吸烟者,因此您最初的猜测是10%( 先验概率 ,对这个人一无所知),但是其他证据 (他是一个男人,他是15岁)可以影响这个猜测。

Each evidence may increase or decrease this chance. 每个证据都可能增加或减少这种机会。 For example, this fact that he is a man may increase the chance, provided that this percentage (being a man) among non-smokers is lower. 例如,如果非吸烟者中的这个比例(是男人)较低,那么他是男人的事实可能会增加机会。 In the other words, being a man must be a good indicator of being a smoker rather than a non-smoker. 换句话说,男人必须是吸烟者而不是不吸烟者的良好标志。

We can show this contribution in another way. 我们可以用另一种方式显示这种贡献。 For each feature, you need to compare the commonness (probability) of that feature under the given conditions with its commonness alone. 对于每个功能,您需要将给定条件下该功能的通用性(概率)与其单独的通用性进行比较。 ( P(f | x) vs. P(f)) . P(f | x) vs. P(f)) For example, if we know that 90% of smokers are men, it's not enough to say being a man is an indicator of being smoker. 例如,如果我们知道90%的吸烟者是男性,那么仅仅说男人是吸烟者的标志是不够的。 For example if the probability of being a man in the society is also 90%, then knowing that someone is a man doesn't help us (10% * (90% / 90%) = 10%) . 例如,如果在社会中成为男人的概率也为90%,那么知道某人是男人不会帮助我们(10% * (90% / 90%) = 10%) But if men contribute to 40% of the society, but 90% of the smokers, then knowing that someone is a man increases the chance of being a smoker (10% * (90% / 40%) = 22.5% ) . 但是,如果男人为社会的40%做出贡献,但为90%的吸烟者做出贡献,则知道某人是男人会增加成为吸烟者的机会(10% * (90% / 40%) = 22.5% ) In the same way, if the probability of being a man was 95% in the society, then regardless of the fact that the percentage of men among smokers is high (90%)! 同样,如果在社会上成为男人的可能性为95%,那么无论吸烟者中男人的比例很高(90%)都是事实! the evidence that someone is a man decreases the chance of him being a smoker! 某人是男人的证据减少了他成为吸烟者的机会! (10% * (90% / 95%) = 9.5%) . (10% * (90% / 95%) = 9.5%)

So we have: 因此,我们有:

P(smoker | f1, f2, f3,... ) = P(smoker) * contribution of f1* contribution of f2 *... =
P(smoker)* 
(P(being a man | smoker)/P(being a man))*
(P(under 20 | smoker)/ P(under 20))

Note that in this formula we assumed that being a man and being under 20 are independent features so we multiplied them, it means that knowing that someone is under 20 has no effect on guessing that he is man or woman. 请注意,在此公式中,我们假设男人20岁以下是独立的特征,因此我们将它们相乘,这意味着知道某人在20岁以下不会影响猜测他是男人还是女人。 But it may not be true, for example maybe most adolescence in a society are men... 但这可能不是真实的,例如,一个社会中的大多数青春期都是男人...

To use this formula in a classifier 在分类器中使用此公式

The classifier is given with some features (being a man and being under 20) and it must decide if he is an smoker or not (these are two classes). 分类器具有某些功能(一个男人且年龄在20岁以下),并且必须确定他是否吸烟(这是两个类别)。 It uses the above formula to calculate the probability of each class under the evidence (features), and it assigns the class with the highest probability to the input. 它使用上面的公式来计算证据(特征)下每个类别的概率,并将具有最高概率的类别分配给输入。 To provide the required probabilities (90%, 10%, 80%...) it uses the training set. 为了提供所需的概率(90%,10%,80%...),它使用训练集。 For example, it counts the people in the training set that are smokers and find they contribute 10% of the sample. 例如,它计算出训练集中吸烟者的人数,发现他们贡献了样本的10%。 Then for smokers checks how many of them are men or women .... how many are above 20 or under 20....In the other words, it tries to build the probability distribution of the features for each class based on the training data. 然后为吸烟者检查其中有多少男人或女人.... 20岁以上或20岁以下的有多少....换句话说,它会根据训练尝试为每个类别建立特征的概率分布数据。


#5楼

Ram Narasimhan explained the concept very nicely here below is an alternative explanation through the code example of Naive Bayes in action Ram Narasimhan很好地解释了这个概念,下面是Naive Bayes的代码示例的替代解释
It uses an example problem from this book on page 351 它使用本书第351页中的示例问题
This is the data set that we will be using 这是我们将要使用的数据集
朴素贝叶斯分类的简单解释
In the above dataset if we give the hypothesis = {"Age":'<=30', "Income":"medium", "Student":'yes' , "Creadit_Rating":'fair'} then what is the probability that he will buy or will not buy a computer. 在上面的数据集中,如果给出假设= {"Age":'<=30', "Income":"medium", "Student":'yes' , "Creadit_Rating":'fair'}那么概率是多少他会买还是不会买电脑。
The code below exactly answers that question. 下面的代码正好回答了这个问题。
Just create a file called named new_dataset.csv and paste the following content. 只需创建一个名为new_dataset.csv的文件,然后粘贴以下内容。

Age,Income,Student,Creadit_Rating,Buys_Computer
<=30,high,no,fair,no
<=30,high,no,excellent,no
31-40,high,no,fair,yes
>40,medium,no,fair,yes
>40,low,yes,fair,yes
>40,low,yes,excellent,no
31-40,low,yes,excellent,yes
<=30,medium,no,fair,no
<=30,low,yes,fair,yes
>40,medium,yes,fair,yes
<=30,medium,yes,excellent,yes
31-40,medium,no,excellent,yes
31-40,high,yes,fair,yes
>40,medium,no,excellent,no

Here is the code the comments explains everything we are doing here! 这是注释说明我们在这里所做的一切的代码! [python] [蟒蛇]

import pandas as pd 
import pprint 

class Classifier():
    data = None
    class_attr = None
    priori = {}
    cp = {}
    hypothesis = None


    def __init__(self,filename=None, class_attr=None ):
        self.data = pd.read_csv(filename, sep=',', header =(0))
        self.class_attr = class_attr

    '''
        probability(class) =    How many  times it appears in cloumn
                             __________________________________________
                                  count of all class attribute
    '''
    def calculate_priori(self):
        class_values = list(set(self.data[self.class_attr]))
        class_data =  list(self.data[self.class_attr])
        for i in class_values:
            self.priori[i]  = class_data.count(i)/float(len(class_data))
        print "Priori Values: ", self.priori

    '''
        Here we calculate the individual probabilites 
        P(outcome|evidence) =   P(Likelihood of Evidence) x Prior prob of outcome
                               ___________________________________________
                                                    P(Evidence)
    '''
    def get_cp(self, attr, attr_type, class_value):
        data_attr = list(self.data[attr])
        class_data = list(self.data[self.class_attr])
        total =1
        for i in range(0, len(data_attr)):
            if class_data[i] == class_value and data_attr[i] == attr_type:
                total+=1
        return total/float(class_data.count(class_value))

    '''
        Here we calculate Likelihood of Evidence and multiple all individual probabilities with priori
        (Outcome|Multiple Evidence) = P(Evidence1|Outcome) x P(Evidence2|outcome) x ... x P(EvidenceN|outcome) x P(Outcome)
        scaled by P(Multiple Evidence)
    '''
    def calculate_conditional_probabilities(self, hypothesis):
        for i in self.priori:
            self.cp[i] = {}
            for j in hypothesis:
                self.cp[i].update({ hypothesis[j]: self.get_cp(j, hypothesis[j], i)})
        print "\nCalculated Conditional Probabilities: \n"
        pprint.pprint(self.cp)

    def classify(self):
        print "Result: "
        for i in self.cp:
            print i, " ==> ", reduce(lambda x, y: x*y, self.cp[i].values())*self.priori[i]

if __name__ == "__main__":
    c = Classifier(filename="new_dataset.csv", class_attr="Buys_Computer" )
    c.calculate_priori()
    c.hypothesis = {"Age":'<=30', "Income":"medium", "Student":'yes' , "Creadit_Rating":'fair'}

    c.calculate_conditional_probabilities(c.hypothesis)
    c.classify()

output: 输出:

Priori Values:  {'yes': 0.6428571428571429, 'no': 0.35714285714285715}

Calculated Conditional Probabilities: 

{
 'no': {
        '<=30': 0.8,
        'fair': 0.6, 
        'medium': 0.6, 
        'yes': 0.4
        },
'yes': {
        '<=30': 0.3333333333333333,
        'fair': 0.7777777777777778,
        'medium': 0.5555555555555556,
        'yes': 0.7777777777777778
      }
}

Result: 
yes  ==>  0.0720164609053
no  ==>  0.0411428571429

Hope it helps in better understanding the problem 希望它有助于更​​好地理解问题

peace 和平


#6楼

Naive Bayes: Naive Bayes comes under supervising machine learning which used to make classifications of data sets. 朴素贝叶斯(Naive Bayes):朴素贝叶斯( Naive Bayes)受监督机器学习的影响,该机器学习过去曾对数据集进行分类。 It is used to predict things based on its prior knowledge and independence assumptions. 它用于基于其先验知识和独立性假设来预测事物。

They call it naive because it's assumptions (it assumes that all of the features in the dataset are equally important and independent) are really optimistic and rarely true in most real-world applications. 他们之所以称其为“ 天真”,是因为它的假设(假设数据集中的所有特征都同等重要且独立)实际上是乐观的,而在大多数实际应用中很少如此。

It is classification algorithm which makes the decision for the unknown data set. 它是分类算法,可为未知数据集做出决策。 It is based on Bayes Theorem which describe the probability of an event based on its prior knowledge. 它基于贝叶斯定理贝叶斯定理根据事件的先验知识描述事件的概率。

Below diagram shows how naive Bayes works 下图显示了朴素贝叶斯的工作方式

朴素贝叶斯分类的简单解释

Formula to predict NB: 预测NB的公式:

朴素贝叶斯分类的简单解释

How to use Naive Bayes Algorithm ? 如何使用朴素贝叶斯算法?

Let's take an example of how NB woks 让我们以NB炒锅的方式为例

Step 1: First we find out Likelihood of table which shows the probability of yes or no in below diagram. 步骤1:首先,我们找到下表的可能性,该可能性在下图中显示是或否的概率。 Step 2: Find the posterior probability of each class. 步骤2:找出每个类别的后验概率。

朴素贝叶斯分类的简单解释

Problem: Find out the possibility of whether the player plays in Rainy condition?

P(Yes|Rainy) = P(Rainy|Yes) * P(Yes) / P(Rainy)

P(Rainy|Yes) = 2/9 = 0.222
P(Yes) = 9/14 = 0.64
P(Rainy) = 5/14 = 0.36

Now, P(Yes|Rainy) = 0.222*0.64/0.36 = 0.39 which is lower probability which means chances of the match played is low.

For more reference refer these blog. 有关更多参考,请参考这些博客。

Refer GitHub Repository Naive-Bayes-Examples 请参阅GitHub存储库Naive-Bayes-示例