机器学习sklearn之贝叶斯网络实战(三)
贝叶斯网络的结构学习
包括:基于评分的结构学习、基于约束的结构学习以及两者结合的结构学习方法(hybrid structure learning)。
评分函数主要分为两大类:贝叶斯评分函数、基于信息论的评分函数。
贝叶斯评分函数主要包括:K2评分、BD评分、BDeu评分
基于信息论的评分函数包括:MDL评分、BIC评分、AIC评分
基于约束(依赖分析或条件独立性测试)的方法:
基于贝叶斯评分函数的网络结构学习
import pandas as pd
import numpy as np
from pgmpy.estimators import BdeuScore, K2Score, BicScore
from pgmpy.models import BayesianModel
# 随机生成数据样本,包括三个变量,其中Z变量依赖于X,Y变量
data = pd.DataFrame(np.random.randint(0, 4, size=(5000, 2)), columns=list('XY'))
data['Z'] = data['X'] + data['Y']
bdeu = BdeuScore(data, equivalent_sample_size=5)
k2 = K2Score(data)
bic = BicScore(data)
model1 = BayesianModel([('X', 'Z'), ('Y', 'Z')]) # X -> Z <- Y
model2 = BayesianModel([('X', 'Z'), ('X', 'Y')]) # Y <- X -> Z
print("-----model1的评分-----")
print(bdeu.score(model1))
print(k2.score(model1))
print(bic.score(model1))
print("-----model2的评分-----")
print(bdeu.score(model2))
print(k2.score(model2))
print(bic.score(model2))
-----model1的评分-----
-13938.59210841577
-14329.38201774425
-14294.632146121177
-----model2的评分-----
-20900.11792216164
-20926.96302682963
-20944.16319103185
从结果中可以看出model1的评分高于model2
重要的是,这些分数可以分解,也就是说,可以在给定潜在父变量的情况下,对每个变量进行局部计算,而不依赖于网络的其他部分:
print(bdeu.local_score('Z', parents=[]))
print(bdeu.local_score('Z', parents=['X']))
print(bdeu.local_score('Z', parents=['X', 'Y']))
-9271.730661171412
-6990.193641137852
-57.119275509385034
而依据评分函数进行搜索的搜索方法常用的有穷举(5个节点以下可用)和 爬山算法(一个贪婪算法)pympy的实现如下:
穷举,5个节点以下
from pgmpy.estimators import ExhaustiveSearch
es = ExhaustiveSearch(data, scoring_method=bic)
best_model = es.estimate()
print(best_model.edges())
print("\nAll DAGs by score:")
for score, dag in reversed(es.all_scores()):
print(score, dag.edges())
[('X', 'Z'), ('Y', 'Z')]
All DAGs by score:
-14294.632146121177 [('X', 'Z'), ('Y', 'Z')]
-14327.184045586026 [('X', 'Z'), ('X', 'Y'), ('Y', 'Z')]
-14327.184045586027 [('Z', 'X'), ('Z', 'Y'), ('X', 'Y')]
-14327.184045586027 [('Z', 'Y'), ('X', 'Z'), ('X', 'Y')]
-14327.18404558603 [('Z', 'X'), ('Z', 'Y'), ('Y', 'X')]
-14327.18404558603 [('Z', 'X'), ('Y', 'Z'), ('Y', 'X')]
-14327.18404558603 [('X', 'Z'), ('Y', 'Z'), ('Y', 'X')]
-16575.5120019897 [('Z', 'X'), ('Y', 'X')]
-16577.020154881986 [('Z', 'Y'), ('X', 'Y')]
-18663.28333516333 [('Z', 'X'), ('Z', 'Y')]
-18663.28333516333 [('Z', 'X'), ('Y', 'Z')]
-18663.28333516333 [('Z', 'Y'), ('X', 'Z')]
-20911.611291567002 [('Z', 'X')]
-20911.611291567002 [('X', 'Z')]
-20913.11944445929 [('Z', 'Y')]
-20913.11944445929 [('Y', 'Z')]
-20944.16319103185 [('X', 'Z'), ('Y', 'X')]
-20944.16319103185 [('Z', 'X'), ('X', 'Y')]
-20944.16319103185 [('X', 'Z'), ('X', 'Y')]
-20945.67134392414 [('Z', 'Y'), ('Y', 'X')]
-20945.67134392414 [('Y', 'Z'), ('Y', 'X')]
-20945.67134392414 [('X', 'Y'), ('Y', 'Z')]
-23161.447400862962 []
-23193.999300327807 [('X', 'Y')]
-23193.99930032781 [('Y', 'X')]
多个节点可使用爬山算法
from pgmpy.estimators import HillClimbSearch
# create some data with dependencies
data = pd.DataFrame(np.random.randint(0, 3, size=(2500, 8)), columns=list('ABCDEFGH'))
data['A'] += data['B'] + data['C']
data['H'] = data['G'] - data['A']
hc = HillClimbSearch(data, scoring_method=BicScore(data))
best_model = hc.estimate()
print(best_model.edges())
[('A', 'C'), ('A', 'B'), ('G', 'A'), ('G', 'H'), ('H', 'A'), ('C', 'B')]
搜索策略:
DAGS的搜索空间在变量数量上是超指数的,上面的评分函数允许局部极大值。第一个属性使穷尽搜索对于除非常小的网络之外的所有网络都是难以解决的,第二个属性禁止有效的局部优化算法总是找到最优结构。因此,识别理想的结构通常是不可处理的。尽管有这些坏消息,启发式搜索策略通常会产生好的结果。
如果只涉及几个节点(读取:小于5),则可以使用ExhaustVesearch计算每个DAG的分数,并返回最佳分数。
一旦涉及到更多的节点,就需要切换到启发式搜索。爬山搜索实现贪婪的局部搜索,从DAG开始(默认:断开DAG),并通过迭代执行最大程度提高分数的单边缘操作继续进行。一旦找到本地最大值,搜索将终止。
一个参数学习和结构学习相结合的例子,没有对连续值进行处理,想要进行学习的朋友可以对数据进行一下处理,这里只是给了个流程,其他因素均为考虑,后面会考虑用一个贝叶斯网络做一个实际的例子,敬请期待…
from pgmpy.models import BayesianModel
from pgmpy.estimators import BayesianEstimator, HillClimbSearch
import pandas as pd
import networkx as nx
from matplotlib import pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
data = pd.DataFrame(np.random.randint(0, 3, size=(2500, 8)), columns=list('ABCDEFGH'))
data['A'] += data['B'] + data['C']
data['H'] = data['G'] - data['A']
# 从数据中学习DAG结构
hc = HillClimbSearch(data, scoring_method=BicScore(data))
best_model = hc.estimate()
edges = best_model.edges()
print(edges)
model = BayesianModel(edges)
# 先已知DAG结构,根据这个结构学习CPD参数
model.fit(data, estimator=BayesianEstimator)
[('A', 'C'), ('A', 'B'), ('G', 'A'), ('H', 'A'), ('H', 'G'), ('C', 'B')]
nx.draw(model,
with_labels=True,
node_size=1000,
font_weight='bold',
node_color='y',
pos={"A": [5, 6], "C": [4, 5], "H": [6, 5], "B": [4.5, 4], "G": [5.5, 4],})
plt.text(1, 6, model.get_cpds("A"), fontsize=10, color='b')
plt.text(4.5, 4, model.get_cpds("B"), fontsize=10, color='b')
plt.text(4, 5, model.get_cpds("C"), fontsize=10, color='b')
plt.text(5.5, 4, model.get_cpds("G"), fontsize=10, color='b')
plt.text(6, 5, model.get_cpds("H"), fontsize=10, color='b')
plt.show()
print(model.cpds)
[<TabularCPD representing P(A:7 | G:3, H:9) at 0x1b9be308710>, <TabularCPD representing P(C:3 | A:7) at 0x1b9be3087f0>, <TabularCPD representing P(G:3 | H:9) at 0x1b9be308630>, <TabularCPD representing P(H:9) at 0x1b9be308588>, <TabularCPD representing P(B:3 | A:7, C:3) at 0x1b9be308f28>]
基于约束的结构学习
从数据构建DAG的一种不同但非常简单的方法是:
- 使用假设检验确定数据集中的独立性
- 根据确定的独立性构造DAG(模式)
(有条件)独立性测试
数据中的独立性可以使用chi2条件独立性测试来识别。为此,pgmpy中基于约束的估计量有一个test_conditional_independence(X, Y, Zs)-方法,对数据样本进行假设测试。它允许在给定一组变量zs的情况下检查x是否独立于y:
from pgmpy.estimators import ConstraintBasedEstimator
data = pd.DataFrame(np.random.randint(0, 3, size=(2500, 8)), columns=list('ABCDEFGH'))
data['A'] += data['B'] + data['C']
data['H'] = data['G'] - data['A']
data['E'] *= data['F']
est = ConstraintBasedEstimator(data)
print(est.test_conditional_independence('B', 'H')) # dependent
print(est.test_conditional_independence('B', 'E')) # independent
print(est.test_conditional_independence('B', 'H', ['A'])) # independent
print(est.test_conditional_independence('A', 'G')) # independent
print(est.test_conditional_independence('A', 'G', ['H'])) # dependent
(666.7771836117021, 6.639207610049184e-124, True)
(7.060271006520653, 0.7941818308528195, True)
(10.223630812021746, 0.9999999733553018, True)
(8.600668236644461, 0.9870879541949906, True)
(4612.0, 0.0, True)
test_conditional_independence()返回一个元组(chi2, p_value, sufficient_data),由计算的chi2测试统计、测试的p_value和一个启发式标志组成,该标志指示样本大小是否足够。p_value是观察计算的chi2统计(或更高的chi2值)的概率,假设x和y独立于zs。
这可用于在给定的重要程度上作出独立判断:
def is_independent(X, Y, Zs=[], significance_level=0.05):
return est.test_conditional_independence(X, Y, Zs)[1] >= significance_level
print(is_independent('B', 'H'))
print(is_independent('B', 'E'))
print(is_independent('B', 'H', ['A']))
print(is_independent('A', 'G'))
print(is_independent('A', 'G', ['H']))
False
True
True
True
False
DAG 结构:
使用现有的独立性测试方法,我们可以通过三个步骤从数据集构造DAG:
-
构造一个无向结构 - estimate_skeleton()
-
定向强迫边缘以获得部分有向无环图(PDAG;DAGS的等价类)- skeleton_to_pdag()
-
通过以某种方式保守地确定其余边的方向,将DAG图案扩展到DAG - pdag_to_dag()
skel, seperating_sets = est.estimate_skeleton(significance_level=0.01)
print("Undirected edges: ", skel.edges())
pdag = est.skeleton_to_pdag(skel, seperating_sets)
print("PDAG edges: ", pdag.edges())
model = est.pdag_to_dag(pdag)
print("DAG edges: ", model.edges())
Undirected edges: [('A', 'H'), ('A', 'C'), ('A', 'B'), ('G', 'H'), ('E', 'F')]
PDAG edges: [('A', 'H'), ('G', 'H'), ('E', 'F'), ('C', 'A'), ('F', 'E'), ('B', 'A')]
DAG edges: [('A', 'H'), ('G', 'H'), ('C', 'A'), ('F', 'E'), ('B', 'A')]
estimate()方法提供了上述三个步骤的简写,并直接返回BayesianModel:
est.estimate(significance_level=0.01).edges()
[('A', 'H'), ('G', 'H'), ('C', 'A'), ('F', 'E'), ('B', 'A')]
可以使用estimate_from_independencies()方法从提供的一组独立项构造Baysianmodel。
from pgmpy.independencies import Independencies
ind = Independencies(['B', 'C'],
['A', ['B', 'C'], 'D'])
ind = ind.closure() # required (!) for faithfulness
model = ConstraintBasedEstimator.estimate_from_independencies("ABCD", ind)
print(model.edges())
[('A', 'D'), ('C', 'D'), ('B', 'D')]
PDAG的构建只保证在确定的一组独立性是可信的前提下工作,即存在一个与之完全对应的DAG。数据集中的虚假依赖关系可能会导致报告的独立关系违反忠实性。可能发生的情况是,估计的PDAG没有任何可靠的完成(即边缘方向,不引入新的V结构)。在这种情况下,会发出警告。
混合结构学习
MMHC算法结合了基于约束和基于分数的方法。它有两部分:
- 使用基于约束的构造过程MMPC学习无向图框架
- 使用基于分数的优化确定边缘方向(BDeu score + 修正的hill-climbing)
我们可以分别执行以下两个步骤:
from pgmpy.estimators import MmhcEstimator
data = pd.DataFrame(np.random.randint(0, 3, size=(2500, 8)), columns=list('ABCDEFGH'))
data['A'] += data['B'] + data['C']
data['H'] = data['G'] - data['A']
data['E'] *= data['F']
mmhc = MmhcEstimator(data)
skeleton = mmhc.mmpc()
print("Part 1) Skeleton: ", skeleton.edges())
# use hill climb search to orient the edges:
hc = HillClimbSearch(data, scoring_method=BDeuScore(data))
model = hc.estimate(tabu=10, white_list=skeleton.to_directed().edges())
print("Part 2) Model: ", model.edges())
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-12-a28c7de0d45e> in <module>()
----> 1 from pgmpy.estimators import MmhcEstimator
2
3 data = pd.DataFrame(np.random.randint(0, 3, size=(2500, 8)), columns=list('ABCDEFGH'))
4 data['A'] += data['B'] + data['C']
5 data['H'] = data['G'] - data['A']
ImportError: cannot import name 'MmhcEstimator'