关联规则挖掘——FPTree
程序员文章站
2022-03-25 23:34:21
...
FPTree
- 前面介绍过用Aprior算法来挖掘关联规则,但是Aprior算法的最大缺陷就是生成了候选项集,通过求组合数可以知道候选项集的数量是非常庞大的。
- 从候选项集到频繁项集,需要不断地扫描整个数据库来计算支持度是否满足min_sup的要求。
- 因此为了消除这种“候选产生过程”,频繁增长模式(Frequent-Pattern Growth,FP-Growth)产生了。
算法思想
- 采用分治策略
- 首先,将代表频繁项集的数据库压缩到一颗频繁模式树(FP-Tree),该树仍然保留项集的关联信息。
- 然后,把这种压缩后的数据库划分成一组条件数据库,每个数据库关联一个频繁项或者是“模式段”,并分别挖掘每个条件数据库。
算法实现过程
- 数据库的第一次扫描和Aprior算法相同,他导出频繁1项集的集合,并得到他们的支持度计数。频繁1项集按照支持度递减进行排序,结果记为L。
- 首先创造根节点,用“null”进行标记。
- 第二次扫描数据库D,每个事物中的项都按照L中的次序处理(按照递减支持度计数排序),并且对每个事物创建一个分枝。
案例分析
事物ID | 项 |
---|---|
001 | r,z,h,j,p |
002 | z,y,x,w,v,u,t,s |
003 | z |
004 | r,x,n,o,s |
005 | y,r,x,z,q,t,p |
006 | y,z,x,e,q,s,t,m |
代码实现
实现过程
- 首先遍历一次数据库,对数据库中的所有事物进行记录frozenDataDict{transaction:count},通过transaction构造初始的headerTable({item:count}),根据minSupport来筛选出频繁1项集,并且将headerTable修改为{item:[counter, treeNodeLink]}
- 遍历fronzenDataDict中的所有key值,即每一个事物,从每一个事物当中筛选出属于频繁1项集当中的项,将筛选出来的项按照headerTable当中的counter值降序排列得到orderedItems
- 将orderItems当中的每一项插入到FPTree当中,对于每一个item都包含两种情况:一种是当前的item已经在子节点children当中,此时只需要将children当中对应节点的counter值增加即可;第二种是children当中并不包含item,那么我们需要创建一个新的节点
- 在新的节点创建完成之后,我们需要更新headerTable当中的treeNodeLink,此时又存在两种情况:一种是treeNodeLink值为None,那么这个时候只需要将treeNodeLink指向新创建的节点即可;第二种情况就是treeNodeLink不是None,那么这个时候需要遍历treeNodeLink指向的treeNode,直到treeNode的nodeLink为None,这个时候将这个空的nodeLink指向刚刚创建的节点即可
- 检查完所有的事物之后FPTree创建完成
FPTree中节点的声明
class TreeNode(object):
def __init__(self, nameValue: str, numOccur: int, parentNode):
# 项的名字
self.name = nameValue
# 项在FPTree当中出现的次数
self.count = numOccur
# 相同项的下一个节点
self.nodeLink = None
# 父节点
self.parentNode = parentNode
# 子节点
# for example, the children like 'milk': TreeNode('milk')
self.children = {}
def inc(self, count):
self.count += count
def show(self, ind=1):
print(' ' * ind, self.name, ' ', self.count)
for child in self.children.values():
child.show(ind + 1)
FPTree类以及相关的函数声明
def makeHeaderTable(headerTable: dict) -> dict:
for item in headerTable:
headerTable[item] = [headerTable[item], None]
return headerTable
def updateHeaderTable(toastNode: TreeNode, targetNode: TreeNode):
while toastNode.nodeLink is not None:
toastNode = toastNode.nodeLink
toastNode.nodeLink = targetNode
class FPTree:
def __init__(self, frozenDataDict: dict, headerTable: dict, minSupport: int):
self.treeNode = TreeNode('null', 1, None)
# 'milk': [counter, nodeLink]
self.headerTable = makeHeaderTable(headerTable)
self.frozenDataDict = frozenDataDict
self.minSupport = minSupport
def updateTree(self, treeNode, items: list, count: int):
item = items[0]
if item in treeNode.children:
treeNode.children[item].inc(count)
else:
treeNode.children[item] = TreeNode(item, count, treeNode)
if self.headerTable[item][1] is None:
self.headerTable[item][1] = treeNode.children[item]
else:
updateHeaderTable(self.headerTable[item][1], treeNode)
if len(items) > 1:
self.updateTree(treeNode.children[item], items[1::], count)
def createFPTree(self):
freqItems = set(self.headerTable.keys())
if len(freqItems) == 0:
self.headerTable = None
return
for transaction, count in self.frozenDataDict.items():
learnSet = {}
for item in transaction:
if item in freqItems:
learnSet[item] = self.headerTable[item][0]
if len(learnSet) > 0:
orderedItems = [item[0] for item in sorted(learnSet.items(), key=lambda k: (k[1], k[0]), reverse=True)]
self.updateTree(self.treeNode, orderedItems, count)
完整代码
def loadSimpleData():
simpleData = [['beer', 'milk', 'chicken'], ['milk', 'bread'], ['milk', 'diaper'],
['beer', 'milk', 'bread'], ['beer', 'diaper'], ['milk', 'diaper'],
['beer', 'diaper'], ['beer', 'milk', 'diaper', 'chicken'], ['beer', 'milk', 'diaper']]
return simpleData
def createInitSet(dataSet: list) -> dict:
returnSet = {}
for item in dataSet:
frozenItem = frozenset(item)
returnSet[frozenItem] = returnSet.get(frozenItem, 0) + 1
return returnSet
class TreeNode(object):
def __init__(self, nameValue: str, numOccur: int, parentNode):
# 项的名字
self.name = nameValue
# 项在FPTree当中出现的次数
self.count = numOccur
# 相同项的下一个节点
self.nodeLink = None
# 父节点
self.parentNode = parentNode
# 子节点
# for example, the children like 'milk': TreeNode('milk')
self.children = {}
def inc(self, count):
self.count += count
def show(self, ind=1):
print(' ' * ind, self.name, ' ', self.count)
for child in self.children.values():
child.show(ind + 1)
def getHeaderTable(dataSet, minSupport=1) -> dict:
headerTable = {}
for key, value in dataSet.items():
for item in key:
headerTable[item] = headerTable.get(item, 0) + 1
lessThanMinSupportList = list(filter(lambda k: headerTable[k] < minSupport, headerTable))
for x in lessThanMinSupportList:
del headerTable[x]
return headerTable
def makeHeaderTable(headerTable: dict) -> dict:
for item in headerTable:
headerTable[item] = [headerTable[item], None]
return headerTable
def updateHeaderTable(toastNode: TreeNode, targetNode: TreeNode):
while toastNode.nodeLink is not None:
toastNode = toastNode.nodeLink
toastNode.nodeLink = targetNode
class FPTree:
def __init__(self, frozenDataDict: dict, headerTable: dict, minSupport: int):
self.treeNode = TreeNode('null', 1, None)
# 'milk': [counter, nodeLink]
self.headerTable = makeHeaderTable(headerTable)
self.frozenDataDict = frozenDataDict
self.minSupport = minSupport
def updateTree(self, treeNode, items: list, count: int):
item = items[0]
if item in treeNode.children:
treeNode.children[item].inc(count)
else:
treeNode.children[item] = TreeNode(item, count, treeNode)
if self.headerTable[item][1] is None:
self.headerTable[item][1] = treeNode.children[item]
else:
updateHeaderTable(self.headerTable[item][1], treeNode)
if len(items) > 1:
self.updateTree(treeNode.children[item], items[1::], count)
def createFPTree(self):
freqItems = set(self.headerTable.keys())
if len(freqItems) == 0:
self.headerTable = None
return
for transaction, count in self.frozenDataDict.items():
learnSet = {}
for item in transaction:
if item in freqItems:
learnSet[item] = self.headerTable[item][0]
if len(learnSet) > 0:
orderedItems = [item[0] for item in sorted(learnSet.items(), key=lambda k: (k[1], k[0]), reverse=True)]
self.updateTree(self.treeNode, orderedItems, count)
def main():
data = loadSimpleData()
dataDict = createInitSet(data)
headerTable = getHeaderTable(dataDict, 3)
fpTree = FPTree(dataDict, headerTable, 3)
fpTree.createFPTree()
fpTree.treeNode.show(3)
if __name__ == '__main__':
main()
上一篇: 关于MySQL实现指定编码遇到的坑