欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

关联规则挖掘——FPTree

程序员文章站 2022-03-25 23:34:21
...

FPTree

  • 前面介绍过用Aprior算法来挖掘关联规则,但是Aprior算法的最大缺陷就是生成了候选项集,通过求组合数可以知道候选项集的数量是非常庞大的。
  • 从候选项集到频繁项集,需要不断地扫描整个数据库来计算支持度是否满足min_sup的要求。
  • 因此为了消除这种“候选产生过程”,频繁增长模式(Frequent-Pattern Growth,FP-Growth)产生了。

算法思想

  • 采用分治策略
  • 首先,将代表频繁项集的数据库压缩到一颗频繁模式树(FP-Tree),该树仍然保留项集的关联信息。
  • 然后,把这种压缩后的数据库划分成一组条件数据库,每个数据库关联一个频繁项或者是“模式段”,并分别挖掘每个条件数据库。

算法实现过程

  • 数据库的第一次扫描和Aprior算法相同,他导出频繁1项集的集合,并得到他们的支持度计数。频繁1项集按照支持度递减进行排序,结果记为L。
  • 首先创造根节点,用“null”进行标记。
  • 第二次扫描数据库D,每个事物中的项都按照L中的次序处理(按照递减支持度计数排序),并且对每个事物创建一个分枝。

案例分析

事物ID
001 r,z,h,j,p
002 z,y,x,w,v,u,t,s
003 z
004 r,x,n,o,s
005 y,r,x,z,q,t,p
006 y,z,x,e,q,s,t,m

代码实现

实现过程

  • 首先遍历一次数据库,对数据库中的所有事物进行记录frozenDataDict{transaction:count},通过transaction构造初始的headerTable({item:count}),根据minSupport来筛选出频繁1项集,并且将headerTable修改为{item:[counter, treeNodeLink]}
  • 遍历fronzenDataDict中的所有key值,即每一个事物,从每一个事物当中筛选出属于频繁1项集当中的项,将筛选出来的项按照headerTable当中的counter值降序排列得到orderedItems
  • 将orderItems当中的每一项插入到FPTree当中,对于每一个item都包含两种情况:一种是当前的item已经在子节点children当中,此时只需要将children当中对应节点的counter值增加即可;第二种是children当中并不包含item,那么我们需要创建一个新的节点
  • 在新的节点创建完成之后,我们需要更新headerTable当中的treeNodeLink,此时又存在两种情况:一种是treeNodeLink值为None,那么这个时候只需要将treeNodeLink指向新创建的节点即可;第二种情况就是treeNodeLink不是None,那么这个时候需要遍历treeNodeLink指向的treeNode,直到treeNode的nodeLink为None,这个时候将这个空的nodeLink指向刚刚创建的节点即可
  • 检查完所有的事物之后FPTree创建完成

FPTree中节点的声明

class TreeNode(object):
    def __init__(self, nameValue: str, numOccur: int, parentNode):
        # 项的名字
        self.name = nameValue
        # 项在FPTree当中出现的次数
        self.count = numOccur
        # 相同项的下一个节点
        self.nodeLink = None
        # 父节点
        self.parentNode = parentNode
        # 子节点
        # for example, the children like 'milk': TreeNode('milk')
        self.children = {}

    def inc(self, count):
        self.count += count

    def show(self, ind=1):
        print('   ' * ind, self.name, ' ', self.count)
        for child in self.children.values():
            child.show(ind + 1)

FPTree类以及相关的函数声明

def makeHeaderTable(headerTable: dict) -> dict:
    for item in headerTable:
        headerTable[item] = [headerTable[item], None]

    return headerTable


def updateHeaderTable(toastNode: TreeNode, targetNode: TreeNode):
    while toastNode.nodeLink is not None:
        toastNode = toastNode.nodeLink
    toastNode.nodeLink = targetNode


class FPTree:
    def __init__(self, frozenDataDict: dict, headerTable: dict, minSupport: int):
        self.treeNode = TreeNode('null', 1, None)
        # 'milk': [counter, nodeLink]
        self.headerTable = makeHeaderTable(headerTable)
        self.frozenDataDict = frozenDataDict
        self.minSupport = minSupport

    def updateTree(self, treeNode, items: list, count: int):
        item = items[0]
        if item in treeNode.children:
            treeNode.children[item].inc(count)
        else:
            treeNode.children[item] = TreeNode(item, count, treeNode)
            if self.headerTable[item][1] is None:
                self.headerTable[item][1] = treeNode.children[item]
            else:
                updateHeaderTable(self.headerTable[item][1], treeNode)
        if len(items) > 1:
            self.updateTree(treeNode.children[item], items[1::], count)

    def createFPTree(self):
        freqItems = set(self.headerTable.keys())

        if len(freqItems) == 0:
            self.headerTable = None
            return

        for transaction, count in self.frozenDataDict.items():
            learnSet = {}
            for item in transaction:
                if item in freqItems:
                    learnSet[item] = self.headerTable[item][0]

            if len(learnSet) > 0:
                orderedItems = [item[0] for item in sorted(learnSet.items(), key=lambda k: (k[1], k[0]), reverse=True)]
                self.updateTree(self.treeNode, orderedItems, count)

完整代码

def loadSimpleData():
    simpleData = [['beer', 'milk', 'chicken'], ['milk', 'bread'], ['milk', 'diaper'],
                  ['beer', 'milk', 'bread'], ['beer', 'diaper'], ['milk', 'diaper'],
                  ['beer', 'diaper'], ['beer', 'milk', 'diaper', 'chicken'], ['beer', 'milk', 'diaper']]
    return simpleData


def createInitSet(dataSet: list) -> dict:
    returnSet = {}

    for item in dataSet:
        frozenItem = frozenset(item)
        returnSet[frozenItem] = returnSet.get(frozenItem, 0) + 1

    return returnSet


class TreeNode(object):
    def __init__(self, nameValue: str, numOccur: int, parentNode):
        # 项的名字
        self.name = nameValue
        # 项在FPTree当中出现的次数
        self.count = numOccur
        # 相同项的下一个节点
        self.nodeLink = None
        # 父节点
        self.parentNode = parentNode
        # 子节点
        # for example, the children like 'milk': TreeNode('milk')
        self.children = {}

    def inc(self, count):
        self.count += count

    def show(self, ind=1):
        print('   ' * ind, self.name, ' ', self.count)
        for child in self.children.values():
            child.show(ind + 1)


def getHeaderTable(dataSet, minSupport=1) -> dict:
    headerTable = {}

    for key, value in dataSet.items():
        for item in key:
            headerTable[item] = headerTable.get(item, 0) + 1

    lessThanMinSupportList = list(filter(lambda k: headerTable[k] < minSupport, headerTable))
    for x in lessThanMinSupportList:
        del headerTable[x]

    return headerTable


def makeHeaderTable(headerTable: dict) -> dict:
    for item in headerTable:
        headerTable[item] = [headerTable[item], None]

    return headerTable


def updateHeaderTable(toastNode: TreeNode, targetNode: TreeNode):
    while toastNode.nodeLink is not None:
        toastNode = toastNode.nodeLink
    toastNode.nodeLink = targetNode


class FPTree:
    def __init__(self, frozenDataDict: dict, headerTable: dict, minSupport: int):
        self.treeNode = TreeNode('null', 1, None)
        # 'milk': [counter, nodeLink]
        self.headerTable = makeHeaderTable(headerTable)
        self.frozenDataDict = frozenDataDict
        self.minSupport = minSupport

    def updateTree(self, treeNode, items: list, count: int):
        item = items[0]
        if item in treeNode.children:
            treeNode.children[item].inc(count)
        else:
            treeNode.children[item] = TreeNode(item, count, treeNode)
            if self.headerTable[item][1] is None:
                self.headerTable[item][1] = treeNode.children[item]
            else:
                updateHeaderTable(self.headerTable[item][1], treeNode)
        if len(items) > 1:
            self.updateTree(treeNode.children[item], items[1::], count)

    def createFPTree(self):
        freqItems = set(self.headerTable.keys())

        if len(freqItems) == 0:
            self.headerTable = None
            return

        for transaction, count in self.frozenDataDict.items():
            learnSet = {}
            for item in transaction:
                if item in freqItems:
                    learnSet[item] = self.headerTable[item][0]

            if len(learnSet) > 0:
                orderedItems = [item[0] for item in sorted(learnSet.items(), key=lambda k: (k[1], k[0]), reverse=True)]
                self.updateTree(self.treeNode, orderedItems, count)


def main():
    data = loadSimpleData()
    dataDict = createInitSet(data)
    headerTable = getHeaderTable(dataDict, 3)
    fpTree = FPTree(dataDict, headerTable, 3)
    fpTree.createFPTree()
    fpTree.treeNode.show(3)


if __name__ == '__main__':
    main()