数据挖掘-9、10——Association Analysis关联分析

程序员文章站 2022-04-23 08:45:23

...

理论

1、需要找出两个关键值：
①support 支持度（后键占所有的比重）
②confidence 置信度（后键占前键的比重）

2、两个关键点：
①找出频繁相集（所有S>min_S的集合）(support 支持度)
②找出规则（在①的前提下找出所有C>min_c）(confidence 置信度)

3、具体过程以及优化方法、见ppt

4、两大方法：
①Apriori 算法（广度优先）
②FP 算法（深度优先）

5、注意的小点：
①超集的S一定小于子集的S（eg : S｛A,B,C｝< S{A,B}）
②A是先发生的事件，B是后发生的事件，若lift（A,B)
=1:AB相互独立
>1:A对B有正向影响
<1:A对B有负向影响
（辛普森悖论）

③三类相集
极大频繁相集（范围最小，看不出内部子集的S支持度）
闭频繁相集（包含极大频繁相集，可以看出内部子集的S支持度，推导见ppt）
频繁相集（包含前两个，具体图见ppt）

④应对长尾模型：用Multiple Minimum Support（不同的最小支持度）应对

⑤衍生：对特定时间/空间内发生的事进行关联分析（如疫情期间各国的政策发布等等）

代码

1、在
①Apriori 算法（广度优先）
②FP 算法（深度优先）
前必须先将数据集转化为transactions的形式：
转化代码举例：

## example: creating transactions from a matrix 举例说明，把数据变成"transactions"类的数据集
a_matrix <- matrix(
  c(1,1,1,0,0,
    1,1,0,0,0,
    1,1,0,1,0,
    0,0,1,0,1,
    1,1,0,1,1), ncol = 5)

## set dim names
dimnames(a_matrix) <-  list(paste("Tr",c(1:5), sep = ""), c("a","b","c","d","e"))#列和行的名字
a_matrix

## coerce
trans2 <-  as(a_matrix, "transactions")##强行转换为"transactions"
trans2
inspect(trans2)
image(trans2) ##画图，查看购买情况

2、制作频繁相集：eclat （该方法自动选取Apriori 算法或FP 算法）

# frequent items寻找频繁相集
##parameter = list(supp = 0.07, maxlen = 15)) 最小支持度为0.07，频繁相集的最大宽度为15
frequentItems <- eclat (Groceries, parameter = list(supp = 0.07, maxlen = 15)) # calculates support for frequent items
inspect(frequentItems) ##检查所有频繁相集

##排序并画图（纵轴为支持度计数）
itemFrequencyPlot(Groceries, topN=10, type="absolute", main="Item Frequency") # plot frequent items

inspect函数检查频繁项集结果如下：

> inspect(frequentItems) ##检查所有频繁相集
     items                         support    transIdenticalToItemsets count
[1]  {other vegetables,whole milk} 0.07483477  736                      736 
[2]  {whole milk}                  0.25551601 2513                     2513 
[3]  {other vegetables}            0.19349263 1903                     1903 
[4]  {rolls/buns}                  0.18393493 1809                     1809 
[5]  {yogurt}                      0.13950178 1372                     1372 
[6]  {soda}                        0.17437722 1715                     1715 
[7]  {root vegetables}             0.10899847 1072                     1072 
[8]  {tropical fruit}              0.10493137 1032                     1032 
[9]  {bottled water}               0.11052364 1087                     1087 
[10] {sausage}                     0.09395018  924                      924 
[11] {shopping bags}               0.09852567  969                      969 
[12] {citrus fruit}                0.08276563  814                      814 
[13] {pastry}                      0.08896797  875                      875 
[14] {pip fruit}                   0.07564820  744                      744 
[15] {whipped/sour cream}          0.07168277  705                      705 
[16] {fruit/vegetable juice}       0.07229283  711                      711 
[17] {newspapers}                  0.07981698  785                      785 
[18] {bottled beer}                0.08052872  792                      792 
[19] {canned beer}                 0.07768175  764                      764

按照支持度排序画图如下
数据挖掘-9、10——Association Analysis关联分析
3、Apriori 算法

# Min Support as 0.001, confidence as 0.8.
##前提，数据集一定要是trans
##lift>1，则规则前项发生会影响后项，<1，则...，=1则无关
rules <- apriori (Groceries, parameter = list(supp = 0.001, conf = 0.5,minlen=2,maxlen=15))
inspectDT(rules)
# 'high-confidence' rules.
rules_conf <- sort (rules, by="confidence", decreasing=TRUE)
# show the support, lift and confidence for all rules
inspect(head(rules_conf))

inspect检查效果如下：

inspect(head(rules_conf)) 
    lhs                                           rhs                support     confidence coverage   
[1] {rice,sugar}                               => {whole milk}       0.001220132 1          0.001220132
[2] {canned fish,hygiene articles}             => {whole milk}       0.001118454 1          0.001118454
[3] {root vegetables,butter,rice}              => {whole milk}       0.001016777 1          0.001016777
[4] {root vegetables,whipped/sour cream,flour} => {whole milk}       0.001728521 1          0.001728521
[5] {butter,soft cheese,domestic eggs}         => {whole milk}       0.001016777 1          0.001016777
[6] {citrus fruit,root vegetables,soft cheese} => {other vegetables} 0.001016777 1          0.001016777
    lift     count
[1] 3.913649 12   
[2] 3.913649 11   
[3] 3.913649 10   
[4] 3.913649 17   
[5] 3.913649 10   
[6] 5.168156 10

将结果数据可视化：
①

# a. Scatter Plot
##横轴支持度，纵轴置信度
plot(rules, control=list(jitter=2, col = rev(brewer.pal(9, "Blues")[4:9])),shading = "lift")

效果：
数据挖掘-9、10——Association Analysis关联分析

②

# b. Grouped Matrix
##横轴为前键，纵轴为后键，圆圈越大，C（置信度）越大
plot(rules, method="grouped",control=list(col = rev(brewer.pal(9, "Greens")[4:9])))

数据挖掘-9、10——Association Analysis关联分析
③

# c. Graphy
##按箭头表示前键与后键，圆圈越大，C（置信度）越大
plot(rules, method="graph", control=list(type="items"))

数据挖掘-9、10——Association Analysis关联分析