R {arules} によるアソシエーション分析をちょっと詳しく <1>

今週は系列パターンマイニング用 R パッケージ {arulesSequences} と格闘していた。使い方にところどころよくわからないポイントがあり、思ったよりも時間がかかってしまった。

関連パッケージである {arules} ともども、ネットには簡単な分析についての情報はあるが、データの作り方/操作についてはまとまったものがないようだ。とりあえず自分が調べたことをまとめておきたい。2 パッケージで結構なボリュームになるため、全 4 記事分くらいの予定。

概要

まずはパターンマイニングの手法を簡単に整理する。いずれもトランザクションと呼ばれるデータの系列を対象にする。トランザクションとは 1レコード中に複数の要素 (アイテム) を含むもの。例えば、

POSデータ: 1トランザクション = POSレジの売上 1回。アイテムはそのときに売れた個々の商品。
アンケート調査: 1トランザクション = アンケートの1回答。アイテムは個々の設問への解答内容。

複数のトランザクションからなるデータから、よく起きるパターンを発見・列挙する手法が頻出パターンマイニング。{arules} と {arulesSequences} でできることは、

	アソシエーション分析 (Association analysis)	系列パターンマイニング (Sequential pattern mining)
抽出されるパターンのイメージ	XのときYが発生しやすい	Xの後にYが発生しやすい
パッケージ	`{arules}`	`{arulesSequences}`
備考	トランザクション中のアイテムの関係をみる。トランザクションの順序は関係ない	連続するトランザクションに含まれるアイテムの関係をみる。トランザクションの順序が重要

補足以降、処理したい一群のトランザクションを "データ", データ中の1レコードを "トランザクション" と書く。

補足 IDA@SMU: Intelligent Data Analysis Lab - Southern Methodist University をみると、{arules} ファミリーには他にも {arulesClassify}, {arulesNBMiner} という関連パッケージがあるようだ。こちらは使ったことないのでそのうち。

アソシエーション分析とは

非常に簡単にいうと、データ全体を集計してありそうなパターン (相関ルール) を抽出するもの。トランザクション中にアイテム ${ X }$ が含まれるときにアイテム ${ Y }$ も含まれる、なんてのが相関ルール。アソシエーション分析の用語では ${ X }$ は条件 (left hand side:lhs)、 ${ Y }$ は結論 (right hand side:rhs) と呼ぶ。相関ルールは ${ X \Rightarrow Y }$ と書く。

アソシエーション分析では、以下のような尺度でルールを評価する。

支持度 (support)

${ X, Y }$ 両方のアイテム (の和集合) を含むトランザクションがデータ中に占める比率になる。支持度が高いルールはデータ全体でみてよく起きやすい。

データ全体のトランザクション数を ${ M }$ , アイテム ${ X }$ を含むトランザクション数を ${ \sigma(X) }$ とあらわすとき、支持度は、

${ supp(X \Rightarrow Y) = \sigma(X \bigcup Y) / M = supp(Y \Rightarrow X) }$

確信度 (confidence)

トランザクション中にアイテム ${ X }$ が含まれるとき、アイテム ${ Y }$ も同時に含まれる比率。確信度が高いルールでは、アイテム ${ X }$ を含むトランザクションにはアイテム ${ Y }$ も含まれやすい。

${ conf(X \Rightarrow Y) = \sigma(X \bigcup Y) / \sigma(X) = supp(X \Rightarrow Y) / supp(X) }$

リフト (lift)

日本語難しいが、"ルールによって起こる ${ Y }$ の頻度 / 全体での ${ Y }$ の頻度"。一般に、リフトが 1 より大きい場合にルールが有効 = ルールによってアイテム ${ Y }$ が含まれやすくなる、とみなす。

${ lift(X \Rightarrow Y) = conf(X \Rightarrow Y) / supp(Y) = supp(X \Rightarrow Y) / supp(X) \times supp(Y) }$

で、何を重視すれば？

というのは目的によってことなる。

広く浅くいきたいなら支持度が大きく確信度/リフトがそこそこのルールを選ぶ。
狭く深くいきたいなら支持度は小さめでも確信度/リフトの高いルールを選ぶ。

という感じ。また、各数値はトランザクションの総数 / アイテムの偏りによっても変わるため、他のルールと比べてよしあしを判断する。

補足これらを、トランザクションにあるアイテムが含まれる確率 ${ P }$ の推定量としてとらえるなら、支持度 = ${ P(X \bigcup Y) }$ , 確信度 = ${ P(Y | X) }$ , リフト = ${ P(X \bigcup Y) / P(X) \times P(Y) }$ 。

パッケージのインストール

アソシエーション分析を行う {arules} パッケージをインストール / ロードする。

install.packages('arules')
library(arules)

サンプルデータのロード

{arules} ではデータを作るところが一番わかりにくい (そして動きを理解していないとデータ作れない) なので、今回はとりあえず天下りで。arules に入っている Income データを使う。

このデータは、世帯収入と世帯の属性をアイテムとした一群のトランザクションからなる。このデータをアソシエーション分析することにより、" ${ X }$ ならば世帯収入が ${ Y }$ (高い/低い) であることが多い" といったルールが出てくる。

data(Income)
tran <- Income

読み込んだデータは arules::transactions インスタンスになっている。

class(tran)
# [1] "transactions"
# attr(,"package")
# [1] "arules"

`transactions` の確認

データの中身を表示してみる。

tran
# transactions in sparse format with
#  6876 transactions (rows) and
#  50 items (columns)

、、、これ、中身どう見るのかな？？というのが最初のつまずきポイントだと思う。

arules::transactions 中の各トランザクションはスライシングで取得できる。そこからトランザクションの詳細 (アイテム) をみる場合は arules::LIST を使うのがよい。

表示されるアイテム、例えば "income=$40,000+" は、 data.frame でいうと "income" カラムの値が "$40,000+" であるという意味。ほかのアイテムも世帯の属性を示すものであることがわかる。

# スライシングした結果も transactions インスタンス
tran[1]
# transactions in sparse format with
#  1 transactions (rows) and
# 50 items (columns)

LIST(tran[1])
# $`2`
#  [1] "income=$40,000+"             "sex=male"                   
#  [3] "marital status=married"      "age=35+"                    
#  [5] "education=college graduate"  "occupation=homemaker"       
#  [7] "years in bay area=10+"       "dual incomes=no"            
#  [9] "number in household=2+"      "number of children=1+"      
# [11] "householder status=own"      "type of home=house"         
# [13] "ethnic classification=white" "language in home=english"

補足 as(tran, 'data.frame') で data.frameへ、もしくは as(tran, 'matrix') で matrix へ変換しても中身を確認できるが、アイテム数が多い場合は表示がわかりにくい。このとき、as.data.frame(tran), as.matrix(tran) では coerce エラーになるので注意。

また、複数トランザクションを選択してスライシング & アイテム表示もできる。

tran[5:10]
# transactions in sparse format with
#  6 transactions (rows) and
#  50 items (columns)

LIST(tran[5:10])
# $`6`
#  [1] "income=$40,000+"               "sex=male"                     
#  [3] "marital status=married"        "age=35+"                      
#  [5] "education=no college graduate" "occupation=retired"           
#  [7] "years in bay area=10+"         "dual incomes=no"              
#  [9] "number in household=1"         "number of children=0"         
# [11] "householder status=own"        "type of home=house"           
# [13] "ethnic classification=white"   "language in home=english"
# .....

`transactions` の要約表示

要約表示は summary。頻出するアイテムの頻度、分布などが表示される。

summary(tran)
# transactions as itemMatrix in sparse format with
#  6876 rows (elements/itemsets/transactions) and
#  50 columns (items) and a density of 0.28 
# 
# most frequent items:
#      language in home=english education=no college graduate 
#                          6277                          4849 
#         number in household=1   ethnic classification=white 
#                          4757                          4605 
#         years in bay area=10+                       (Other) 
#                          4446                         71330 
# 
# element (itemset/transaction) length distribution:
# sizes
#   14 
# 6876 
# 
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#      14      14      14      14      14      14 
# 
# includes extended item information - examples:
#              labels variables     levels
# 1 income=$0-$40,000    income $0-$40,000
# 2   income=$40,000+    income   $40,000+
# 3          sex=male       sex       male
# 
# includes extended transaction information - examples:
#   transactionID
# 1             2
# 2             3
# 3             4

トランザクションデータの詳細

arules::transactions は S4 クラスで、以下のようなプロパティを持つ。

@transactionInfo: 各トランザクションの番号を含む data.frame
@data: 各トランザクションについて、個々のアイテムが含まれるかどうかを示す Matrix::ngCMatrix クラス (真偽値をセルの値とする疎行列)
@itemInfo: データに含まれるアイテムの一覧を含む data.frame
@itemsetInfo: アイテムのセットを含む data.frame

arules にはこれらのプロパティにアクセスするための関数が用意されている。トランザクションの IDを確認するには arules::transactionInfo。一部のみ表示するために head をかます。

head(transactionInfo(tran))
#   transactionID
# 1             2
# 2             3
# 3             4
# 4             5
# 5             6
# 6             7

アイテムの一覧を表示するには arules::itemInfo。

head(itemInfo(tran))
#                        labels      variables       levels
# 1           income=$0-$40,000         income   $0-$40,000
# 2             income=$40,000+         income     $40,000+
# 3                    sex=male            sex         male
# 4                  sex=female            sex       female
# 5      marital status=married marital status      married
# 6 marital status=cohabitation marital status cohabitation

またアイテムのセットをあらわす arules::itemsetInfo は、 Income データには定義されていないので空の data.frameになる。これの使い方は別途。

head(itemsetInfo(tran))
# data frame with 0 columns and 0 rows

アイテムの頻度の確認

transactions に含まれるアイテムの発生頻度を確認するには arules::itemFrequency。既定だと相対頻度になるため、発生回数(絶対頻度)をみたい場合は type = 'absolute' を指定。

head(itemFrequency(tran, type = 'absolute'))
#      income=$0-$40,000             income=$40,000+                    sex=male 
#                   4280                        2596                        3067 
#             sex=female      marital status=married marital status=cohabitation 
#                   3809                        2652                         536

また、arules::itemFrequencyPlot で頻度をプロットできる。アイテム数が多い場合は、 topN キーワードでプロットするアイテム数を指定する。

itemFrequencyPlot(tran, type = 'absolute', topN = 5, cex = 0.8)

f:id:sinhrks:20141212075936p:plain

また、個人的によく使うのは arules::affinity。アイテム ${ X }$ に対する ${ Y }$ の affinity ${ A(X, Y) }$ は、

${ A(X, Y) = supp(X \bigcup Y)/(supp(X) + supp(Y) - supp(X \bigcup Y)) }$

で計算される。各アイテム別々での発生頻度に対する同時発生頻度で、affinity が大きいほどアイテム ${ X }$ と ${ Y }$ は同時に含まれやすい (別々に含まれることは少ない)。

出力大量になるためスライシングして表示。同時発生しえない組み合わせの affinity は 0。

affinity(tran)[1:4, 1:4]
#                   income=$0-$40,000 income=$40,000+  sex=male sex=female
# income=$0-$40,000         0.0000000       0.0000000 0.3399599   0.425877
# income=$40,000+           0.0000000       0.0000000 0.2697309   0.277933
# sex=male                  0.3399599       0.2697309 0.0000000   0.000000
# sex=female                0.4258770       0.2779330 0.0000000   0.000000

ルール抽出

今回は arules::apriori 。

`apriori` アルゴリズムでの頻出パターンマイニング

arules::transactions インスタンスをそのまま arules::apriori に渡せばルールが出てくる。出力は 8664 個のルールが見つかったことを示す。

rule <- apriori(tran)
# parameter specification:
#  confidence minval smax arem  aval originalSupport support minlen maxlen target   ext
#         0.8    0.1    1 none FALSE            TRUE     0.1      1     10  rules FALSE
# 
# algorithmic control:
#  filter tree heap memopt load sort verbose
#     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
# 
# apriori - find association rules with the apriori algorithm
# version 4.21 (2004.05.09)        (c) 1996-2004   Christian Borgelt
# set item appearances ...[0 item(s)] done [0.00s].
# set transactions ...[50 item(s), 6876 transaction(s)] done [0.00s].
# sorting and recoding items ... [30 item(s)] done [0.00s].
# creating transaction tree ... done [0.00s].
# checking subsets of size 1 2 3 4 5 6 7 8 done [0.06s].
# writing ... [8664 rule(s)] done [0.00s].
# creating S4 object  ... done [0.00s].

経過出力を消したい場合は control オプションで verbose = FALSE を指定。

apriori(tran, control = list(verbose = FALSE))
# set of 8664 rules

作成されたルールのセットは、arules::rules インスタンスになっている。

class(rule)
# [1] "rules"
# attr(,"package")
# [1] "arules"

`rules` の確認

スライシングによって個々のルールが取得できるのは arules::transactions と同様。中身を表示する場合は arules::inspect。

例えば 5番目のルールは、"dual incomes=no のとき language in home=english である" というルールを示す。列名はそれぞれ、条件(lhs)、結論(rhs)、支持度 (support)、確信度 (confidence)、リフト (lift) に対応。

inspect(rules[5])
#   lhs                  rhs                          support confidence     lift
# 1 {dual incomes=no} => {language in home=english} 0.1364165  0.9196078 1.007364

補足 arules::inspect はルールを表示するための関数である (ルールをdata.frameに変換する関数ではない)。そのため、ルールの一部を表示したいときには arules::inspect に渡す前にフィルタする必要がある。

# OK
inspect(head(rules))
# 略

# NG! 
head(inspect(rules))
# 略 (全ルール表示される)

補足 arules::LIST はrules に対しては使えない。

ルールの並べ替えと条件抽出

抽出されたルールを、並べ替え / フィルタして確認したいということはよくある。並べ替えには arules::sort。

# 信頼度 で並べ替え
sort(rules, by = 'confidence')
# set of 8664 rules

# 信頼度の上位5件を表示
inspect(head(sort(rules, by = 'confidence'), n = 5))
#   lhs                                              rhs                          support confidence     lift
# 1 {marital status=single}                       => {dual incomes=not married} 0.4091041          1 1.671366
# 2 {marital status=single,                                                                                  
#    occupation=student}                          => {dual incomes=not married} 0.1449971          1 1.671366
# 3 {marital status=single,                                                                                  
#    householder status=live with parents/family} => {dual incomes=not married} 0.1884817          1 1.671366
# 4 {marital status=single,                                                                                  
#    type of home=apartment}                      => {dual incomes=not married} 0.1339442          1 1.671366
# 5 {marital status=single,                                                                                  
#    number in household=2+}                      => {dual incomes=not married} 0.1545957          1 1.671366

ルールの左辺、もしくは右辺に対して条件を付けてフィルタする場合は arules::subset。arules::subsetには条件抽出を行うための条件式が渡せる。条件式中では、lhs, rhs, support, confidence, lift が変数として、また以下の演算子も使える。上記ヘルプ中に各演算子の使用例が記載されている。

%in%: 指定したアイテムを含むルールを抽出
%pin%: 文字列を指定し、部分一致したアイテムを含むルールを抽出
%ain%: 複数のアイテムを vector で指定し、それらを全て含むルールを抽出

例えば、"income=$0-$40,000" になりやすいルールを見つけたい場合は、rhs が "income=$0-$40,000" であるルールを抽出すればよいので、

subset(rules, subset = rhs %in% 'income=$0-$40,000')
# set of 571 rules 

# 詳細表示
inspect(head(subset(rules, subset = rhs %in% 'income=$0-$40,000')))
#   lhs                                              rhs                   support confidence     lift
# 1 {occupation=student}                          => {income=$0-$40,000} 0.1381617  0.8421986 1.353027
# 2 {householder status=live with parents/family} => {income=$0-$40,000} 0.1669575  0.8141844 1.308021
# 3 {type of home=apartment}                      => {income=$0-$40,000} 0.2242583  0.8137203 1.307276
# 4 {marital status=single}                       => {income=$0-$40,000} 0.3302792  0.8073231 1.296999
# 5 {marital status=single,                                                                           
#    occupation=student}                          => {income=$0-$40,000} 0.1230366  0.8485456 1.363224
# 6 {age=14-34,                                                                                       
#    occupation=student}                          => {income=$0-$40,000} 0.1348168  0.8465753 1.360059

# 上記 かつ 信頼度が 0.84 より大きいルールを表示
inspect(head(subset(rules, subset = rhs %in% 'income=$0-$40,000' & confidence > 0.84)))
#   lhs                                rhs                   support confidence     lift
# 1 {occupation=student}            => {income=$0-$40,000} 0.1381617  0.8421986 1.353027
# 2 {marital status=single,                                                             
#    occupation=student}            => {income=$0-$40,000} 0.1230366  0.8485456 1.363224
# 3 {age=14-34,                                                                         
#    occupation=student}            => {income=$0-$40,000} 0.1348168  0.8465753 1.360059
# 4 {occupation=student,                                                                
#    dual incomes=not married}      => {income=$0-$40,000} 0.1301629  0.8475379 1.361605
# 5 {education=no college graduate,                                                     
#    occupation=student}            => {income=$0-$40,000} 0.1282723  0.8440191 1.355952
# 6 {occupation=student,                                                                
#    language in home=english}      => {income=$0-$40,000} 0.1195462  0.8456790 1.358619

`rules` の要約表示

arules::transactions の場合と同じく summary。ルールの長さごとの頻度や、支持度 / 確信度 / リフトの分布が表示される。

summary(rules)
# set of 8664 rules
# 
# rule length distribution (lhs + rhs):sizes
#    1    2    3    4    5    6    7    8 
#    1   56  615 2287 3387 1925  385    8 
# 
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   1.000   4.000   5.000   4.888   6.000   8.000 
# 
# summary of quality measures:
#     support         confidence          lift       
#  Min.   :0.1001   Min.   :0.8000   Min.   :0.8971  
#  1st Qu.:0.1101   1st Qu.:0.8436   1st Qu.:1.0813  
#  Median :0.1241   Median :0.9027   Median :1.3297  
#  Mean   :0.1393   Mean   :0.9021   Mean   :1.4099  
#  3rd Qu.:0.1510   3rd Qu.:0.9574   3rd Qu.:1.5309  
#  Max.   :0.9129   Max.   :1.0000   Max.   :4.3554  
# 
# mining info:
#  data ntransactions support confidence
#  tran          6876     0.1        0.8
# head(quality(rules))

`apriori` 実行時の条件指定

arules::apriori の実行時点でフィルタ条件がわかっている場合は、その条件をキーワードとして渡すことによって抽出されるルールを絞り込める。各キーワードに対して引数のリストを渡す形なのでちょっとわかりにくい。詳細は arules::apriori。ならびに、各引数クラスのヘルプを参照。

parameter: 引数のリストまたは APparameter インスタンス
appearance: 引数のリストまたは APappearance インスタンス
control: 引数のリストまたは APcontrol インスタンス

# 経過出力を抑制
control <- list(verbose = FALSE)

# 支持度が 0.6 以上のルールのみ抽出
r <- apriori(tran, parameter = list(supp = 0.6, target = "rules"), control = control)
inspect(r)
#   lhs                                rhs                          support confidence      lift
# 1 {}                              => {language in home=english} 0.9128854  0.9128854 1.0000000
# 2 {years in bay area=10+}         => {language in home=english} 0.6013671  0.9300495 1.0188020
# 3 {ethnic classification=white}   => {language in home=english} 0.6595404  0.9847991 1.0787763
# 4 {number in household=1}         => {language in home=english} 0.6495055  0.9388270 1.0284171
# 5 {education=no college graduate} => {language in home=english} 0.6343805  0.8995669 0.9854106

# 条件 (左辺) が "occupation=student" のルールのみ抽出
r <- apriori(tran, appearance = list(items = c("occupation=student"), default = 'rhs'), control = control)
inspect(r)
#   lhs                     rhs                               support confidence      lift
# 1 {}                   => {language in home=english}      0.9128854  0.9128854 1.0000000
# 2 {occupation=student} => {marital status=single}         0.1449971  0.8838652 2.1604897
# 3 {occupation=student} => {age=14-34}                     0.1592496  0.9707447 1.6583454
# 4 {occupation=student} => {dual incomes=not married}      0.1535777  0.9361702 1.5646831
# 5 {occupation=student} => {income=$0-$40,000}             0.1381617  0.8421986 1.3530274
# 6 {occupation=student} => {education=no college graduate} 0.1519779  0.9264184 1.3136839
# 7 {occupation=student} => {language in home=english}      0.1413613  0.8617021 0.9439324

APparameter, APappearance, APcontrolはそれぞれ空のリスト、もしくは NULL からインスタンス化すれば、どんなインスタンスが作られているのかわかる。

as(list(), 'APparameter')
#  confidence minval smax arem  aval originalSupport support minlen maxlen target   ext
#         0.8    0.1    1 none FALSE            TRUE     0.1      1     10  rules FALSE

データの作り方

最後、単純な transactions データの作り方。よく使うのは以下の 3パターンだと思う。

`vector` のリストから作る

最も単純な方法。リストの各要素 (vector) が 1 トランザクション、 vector 内の各要素がアイテムになる。

l <- list(c('x', 'z'), c('x', 'y'), c('x', 'y'))

as(l, 'transactions')
# transactions in sparse format with
#  3 transactions (rows) and
#  3 items (columns)

LIST(as(l, 'transactions'))
# [[1]]
# [1] "x" "z"
# 
# [[2]]
# [1] "x" "y"
# 
# [[3]]
# [1] "x" "y"

値が真偽値の `data.frame` から作る

data.frame のカラム名について、各カラムが含まれる / 含まれない場合それぞれをアイテムとしてルールを作成したい場合はこの形式。

df <- data.frame(x = c(TRUE, FALSE, TRUE),
                 y = c(TRUE, TRUE, FALSE),
                 z = c(TRUE, TRUE, FALSE))

as(df, 'transactions')
# transactions in sparse format with
#  3 transactions (rows) and
#  6 items (columns)

LIST(as(df, 'transactions'))
# $`1`
# [1] "x=TRUE" "y=TRUE" "z=TRUE"
# 
# $`2`
# [1] "x=FALSE" "y=TRUE"  "z=TRUE" 
# 
# $`3`
# [1] "x=TRUE"  "y=FALSE" "z=FALSE"

値が `factor` の `data.frame` から作る

data.frame のカラム名が複数の値をとりうるカテゴリで、各値をアイテムとして扱いたい場合はこの形式。transactions インスタンスにする対象は全て factor である必要がある。

df <- data.frame(x = factor(c(1, 2, 1)),
                 y = factor(c('a', 'a', 'b')),
                 z = factor(c('A', 'A', 'B')))

as(df, 'transactions')
# transactions in sparse format with
#  3 transactions (rows) and
#  6 items (columns)

LIST(as(df, 'transactions'))
# $`1`
# [1] "x=1" "y=a" "z=A"
# 
# $`2`
# [1] "x=2" "y=a" "z=A"
# 
# $`3`
# [1] "x=1" "y=b" "z=B"