Chapter 29 Association rule mining in R
This lab activity is adapted from the following tutorials/guides: - Association mining (Market Basket Analysis) - Introduction to Association Rule Mining in R
In this lab activity, we will
- use the
arules
package to perform association rule mining - load our transaction data from file (with the appropriate format for the
arules
package)
29.1 Dependencies
We’ll use the following packages in this lab activity (you will need to install any that you do not already have installed):
library(tidyverse) # For data wrangling
library(arules) # Association rule mining algorithms/functions
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Attaching package: 'arules'
## The following object is masked from 'package:khroma':
##
## info
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz) # Contains visualizations for association rule mining
29.2 Data preparation and inspection
In this lab, we’ll load our transaction data from file: transactions.dat
, which you can download on blackboard or you can access it online here.
Before continuing on with loading these data, take a look at transactions.dat
.
Notice that each line (after the first) contains a comma separated list of items.
Each line describes a single transaction.
For example, the line B,E,C
describes a transaction containing the items B, C, and E.
We can use the read.transactions
function from the arules
package to read a our data file containing transactions.
# You will need to adjust the file path to run this lab activity locally.
<- read.transactions(
transactions "lecture-material/week-11/transactions.dat",
sep = ",",
skip = 1
)
In your R console, run ?read.transactions
to see more information about loading transaction data from file.
The arules
packages provides some useful functions for inspecting our transaction data:
To view the different items represented across all transactions in your data, you can use the itemLabels
function:
itemLabels(transactions)
## [1] "A" "B" "C" "D" "E"
To view the sizes of each transaction in our dataset, we can use the size
function:
# "arules::" before calling the size function tells R that we want to use the
# size function provided by the arules package.
::size(transactions) arules
## [1] 4 1 1 1 5 3 5 4 5 5 2 5 2 2 2 4 3 1 5 3 5 4 1 1 3 1 5 3 1 1 3 3 5 5 4 2 1
## [38] 2 2 3 5 5 4 1 1 5 2 3 5 4 5 2 1 2 3 5 5 2 3 4 1 3 5 3 5 5 1 2 1 5 4 2 5 5
## [75] 2 4 1 1 4 4 2 2 4 2 4 3 5 2 3 2 3 1 1 1 5 2 5 2 3 5
If you wanted to get a list object containing all transactions, you could use the LIST
function:
<- LIST(transactions)
transaction_list # For brevity, we'll just show the first few entries in the list we created.
head(transaction_list)
## [[1]]
## [1] "A" "B" "C" "D"
##
## [[2]]
## [1] "B"
##
## [[3]]
## [1] "A"
##
## [[4]]
## [1] "B"
##
## [[5]]
## [1] "A" "B" "C" "D" "E"
##
## [[6]]
## [1] "B" "C" "D"
The good old summary
function works with transaction data, too:
summary(transactions)
## transactions as itemMatrix in sparse format with
## 100 rows (elements/itemsets/transactions) and
## 5 columns (items) and a density of 0.61
##
## most frequent items:
## C A B D E (Other)
## 67 66 63 59 50 0
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5
## 21 21 17 14 27
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 3.00 3.05 5.00 5.00
##
## includes extended item information - examples:
## labels
## 1 A
## 2 B
## 3 C
The arules
package also provides the image
function to visually inspect binary incidence matrices describing your data.
In most cases, you’ll have too many transactions for this to be a particularly useful visualization.
# look at all transactions
image(transactions)
The arules
package also provides a quick function for visualizing the frequences of individual items in your transaction data:
itemFrequencyPlot(transactions, topN=10, cex.names=1)
In these data, we can see that C is the most frequent and E is the least frequent.
29.3 Generating association rules
The data provided in transactions.dat
were generated randomly (for demonstrative purposes).
I.e., I wanted to show off a small example of transaction data formated in a way that works with the functions in the arules
package.
Because these data were generated random, the association rules are not going to be particularly meaningful.
We can use the apriori
function to generate association rules.
Run ?apriori
in your R consolue for more information about using and parameterizing the apriori
function.
<- apriori(
rules
transactions,parameter = list(
supp = 0.3, # Sets our minimum support threshold
conf = 0.5, # Sets our confidence threshold
minlen = 2, # Rules must have at least two items. Eliminates null rules.
target = "rules" # We'd like rules as our output.
) )
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.3 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 30
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[5 item(s), 100 transaction(s)] done [0.00s].
## sorting and recoding items ... [5 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [58 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(rules)
## set of 58 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 20 30 8
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.793 3.000 4.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.3000 Min. :0.5758 Min. :0.320 Min. :0.9759
## 1st Qu.:0.3200 1st Qu.:0.7234 1st Qu.:0.400 1st Qu.:1.1890
## Median :0.3500 Median :0.7983 Median :0.465 Median :1.2819
## Mean :0.3671 Mean :0.7835 Mean :0.479 Mean :1.2927
## 3rd Qu.:0.4000 3rd Qu.:0.8563 3rd Qu.:0.590 3rd Qu.:1.3719
## Max. :0.4800 Max. :0.9375 Max. :0.670 Max. :1.7647
## count
## Min. :30.00
## 1st Qu.:32.00
## Median :35.00
## Mean :36.71
## 3rd Qu.:40.00
## Max. :48.00
##
## mining info:
## data ntransactions support confidence
## transactions 100 0.3 0.5
## call
## apriori(data = transactions, parameter = list(supp = 0.3, conf = 0.5, minlen = 2, target = "rules"))
We can use the inspect
function to display the association rules that we found:
inspect(rules)
## lhs rhs support confidence coverage lift count
## [1] {E} => {D} 0.37 0.7400000 0.50 1.2542373 37
## [2] {D} => {E} 0.37 0.6271186 0.59 1.2542373 37
## [3] {E} => {B} 0.40 0.8000000 0.50 1.2698413 40
## [4] {B} => {E} 0.40 0.6349206 0.63 1.2698413 40
## [5] {E} => {A} 0.40 0.8000000 0.50 1.2121212 40
## [6] {A} => {E} 0.40 0.6060606 0.66 1.2121212 40
## [7] {E} => {C} 0.42 0.8400000 0.50 1.2537313 42
## [8] {C} => {E} 0.42 0.6268657 0.67 1.2537313 42
## [9] {D} => {B} 0.41 0.6949153 0.59 1.1030401 41
## [10] {B} => {D} 0.41 0.6507937 0.63 1.1030401 41
## [11] {D} => {A} 0.38 0.6440678 0.59 0.9758603 38
## [12] {A} => {D} 0.38 0.5757576 0.66 0.9758603 38
## [13] {D} => {C} 0.47 0.7966102 0.59 1.1889704 47
## [14] {C} => {D} 0.47 0.7014925 0.67 1.1889704 47
## [15] {B} => {A} 0.47 0.7460317 0.63 1.1303511 47
## [16] {A} => {B} 0.47 0.7121212 0.66 1.1303511 47
## [17] {B} => {C} 0.46 0.7301587 0.63 1.0897891 46
## [18] {C} => {B} 0.46 0.6865672 0.67 1.0897891 46
## [19] {A} => {C} 0.48 0.7272727 0.66 1.0854817 48
## [20] {C} => {A} 0.48 0.7164179 0.67 1.0854817 48
## [21] {D, E} => {B} 0.32 0.8648649 0.37 1.3728014 32
## [22] {B, E} => {D} 0.32 0.8000000 0.40 1.3559322 32
## [23] {B, D} => {E} 0.32 0.7804878 0.41 1.5609756 32
## [24] {D, E} => {A} 0.32 0.8648649 0.37 1.3104013 32
## [25] {A, E} => {D} 0.32 0.8000000 0.40 1.3559322 32
## [26] {A, D} => {E} 0.32 0.8421053 0.38 1.6842105 32
## [27] {D, E} => {C} 0.34 0.9189189 0.37 1.3715208 34
## [28] {C, E} => {D} 0.34 0.8095238 0.42 1.3720743 34
## [29] {C, D} => {E} 0.34 0.7234043 0.47 1.4468085 34
## [30] {B, E} => {A} 0.35 0.8750000 0.40 1.3257576 35
## [31] {A, E} => {B} 0.35 0.8750000 0.40 1.3888889 35
## [32] {A, B} => {E} 0.35 0.7446809 0.47 1.4893617 35
## [33] {B, E} => {C} 0.35 0.8750000 0.40 1.3059701 35
## [34] {C, E} => {B} 0.35 0.8333333 0.42 1.3227513 35
## [35] {B, C} => {E} 0.35 0.7608696 0.46 1.5217391 35
## [36] {A, E} => {C} 0.36 0.9000000 0.40 1.3432836 36
## [37] {C, E} => {A} 0.36 0.8571429 0.42 1.2987013 36
## [38] {A, C} => {E} 0.36 0.7500000 0.48 1.5000000 36
## [39] {B, D} => {A} 0.32 0.7804878 0.41 1.1825573 32
## [40] {A, D} => {B} 0.32 0.8421053 0.38 1.3366750 32
## [41] {A, B} => {D} 0.32 0.6808511 0.47 1.1539849 32
## [42] {B, D} => {C} 0.35 0.8536585 0.41 1.2741172 35
## [43] {C, D} => {B} 0.35 0.7446809 0.47 1.1820331 35
## [44] {B, C} => {D} 0.35 0.7608696 0.46 1.2896094 35
## [45] {A, D} => {C} 0.34 0.8947368 0.38 1.3354281 34
## [46] {C, D} => {A} 0.34 0.7234043 0.47 1.0960671 34
## [47] {A, C} => {D} 0.34 0.7083333 0.48 1.2005650 34
## [48] {A, B} => {C} 0.38 0.8085106 0.47 1.2067323 38
## [49] {B, C} => {A} 0.38 0.8260870 0.46 1.2516469 38
## [50] {A, C} => {B} 0.38 0.7916667 0.48 1.2566138 38
## [51] {A, D, E} => {C} 0.30 0.9375000 0.32 1.3992537 30
## [52] {C, D, E} => {A} 0.30 0.8823529 0.34 1.3368984 30
## [53] {A, C, E} => {D} 0.30 0.8333333 0.36 1.4124294 30
## [54] {A, C, D} => {E} 0.30 0.8823529 0.34 1.7647059 30
## [55] {A, B, E} => {C} 0.32 0.9142857 0.35 1.3646055 32
## [56] {B, C, E} => {A} 0.32 0.9142857 0.35 1.3852814 32
## [57] {A, C, E} => {B} 0.32 0.8888889 0.36 1.4109347 32
## [58] {A, B, C} => {E} 0.32 0.8421053 0.38 1.6842105 32
29.3.1 Generating frequent itemsets
What if you just want to generate the set of frequent itemsets using the apriori algorithm?
You can use the apriori
function to do that if you change the target parameter:
<- apriori(
freq_itemsets
transactions,parameter = list(
supp = 0.3, # Sets our minimum support threshold
minlen = 1, # Sets must have at least one item. Eliminates null sets.
target = "frequent itemsets"
) )
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## NA 0.1 1 none FALSE TRUE 5 0.3 1
## maxlen target ext
## 10 frequent itemsets TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 30
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[5 item(s), 100 transaction(s)] done [0.00s].
## sorting and recoding items ... [5 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## sorting transactions ... done [0.00s].
## writing ... [27 set(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(freq_itemsets)
## set of 27 itemsets
##
## most frequent items:
## A C E B D (Other)
## 13 13 13 12 12 0
##
## element (itemset/transaction) length distribution:sizes
## 1 2 3 4
## 5 10 10 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 2.000 2.333 3.000 4.000
##
## summary of quality measures:
## support count
## Min. :0.3000 Min. :30.00
## 1st Qu.:0.3450 1st Qu.:34.50
## Median :0.3800 Median :38.00
## Mean :0.4207 Mean :42.07
## 3rd Qu.:0.4700 3rd Qu.:47.00
## Max. :0.6700 Max. :67.00
##
## includes transaction ID lists: FALSE
##
## mining info:
## data ntransactions support confidence
## transactions 100 0.3 1
## call
## apriori(data = transactions, parameter = list(supp = 0.3, minlen = 1, target = "frequent itemsets"))
Again, we can use the inspect
function from the arules
package to see all of the frequent itemsets:
inspect(freq_itemsets)
## items support count
## [1] {E} 0.50 50
## [2] {D} 0.59 59
## [3] {B} 0.63 63
## [4] {A} 0.66 66
## [5] {C} 0.67 67
## [6] {D, E} 0.37 37
## [7] {B, E} 0.40 40
## [8] {A, E} 0.40 40
## [9] {C, E} 0.42 42
## [10] {B, D} 0.41 41
## [11] {A, D} 0.38 38
## [12] {C, D} 0.47 47
## [13] {A, B} 0.47 47
## [14] {B, C} 0.46 46
## [15] {A, C} 0.48 48
## [16] {B, D, E} 0.32 32
## [17] {A, D, E} 0.32 32
## [18] {C, D, E} 0.34 34
## [19] {A, B, E} 0.35 35
## [20] {B, C, E} 0.35 35
## [21] {A, C, E} 0.36 36
## [22] {A, B, D} 0.32 32
## [23] {B, C, D} 0.35 35
## [24] {A, C, D} 0.34 34
## [25] {A, B, C} 0.38 38
## [26] {A, C, D, E} 0.30 30
## [27] {A, B, C, E} 0.32 32
29.4 Exercises
- Adjust confidence and support thresholds used for rule generation. What happens if you decrease these thresholds? What about increase these threshold?
- Apply association rule mining to the Groceries dataset.
The Groceries dataset comes with
arules
package. You can load it as follows:
data("Groceries")
<- Groceries grocery_trans
- Describe a type of data that does not describe customer transactions that you could imagine applying association rule mining to. For example, how might you apply association rule mining to generate association rules for characters in a set of words? What other types of data might be useful to apply association rule mining to?