Lab_MBA_1_RM
Asmi Ariv
2022-10-11
Market Basket Analysis in R
In this lab, we will go through market basket analysis. Kindly go through the video lectures before proceeding with this lab.
We will require few packages to run the analysis: arules and arulesSequences.
library(arules)
library(arulesSequences)Data
Let’s begin with the dataset Adult that comes with the package arules. This dataset has been prepared for association mining from the original data AdultUCI as explained by the arules package on the help page. You can get more details on the dataset by typing in R: help(package=“arules”) and then clicking “Adult” on the help page.
The Adult database was extracted from the census bureau database found at https://www.census.gov/ in 1994 by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). It was originally used to predict whether income exceeds USD 50K/yr based on census data. We added the attribute income with levels small and large (>50K).
data("Adult")
dim(Adult)## [1] 48842 115Adult## transactions in sparse format with
## 48842 transactions (rows) and
## 115 items (columns)There are 48842 transaction and 115 items (columns).
To view the data we will use a function inspect() from arules
inspect(Adult[1:10]) #Since number of records is very large, we will just look at first 10 records ## items transactionID
## [1] {age=Middle-aged,
## workclass=State-gov,
## education=Bachelors,
## marital-status=Never-married,
## occupation=Adm-clerical,
## relationship=Not-in-family,
## race=White,
## sex=Male,
## capital-gain=Low,
## capital-loss=None,
## hours-per-week=Full-time,
## native-country=United-States,
## income=small} 1
## [2] {age=Senior,
## workclass=Self-emp-not-inc,
## education=Bachelors,
## marital-status=Married-civ-spouse,
## occupation=Exec-managerial,
## relationship=Husband,
## race=White,
## sex=Male,
## capital-gain=None,
## capital-loss=None,
## hours-per-week=Part-time,
## native-country=United-States,
## income=small} 2
## [3] {age=Middle-aged,
## workclass=Private,
## education=HS-grad,
## marital-status=Divorced,
## occupation=Handlers-cleaners,
## relationship=Not-in-family,
## race=White,
## sex=Male,
## capital-gain=None,
## capital-loss=None,
## hours-per-week=Full-time,
## native-country=United-States,
## income=small} 3
## [4] {age=Senior,
## workclass=Private,
## education=11th,
## marital-status=Married-civ-spouse,
## occupation=Handlers-cleaners,
## relationship=Husband,
## race=Black,
## sex=Male,
## capital-gain=None,
## capital-loss=None,
## hours-per-week=Full-time,
## native-country=United-States,
## income=small} 4
## [5] {age=Middle-aged,
## workclass=Private,
## education=Bachelors,
## marital-status=Married-civ-spouse,
## occupation=Prof-specialty,
## relationship=Wife,
## race=Black,
## sex=Female,
## capital-gain=None,
## capital-loss=None,
## hours-per-week=Full-time,
## native-country=Cuba,
## income=small} 5
## [6] {age=Middle-aged,
## workclass=Private,
## education=Masters,
## marital-status=Married-civ-spouse,
## occupation=Exec-managerial,
## relationship=Wife,
## race=White,
## sex=Female,
## capital-gain=None,
## capital-loss=None,
## hours-per-week=Full-time,
## native-country=United-States,
## income=small} 6
## [7] {age=Senior,
## workclass=Private,
## education=9th,
## marital-status=Married-spouse-absent,
## occupation=Other-service,
## relationship=Not-in-family,
## race=Black,
## sex=Female,
## capital-gain=None,
## capital-loss=None,
## hours-per-week=Part-time,
## native-country=Jamaica,
## income=small} 7
## [8] {age=Senior,
## workclass=Self-emp-not-inc,
## education=HS-grad,
## marital-status=Married-civ-spouse,
## occupation=Exec-managerial,
## relationship=Husband,
## race=White,
## sex=Male,
## capital-gain=None,
## capital-loss=None,
## hours-per-week=Over-time,
## native-country=United-States,
## income=large} 8
## [9] {age=Middle-aged,
## workclass=Private,
## education=Masters,
## marital-status=Never-married,
## occupation=Prof-specialty,
## relationship=Not-in-family,
## race=White,
## sex=Female,
## capital-gain=High,
## capital-loss=None,
## hours-per-week=Over-time,
## native-country=United-States,
## income=large} 9
## [10] {age=Middle-aged,
## workclass=Private,
## education=Bachelors,
## marital-status=Married-civ-spouse,
## occupation=Exec-managerial,
## relationship=Husband,
## race=White,
## sex=Male,
## capital-gain=Low,
## capital-loss=None,
## hours-per-week=Full-time,
## native-country=United-States,
## income=large} 10As you can see from the table above, how the association mining dataset looks like. Each record is a collection of items.
Some of the items in this dataset include: age, workclass, education, matrital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country, income
Generating rules
Based on the records, let’s generate association rules. To generate the rules, we will use a function apriori from the package “arules”
In the function we need to set support and confidence values and the maximum length in the rule
rules <- apriori (Adult, parameter = list(supp = 0.009, conf = 0.5))## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.009 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 439
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 48842 transaction(s)] done [0.03s].
## sorting and recoding items ... [68 item(s)] done [0.01s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10## Warning in apriori(Adult, parameter = list(supp = 0.009, conf = 0.5)): Mining
## stopped (maxlen reached). Only patterns up to a length of 10 returned!## done [0.55s].
## writing ... [362223 rule(s)] done [0.10s].
## creating S4 object ... done [0.13s].length(rules)## [1] 362223As we can see that the alogorithm has generated 362,223 rules, a very large number. Perhaps, we have set a very low support value.
Let’s change the support value to 0.1 and see how many rules are generated.
rules <- apriori (Adult, parameter = list(supp = 0.1, conf = 0.5))## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 4884
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 48842 transaction(s)] done [0.05s].
## sorting and recoding items ... [31 item(s)] done [0.00s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 done [0.07s].
## writing ... [9259 rule(s)] done [0.00s].
## creating S4 object ... done [0.02s].length(rules)## [1] 9259Now we have 9259 rules, even this is a very large number. Let’s try to change the confidence value to 0.75
rules <- apriori (Adult, parameter = list(supp = 0.1, conf = 0.75))## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.75 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 4884
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 48842 transaction(s)] done [0.03s].
## sorting and recoding items ... [31 item(s)] done [0.01s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 done [0.07s].
## writing ... [6392 rule(s)] done [0.00s].
## creating S4 object ... done [0.02s].length(rules)## [1] 6392Now we have 6392 rules.
As you can see, when we alter the values of support and confidence, how the number of rules come down drastically.
Therefore, it is very important to take of these parameters, which depends a lot on the domain we are dealing with.
Inspecting rules
Now let’s look at some of the rules
inspect(rules[100:102])## lhs rhs support confidence coverage
## [1] {age=Middle-aged} => {capital-gain=None} 0.4632079 0.9170281 0.5051185
## [2] {age=Middle-aged} => {capital-loss=None} 0.4800786 0.9504276 0.5051185
## [3] {income=small} => {race=White} 0.4237951 0.8373382 0.5061218
## lift count
## [1] 0.9996091 22624
## [2] 0.9970100 23448
## [3] 0.9792939 20699So, rule number 100: when the age in antecedent is Middle-aged, the capital-gain in consequent is None with support 0.46 and confidence 0.92 and the lift value is close to 1. So, there is a confidence that 92% of the middle aged people in this group have no capital gain.
As per the rule number 101, there is a confidence that 95% of the middle aged people in this group have no capital loss.
As per the rule number 102, there is a confidence of 84% that people with small income are white in this group
We can also change the appearance of the rules by supplying some arguemnets in the inspect() function
inspect(rules[1000:1002], ruleSep = "~~>", itemSep = " + ", setStart = "", setEnd = "",
linebreak = FALSE)## lhs
## [1] age=Young + native-country=United-States + income=small ~~>
## [2] age=Young + capital-loss=None + income=small ~~>
## [3] age=Young + capital-gain=None + income=small ~~>
## rhs support confidence coverage lift count
## [1] capital-loss=None 0.1138365 0.9774965 0.1164571 1.025406 5560
## [2] native-country=United-States 0.1138365 0.9021580 0.1261824 1.005275 5560
## [3] capital-loss=None 0.1229475 0.9781723 0.1256910 1.026115 6005The 1000th rule says: there is a confidence of 90% that young people with small income and no capital loss are native of United States.
Eleminating redundant rules
Some rules are redundant and it makes no sense to include them as we want simple rules that can be used in our work.
For example, when two rules have same consequent with close to the same confidence level but the antecedent of one rule is a subset of that of another. Such as (A, B, C) —> (D) and (A, B) —> (D); here we can drop the first rule as if both the rules have almost similar confidence.
We can do this by using a function is.redundant() from the package arules
new_rules = rules[!is.redundant(rules)]
length(new_rules)## [1] 593We have brought down the number of rules from 6392 to 593.
We can also generate rules using some controls for our analysis.
Let’s say, we want to know the rules for consequent as income=large with minimum support = 0.01 and minimum confidence = 0.6
rules <- apriori (data=Adult, parameter=list (supp=0.01,conf = 0.6), appearance = list(default="lhs",rhs="income=large"), control = list (verbose=F))
length(rules)## [1] 88So, we found 88 such rules that satisfy our conditions.
Let’s inspect some of the rules
inspect(rules[10])## lhs rhs support confidence coverage lift count
## [1] {capital-gain=High,
## capital-loss=None} => {income=large} 0.02319725 0.6704142 0.03460137 4.176045 1133Rule 10 says that there is a confidence of 67% that people with high capital gain and no capital loss have large income. The lift value is 4.18, much higher than 1, which is a strong evidence that these events are dependent.
Sequencial Data Mining
When the order of the events are important, i.e. when there is pattern in the data, we need to look at something called sequence mining. For example, you apply brake to stop the moving car, but you do not stop the car to apply brake. In this example order of the event becomes extremely important.
We need data based on frequent sequential patterns for mining.
For this kind of data mining we will use a function cspade() from the package, arulesSequences.
We will use the dataset zaki that comes with this package.
The data must contain temporal information such as sequenceID (sequence or customer identifier) and eventID (event identifier)
Loadin data
data(zaki)
class(zaki)## [1] "transactions"
## attr(,"package")
## [1] "arules"length(zaki)## [1] 10So, the data type is transactional and there are only 10 records
Let’s look at the data:
as(zaki, "data.frame")## items sequenceID eventID SIZE
## 1 {C,D} 1 10 2
## 2 {A,B,C} 1 15 3
## 3 {A,B,F} 1 20 3
## 4 {A,C,D,F} 1 25 4
## 5 {A,B,F} 2 15 3
## 6 {E} 2 20 1
## 7 {A,B,F} 3 10 3
## 8 {D,G,H} 4 10 3
## 9 {B,F} 4 20 2
## 10 {A,G,H} 4 25 3As we can see that each record has items, sequenceID, eventID and size.
s1 <- cspade(zaki, parameter = list(support = 0.4),
control = list(verbose = TRUE, tidLists = TRUE))##
## parameter specification:
## support : 0.4
## maxsize : 10
## maxlen : 10
##
## algorithmic control:
## bfstype : FALSE
## verbose : TRUE
## summary : FALSE
## tidLists : TRUE
##
## preprocessing ... 1 partition(s), 0 MB [0.14s]
## mining transactions ... 0 MB [0.14s]
## reading sequences ... [0.05s]
##
## total elapsed time: 0.33ssummary(s1)## set of 18 sequences with
##
## most frequent items:
## A B F D (Other)
## 11 10 10 8 28
##
## most frequent elements:
## {A} {D} {B} {F} {B,F} (Other)
## 8 8 4 4 4 3
##
## element (sequence) size distribution:
## sizes
## 1 2 3
## 8 7 3
##
## sequence length distribution:
## lengths
## 1 2 3 4
## 4 8 5 1
##
## summary of quality measures:
## support
## Min. :0.5000
## 1st Qu.:0.5000
## Median :0.5000
## Mean :0.6528
## 3rd Qu.:0.7500
## Max. :1.0000
##
## includes transaction ID lists: TRUE
##
## mining info:
## data ntransactions nsequences support
## zaki 10 4 0.4as(s1, "data.frame")## sequence support
## 1 <{A}> 1.00
## 2 <{B}> 1.00
## 3 <{D}> 0.50
## 4 <{F}> 1.00
## 5 <{A,F}> 0.75
## 6 <{B,F}> 1.00
## 7 <{D},{F}> 0.50
## 8 <{D},{B,F}> 0.50
## 9 <{A,B,F}> 0.75
## 10 <{A,B}> 0.75
## 11 <{D},{B}> 0.50
## 12 <{B},{A}> 0.50
## 13 <{D},{A}> 0.50
## 14 <{F},{A}> 0.50
## 15 <{D},{F},{A}> 0.50
## 16 <{B,F},{A}> 0.50
## 17 <{D},{B,F},{A}> 0.50
## 18 <{D},{B},{A}> 0.50It basically gives the sequence of items and their support values. For example the support of sequence
Top sequence
Let’s look at the top 5 sequence.
x = as(s1, "data.frame")
x_or = x[order(x$support, decreasing = T),]
x_or[1:5,]## sequence support
## 1 <{A}> 1.00
## 2 <{B}> 1.00
## 4 <{F}> 1.00
## 6 <{B,F}> 1.00
## 5 <{A,F}> 0.75barplot(x_or[1:5,]$support~x_or[1:5,]$sequence, col=1:5)Using this sequence itemset s1 we can also generate association rules. We can do this by using a function ruleInduction() from the package arules
rules = ruleInduction(s1, confidence = 0.6, control = list(verbose = FALSE))
inspect(rules)## lhs rhs support confidence lift
## 1 <{D}> => <{F}> 0.5 1 1
## 2 <{D}> => <{B, 0.5 1 1
## F}>
## 3 <{D}> => <{B}> 0.5 1 1
## 4 <{D}> => <{A}> 0.5 1 1
## 5 <{D},
## {F}> => <{A}> 0.5 1 1
## 6 <{D},
## {B,
## F}> => <{A}> 0.5 1 1
## 7 <{D},
## {B}> => <{A}> 0.5 1 1
## Exercise: Try using some other dataset from R or other sources and use the techniques for market basket analysis using association rules.
Comments
Post a Comment