Market Basket Analysis in R

In this lab, we will go through market basket analysis. Kindly go through the video lectures before proceeding with this lab.

We will require few packages to run the analysis: arules and arulesSequences.

library(arules)

library(arulesSequences)

Data

Let’s begin with the dataset Adult that comes with the package arules. This dataset has been prepared for association mining from the original data AdultUCI as explained by the arules package on the help page. You can get more details on the dataset by typing in R: help(package=“arules”) and then clicking “Adult” on the help page.

The Adult database was extracted from the census bureau database found at https://www.census.gov/ in 1994 by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). It was originally used to predict whether income exceeds USD 50K/yr based on census data. We added the attribute income with levels small and large (>50K).

data("Adult")
dim(Adult)

## [1] 48842   115

Adult

## transactions in sparse format with
##  48842 transactions (rows) and
##  115 items (columns)

There are 48842 transaction and 115 items (columns).

To view the data we will use a function inspect() from arules

inspect(Adult[1:10]) #Since number of records is very large, we will just look at first 10 records

##      items                                   transactionID
## [1]  {age=Middle-aged,                                    
##       workclass=State-gov,                                
##       education=Bachelors,                                
##       marital-status=Never-married,                       
##       occupation=Adm-clerical,                            
##       relationship=Not-in-family,                         
##       race=White,                                         
##       sex=Male,                                           
##       capital-gain=Low,                                   
##       capital-loss=None,                                  
##       hours-per-week=Full-time,                           
##       native-country=United-States,                       
##       income=small}                                     1 
## [2]  {age=Senior,                                         
##       workclass=Self-emp-not-inc,                         
##       education=Bachelors,                                
##       marital-status=Married-civ-spouse,                  
##       occupation=Exec-managerial,                         
##       relationship=Husband,                               
##       race=White,                                         
##       sex=Male,                                           
##       capital-gain=None,                                  
##       capital-loss=None,                                  
##       hours-per-week=Part-time,                           
##       native-country=United-States,                       
##       income=small}                                     2 
## [3]  {age=Middle-aged,                                    
##       workclass=Private,                                  
##       education=HS-grad,                                  
##       marital-status=Divorced,                            
##       occupation=Handlers-cleaners,                       
##       relationship=Not-in-family,                         
##       race=White,                                         
##       sex=Male,                                           
##       capital-gain=None,                                  
##       capital-loss=None,                                  
##       hours-per-week=Full-time,                           
##       native-country=United-States,                       
##       income=small}                                     3 
## [4]  {age=Senior,                                         
##       workclass=Private,                                  
##       education=11th,                                     
##       marital-status=Married-civ-spouse,                  
##       occupation=Handlers-cleaners,                       
##       relationship=Husband,                               
##       race=Black,                                         
##       sex=Male,                                           
##       capital-gain=None,                                  
##       capital-loss=None,                                  
##       hours-per-week=Full-time,                           
##       native-country=United-States,                       
##       income=small}                                     4 
## [5]  {age=Middle-aged,                                    
##       workclass=Private,                                  
##       education=Bachelors,                                
##       marital-status=Married-civ-spouse,                  
##       occupation=Prof-specialty,                          
##       relationship=Wife,                                  
##       race=Black,                                         
##       sex=Female,                                         
##       capital-gain=None,                                  
##       capital-loss=None,                                  
##       hours-per-week=Full-time,                           
##       native-country=Cuba,                                
##       income=small}                                     5 
## [6]  {age=Middle-aged,                                    
##       workclass=Private,                                  
##       education=Masters,                                  
##       marital-status=Married-civ-spouse,                  
##       occupation=Exec-managerial,                         
##       relationship=Wife,                                  
##       race=White,                                         
##       sex=Female,                                         
##       capital-gain=None,                                  
##       capital-loss=None,                                  
##       hours-per-week=Full-time,                           
##       native-country=United-States,                       
##       income=small}                                     6 
## [7]  {age=Senior,                                         
##       workclass=Private,                                  
##       education=9th,                                      
##       marital-status=Married-spouse-absent,               
##       occupation=Other-service,                           
##       relationship=Not-in-family,                         
##       race=Black,                                         
##       sex=Female,                                         
##       capital-gain=None,                                  
##       capital-loss=None,                                  
##       hours-per-week=Part-time,                           
##       native-country=Jamaica,                             
##       income=small}                                     7 
## [8]  {age=Senior,                                         
##       workclass=Self-emp-not-inc,                         
##       education=HS-grad,                                  
##       marital-status=Married-civ-spouse,                  
##       occupation=Exec-managerial,                         
##       relationship=Husband,                               
##       race=White,                                         
##       sex=Male,                                           
##       capital-gain=None,                                  
##       capital-loss=None,                                  
##       hours-per-week=Over-time,                           
##       native-country=United-States,                       
##       income=large}                                     8 
## [9]  {age=Middle-aged,                                    
##       workclass=Private,                                  
##       education=Masters,                                  
##       marital-status=Never-married,                       
##       occupation=Prof-specialty,                          
##       relationship=Not-in-family,                         
##       race=White,                                         
##       sex=Female,                                         
##       capital-gain=High,                                  
##       capital-loss=None,                                  
##       hours-per-week=Over-time,                           
##       native-country=United-States,                       
##       income=large}                                     9 
## [10] {age=Middle-aged,                                    
##       workclass=Private,                                  
##       education=Bachelors,                                
##       marital-status=Married-civ-spouse,                  
##       occupation=Exec-managerial,                         
##       relationship=Husband,                               
##       race=White,                                         
##       sex=Male,                                           
##       capital-gain=Low,                                   
##       capital-loss=None,                                  
##       hours-per-week=Full-time,                           
##       native-country=United-States,                       
##       income=large}                                     10

As you can see from the table above, how the association mining dataset looks like. Each record is a collection of items.

Some of the items in this dataset include: age, workclass, education, matrital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country, income

Generating rules

Based on the records, let’s generate association rules. To generate the rules, we will use a function apriori from the package “arules”

In the function we need to set support and confidence values and the maximum length in the rule

rules <- apriori (Adult, parameter = list(supp = 0.009, conf = 0.5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.009      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 439 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 48842 transaction(s)] done [0.03s].
## sorting and recoding items ... [68 item(s)] done [0.01s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10

## Warning in apriori(Adult, parameter = list(supp = 0.009, conf = 0.5)): Mining
## stopped (maxlen reached). Only patterns up to a length of 10 returned!

##  done [0.55s].
## writing ... [362223 rule(s)] done [0.10s].
## creating S4 object  ... done [0.13s].

length(rules)

## [1] 362223

As we can see that the alogorithm has generated 362,223 rules, a very large number. Perhaps, we have set a very low support value.

Let’s change the support value to 0.1 and see how many rules are generated.

rules <- apriori (Adult, parameter = list(supp = 0.1, conf = 0.5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 4884 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 48842 transaction(s)] done [0.05s].
## sorting and recoding items ... [31 item(s)] done [0.00s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 done [0.07s].
## writing ... [9259 rule(s)] done [0.00s].
## creating S4 object  ... done [0.02s].

length(rules)

## [1] 9259

Now we have 9259 rules, even this is a very large number. Let’s try to change the confidence value to 0.75

rules <- apriori (Adult, parameter = list(supp = 0.1, conf = 0.75))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.75    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 4884 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 48842 transaction(s)] done [0.03s].
## sorting and recoding items ... [31 item(s)] done [0.01s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 done [0.07s].
## writing ... [6392 rule(s)] done [0.00s].
## creating S4 object  ... done [0.02s].

length(rules)

## [1] 6392

Now we have 6392 rules.

As you can see, when we alter the values of support and confidence, how the number of rules come down drastically.

Therefore, it is very important to take of these parameters, which depends a lot on the domain we are dealing with.

Inspecting rules

Now let’s look at some of the rules

inspect(rules[100:102])

##     lhs                  rhs                 support   confidence coverage 
## [1] {age=Middle-aged} => {capital-gain=None} 0.4632079 0.9170281  0.5051185
## [2] {age=Middle-aged} => {capital-loss=None} 0.4800786 0.9504276  0.5051185
## [3] {income=small}    => {race=White}        0.4237951 0.8373382  0.5061218
##     lift      count
## [1] 0.9996091 22624
## [2] 0.9970100 23448
## [3] 0.9792939 20699

So, rule number 100: when the age in antecedent is Middle-aged, the capital-gain in consequent is None with support 0.46 and confidence 0.92 and the lift value is close to 1. So, there is a confidence that 92% of the middle aged people in this group have no capital gain.

As per the rule number 101, there is a confidence that 95% of the middle aged people in this group have no capital loss.

As per the rule number 102, there is a confidence of 84% that people with small income are white in this group

We can also change the appearance of the rules by supplying some arguemnets in the inspect() function

inspect(rules[1000:1002], ruleSep = "~~>", itemSep = " + ", setStart = "", setEnd = "",
  linebreak = FALSE)

##     lhs                                                        
## [1] age=Young + native-country=United-States + income=small ~~>
## [2] age=Young + capital-loss=None + income=small            ~~>
## [3] age=Young + capital-gain=None + income=small            ~~>
##     rhs                          support   confidence coverage  lift     count
## [1] capital-loss=None            0.1138365 0.9774965  0.1164571 1.025406 5560 
## [2] native-country=United-States 0.1138365 0.9021580  0.1261824 1.005275 5560 
## [3] capital-loss=None            0.1229475 0.9781723  0.1256910 1.026115 6005

The 1000th rule says: there is a confidence of 90% that young people with small income and no capital loss are native of United States.

Eleminating redundant rules

Some rules are redundant and it makes no sense to include them as we want simple rules that can be used in our work.

For example, when two rules have same consequent with close to the same confidence level but the antecedent of one rule is a subset of that of another. Such as (A, B, C) —> (D) and (A, B) —> (D); here we can drop the first rule as if both the rules have almost similar confidence.

We can do this by using a function is.redundant() from the package arules

new_rules = rules[!is.redundant(rules)]
length(new_rules)

## [1] 593

We have brought down the number of rules from 6392 to 593.

We can also generate rules using some controls for our analysis.

Let’s say, we want to know the rules for consequent as income=large with minimum support = 0.01 and minimum confidence = 0.6

rules <- apriori (data=Adult, parameter=list (supp=0.01,conf = 0.6), appearance = list(default="lhs",rhs="income=large"),  control = list (verbose=F))
length(rules)

## [1] 88

So, we found 88 such rules that satisfy our conditions.

Let’s inspect some of the rules

inspect(rules[10])

##     lhs                     rhs               support confidence   coverage     lift count
## [1] {capital-gain=High,                                                                   
##      capital-loss=None}  => {income=large} 0.02319725  0.6704142 0.03460137 4.176045  1133

Rule 10 says that there is a confidence of 67% that people with high capital gain and no capital loss have large income. The lift value is 4.18, much higher than 1, which is a strong evidence that these events are dependent.

Sequencial Data Mining

When the order of the events are important, i.e. when there is pattern in the data, we need to look at something called sequence mining. For example, you apply brake to stop the moving car, but you do not stop the car to apply brake. In this example order of the event becomes extremely important.

We need data based on frequent sequential patterns for mining.

For this kind of data mining we will use a function cspade() from the package, arulesSequences.

We will use the dataset zaki that comes with this package.

The data must contain temporal information such as sequenceID (sequence or customer identifier) and eventID (event identifier)

Loadin data

data(zaki)
class(zaki)

## [1] "transactions"
## attr(,"package")
## [1] "arules"

length(zaki)

## [1] 10

So, the data type is transactional and there are only 10 records

Let’s look at the data:

as(zaki, "data.frame")

##        items sequenceID eventID SIZE
## 1      {C,D}          1      10    2
## 2    {A,B,C}          1      15    3
## 3    {A,B,F}          1      20    3
## 4  {A,C,D,F}          1      25    4
## 5    {A,B,F}          2      15    3
## 6        {E}          2      20    1
## 7    {A,B,F}          3      10    3
## 8    {D,G,H}          4      10    3
## 9      {B,F}          4      20    2
## 10   {A,G,H}          4      25    3

As we can see that each record has items, sequenceID, eventID and size.

s1 <- cspade(zaki, parameter = list(support = 0.4), 
           control   = list(verbose = TRUE, tidLists = TRUE))

## 
## parameter specification:
## support : 0.4
## maxsize :  10
## maxlen  :  10
## 
## algorithmic control:
## bfstype  : FALSE
## verbose  :  TRUE
## summary  : FALSE
## tidLists :  TRUE
## 
## preprocessing ... 1 partition(s), 0 MB [0.14s]
## mining transactions ... 0 MB [0.14s]
## reading sequences ... [0.05s]
## 
## total elapsed time: 0.33s

summary(s1)

## set of 18 sequences with
## 
## most frequent items:
##       A       B       F       D (Other) 
##      11      10      10       8      28 
## 
## most frequent elements:
##     {A}     {D}     {B}     {F}   {B,F} (Other) 
##       8       8       4       4       4       3 
## 
## element (sequence) size distribution:
## sizes
## 1 2 3 
## 8 7 3 
## 
## sequence length distribution:
## lengths
## 1 2 3 4 
## 4 8 5 1 
## 
## summary of quality measures:
##     support      
##  Min.   :0.5000  
##  1st Qu.:0.5000  
##  Median :0.5000  
##  Mean   :0.6528  
##  3rd Qu.:0.7500  
##  Max.   :1.0000  
## 
## includes transaction ID lists: TRUE 
## 
## mining info:
##  data ntransactions nsequences support
##  zaki            10          4     0.4

as(s1, "data.frame")

##           sequence support
## 1            <{A}>    1.00
## 2            <{B}>    1.00
## 3            <{D}>    0.50
## 4            <{F}>    1.00
## 5          <{A,F}>    0.75
## 6          <{B,F}>    1.00
## 7        <{D},{F}>    0.50
## 8      <{D},{B,F}>    0.50
## 9        <{A,B,F}>    0.75
## 10         <{A,B}>    0.75
## 11       <{D},{B}>    0.50
## 12       <{B},{A}>    0.50
## 13       <{D},{A}>    0.50
## 14       <{F},{A}>    0.50
## 15   <{D},{F},{A}>    0.50
## 16     <{B,F},{A}>    0.50
## 17 <{D},{B,F},{A}>    0.50
## 18   <{D},{B},{A}>    0.50

It basically gives the sequence of items and their support values. For example the support of sequence

Top sequence

Let’s look at the top 5 sequence.

x = as(s1, "data.frame")

x_or = x[order(x$support, decreasing = T),]

x_or[1:5,]

##   sequence support
## 1    <{A}>    1.00
## 2    <{B}>    1.00
## 4    <{F}>    1.00
## 6  <{B,F}>    1.00
## 5  <{A,F}>    0.75

barplot(x_or[1:5,]$support~x_or[1:5,]$sequence, col=1:5)

Using this sequence itemset s1 we can also generate association rules. We can do this by using a function ruleInduction() from the package arules

rules = ruleInduction(s1, confidence = 0.6, control = list(verbose = FALSE))
inspect(rules)

##    lhs      rhs   support confidence lift 
##  1 <{D}> => <{F}>     0.5          1    1 
##  2 <{D}> => <{B,      0.5          1    1 
##               F}>    
##  3 <{D}> => <{B}>     0.5          1    1 
##  4 <{D}> => <{A}>     0.5          1    1 
##  5 <{D},             
##     {F}> => <{A}>     0.5          1    1 
##  6 <{D},             
##     {B,              
##      F}> => <{A}>     0.5          1    1 
##  7 <{D},             
##     {B}> => <{A}>     0.5          1    1 
##

Exercise: Try using some other dataset from R or other sources and use the techniques for market basket analysis using association rules.

Business and AI

Search This Blog