R: Market Basket Analysis - Tech Products











Context

Market Basket Analysis is a technique used in data mining and analytics to identify patterns of co-occurrence among items purchased by customers. It aims to uncover associations or relationships between products that are frequently bought together in a transaction or shopping basket. We need transactional data for a project like this, such as point-of-sale records or online purchase histories, where each transaction consists of a list of items purchased by a customer. The goal is to determine which items tend to be bought together and understand the underlying associations between them.

In this article I'm studying a dataset which gives transaction history for 2 years at a small telecommunications equipment store, determining with the appropriate algorithms what sets of items can be reported to the executive staff as being strongly related. By related, we mean that the goal is to identify rules such as, “if a customer buys a ‘Razer Wireless Gaming Mouse Venom Green Edition’, then they are likely also to buy a ‘Logitech High-Performance Mouse Pad with Wireless Charging and 40,000 Color RGB Light Strip w/ Anime Girl Design’”. Pre-processing of the data will also be discussed as we begin the project using the R programming language.

Apriori Association Mining

The market basket technique first involves transforming the data from whatever form into a formal transaction register so that the associated packages can search it for patterns in buying habits. If it happens often that a customer purchasing the set of items on the “left” side of a certain relation implies they also bought the item on the “right” side, we want to quantify the strength of that relation. For example,

{NVIDIA GTX 1090, ROC 32” Gaming Monitor} => {Shadow of the Tomb Raider}

is a hypothetical pattern in which a very graphically demanding game is purchased along with two high-end PC gaming-related items. The Apriori algorithm can quantify important metrics related to transactions that included these items. One is support, which is a measure of how often this set of items actually occurred. Another is confidence, which is how likely Tomb Raider’s appearance is in sets containing the graphics card and monitor. Another is Lift, which is a measure of how much the confidence for this rule exceeds expectations. We can decide on what thresholds are appropriate using analysis and visualization, and then make recommendations for targeted marketing based on the results.¹

Data Preparation

This dataset was proprietary to my graduate school and as such a public link to the source is not available. 

#import dataset
data <- read.csv("*/teleco_market_basket.csv",na.strings=c(""," "),sep = ',')
#investigate structure
str(data)
## 'data.frame':    15002 obs. of  20 variables:
##  $ Item01: chr  NA "Logitech M510 Wireless mouse" NA "Apple Lightning to Digital AV Adapter" ...
##  $ Item02: chr  NA "HP 63 Ink" NA "TP-Link AC1750 Smart WiFi Router" ...
##  $ Item03: chr  NA "HP 65 ink" NA "Apple Pencil" ...
##  $ Item04: chr  NA "nonda USB C to USB Adapter" NA NA ...
##  $ Item05: chr  NA "10ft iPHone Charger Cable" NA NA ...
##  $ Item06: chr  NA "HP 902XL ink" NA NA ...
##  $ Item07: chr  NA "Creative Pebble 2.0 Speakers" NA NA ...
##  $ Item08: chr  NA "Cleaning Gel Universal Dust Cleaner" NA NA ...
##  $ Item09: chr  NA "Micro Center 32GB Memory card" NA NA ...
##  $ Item10: chr  NA "YUNSONG 3pack 6ft Nylon Lightning Cable" NA NA ...
##  $ Item11: chr  NA "TopMate C5 Laptop Cooler pad" NA NA ...
##  $ Item12: chr  NA "Apple USB-C Charger cable" NA NA ...
##  $ Item13: chr  NA "HyperX Cloud Stinger Headset" NA NA ...
##  $ Item14: chr  NA "TONOR USB Gaming Microphone" NA NA ...
##  $ Item15: chr  NA "Dust-Off Compressed Gas 2 pack" NA NA ...
##  $ Item16: chr  NA "3A USB Type C Cable 3 pack 6FT" NA NA ...
##  $ Item17: chr  NA "HOVAMP iPhone charger" NA NA ...
##  $ Item18: chr  NA "SanDisk Ultra 128GB card" NA NA ...
##  $ Item19: chr  NA "FEEL2NICE 5 pack 10ft Lighning cable" NA NA ...
##  $ Item20: chr  NA "FEIYOLD Blue light Blocking Glasses" NA NA ...

15002 records are present. It would appear that the raw data is formatted in such a way that up to 20 items may be recorded in a transaction, with each of the items forming a column and any transactions with less than 20 items recording blank space for the missing ones. It is also apparent that every other column is blank. We need to format the data in a particular way in order to code it as an apriori transaction database.

#delete blank rows
data <- data[seq(2, nrow(data), 2),]
#create two-column dataframe with items attached to each transaction
t <- data.frame(tid=rep(1:nrow(data),20))
t$item <- unlist(data)
t <- na.omit(t)[order(na.omit(t)$tid),]
#send dataframe file
write.csv(t,"*/transaction_frame.csv")

The set is now prepared for the creation of a transaction register with the help of the arules package. Along with this, the accompanying arulesViz package is loaded for later analysis.

#load package
suppressMessages(library(arules))
suppressMessages(library(arulesViz))
#create transaction object
t_list <- split(t$item,t$tid)
trx <- as(t_list,"transactions")
## Warning in asMethod(object): removing duplicated items in transactions
#verify second transaction from B2
inspect(trx[2])
##     items                                    transactionID
## [1] {Apple Lightning to Digital AV Adapter,               
##      Apple Pencil,                                        
##      TP-Link AC1750 Smart WiFi Router}                   2

With the last command, it's evident by comparison to the original dataframe that the integrity of transactions has been maintained in this transformation.

Analysis

Apriori is employed here with two parameters set in advance, though there are methods to programmatically search for optimal values: minimum support threshold and minimum confidence threshold. These control the number and quality of rules, or patterns, that we can use as insights on shopping behavior. Naturally, it isn't ideal to have many rules that cannot be trusted nor to have a much smaller number of dependable shopping patterns than the size of our enterprise data should allow. There aren't established rules concerning what these parameters should be, but below I'll present an entry-level thought process. 

Consider the top 100 most frequent single items purchases:

#view top items
itemFrequencyPlot(trx,topN=100)

The most frequent item occurs in about 25% of the transactions. By definition then, any rule which involves that item will not appear if the minimum support threshold is set above 0.25. To capture a wide variety of possible antecedent-consequent rules, we can opt to start the algorithm with a minimum support of 0.005. Much higher given the relatively small size of this dataset, and there won't be many rules to choose from. With this is mind, we can be more quantitative in our selection of minimum confidence. The following is an evaluation of the model behavior for different confidences, assuming the support is fixed:

confidenceLevels = seq(from=0.1, to=0.9, by =0.1)
rules_sup0005 =NULL
for(i in 1:length(confidenceLevels)) {  
  rules_sup0005[i]=
    length(apriori(trx,control=list(verbose = FALSE),                   
                   parameter=list(supp=0.005,                                  
                                  conf=confidenceLevels[i],                                  
                                  target="rules")))}
library(ggplot2)
qplot(confidenceLevels, rules_sup0005,geom=c("point","line"),xlab="Confidence level",ylab="Number of Rules")+theme_bw()

This reveals that There are no rules available with confidence greater than 0.5, and that selecting this as the confidence threshold should result in a fairly compact but useful set of rules with a confidence of 0.5 or greater.

#create rules with apriori
rules <- apriori(trx,parameter=list(supp=0.005,conf=0.5),control=list(verbose = FALSE))

Findings

summary(rules)
## set of 20 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3 
## 20 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3       3       3       3       3       3 
## 
## summary of quality measures:
##     support           confidence        coverage             lift      
##  Min.   :0.005066   Min.   :0.5000   Min.   :0.007999   Min.   :2.098  
##  1st Qu.:0.005933   1st Qu.:0.5065   1st Qu.:0.011332   1st Qu.:2.148  
##  Median :0.007132   Median :0.5219   Median :0.013865   Median :2.275  
##  Mean   :0.007272   Mean   :0.5336   Mean   :0.013771   Mean   :2.358  
##  3rd Qu.:0.008532   3rd Qu.:0.5472   3rd Qu.:0.016531   3rd Qu.:2.424  
##  Max.   :0.011065   Max.   :0.6333   Max.   :0.021997   Max.   :3.005  
##      count      
##  Min.   :38.00  
##  1st Qu.:44.50  
##  Median :53.50  
##  Mean   :54.55  
##  3rd Qu.:64.00  
##  Max.   :83.00  
## 
## mining info:
##  data ntransactions support confidence
##   trx          7501   0.005        0.5
##                                                                                              call
##  apriori(data = trx, parameter = list(supp = 0.005, conf = 0.5), control = list(verbose = FALSE))

What the summary table reveals is that with a confidence level of 0.5 and minimum support of 0.005, 20 rules have been discovered. The mean support, lift, and confidence of all rules is 0.5336, 0.0137, and 2.358. The highest support is 0.0111, the highest confidence is 0.6333, and the highest lift is 3.0050. It may be of interest to view the top rules that apriori has found:

DATAFRAME(rules[1:3])
##                                                                       LHS
## 1       {3A USB Type C Cable 3 pack 6FT,VIVO Dual LCD Monitor Desk mount}
## 2  {10ft iPHone Charger Cable 2 Pack,FEIYOLD Blue light Blocking Glasses}
## 3 {10ft iPHone Charger Cable 2 Pack,Nylon Braided Lightning to USB cable}
##                                RHS     support confidence    coverage     lift
## 1 {Dust-Off Compressed Gas 2 pack} 0.006799093  0.5049505 0.013464871 2.118363
## 2 {Dust-Off Compressed Gas 2 pack} 0.005199307  0.5820896 0.008932142 2.441976
## 3 {Dust-Off Compressed Gas 2 pack} 0.005065991  0.6333333 0.007998933 2.656954
##   count
## 1    51
## 2    39
## 3    38

Although the item names are very long, we can tell that all the top rules have the Dust-Off Compressed Gas product as a consequent. The individual rules are not very strong according to the critical metrics, and the small size and scope of this data is partially to blame. But, there is a clear pattern in that nearly every single rule has the air duster product on the right hand side. Is this an expected result? This visualization shows the rules converging on this product:

#plot rules
plot(rules,method="graph")
#view top items with names (unfortunately we cannot control their rotation)
itemFrequencyPlot(trx,topN=10)

Yes, we expect this result since the Compressed Gas 2-pack is indeed the highest-selling item by a fair margin. Additionally, it is an item of particular use an an accessory to electronic equipment. One conclusion then, is that the Compressed Gas 2-pack is a major upsell opportunity as it helps maintain expensive electronic equipment. Customers should be targeted with this product if they are purchasing electronics. Additional research with a greater scope of data would be the next step to investigate these patterns further. 

Sources

¹McColl, L. (2022, March 1). Market basket analysis: Understanding customer behaviour. Select Statistical Consultants. https://select-statistics.co.uk/blog/market-basket-analysis-understanding-customer-behaviour/#:~:text=Rules%20with%20a%20high%20support,hand%20side%20(a%20rubber).