Fraudulent Transaction Detection with GBM
Introduction
In this post I will be creating a predictive model to identify fraudulent credit card transactions on a public data set from kaggle. Along the way, I will be reviewing some of the functionality of R
’s gbm
package for predictive modelling.
Data Overview
The data set contiains 280K+ records of credit card transactions from a two-week period in which a small percentage of transactions have been labeled as fraudulent. This data contains 31 columns (features) of which 28 are unnamed and pre-normalized in order to make their contents anonymous.
Gradient Boosting
To identify the fraudulent transactions, I will be using a prediction model in R
using the gbm
(Gradient Boosted Machines) package.
If you’re not familiar, gradient boosting is a machine learning technique for regression and classification that produces a prediction model in the form of an ensemble of weak prediction models, similar to the random Forrest model I employed in this post. A UC Business Analytics post describes the difference well:
Whereas random forests build an ensemble of deep independent trees, GBMs build an ensemble of shallow and weak successive trees with each tree learning and improving on the previous.
Let’s get started!
library(tidyverse) # Data wrangling
library(gbm) # Predictive modelling
library(pROC) # Model evaluation
library(kableExtra) # Table formatting
Data & Summary
First we download the compressed data files (3 of them in total) from my general github repo, uncompress them, and combine them into a single file called ccdat
. This file contains over 280K records so it may take some time.
mult_zip <- function(baseurl, s){
# function for downloading multiple gzip files
# from a github URL where s = seq(start,end)
ccdat <- data_frame()
for (i in s){
daturl = paste0(baseurl, as.character(i), ".csv.gz")
tmp = tempfile()
download.file(daturl, tmp)
df = read.csv(gzfile(tmp))
ccdat = rbind(ccdat, df)
}
return(ccdat)
}
ccurl = paste0("https://raw.githubusercontent.com/jbrnbrg/",
"msda-datasets/master/ccdat0")
ccdat <- mult_zip(ccurl, 0:2) %>%
select(-X)
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
Next we’ll take a quick look at the data using glimpse
- not all the V
features are shown as they’re pre-normalized and not very enlightening in this summary view:
ccdat %>%
select(colnames(ccdat)[c(1:5,20:ncol(ccdat))]) %>%
glimpse(width = 50)
## Rows: 284,807
## Columns: 17
## $ Time <dbl> 0, 0, 1, 1, 2, 2, 4, 7, 7, 9, 10,~
## $ V1 <dbl> -1.3598071, 1.1918571, -1.3583541~
## $ V2 <dbl> -0.07278117, 0.26615071, -1.34016~
## $ V3 <dbl> 2.53634674, 0.16648011, 1.7732093~
## $ V4 <dbl> 1.37815522, 0.44815408, 0.3797795~
## $ V19 <dbl> 0.40399296, -0.14578304, -2.26185~
## $ V20 <dbl> 0.25141210, -0.06908314, 0.524979~
## $ V21 <dbl> -0.018306778, -0.225775248, 0.247~
## $ V22 <dbl> 0.277837576, -0.638671953, 0.7716~
## $ V23 <dbl> -0.110473910, 0.101288021, 0.9094~
## $ V24 <dbl> 0.06692807, -0.33984648, -0.68928~
## $ V25 <dbl> 0.12853936, 0.16717040, -0.327641~
## $ V26 <dbl> -0.18911484, 0.12589453, -0.13909~
## $ V27 <dbl> 0.133558377, -0.008983099, -0.055~
## $ V28 <dbl> -0.021053053, 0.014724169, -0.059~
## $ Amount <dbl> 149.62, 2.69, 378.66, 123.50, 69.~
## $ Class <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
Above, we can see that the only named features are Time
, Amount
, and Class
.
Below, a tabular preview; again, not all the variables are shown.
Time | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 |
---|---|---|---|---|---|---|---|---|---|
0 | -1.3598071 | -0.0727812 | 2.5363467 | 1.3781552 | -0.3383208 | 0.4623878 | 0.2395986 | 0.0986979 | 0.3637870 |
0 | 1.1918571 | 0.2661507 | 0.1664801 | 0.4481541 | 0.0600176 | -0.0823608 | -0.0788030 | 0.0851017 | -0.2554251 |
1 | -1.3583541 | -1.3401631 | 1.7732093 | 0.3797796 | -0.5031981 | 1.8004994 | 0.7914610 | 0.2476758 | -1.5146543 |
1 | -0.9662717 | -0.1852260 | 1.7929933 | -0.8632913 | -0.0103089 | 1.2472032 | 0.2376089 | 0.3774359 | -1.3870241 |
Feature Review
As previously stated, there are 31 features in this dataset but only Time
, Amount
, and Class
are named. The Class
variable identifies the type of transaction (0 for legit, 1 for fraudulent) and is our target variable, Amount
is the currency amount of the transaction, and Time
is the seconds elapsed between each transaction and the first transaction. Per the data description, the remaining 28 features have been provided post-transformation:
Features V1, V2, … V28 are the principal components obtained with PCA….
Since there’s a decent number of features and I’m not given any information regarding what the features are or what they contain, I’ll create side-by-side box plots that are separated by Class
to give me a quick overview of the differences between legit (Red) and fraudulent (Blue) transaction classes:
There are a number of features that appear to provide good separation between the fraudulent and legit transactions (e.g. v11
, V12
, v14
, & V17
) but looks can be deceiving - I’ll know better after I fit the model and review the feature importance measures.
Train & Test Split
I will be using a 66%/33% split for model training and testing.
tr_te <- sample(c(0,1), nrow(ccdat),
replace = T,
prob = c(0.66, 0.33))
train <- ccdat[tr_te == 0,]
test <- ccdat[tr_te == 1,]
Gradient Boosted Model
For a logistic regression, gbm
will automatically assume a bernoulli
distribution for binary target variables - convenient! There are tons of options not shown in the code chunk below that are also available and the documentation is very good and worth a read.
system.time(
my_gbm <- gbm(Class ~ .,
distribution = 'bernoulli',
data = train,
n.trees = 300, # or number of iters
interaction.depth = 3, # max depth of each tree
n.minobsinnode = 100, # min number of obs in nodes
shrinkage = 0.01, # the learning rate
bag.fraction = 0.5, # fraction of training
train.fraction = 1)
)
## user system elapsed
## 97.63 0.09 97.72
While certainly not the fasted run time, it’s important to note that the gbm
module does perform training and testing of the labeled data.
Feature Importance Review
Below, the summary computes the relative influence of each variable in the gbm model - I’m showing just the top 10 to keep things neat:
feat_importance <- summary(my_gbm, plotit = F)
feat_importance %>% head(10)
## var rel.inf
## V12 V12 31.5554169
## V14 V14 28.8967653
## V17 V17 25.9827440
## V10 V10 5.0675057
## V20 V20 4.3388632
## V7 V7 1.4728345
## V1 V1 0.6891196
## V11 V11 0.6534464
## V4 V4 0.4352978
## V9 V9 0.3512785
Model Evaluation
Given the very small number of fraudulent transactions compared to real transactions in this data, I’ll be measuring the accuracy using the Area Under the Curve (Precision/Recall Curve, that is).
test_gbm = predict(my_gbm, newdata = test, n.trees = 300)
gbm_auc = roc(test$Class, test_gbm, plot = TRUE, col = "red")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
gbm_auc
##
## Call:
## roc.default(response = test$Class, predictor = test_gbm, plot = TRUE, col = "red")
##
## Data: test_gbm in 94917 controls (test$Class 0) < 155 cases (test$Class 1).
## Area under the curve: 0.9669
While I believe there’s room for improvement, an AUC of ~.93 is pretty good for a basic demonstration of the power of gradient boosting machines on this data. Note that one can get a big performance boost through variable normalization but since this data came with that work already complete, it wasn’t necessary here.