Skip to main content

jbrnbrg

Predicting Car Crashes with Insurance Data

Purpose

Today I review a data set containing information on approximately 8,000 customers of an insurance company. Each record contains two response variables that indicate whether a customer was in a car crash or not and how much said car-crash cost.

library(tidyverse)
library(stargazer)
library(GGally)
library(kableExtra)
library(ResourceSelection)
library(RCurl)
library(pROC)
clr_dollar <- function(x){
  # cleans out commas and $ of string-formatted currency
  return(as.numeric(gsub('[$,]','',x)))
}

# Mappings to clean-up data entries for easier manipulation and analysis
job_map <- data.frame(JOB = c("Doctor", "Lawyer", "Professional", 
                              "Manager", "z_Blue Collar", "Clerical", 
                              "Home Maker", "Student", ""),
                      EMPL = c("Doc", "Law", "Pro", "Mgr", "BC", 
                               "Cler", "HM", "Stu", "-")) # '-' instead of blank

urb_map <- data.frame(URBANICITY = c("Highly Urban/ Urban", 
                                     "z_Highly Rural/ Rural"), 
                      URBCITY = c("Urb", "Rur"))

edu_map <- data.frame(EDUCATION = c("<High School", "z_High School", 
                                    "Bachelors", "Masters", "PhD"),
                      EDU = c("HS", "HS", "BS", "MS", "PhD"))

car_map <- data.frame(CAR_TYPE = c("Minivan", "z_SUV", "Sports Car", 
                                   "Van", "Panel Truck", "Pickup"),
                      CTYPE = c("MV", "SUV", "Sprt", "Van", 
                                "panTr", "pikTr"))

carU_map <- data.frame(CAR_USE = c("Commercial", "Private"), 
                       CAR_U = c("Comm", "Priv"))

my_theme <- function(){
    theme(panel.background = element_rect(fill = 'grey92', color = NA),
          panel.grid = element_line(color = 'white'),
          strip.background = element_rect(fill = "grey85", colour = NA),
          plot.margin = margin(.1, .1, .1, .1, "cm"),
          legend.position="none",
          strip.text = element_text(color = "black", face = "bold", size = rel(.8)),
          axis.title.x=element_blank(),
          axis.title.y=element_blank())
}

Data Pre-Processing

Several actions were taken ensure the data was ready for analysis:

  • Cleaned up consistency of factors (e.g. z_F changed to f under SEX)
  • Shortened string names of factors for categorical variables (e.g. Professional to Pro)
  • Added tag field to identify the sets of TARGET_FLAG (e.g. 1 = crash vs. 0 = safe)
  • Stripped out any non-standard characters from numerical entries (e.g. $ and ,)
# Repair data types, normalize factors, shorten entries, re-order columns
read_ins_factor <- function(filename){
  insur <- read_csv(filename) %>%
  mutate(INCOME = clr_dollar(INCOME),
         HOME_VAL = clr_dollar(HOME_VAL),
         BLUEBOOK = clr_dollar(BLUEBOOK),
         OLDCLAIM = clr_dollar(OLDCLAIM),
         REVOKED = as.factor(tolower(REVOKED)),
         PARENT1 = as.factor(tolower(PARENT1)),
         MSTATUS = as.factor(ifelse(MSTATUS=='z_No', 'no', 'yes')),
         SEX = as.factor(ifelse(SEX == 'z_F', 'f', 'm')),
         RED_CAR = as.factor(RED_CAR),
         tag = ifelse(TARGET_FLAG == 1, 'crash', 'safe')) %>%
    left_join(job_map, by = "JOB") %>% 
    left_join(urb_map, by = "URBANICITY") %>% 
    left_join(edu_map, by = "EDUCATION") %>% 
    left_join(car_map, by = "CAR_TYPE") %>% 
    left_join(carU_map, by = "CAR_USE") %>%
  select_('tag', 'TARGET_FLAG', 'TARGET_AMT', 'KIDSDRIV', 'AGE',
       'HOMEKIDS', 'YOJ', 'TRAVTIME', 'TIF', 'CLM_FREQ',
       'MVR_PTS', 'CAR_AGE', 'INCOME', 'HOME_VAL',
       'BLUEBOOK', 'OLDCLAIM', 'PARENT1', 'MSTATUS',
       'SEX','RED_CAR', 'REVOKED','URBCITY', 'CAR_U',
       'EDU', 'CTYPE', 'EMPL') %>%
  rename(JOB = EMPL, URBANICITY = URBCITY, EDUCATION = EDU,
         CAR_TYPE = CTYPE, CAR_USE = CAR_U)
}

ins_url <- paste0("https://raw.githubusercontent.com/jbrnbrg/",
                  "msda-datasets/master/insurance_training_data.csv")
ins_ev_url <- paste0("https://raw.githubusercontent.com/jbrnbrg/",
                     "msda-datasets/master/insurance-evaluation-data.csv")

insurf <- read_ins_factor(ins_url)

stat_info = describe(insurf) %>% 
  as.data.frame()

Data Exploration

Summary findings:

  • 8161 total records with 25 variables: 2 response, 13 numerical, 7 binary, 3 categorical
  • Many NAs:CAR_AGE: 510, HOME_VAL: 464, YOJ: 454, INCOME: 445, AGE: 6
  • Collinearity present in at least one variable pair

Data preview:

A preview of 4 random records of insurance_training_data.csv used to develop the model. As there are 26 variables, the preview is broken into 3 parts:

TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ TRAVTIME TIF CLM_FREQ MVR_PTS
0 0 0 46 0 11 13 1 3 6
0 0 2 45 3 14 28 11 0 3
0 0 0 42 0 NA 36 8 0 1
1 4085 0 28 1 9 32 7 3 6
knitr::kable(head(t2, n=4), format = 'markdown', booktabs = T) %>%
  row_spec(0, bold = T) %>% 
  kable_styling(font_size = 5, position = "left", full_width = F)
CAR_AGE INCOME HOME_VAL BLUEBOOK OLDCLAIM PARENT1 MSTATUS SEX RED_CAR REVOKED
15 104017 0 21610 28005 no no m no no
10 60769 218782 8310 0 no yes m no no
15 107562 281386 21600 0 no yes f no no
1 NA 0 24560 20778 yes no f no yes
knitr::kable(head(t3, n=4), format = 'markdown', booktabs = T) %>%
  row_spec(0, bold = T) %>% 
  kable_styling(font_size = 5, position = "left", full_width = F)
URBANICITY CAR_USE EDUCATION CAR_TYPE JOB
Urb Comm MS Van NA
Urb Priv BS MV Mgr
Urb Priv BS SUV Mgr
Urb Comm HS SUV BC

Summary statistics:

The below table provides the summary statistics. Variables with an * are binary/categorical variables.
n mean sd median min max skew kurtosis se
TARGET_FLAG 8161 0 0 0 0 1 1 -1 0
TARGET_AMT 8161 1504 4704 0 0 107586 9 112 52
KIDSDRIV 8161 0 1 0 0 4 3 12 0
AGE 8155 45 9 45 16 81 0 0 0
HOMEKIDS 8161 1 1 0 0 5 1 1 0
YOJ 7707 10 4 11 0 23 -1 1 0
TRAVTIME 8161 33 16 33 5 142 0 1 0
TIF 8161 5 4 4 1 25 1 0 0
CLM_FREQ 8161 1 1 0 0 5 1 0 0
MVR_PTS 8161 2 2 1 0 13 1 1 0
CAR_AGE 7651 8 6 8 -3 28 0 -1 0
INCOME 7716 61898 47573 54028 0 367030 1 2 542
HOME_VAL 7697 154867 129124 161160 0 885282 0 0 1472
BLUEBOOK 8161 15710 8420 14440 1500 69740 1 1 93
OLDCLAIM 8161 4037 8777 0 0 57037 3 10 97
PARENT1* 8161 1 0 1 1 2 2 3 0
MSTATUS* 8161 2 0 2 1 2 0 -2 0
SEX* 8161 1 0 1 1 2 0 -2 0
RED_CAR* 8161 1 0 1 1 2 1 -1 0
REVOKED* 8161 1 0 1 1 2 2 3 0
URBANICITY* 8161 2 0 2 1 2 -1 0 0
CAR_USE* 8161 2 0 2 1 2 -1 -2 0
EDUCATION* 8161 2 1 2 1 4 1 -1 0
CAR_TYPE* 8161 3 2 3 1 6 0 -1 0
JOB* 7635 4 3 4 1 8 0 -2 0

Continuous Variables:

The below plots employ binomial smoothing to highlight the probability of a crash over the range of variable values.

binomial_smooth <- function(...){
  # Via: http://ggplot2.tidyverse.org/reference/geom_smooth.html
  geom_smooth(method='glm',  method.args = list(family = 'binomial'), ...)
}

insurf %>% select(-c(17:26)) %>% 
  gather(-c(TARGET_AMT, TARGET_FLAG, tag), key = "var", value = "val") %>%
  ggplot(aes(x = val, y = TARGET_FLAG)) +
  geom_point(aes(color=tag),
             position=position_jitter(height=.3, width=0),
             alpha=.05) +
  binomial_smooth(formula = y ~ splines::ns(x, 2)) +
  facet_wrap(~ var, ncol =5, scales = "free_x") + my_theme()

# imputed %>% select(-c(17:26)) %>%
#   gather(-c(TARGET_AMT, TARGET_FLAG, tag), key = "var", value = "val") %>%
#   ggplot(aes(x = val, y = TARGET_FLAG)) +
#   geom_point(aes(color=tag),
#              position=position_jitter(height=.3, width=0),
#              alpha=.05) +
#   binomial_smooth(formula = y ~ splines::ns(x, 2)) +
#   facet_wrap(~ var, ncol =5, scales = "free_x") + my_theme()

The primary take-aways from the above visualization:

  • As CAR_AGE, HOME_VALUE, INCOME or TIF increases, the probability of a crash decreases.
  • MVR_PTS appears to be a good predictor of crashes.
  • The remaining points appear to have some polynomial relationship - i.e. there’s some point between low and high variable values that represents a maxima or minima of crash probability before changing direction.

Further, the below visualization provides histograms for each of the continuous variables on log scales for y separated by the crash/safe tag. These plots support the above take-aways for the continuous variables:

insurf %>% select(-c(17:26)) %>% 
  gather(-c(TARGET_AMT, TARGET_FLAG, tag), key = "var", value = "val") %>%
  ggplot(aes(x = val, fill = tag, color = tag)) +
  geom_histogram(alpha=.4) + scale_y_log10() + 
  facet_wrap(~ var, ncol =5, scales = "free_x") + my_theme()

The above plot highlights the splits between crash/safe among the continuous variables. Note that as TRAVTIME, INCOME, HOMEKIDS, KIDDRIV, MVR_PTS, and TIF increases, the presence of crashes decreases.

Categorical Variables:

The below visualization provides a way to review the density of the crash/safe populations among the categorical variables in the data:

# binary and categorical
insurf %>%
  select(names(insurf)[c(1:3,17:26)]) %>% ungroup() %>% 
  gather(-c(tag, TARGET_FLAG, TARGET_AMT), key = "var", value = "val") %>%
  ggplot(aes(x = val, y=tag, color = tag)) + 
  geom_jitter(alpha = .1) + #scale_y_log10(labels = scales::comma) +
  facet_wrap(~ var, scales = "free_x", ncol =5) + my_theme()

The summary findings from the above visualization are with respect to an apparent higher presence of crashes:

  • CAR_TYPE = SUV
  • EDUCATION = High School (HS)
  • JOB = Blue Collar (BC)
  • PARENT1 = non-single parent households (no)
  • URBANCITY = Urban (Urb)

Each of these could be population specific, however. Next, I’ve provided violin plots for the TARGET_AMT for each of the continuous variables strictly among those who have crashed. The width of the violins is proportional to the sample size:

# binary and categorical
insurf %>% filter(TARGET_FLAG != 0) %>% 
  select(names(insurf)[c(1:3,17:26)]) %>% ungroup() %>% 
  gather(-c(tag, TARGET_FLAG, TARGET_AMT), key = "var", value = "val") %>%
  ggplot(aes(x = val, y=TARGET_AMT, fill = var)) + 
  geom_violin(alpha = .5, 
              draw_quantiles = c(0.25, 0.5, 0.75), 
              adjust = .8, scale = "count") + 
  scale_y_log10(labels = scales::comma) +
  facet_wrap(~ var, scales = "free_y", ncol =5) + coord_flip() + my_theme()

Model Build

First I will create and select a logit regression to be able to predict whether a crash is likely and then create and select a linear regression to predict crash cost.

Logistic

I created 3 glm models m1, m2, and m3 and selected them based on the AIC score. I performed additional review comparing the results of their Hosmer-lemeshow fit tests for the binary response TARGET_FLAG that I’ve included in the code.

insur.no.na <- insurf %>% na.omit()

# Model 1: All variables included ******************************
m1 <- glm(TARGET_FLAG ~  KIDSDRIV + AGE + HOMEKIDS + YOJ + TRAVTIME +
            TIF + CLM_FREQ + MVR_PTS + CAR_AGE + INCOME + HOME_VAL +
            BLUEBOOK + OLDCLAIM + PARENT1 + MSTATUS + SEX + RED_CAR + 
            REVOKED + URBANICITY + CAR_USE + EDUCATION + CAR_TYPE + 
            JOB, data = insur.no.na, family = "binomial")

# Model 2: uses variables from: car::Anova(MASS:stepAIC(m1)) ***
m2 <- glm(TARGET_FLAG ~  KIDSDRIV + TRAVTIME + TIF + CLM_FREQ + MVR_PTS + 
            INCOME + HOME_VAL + BLUEBOOK + OLDCLAIM + PARENT1 + MSTATUS + 
            REVOKED + URBANICITY + CAR_USE + EDUCATION + CAR_TYPE + JOB,
          data = insur.no.na, family = "binomial")

# Model 3: is m2 w/out HOME_VAL ********************************
m3 <- glm(TARGET_FLAG ~  KIDSDRIV + TRAVTIME + TIF + CLM_FREQ + MVR_PTS + 
            INCOME + BLUEBOOK + OLDCLAIM + PARENT1 + MSTATUS + 
            REVOKED + URBANICITY + CAR_USE + EDUCATION + CAR_TYPE + JOB,
          data = insur.no.na, family = "binomial")
# hoslem.test(insur.no.na$TARGET_FLAG, fitted(m1))
# hoslem.test(insur.no.na$TARGET_FLAG, fitted(m2))
# hoslem.test(insur.no.na$TARGET_FLAG, fitted(m3))

Logistic Eval/Selection:

The summaries of each of the models created are shown below:

stargazer(m1, m2, m3,
          type = 'latex',
          column.sep.width = "1pt",
          single.row = T,
          omit.stat = c('ser'),
          header = F)

As shown above, M2 has the lowest AIC. Below, I have completed the diagnostics on each of the models - the ROC curves and table entries below are in the order of the models:

As shown above, there is little difference between the plots.

AIC accuracy class_err precise sensitive specific f1_score AUC
5434.272 0.7932175 0.2067825 0.6682600 0.4363296 0.9218996 0.5279456 0.8186477
5429.962 0.7943755 0.2056245 0.6711153 0.4394507 0.9223498 0.5311203 0.8179944
5439.371 0.7932175 0.2067825 0.6702128 0.4325843 0.9232501 0.5257967 0.8170804

I note that there’s little difference between the 3 sets of scores. I choose m2 because it has the lowest AIC, and slightly fewer variables than the original set:

## glm(formula = TARGET_FLAG ~ KIDSDRIV + TRAVTIME + TIF + CLM_FREQ + 
##     MVR_PTS + INCOME + HOME_VAL + BLUEBOOK + OLDCLAIM + PARENT1 + 
##     MSTATUS + REVOKED + URBANICITY + CAR_USE + EDUCATION + CAR_TYPE + 
##     JOB, family = "binomial", data = insur.no.na)

Linear Model:

I created 3 lm models linM1, linM2, and linM3 and selected them based on the R^2 and residual analysis. Each linear model was refined using MASS::stepAIC.

Linear Eval/Selection

After several iterations and reviews I settled on the second linear model linM2_step. While the R^2 was the best, I use this model with some reservation given the results of the below residual analysis - these do not appear normal.

# col_labels = c("model1", "model2", "model3")
# coefs = c(m1$coefficients, m2$coefficients, m3$coefficients)
stargazer(linM1_step, linM2_step, linM3_step,
          type = 'html',
          column.sep.width = "1pt",
          single.row = T,
          omit.stat = c('ser'),
          header = F)
Dependent variable:
TARGET_AMT log(TARGET_AMT + 1) (TARGET_AMT + 1)
(1) (2) (3)
KIDSDRIV 217.739* (113.522) 0.386*** (0.083) -0.048*** (0.010)
TRAVTIME 11.061*** (3.608) 0.018*** (0.003) -0.002*** (0.0003)
TIF -44.805*** (13.673) -0.063*** (0.010) 0.007*** (0.001)
CLM_FREQ 0.264*** (0.045) -0.033*** (0.005)
MVR_PTS 184.040*** (27.039) 0.189*** (0.021) -0.022*** (0.003)
CAR_AGE -27.544* (14.399)
INCOME -0.00000** (0.00000) 0.00000*** (0.00000)
HOME_VAL -0.002** (0.001) -0.00000*** (0.00000)
BLUEBOOK -0.00002*** (0.00001) 0.00000*** (0.00000)
OLDCLAIM -0.00002*** (0.00001) 0.00000*** (0.00000)
PARENT1yes 528.163*** (195.907) 0.763*** (0.144) -0.088*** (0.017)
MSTATUSyes -478.386*** (154.050) -0.500*** (0.117) 0.079*** (0.012)
SEXm 302.750* (161.942) 0.251* (0.148)
RED_CARyes -0.268** (0.125)
REVOKEDyes 443.978** (173.335) 1.175*** (0.143) -0.143*** (0.017)
URBANICITYUrb 1,678.558*** (149.773) 2.439*** (0.113) -0.290*** (0.013)
CAR_USEPriv -693.703*** (174.442) -1.079*** (0.128) 0.128*** (0.015)
EDUCATIONHS 215.761 (168.255) 0.544*** (0.115) -0.064*** (0.014)
EDUCATIONMS 50.920 (249.731) 0.015 (0.176) -0.001 (0.021)
EDUCATIONPhD 730.822** (346.171) 0.588** (0.255) -0.066** (0.030)
CAR_TYPEpanTr 691.566** (297.048) 0.645*** (0.239) -0.079*** (0.027)
CAR_TYPEpikTr 415.945** (185.211) 0.591*** (0.137) -0.069*** (0.016)
CAR_TYPESprt 1,195.211*** (221.704) 1.288*** (0.172) -0.145*** (0.017)
CAR_TYPESUV 850.150*** (179.344) 0.933*** (0.142) -0.103*** (0.013)
CAR_TYPEVan 566.810** (234.540) 0.559*** (0.177) -0.068*** (0.020)
JOBCler -139.469 (206.070) 0.236 (0.153) -0.031* (0.018)
JOBDoc -1,453.438*** (492.990) -1.038*** (0.363) 0.119*** (0.043)
JOBHM -199.516 (269.640) -0.067 (0.209) 0.007 (0.024)
JOBLaw -315.007 (335.933) -0.206 (0.247) 0.024 (0.029)
JOBMgr -1,210.374*** (250.123) -1.187*** (0.184) 0.139*** (0.022)
JOBPro -105.704 (228.042) -0.161 (0.168) 0.023 (0.020)
JOBStu -372.302 (243.145) -0.070 (0.184) -0.006 (0.021)
Constant 227.456 (342.076) 0.272 (0.284) 0.956*** (0.029)
Observations 6,045 6,045 6,045
R2 0.075 0.235 0.235
Adjusted R2 0.071 0.231 0.232
F Statistic 18.165*** (df = 27; 6017) 59.570*** (df = 31; 6013) 66.153*** (df = 28; 6016)
Note: p<0.1; p<0.05; p<0.01

The below plots suggest that my linear model has some underlying relationships that haven’t been addressed but the prediction power remains. Further training and transformation will result in improved results.

par(mfrow=c(1,2))
plot(linM2_step,which=c(2,2))
hist(linM2_step$residuals)

par(mfrow=c(1,1))