Basic EDA for Multilevel Data in R

Multilevel data is data that includes repeated measures of the same subject or variables over some period of time e.g: A five-year study on the test scores of students grouped by cohorts and classes across a state’s K-12 schools Patient satisfaction of care grouped by attending doctors and their respective practices Police use of force incidents by race of suspect grouped by precinct, patrol, and arresting officer Analysts can encounter data of this type in just about any conceivable industry that produces data and the grouping structure must be fully understood to properly explore, model, and simulate before creating any usable insight.

2020-11-11

Linear Mixed Effect Models with lme4

This post is going to follow along and expand on a well-known tutorial for Mixed Effect Linear Models by Bodo Winters, a lecturer in Cognitive Linguistics at the University of Birmingham, Dept. of English Language and Applied Linguistics. While this tutorial will generally follow Winter’s, I have added additional detail for EDA, expanded the details re: code & errors that are encountered, and added notes from many other sources on Mixed Effect Models (see the Sources section at the end for links) - let’s get started!

2019-06-12

Predicting House Sale Price with the Ames Housing Data

In this post I’m going to perform simplified bottoms-up EDA, model development, prediction, and model evaluation using a public data set that contains data for houses that sold in Ames, Iowa. This post essentially revisits a previous analysis I performed in a group for my MS degree program at CUNY. The main differences here will be that my analysis will be abbreviated to illustrate my skillset, I’ll be using python instead of R, and I will be relying on python’s fantastic scikit-learn library to perform the regressions.

2019-05-27

Fraudulent Transaction Detection with GBM

Introduction In this post I will be creating a predictive model to identify fraudulent credit card transactions on a public data set from kaggle. Along the way, I will be reviewing some of the functionality of R’s gbm package for predictive modelling. Data Overview The data set contiains 280K+ records of credit card transactions from a two-week period in which a small percentage of transactions have been labeled as fraudulent.

2019-04-28

Predicting Car Crashes with Insurance Data

Purpose Today I review a data set containing information on approximately 8,000 customers of an insurance company. Each record contains two response variables that indicate whether a customer was in a car crash or not and how much said car-crash cost. library(tidyverse) library(stargazer) library(GGally) library(kableExtra) library(ResourceSelection) library(RCurl) library(pROC) clr_dollar <- function(x){ # cleans out commas and $ of string-formatted currency return(as.numeric(gsub('[$,]','',x))) } # Mappings to clean-up data entries for easier manipulation and analysis job_map <- data.

2018-04-14

Multiple Linear Regression Modelling; Money Ball

In this post I will pre-process, explore, transform, and model data from baseball team seasons from 1871-2006 inclusive (stats adjusted to match 162-game season). Feature additions and transformations can improve the predictive power of created models and in this post I’ll employ a Box-Cox transformation to acheive this. The goal is to create a model that can predict baseball team wins based on the performance metrics captured in the training dataset moneyball-training-data.

2018-02-09