Tag: modeling
Multilevel data is data that includes repeated measures of the same subject or variables over some period of time e.g:
A five-year study on the test scores of students grouped by cohorts and classes across a state’s K-12 schools
Patient satisfaction of care grouped by attending doctors and their respective practices
Police use of force incidents by race of suspect grouped by precinct, patrol, and arresting officer
Analysts can encounter data of this type in just about any conceivable industry that produces data and the grouping structure must be fully understood to properly explore, model, and simulate before creating any usable insight.
This post is going to follow along and expand on a well-known tutorial for Mixed Effect Linear Models by Bodo Winters, a lecturer in Cognitive Linguistics at the University of Birmingham, Dept. of English Language and Applied Linguistics.
While this tutorial will generally follow Winter’s, I have added additional detail for EDA, expanded the details re: code & errors that are encountered, and added notes from many other sources on Mixed Effect Models (see the Sources section at the end for links) - let’s get started!
In this post I’m going to perform simplified bottoms-up EDA, model development, prediction, and model evaluation using a public data set that contains data for houses that sold in Ames, Iowa. This post essentially revisits a previous analysis I performed in a group for my MS degree program at CUNY. The main differences here will be that my analysis will be abbreviated to illustrate my skillset, I’ll be using python instead of R, and I will be relying on python’s fantastic scikit-learn library to perform the regressions.
Introduction
In this post I will be creating a predictive model to identify fraudulent credit card transactions on a public data set from kaggle. Along the way, I will be reviewing some of the functionality of R’s gbm package for predictive modelling.
Data Overview
The data set contiains 280K+ records of credit card transactions from a two-week period in which a small percentage of transactions have been labeled as fraudulent.
Purpose
Today I review a data set containing information on approximately 8,000 customers of an insurance company. Each record contains two response variables that indicate whether a customer was in a car crash or not and how much said car-crash cost.
library(tidyverse)
library(stargazer)
library(GGally)
library(kableExtra)
library(ResourceSelection)
library(RCurl)
library(pROC)
clr_dollar <- function(x){
# cleans out commas and $ of string-formatted currency
return(as.numeric(gsub('[$,]','',x)))
}
# Mappings to clean-up data entries for easier manipulation and analysis
job_map <- data.
In this post I will pre-process, explore, transform, and model data from baseball team seasons from 1871-2006 inclusive (stats adjusted to match 162-game season). Feature additions and transformations can improve the predictive power of created models and in this post I’ll employ a Box-Cox transformation to acheive this.
The goal is to create a model that can predict baseball team wins based on the performance metrics captured in the training dataset moneyball-training-data.