Tag: eda
The SEC’s fails-to-deliver data, per the SEC website, represents the “aggregate net balance of [equity] shares that failed to be delivered as of a particular settlement date.” Further, the SEC website clarifies what these values represent:
The figure is not a daily amount of fails, but a combined figure that includes both new fails on the reporting day as well as existing fails. In other words, these numbers reflect aggregate fails as of a specific point in time, and may have little or no relationship to yesterday’s aggregate fails.
Multilevel data is data that includes repeated measures of the same subject or variables over some period of time e.g:
A five-year study on the test scores of students grouped by cohorts and classes across a state’s K-12 schools
Patient satisfaction of care grouped by attending doctors and their respective practices
Police use of force incidents by race of suspect grouped by precinct, patrol, and arresting officer
Analysts can encounter data of this type in just about any conceivable industry that produces data and the grouping structure must be fully understood to properly explore, model, and simulate before creating any usable insight.
At this point it’s pretty well-established that the pandemic has had a tremendous impact on human life around the world and particularly within New York City’s five boroughs. At the time of this post, over 10% of the USA’s 209K+ deaths have come from NYC alone (23,852).
The death-toll from the pandemic is staggering on its own - and continues to shock in its reach - but the addition of the shutdown (and continued restrictions - even in Phase 4 of the reopening) has had a compounding effect on all aspects of socioeconomic activity in the city.
In this post I’ll be investigating equity between different census tracts in NYC based on economic measures and the open-to-close service request (SR) duration for 311 heat/hot water complaints.
Census Tract Granularity
If you’ve checked out my previous posts, you know I’ve done a bunch of work with NYC’s OpenData at the ZIP-code level. Today I’m going to go a step further and obtain NYC data with latitude and longitude coordinates and merge it with census data by census tract from the 5-year American Community Survey
(ACS) from 2016.
In this post I’m going to perform simplified bottoms-up EDA, model development, prediction, and model evaluation using a public data set that contains data for houses that sold in Ames, Iowa. This post essentially revisits a previous analysis I performed in a group for my MS degree program at CUNY. The main differences here will be that my analysis will be abbreviated to illustrate my skillset, I’ll be using python instead of R, and I will be relying on python’s fantastic scikit-learn library to perform the regressions.
Purpose
Today I review a data set containing information on approximately 8,000 customers of an insurance company. Each record contains two response variables that indicate whether a customer was in a car crash or not and how much said car-crash cost.
library(tidyverse)
library(stargazer)
library(GGally)
library(kableExtra)
library(ResourceSelection)
library(RCurl)
library(pROC)
clr_dollar <- function(x){
# cleans out commas and $ of string-formatted currency
return(as.numeric(gsub('[$,]','',x)))
}
# Mappings to clean-up data entries for easier manipulation and analysis
job_map <- data.
In this post I will pre-process, explore, transform, and model data from baseball team seasons from 1871-2006 inclusive (stats adjusted to match 162-game season). Feature additions and transformations can improve the predictive power of created models and in this post I’ll employ a Box-Cox transformation to acheive this.
The goal is to create a model that can predict baseball team wins based on the performance metrics captured in the training dataset moneyball-training-data.