Skip to main content

jbrnbrg

Tag: eda

Failure-to-Deliver Data via SEC and the GameStop Stock

The SEC’s fails-to-deliver data, per the SEC website, represents the “aggregate net balance of [equity] shares that failed to be delivered as of a particular settlement date.” Further, the SEC website clarifies what these values represent: The figure is not a daily amount of fails, but a combined figure that includes both new fails on the reporting day as well as existing fails. In other words, these numbers reflect aggregate fails as of a specific point in time, and may have little or no relationship to yesterday’s aggregate fails.

Basic EDA for Multilevel Data in R

Multilevel data is data that includes repeated measures of the same subject or variables over some period of time e.g: A five-year study on the test scores of students grouped by cohorts and classes across a state’s K-12 schools Patient satisfaction of care grouped by attending doctors and their respective practices Police use of force incidents by race of suspect grouped by precinct, patrol, and arresting officer Analysts can encounter data of this type in just about any conceivable industry that produces data and the grouping structure must be fully understood to properly explore, model, and simulate before creating any usable insight.

StreetEasy Neighborhood Rentals & CrossTalk

At this point it’s pretty well-established that the pandemic has had a tremendous impact on human life around the world and particularly within New York City’s five boroughs. At the time of this post, over 10% of the USA’s 209K+ deaths have come from NYC alone (23,852). The death-toll from the pandemic is staggering on its own - and continues to shock in its reach - but the addition of the shutdown (and continued restrictions - even in Phase 4 of the reopening) has had a compounding effect on all aspects of socioeconomic activity in the city.

Equity and 311 Heat & Hot Water Service Requests

In this post I’ll be investigating equity between different census tracts in NYC based on economic measures and the open-to-close service request (SR) duration for 311 heat/hot water complaints. Census Tract Granularity If you’ve checked out my previous posts, you know I’ve done a bunch of work with NYC’s OpenData at the ZIP-code level. Today I’m going to go a step further and obtain NYC data with latitude and longitude coordinates and merge it with census data by census tract from the 5-year American Community Survey (ACS) from 2016.

Predicting House Sale Price with the Ames Housing Data

In this post I’m going to perform simplified bottoms-up EDA, model development, prediction, and model evaluation using a public data set that contains data for houses that sold in Ames, Iowa. This post essentially revisits a previous analysis I performed in a group for my MS degree program at CUNY. The main differences here will be that my analysis will be abbreviated to illustrate my skillset, I’ll be using python instead of R, and I will be relying on python’s fantastic scikit-learn library to perform the regressions.

Predicting Car Crashes with Insurance Data

Purpose Today I review a data set containing information on approximately 8,000 customers of an insurance company. Each record contains two response variables that indicate whether a customer was in a car crash or not and how much said car-crash cost. library(tidyverse) library(stargazer) library(GGally) library(kableExtra) library(ResourceSelection) library(RCurl) library(pROC) clr_dollar <- function(x){ # cleans out commas and $ of string-formatted currency return(as.numeric(gsub('[$,]','',x))) } # Mappings to clean-up data entries for easier manipulation and analysis job_map <- data.

Multiple Linear Regression Modelling; Money Ball

In this post I will pre-process, explore, transform, and model data from baseball team seasons from 1871-2006 inclusive (stats adjusted to match 162-game season). Feature additions and transformations can improve the predictive power of created models and in this post I’ll employ a Box-Cox transformation to acheive this. The goal is to create a model that can predict baseball team wins based on the performance metrics captured in the training dataset moneyball-training-data.