Web Scraping for Public Health Data
Part of the new normal for many data analytics professionals is being able to obtain Covid-19 data by geographic location. To address this need, today’s post is going to walk through how to scrape the repo for the NYC Department of Health and Mental Hygiene using the flexible rvest
library to to create a time-series of positive cases by NYC zip code.
The Target Data
As a resident of NYC, the former epicenter of the outbreak, I’ve been keen on getting some time series data on Covid 19 to see how things are coming along.
To that end, I’ve been pulling NYC Health’s coronavirus-data
since they began sharing it and I’ve seen how it’s evolved over time. E.g. at first, tests-by-zcta.csv
was the only zip-code-focused and later they added data-by-modzcta.csv
; essentially the same data but with different field names and additional columns.
In order to create the most comprehensive dataset with the longest time-frame available, I would have to combine these files over time referencing the historical commits to their respective repos.
Web Scraping
library(tidyverse) # data manipulation
library(glue) # string manipulation
library(rvest) # webscraping
library(xml2) # dependency for rvest
library(tsibble) # time-series aggregation
## Other libraries in use but not downloaded here:
# plyr::rbind.fill() # stacking dfs with only some matching columns
# tidyr::fill() # for days when data isn't collected
# lubridate:: # all things date formatting
A key component of web-scrapping is being able to identify different parts of a webpage using the DOM (Document Object Model). I do this with the help of the SelectorGadget for Chrome, going to the web page of interest, enabling SelectorGadget, and clicking on the components of interest. Below I’ve included a screen cap of the ".js-navigation-open"
component that contains the link to the historical files.
After a few minutes playing around, I was able to find the page elements containing the data I needed. This stuff isn’t straight-forward so it’s probably best to start with the example given in the rvest
documentation: it’s very good and worth a read if you want to understand what’s going on under the hood before diving in.
The first step will be to obtain the URLs to all historical files and the second step will be to download the data stored in each of those URLs.
The File URLs
base_url = "https://raw.githubusercontent.com/nychealth/coronavirus-data/"
historical_urls_pg1 <- glue("https://github.com/nychealth/coronavirus-data/",
"commits/master/tests-by-zcta.csv")
second_page_url <- function(url){
# expects the URL pointing to the file's commit history on git
# pull the last button from the page (Older)
pg <- xml2::read_html(url)
paginate <- rvest::html_nodes(pg, ".paginate-container , .BtnGroup-item")
rvest::html_attr(paginate[[length(paginate)]], "href")
}
# it would be better to make this such that it can handle any page length.
historical_urls_pg2 <- second_page_url(historical_urls_pg1)
historical_urls_new <- glue("https://github.com/nychealth/coronavirus-data/",
"commits/master/data-by-modzcta.csv")
c19_url <- function(git_code, closer = "tests-by-zcta.csv"){
base_url = "https://raw.githubusercontent.com/nychealth/coronavirus-data/"
glue::glue(base_url, git_code, "/", closer)
}
get_historical_covid <- function(url){
# scrapes github for previously committed versions over time
# places the URLs and commit dates into a df
pg <- xml2::read_html(url)
date_nodes <- rvest::html_nodes(pg, ".no-wrap")
dates_raw <- rvest::html_nodes(pg, "relative-time") %>% as.list()
file_urls <- rvest::html_nodes(pg, ".js-navigation-open")
rel_urls <- rvest::html_attr(file_urls, "href")
rel_urls <- rel_urls[lapply(rel_urls,nchar)>0] %>% as.list()
# dates of historical files (commit datetime)
dates <- substr(dates_raw, 26, 44) #%>% length()
# url codes for historical files
files <- substr(rel_urls,36, 75)
if (length(dates) == length(files)){
return(tibble::tibble(commit_date = dates, git_code = files))
}else{
print("broken!")
}
}
This will allow me to obtain a combined list of URLs for each of the files. I then loop through the list and download a tibble
for each historical version of the files.
Note: If this were many, many sources and multiple pages of history, I would probably set this up differently:
url <- c(historical_urls_pg1, historical_urls_pg2)
# combine the pages of the URLS for test-by-zcta...
df <- rbind(get_historical_covid(url[[1]]),
get_historical_covid(url[[2]])) %>%
mutate(source = "tests-by-zcta.csv")
# obtain historical urls for data-by-modacta.csv
dff <- get_historical_covid(historical_urls_new) %>%
mutate(source = "data-by-modzcta.csv")
all_urls <- rbind(df, dff) %>%
mutate(commit_date = lubridate::ymd_hms(commit_date)) %>%
arrange(commit_date) %>%
mutate(file_url = glue::glue("{base_url}{git_code}/{source}"))
read_csv_check <- function(f, c_date, src, ...) {
# for reading the GITHUB url, adding details & src.
tryCatch({vroom::vroom(f) %>%
select_all(tolower) %>%
mutate(commit_date = c_date,
source = src)},
error = function(c) {
print(paste0(c$message, " (in ", f, ")"))
invisible()
})
}
Here’s a preview of the all_urls
variable and it’s contents:
## Rows: 102
## Columns: 4
## $ commit_date <dttm> 2020-04-01 16:35:56, 2020-04-03 22:56~
## $ git_code <chr> "097cbd70aa00eb635b17b177bc4546b2fce21~
## $ source <chr> "tests-by-zcta.csv", "tests-by-zcta.cs~
## $ file_url <glue> "https://raw.githubusercontent.com/ny~
The Files
The below chunk does the actual URL-reading but I’ve suppressed the output because it’s a bit verbose.
covid_dfs <- list()
covid <- tibble()
## Create a long list dfs from the URLs (skip the 404s)
for (i in seq_along(all_urls$git_code)){
temp <- read_csv_check(f = all_urls[i,]$file_url %>% as.character(),
c_date = all_urls$commit_date[[i]],
src = all_urls$source[[i]])
## if you're intereted in seeing it work
#print(glue::glue("File {temp$source %>% unique()}"))
#print(glue::glue("Commit Date {temp$commit_date %>% unique()}"))
#print(glue::glue("Columns {temp %>% colnames()}"))
covid_dfs <- covid_dfs %>%
append(list(temp))
}
Above, covid_dfs
is a list of tibble
s, each containing a historical version of the files, here’s a look at it:
## Rows: 178
## Columns: 5
## $ modzcta <dbl> NA, 10001, 10002, 10003, 10004, 10005,~
## $ positive <dbl> 32, 113, 250, 161, 16, 25, 6, 26, 181,~
## $ total <dbl> 36, 265, 542, 379, 38, 81, 24, 67, 450~
## $ commit_date <dttm> 2020-04-01 16:35:56, 2020-04-01 16:35~
## $ source <chr> "tests-by-zcta.csv", "tests-by-zcta.cs~
Data Cleanup
Metadata Prep
First step is creating a mapping table containing zip code, neighborhood names, and borough fields. It appears they started using a consistent naming format after 2020-06-27.
c_date_func <- function(a_date){substr(as.character(a_date),1,10)}
neighborhood <- covid_dfs %>%
plyr::rbind.fill() %>%
as_tibble() %>%
dplyr::select(modified_zcta, neighborhood_name,
borough_group, commit_date) %>%
na.omit() %>%
# I filter on the 2020-06-27 as this appears to be the latest
# neighborhood_name version
filter(commit_date > as.Date("2020-06-27")) %>%
mutate(modified_zcta = as.character(modified_zcta)) %>%
distinct() %>%
dplyr::select(zip_code = modified_zcta,
neighborhood_name, borough_group)
And here’s a preview of the cleaned up neighborhood
variable:
## Rows: 5,841
## Columns: 3
## $ zip_code <chr> "10001", "10002", "10003", "1000~
## $ neighborhood_name <chr> "Chelsea/NoMad/West Chelsea", "C~
## $ borough_group <chr> "Manhattan", "Manhattan", "Manha~
Core Data Prep
Next I create the core data. I have narrowed the example to include the date, count of positive tests recorded, and the zip code. There is a bit of overlap between the files and I take the median if more than one measure is encountered per day.
Since this data is a time-series, I employ the tsibble
package as it’s got utilities to deal with gaps in the records.
# covid time series = c_ts containing
# - f_date: commit date of the file
# - zip_code:
# - covid_case_count = "positive tests recorded"
covid_ts <- covid_dfs %>%
plyr::rbind.fill() %>%
as_tibble() %>%
mutate(modified_zcta = ifelse(is.na(modified_zcta),
modzcta, modified_zcta),
covid_case_count = ifelse(is.na(covid_case_count),
as.numeric(positive),
as.numeric(covid_case_count)),
commit_date = as.POSIXct(commit_date),
f_date = c_date_func(commit_date),
modified_zcta = ifelse(modified_zcta == 99999,
NA_real_, modified_zcta)
) %>%
dplyr::select(source:f_date) %>%
dplyr::distinct() %>%
dplyr::group_by(f_date, zip_code = modified_zcta) %>%
## Create the time series of covid case count (positive covid tests)
## by combining sources & taking median where there's overlap
dplyr::summarise(covid_case_count = mean(covid_case_count, na.rm = T),
.groups = "drop") %>%
mutate(f_date = lubridate::ymd(f_date),
zip_code = as.character(zip_code)) %>%
left_join(neighborhood, by="zip_code", all.x = T) %>%
distinct()
# These dates appear to show data corrections:
odd_entries <- c(as.Date("2020-04-26"), as.Date("2020-05-23"),
as.Date("2020-09-26"), as.Date("2020-09-27"),
as.Date("2020-09-28"))
covid_ts <- as_tsibble(covid_ts,
key = c(zip_code, neighborhood_name, borough_group),
index = f_date) %>%
filter(!(f_date %in% odd_entries))
And here’s a preview of first-pass view of the covid_ts
data that I’ve created:
## Rows: 16,526
## Columns: 5
## Key: zip_code, neighborhood_name, borough_group [179]
## $ f_date <date> 2020-04-01, 2020-04-03, 2020-04~
## $ zip_code <chr> "10001", "10001", "10001", "1000~
## $ covid_case_count <dbl> 113, 136, 146, 158, 170, 178, 19~
## $ neighborhood_name <chr> "Chelsea/NoMad/West Chelsea", "C~
## $ borough_group <chr> "Manhattan", "Manhattan", "Manha~
Gaps in Temporal Data
Whenever you receive time-series data it’s a good idea to check for gaps. Observations that may have been missed one day but return the next. Thankfully, the tsibble
library contains several helpful utilities for looking over temporal data.
The first function is count_gaps
that takes a tsibble
and returns the elements of the index (zip_code
in this example) along with the .from
and .to
dates in which there is missing data. Here’s a glimpse of that data:
## Rows: 1,778
## Columns: 6
## $ zip_code <chr> "10001", "10001", "10001", "1000~
## $ neighborhood_name <chr> "Chelsea/NoMad/West Chelsea", "C~
## $ borough_group <chr> "Manhattan", "Manhattan", "Manha~
## $ .from <date> 2020-04-02, 2020-04-06, 2020-04~
## $ .to <date> 2020-04-02, 2020-04-06, 2020-04~
## $ .n <int> 1, 1, 1, 1, 1, 1, 1, 1, 120, 103~
From the above we can see that zip code 10001, for example, is missing data on 2020-04-02, 2020-04-06, etc. We can fill those gaps with NA
s using the fill_gaps
function:
covid_ts %>%
fill_gaps(.full = start()) %>%
head()
## # A tsibble: 6 x 5 [1D]
## # Key: zip_code, neighborhood_name, borough_group [1]
## f_date zip_code covid_case_count neighborhood_name borough_group
## <date> <chr> <dbl> <chr> <chr>
## 1 2020-04-01 10001 113 Chelsea/NoMad/West Chelsea Manhattan
## 2 2020-04-02 10001 NA Chelsea/NoMad/West Chelsea Manhattan
## 3 2020-04-03 10001 136 Chelsea/NoMad/West Chelsea Manhattan
## 4 2020-04-04 10001 146 Chelsea/NoMad/West Chelsea Manhattan
## 5 2020-04-05 10001 158 Chelsea/NoMad/West Chelsea Manhattan
## 6 2020-04-06 10001 NA Chelsea/NoMad/West Chelsea Manhattan
In time series analysis, it’s common to replace NA
’s with their previous observation for each origin, which I do here using tidyr::fill
:
covid_ts %>%
fill_gaps(.full = start()) %>%
tidyr::fill(covid_case_count, .direction = "down") %>%
head()
## # A tsibble: 6 x 5 [1D]
## # Key: zip_code, neighborhood_name, borough_group [1]
## f_date zip_code covid_case_count neighborhood_name borough_group
## <date> <chr> <dbl> <chr> <chr>
## 1 2020-04-01 10001 113 Chelsea/NoMad/West Chelsea Manhattan
## 2 2020-04-02 10001 113 Chelsea/NoMad/West Chelsea Manhattan
## 3 2020-04-03 10001 136 Chelsea/NoMad/West Chelsea Manhattan
## 4 2020-04-04 10001 146 Chelsea/NoMad/West Chelsea Manhattan
## 5 2020-04-05 10001 158 Chelsea/NoMad/West Chelsea Manhattan
## 6 2020-04-06 10001 158 Chelsea/NoMad/West Chelsea Manhattan
Note that we’ve filled in the missing dates with the prior day’s value. Finally, I put all the steps together and filter out a few zip codes that appear to have been included in error as they’re only included in a single day of data 5/23/2020:
Visualize the Data
Finally, to complete the example, I plot the sum of the daily differences for a random zip code, 10040
my_plot <- covid_ts %>%
arrange(zip_code, f_date) %>%
mutate(f_month = lubridate::month(f_date, label=T),
f_week = lubridate::week(f_date),
f_day = lubridate::day(f_date),
f_wday = lubridate::wday(f_date, label = T),
dailydiff = difference(covid_case_count,lag = 1),
diff_7day_ma = slider::slide(dailydiff, mean,
na.rm = TRUE, .size = 7)) %>%
filter(f_date > covid_ts$f_date %>% min()+7) %>%
group_by(f_month, f_day, borough_group) %>%
summarise(SumOfDailyDiff = sum(dailydiff)) %>%
na.omit() %>%
filter(f_date < as.Date("2020-06-7")) %>%
ggplot(aes(x = f_date, y = SumOfDailyDiff))+
geom_line()+
facet_wrap(~borough_group)+
labs(title = "Sum, Daily Differences in C19 Case Count;10040")+
theme(axis.title.x = element_blank(),
axis.title.y = element_blank())
plotly::ggplotly(my_plot)
As of the latest available data for this post, the plots paint a promising picture but I am going to remain cautiously optimistic given the recent spikes showing in the Bronx and Brooklyn. The weather’s been really nice, too, which is a blessing and curse in times like these.
Please stay safe out there and take care!