Rows: 6,804
Columns: 7
$ ParkName <chr> "Acadia NP", "Acadia NP", "Acadia~
$ pid <chr> "ACAD", "ACAD", "ACAD", "ACAD", "~
$ Region <chr> "Northeast", "Northeast", "Northe~
$ State <chr> "ME", "ME", "ME", "ME", "ME", "ME~
$ Year <dbl> 2012, 2012, 2012, 2012, 2012, 201~
$ Month <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11~
$ RecreationVisits <dbl> 11930, 12388, 22684, 59471, 15740~
Following full vaccination, folks have begun venturing outdoors for the first time in long while - myself included! In support of outdoor-activity planning, today’s post is going to cover forecasting national park visits for each month of 2021.
Since there are 63 National Parks in the US across 30 states and two territories, I will employ the sweep
package to address the scale of forecasts to be made/reviewed all while keeping the data in a tidy format.
Let’s get started!
National Park Service (NPS) Data
The NPS offers historical visitor data for querying via their “NPS Query Tool” found here. While the NPS does offer an API, visit counts are unfortunately not included.
The data for the forecast was pulled from the NPS query tool and contains the most-recent 8 years of data including monthly visitor totals from 2012 through 2020.
There are many more fields available in this data that I have not shown - you can find additional details about the them here.
I’ve cleaned and stored the data in a variable called visit_stats
and I’ve made an interactive plot, below, for exploring the 63 time series for each park (see my previous post for details behind how this interactive plot was made with plotly
and highlight
).
Making Many Forecasts
The below example is similar to the sweep
package’s excellent package vignette for making forecasts of groups - be sure to check it out for additional information and features not covered here.
library(tidyverse)
library(forecast)
library(timetk) # Coercing tibbles into ts objects
library(sweep) # Organizing forecast results
Here’s a preview of the data that I will build the forecast on, visit_stats
:
Rows: 6,804
Columns: 6
$ pid <chr> "ACAD", "ACAD", "ACAD", "ACAD", "~
$ ParkName <chr> "Acadia NP", "Acadia NP", "Acadia~
$ State <chr> "ME", "ME", "ME", "ME", "ME", "ME~
$ Region <chr> "Northeast", "Northeast", "Northe~
$ yearmonth <date> 2012-01-01, 2012-02-01, 2012-03-~
$ RecreationVisits <dbl> 11930, 12388, 22684, 59471, 15740~
Similar to the vignette, I will be going through five steps including:
- nesting the data by park
- coercing each group (park) into a
ts
object - this is required for using theforecast
package. - fitting an “exponential smoothing state space (
ets
) model” to each park and reviewing the parameters withsweep
- use the fit to make a 12-month forecast for 2021
- leverage the
sweep
package to deliver forecast results in a tidy, easy-to-plot format
1. Nesting Data
When making grouped forecasts, it’s often advantageous to work with nested data as it reduces the chance you’ll inadvertantly mix data and it opens the door for a variety of advanced methods - see this previous post under the section: “tidyr::nest
and purrr:map
for Grouped Modeling” for a refresher.
I nest the visit_stats
by distinct, park-level attributes to create visit_nest
. A glimpse of visit_nest
shows that there is now one row per park, and each park contains a nested tibble containing the visit counts and dates:
<- visit_stats %>%
visit_nest group_by(pid, ParkName, State, Region) %>%
nest()
glimpse(visit_nest, width = 70)
Rows: 63
Columns: 5
Groups: pid, ParkName, State, Region [63]
$ pid <chr> "ACAD", "ARCH", "BADL", "BIBE", "BISC", "BLCA", "BR~
$ ParkName <chr> "Acadia NP", "Arches NP", "Badlands NP", "Big Bend ~
$ State <chr> "ME", "UT", "SD", "TX", "FL", "CO", "UT", "UT", "UT~
$ Region <chr> "Northeast", "Intermountain", "Midwest", "Intermoun~
$ data <list> [<tbl_df[108 x 2]>], [<tbl_df[108 x 2]>], [<tbl_df~
2. Coercing to ts
:
I create visit_ts
which is just visit_nest
with a time-series version of the nested date/visit data
column. If you’ve ever worked with ts
objects, you may find the timetk
package a breath of fresh air.
<- visit_nest %>%
visit_ts mutate(data.ts = map(
.x = data,
.f = timetk::tk_ts,
select = -yearmonth,
start = 2012,
freq = 12
))
3. Fit the ets
model
Fitting 63 individual models can take a bit depending on your computer:
tic()
<- visit_ts %>%
visit_fit mutate(fit.ets = map(data.ts, forecast::ets))
toc()
22.08 sec elapsed
To review the fits, sweep
provides the sw_tidy
function for extracting the parameters of the models for review (notice the “term” and “estimate” columns):
# view just one park's fit:
%>%
visit_fit mutate(ets.tidy = map(fit.ets, sweep::sw_tidy)) %>%
unnest(ets.tidy) %>%
filter(pid == "ACAD") %>%
glimpse(width = 70)
Rows: 14
Columns: 9
Groups: pid, ParkName, State, Region [1]
$ pid <chr> "ACAD", "ACAD", "ACAD", "ACAD", "ACAD", "ACAD", "AC~
$ ParkName <chr> "Acadia NP", "Acadia NP", "Acadia NP", "Acadia NP",~
$ State <chr> "ME", "ME", "ME", "ME", "ME", "ME", "ME", "ME", "ME~
$ Region <chr> "Northeast", "Northeast", "Northeast", "Northeast",~
$ data <list> [<tbl_df[108 x 2]>], [<tbl_df[108 x 2]>], [<tbl_df~
$ data.ts <list> <<ts[108 x 1]>>, <<ts[108 x 1]>>, <<ts[108 x 1]>>,~
$ fit.ets <list> [-1304.68, 2639.36, 2679.592, 2644.577, 1871814098~
$ term <chr> "alpha", "gamma", "l", "s0", "s1", "s2", "s3", "s4"~
$ estimate <dbl> 4.704027e-01, 1.415324e-03, 2.220751e+05, 5.341076e~
4. Create the 2021 visit forecast
The only parameter is the forecast window of h = 12
representing the 12 months of 2021.
<- visit_fit %>%
visit_fcast mutate(fcast.ets = map(fit.ets, forecast::forecast, h = 12))
5. Tidy the forcast results with sweep::sw_sweep
Finally, sw_sweep
provides the forecast results in a very convenient structure with with ggplot
-ready tidy data.
<- visit_fcast %>%
visit_fcast_tidy mutate(sweep = map(fcast.ets, sweep::sw_sweep,
fitted = FALSE,
timetk_idx = TRUE
%>%
)) unnest(sweep)
glimpse(visit_fcast_tidy, width = 70)
Rows: 7,560
Columns: 15
Groups: pid, ParkName, State, Region [63]
$ pid <chr> "ACAD", "ACAD", "ACAD", "ACAD", "ACAD", "AC~
$ ParkName <chr> "Acadia NP", "Acadia NP", "Acadia NP", "Aca~
$ State <chr> "ME", "ME", "ME", "ME", "ME", "ME", "ME", "~
$ Region <chr> "Northeast", "Northeast", "Northeast", "Nor~
$ data <list> [<tbl_df[108 x 2]>], [<tbl_df[108 x 2]>], ~
$ data.ts <list> <<ts[108 x 1]>>, <<ts[108 x 1]>>, <<ts[108~
$ fit.ets <list> [-1304.68, 2639.36, 2679.592, 2644.577, 18~
$ fcast.ets <list> [-1304.68, 2639.36, 2679.592, 2644.577, 18~
$ index <date> 2012-01-01, 2012-02-01, 2012-03-01, 2012-0~
$ key <chr> "actual", "actual", "actual", "actual", "ac~
$ RecreationVisits <dbl> 11930, 12388, 22684, 59471, 157406, 309198,~
$ lo.80 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ lo.95 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ hi.80 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ hi.95 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
Below, I plot the forecasts for the parks in the Northeast region along with the past 3 years of actual visits. You can hover over the lines to see high & low 95% confidence intervals for the forecast, too.
And this only scratches the surface of what’s possible with the sweep
package - see the documentation for sw_agument
that allows for comparing the actual vs. forecast, residuals, and cluster assignments - it’s a truly exciting time to be in data analytics!
National Park Locations & Information
If you enjoy the outdoors, consider including a national park visit in your post-pandemic vacation plans - you can use the map below for additional park research: simply hover over the green points on the map to read the NPS description and weather information
The map with Alaska and Hawaii placed below the contiguous states was created with the usmap
package which currently lacks positioning for Puerto Rico, American Somoa, and the US Virgin Islands - the latter two of which contain the National Park of American Samoa and Virgin Islands National Park, respectively.