Forecasting National Park Visits

Following full vaccination, folks have begun venturing outdoors for the first time in long while - myself included! In support of outdoor-activity planning, today’s post is going to cover forecasting national park visits for each month of 2021.

Since there are 63 National Parks in the US across 30 states and two territories, I will employ the sweep package to address the scale of forecasts to be made/reviewed all while keeping the data in a tidy format.

Let’s get started!

National Park Service (NPS) Data

The NPS offers historical visitor data for querying via their “NPS Query Tool” found here. While the NPS does offer an API, visit counts are unfortunately not included.

The data for the forecast was pulled from the NPS query tool and contains the most-recent 8 years of data including monthly visitor totals from 2012 through 2020.

Rows: 6,804
Columns: 7
$ ParkName         <chr> "Acadia NP", "Acadia NP", "Acadia~
$ pid              <chr> "ACAD", "ACAD", "ACAD", "ACAD", "~
$ Region           <chr> "Northeast", "Northeast", "Northe~
$ State            <chr> "ME", "ME", "ME", "ME", "ME", "ME~
$ Year             <dbl> 2012, 2012, 2012, 2012, 2012, 201~
$ Month            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11~
$ RecreationVisits <dbl> 11930, 12388, 22684, 59471, 15740~

There are many more fields available in this data that I have not shown - you can find additional details about the them here.

I’ve cleaned and stored the data in a variable called visit_stats and I’ve made an interactive plot, below, for exploring the 63 time series for each park (see my previous post for details behind how this interactive plot was made with plotly and highlight).

Making Many Forecasts

The below example is similar to the sweep package’s excellent package vignette for making forecasts of groups - be sure to check it out for additional information and features not covered here.

library(tidyverse)
library(forecast)
library(timetk)     # Coercing tibbles into ts objects
library(sweep)      # Organizing forecast results

Here’s a preview of the data that I will build the forecast on, visit_stats:

Rows: 6,804
Columns: 6
$ pid              <chr> "ACAD", "ACAD", "ACAD", "ACAD", "~
$ ParkName         <chr> "Acadia NP", "Acadia NP", "Acadia~
$ State            <chr> "ME", "ME", "ME", "ME", "ME", "ME~
$ Region           <chr> "Northeast", "Northeast", "Northe~
$ yearmonth        <date> 2012-01-01, 2012-02-01, 2012-03-~
$ RecreationVisits <dbl> 11930, 12388, 22684, 59471, 15740~

Similar to the vignette, I will be going through five steps including:

nesting the data by park
coercing each group (park) into a ts object - this is required for using the forecast package.
fitting an “exponential smoothing state space (ets) model” to each park and reviewing the parameters with sweep
use the fit to make a 12-month forecast for 2021
leverage the sweep package to deliver forecast results in a tidy, easy-to-plot format

1. Nesting Data

When making grouped forecasts, it’s often advantageous to work with nested data as it reduces the chance you’ll inadvertantly mix data and it opens the door for a variety of advanced methods - see this previous post under the section: “tidyr::nest and purrr:map for Grouped Modeling” for a refresher.

I nest the visit_stats by distinct, park-level attributes to create visit_nest. A glimpse of visit_nest shows that there is now one row per park, and each park contains a nested tibble containing the visit counts and dates:

visit_nest <- visit_stats %>%
  group_by(pid, ParkName, State, Region) %>%
  nest()

glimpse(visit_nest, width = 70)

Rows: 63
Columns: 5
Groups: pid, ParkName, State, Region [63]
$ pid      <chr> "ACAD", "ARCH", "BADL", "BIBE", "BISC", "BLCA", "BR~
$ ParkName <chr> "Acadia NP", "Arches NP", "Badlands NP", "Big Bend ~
$ State    <chr> "ME", "UT", "SD", "TX", "FL", "CO", "UT", "UT", "UT~
$ Region   <chr> "Northeast", "Intermountain", "Midwest", "Intermoun~
$ data     <list> [<tbl_df[108 x 2]>], [<tbl_df[108 x 2]>], [<tbl_df~

2. Coercing to `ts`:

I create visit_ts which is just visit_nest with a time-series version of the nested date/visit data column. If you’ve ever worked with ts objects, you may find the timetk package a breath of fresh air.

visit_ts <- visit_nest %>%
  mutate(data.ts = map(
    .x = data,
    .f = timetk::tk_ts,
    select = -yearmonth,
    start = 2012,
    freq = 12
  ))

3. Fit the `ets` model

Fitting 63 individual models can take a bit depending on your computer:

tic()
visit_fit <- visit_ts %>%
  mutate(fit.ets = map(data.ts, forecast::ets))
toc()

22.08 sec elapsed

To review the fits, sweep provides the sw_tidy function for extracting the parameters of the models for review (notice the “term” and “estimate” columns):

# view just one park's fit:
visit_fit %>%
  mutate(ets.tidy = map(fit.ets, sweep::sw_tidy)) %>%
  unnest(ets.tidy) %>%
  filter(pid == "ACAD") %>%
  glimpse(width = 70)

Rows: 14
Columns: 9
Groups: pid, ParkName, State, Region [1]
$ pid      <chr> "ACAD", "ACAD", "ACAD", "ACAD", "ACAD", "ACAD", "AC~
$ ParkName <chr> "Acadia NP", "Acadia NP", "Acadia NP", "Acadia NP",~
$ State    <chr> "ME", "ME", "ME", "ME", "ME", "ME", "ME", "ME", "ME~
$ Region   <chr> "Northeast", "Northeast", "Northeast", "Northeast",~
$ data     <list> [<tbl_df[108 x 2]>], [<tbl_df[108 x 2]>], [<tbl_df~
$ data.ts  <list> <<ts[108 x 1]>>, <<ts[108 x 1]>>, <<ts[108 x 1]>>,~
$ fit.ets  <list> [-1304.68, 2639.36, 2679.592, 2644.577, 1871814098~
$ term     <chr> "alpha", "gamma", "l", "s0", "s1", "s2", "s3", "s4"~
$ estimate <dbl> 4.704027e-01, 1.415324e-03, 2.220751e+05, 5.341076e~

4. Create the 2021 visit forecast

The only parameter is the forecast window of h = 12 representing the 12 months of 2021.

visit_fcast <- visit_fit %>%
  mutate(fcast.ets = map(fit.ets, forecast::forecast, h = 12))

5. Tidy the forcast results with `sweep::sw_sweep`

Finally, sw_sweep provides the forecast results in a very convenient structure with with ggplot-ready tidy data.

visit_fcast_tidy <- visit_fcast %>%
  mutate(sweep = map(fcast.ets, sweep::sw_sweep,
    fitted = FALSE,
    timetk_idx = TRUE
  )) %>%
  unnest(sweep)

glimpse(visit_fcast_tidy, width = 70)

Rows: 7,560
Columns: 15
Groups: pid, ParkName, State, Region [63]
$ pid              <chr> "ACAD", "ACAD", "ACAD", "ACAD", "ACAD", "AC~
$ ParkName         <chr> "Acadia NP", "Acadia NP", "Acadia NP", "Aca~
$ State            <chr> "ME", "ME", "ME", "ME", "ME", "ME", "ME", "~
$ Region           <chr> "Northeast", "Northeast", "Northeast", "Nor~
$ data             <list> [<tbl_df[108 x 2]>], [<tbl_df[108 x 2]>], ~
$ data.ts          <list> <<ts[108 x 1]>>, <<ts[108 x 1]>>, <<ts[108~
$ fit.ets          <list> [-1304.68, 2639.36, 2679.592, 2644.577, 18~
$ fcast.ets        <list> [-1304.68, 2639.36, 2679.592, 2644.577, 18~
$ index            <date> 2012-01-01, 2012-02-01, 2012-03-01, 2012-0~
$ key              <chr> "actual", "actual", "actual", "actual", "ac~
$ RecreationVisits <dbl> 11930, 12388, 22684, 59471, 157406, 309198,~
$ lo.80            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ lo.95            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ hi.80            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ hi.95            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~

Below, I plot the forecasts for the parks in the Northeast region along with the past 3 years of actual visits. You can hover over the lines to see high & low 95% confidence intervals for the forecast, too.

And this only scratches the surface of what’s possible with the sweep package - see the documentation for sw_agument that allows for comparing the actual vs. forecast, residuals, and cluster assignments - it’s a truly exciting time to be in data analytics!

National Park Locations & Information

If you enjoy the outdoors, consider including a national park visit in your post-pandemic vacation plans - you can use the map below for additional park research: simply hover over the green points on the map to read the NPS description and weather information

The map with Alaska and Hawaii placed below the contiguous states was created with the usmap package which currently lacks positioning for Puerto Rico, American Somoa, and the US Virgin Islands - the latter two of which contain the National Park of American Samoa and Virgin Islands National Park, respectively.

National Park Service (NPS) Data

Making Many Forecasts

1. Nesting Data

2. Coercing to ts:

3. Fit the ets model

4. Create the 2021 visit forecast

5. Tidy the forcast results with sweep::sw_sweep

National Park Locations & Information

Links and Sources

2. Coercing to `ts`:

3. Fit the `ets` model

5. Tidy the forcast results with `sweep::sw_sweep`