Overview

The sars2pack R package provides one-line access to over 40 COVID-related datasets. Datasets are accessed in real time directly from their sources and then transformed to tidy-data form where possible and applicable. The result of each dataset accessor is a ready-to-use R dataset, often a dataframe. Documentation includes dataset descriptions, sources and references, and examples. Online documentation is available in two locations:

Questions addressed by sars2pack

  • What are the current and historical total, new cases, and deaths of COVID-19 at the city, county, state, national, and international levels?
  • How do changes in infection rates differ across locations?
  • What are the non-pharmacological interventions in place at the local and national levels?
  • In the United States, what is the geographical distribution of healthcare capacity (ICU beds, total beds, doctors, etc.)?
  • What are the published values of key epidemic parameters, as curated from the literature?

Installation

# If you do not have BiocManager installed:
install.packages('BiocManager')

# Then, if sars2pack is not already installed:
BiocManager::install('seandavi/sars2pack')

After the one-time installation, load the packge to get started.

library(sars2pack)

Available datasets

name accessor data_type geographical geospatial region resolution url
United States county-level geographic details us_county_geo_details c(“demographics”, “geographic”) TRUE TRUE United States admin2 LINK
OECD International Unemployment Data oecd_unemployment_data c(“economics”, “time series”) TRUE FALSE World admin0 LINK
healthdata.org COVID-19 Mobility Observations and Projections healthdata_mobility_data c(“mobility”, “time series”, “projections”) TRUE FALSE International c(“admin0”, “admin1”) LINK
healthdata.org COVID-19 Testing Observations and Projections healthdata_testing_data c(“testing”, “time series”, “projections”) TRUE FALSE International c(“admin0”, “admin1”) LINK
Our World In Data testing and cases reporting owid_data c(“time series”, “cases”, “deaths”, “testing”) TRUE FALSE World admin0 LINK
CovidTracker data covidtracker_data c(“time series”, “cases”, “deaths”, “testing”) TRUE FALSE United States admin1 LINK
European CDC world tracking ecdc_data c(“time series”, “cases”, “deaths”) TRUE FALSE World admin0 LINK
EU data Github aggregator eu_data_cache_data c(“time series”, “cases”, “deaths”) TRUE FALSE Europe c(“admin0”, “admin1”) LINK
USA Facts usa_facts_data c(“time series”, “cases”, “deaths”) TRUE FALSE United States admin1 LINK
Johns Hopkins dataset jhu_data c(“time series”, “cases”, “deaths”) TRUE FALSE World admin0 LINK
Johns Hopkins US-centric data jhu_us_data c(“time series”, “cases”, “deaths”) TRUE FALSE United States c(“admin1”, “admin2”) LINK
New York Times county level data nytimes_county_data c(“time series”, “cases”, “deaths”) TRUE FALSE United States admin2 LINK
New York Times state level data nytimes_state_data c(“time series”, “cases”, “deaths”) TRUE FALSE United States admin1 LINK
The Economist: Excess deaths during COVID pandemic economist_excess_deaths c(“time series”, “deaths”, “excess deaths”) TRUE FALSE International c(“admin0”, “admin1”) LINK
The : Excess deaths during COVID pandemic financial_times_excess_deaths c(“time series”, “deaths”, “excess deaths”) TRUE FALSE International c(“admin0”, “admin1”) LINK
US CDC excess deaths dataset cdc_excess_deaths c(“time series”, “deaths”, “excess deaths”) TRUE FALSE United States admin1 LINK
Descartes Labs Mobility Data descartes_mobility_data c(“time series”, “mobility”) TRUE FALSE United States admin1 LINK
Apple mobility data from maps apple_mobility_data c(“time series”, “mobility”) TRUE FALSE World c(“admin0”, “admin1”, “admin2”, “admin3”) LINK
Healthdata.org projections of hospital utilization and deaths healthdata_projections_data c(“time series”, “projections”, “cases”, “deaths”) TRUE FALSE c(“United States”, “World”) c(“admin1”, “admin2”) LINK
Healthdata.org mobility data healthdata_mobility_data c(“time series”, “projections”, “mobility”) TRUE FALSE c(“United States”, “World”) c(“admin1”, “admin2”) LINK
United States CDC Social Vulnerability Index cdc_social_vulnerability_index demographics TRUE FALSE United States admin2 LINK
US county health rankings from ‘https://www.countyhealthrankings.org us_county_health_rankings demographics TRUE FALSE United States c(“admin0”, “admin1”, “admin2”) LINK
Country metadata from restcountries.eu country_metadata demographics TRUE FALSE World admin0 LINK
Extensive United States hospital capabilities us_hospital_details healthcare capacity TRUE TRUE United States individual hospital LINK
Kaiser Family Foundation ICU bed data kff_icu_beds healthcare capacity TRUE TRUE United States Individual hospital LINK
CovidCare United States Healthcare Capacity us_healthcare_capacity healthcare capacity TRUE TRUE United States Individual hospital LINK
GISAID metadata from thousands of SARS-CoV-2 sequences cov_glue_lineage_data line list TRUE FALSE World multiple LINK
beoutbreakprepared beoutbreakprepared_data line list TRUE FALSE World patient LINK
Published epidemic parameters for COVID-19 param_estimates_published miscellaneous FALSE FALSE list() list() LINK
Google mobility data google_mobility_data mobility TRUE FALSE World c(“admin0”, “admin1”, “admin2”) LINK
Newick tree from thousands of SARS-CoV-2 sequences cov_glue_newick_data phylogenetic FALSE FALSE World multiple LINK
Aggregated projections from US CDC cdc_aggregated_projections projections TRUE FALSE list() c(“admin0”, “admin1”) LINK
CoronaNet government response database coronanet_government_response_data public policy TRUE FALSE World c(“admin0”, “admin1”) LINK
Oxford Government Policy Intervention time series government_policy_timeline public policy TRUE FALSE World admin0 LINK
United States social distancing policies us_state_distancing_policy public policy TRUE FALSE United States admin1 LINK

Case tracking

Updated tracking of city, county, state, national, and international confirmed cases, deaths, and testing is critical to driving policy, implementing interventions, and measuring their effectiveness. Case tracking datasets include date, a count of cases, and usually numerous other pieces of information related to location of reporting, etc.

Accessing case-tracking datasets is typically done with one function per dataset. The example here is data from the European Centers for Disease Control, or ECDC.

ecdc = ecdc_data()

Get a quick overview of the dataset.

head(ecdc)

## # A tibble: 6 x 8
## # Groups:   location_name, subset [6]
##   date       location_name iso2c iso3c population_2019 continent subset    count
##   <date>     <chr>         <chr> <chr>           <dbl> <chr>     <chr>     <dbl>
## 1 2019-12-31 Afghanistan   AF    AFG          38041757 Asia      confirmed     0
## 2 2019-12-31 Afghanistan   AF    AFG          38041757 Asia      deaths        0
## 3 2019-12-31 Algeria       DZ    DZA          43053054 Africa    confirmed     0
## 4 2019-12-31 Algeria       DZ    DZA          43053054 Africa    deaths        0
## 5 2019-12-31 Armenia       AM    ARM           2957728 Europe    confirmed     0
## 6 2019-12-31 Armenia       AM    ARM           2957728 Europe    deaths        0

The ecdc dataset is just a data.frame (actually, a tibble), so applying standard R or tidyverse functionality can get answers to basic questions with little code. The next code block generates a top10 of countries with the most deaths recorded to date. Note that if you do this on your own computer, the data will be updated to today’s data values.

library(dplyr)
top10 = ecdc %>% filter(subset=='deaths') %>% 
    group_by(location_name) %>%
    filter(count==max(count)) %>%
    arrange(desc(count)) %>%
    head(10) %>% select(-starts_with('iso'),-continent,-subset) %>%
    mutate(rate_per_100k = 1e5*count/population_2019)

Finally, present a nice table of those countries:

knitr::kable(
    top10,
    caption = "Reported COVID-19-related deaths in ten most affected countries.",
    format = 'pandoc')
Reported COVID-19-related deaths in ten most affected countries.
date location_name population_2019 count rate_per_100k
2020-07-06 United_States_of_America 329064917 129947 39.489776
2020-07-06 Brazil 211049519 64867 30.735441
2020-07-06 United_Kingdom 66647112 44220 66.349462
2020-07-06 Italy 60359546 34861 57.755570
2020-07-06 Mexico 127575529 30639 24.016361
2020-07-04 France 67012883 29893 44.607841
2020-07-05 France 67012883 29893 44.607841
2020-07-06 France 67012883 29893 44.607841
2020-05-24 Spain 46937060 28752 61.256500
2020-07-06 India 1366417756 19693 1.441214

Examine the spread of the pandemic throughout the world by examining cumulative deaths reported for the top 10 countries above.

ecdc_top10 = ecdc %>% filter(location_name %in% top10$location_name & subset=='deaths')
plot_epicurve(ecdc_top10,
              filter_expression = count > 10, 
              color='location_name')

Comparing the features of disease spread is easiest if all curves are shifted to “start” at the same absolute level of infection. In this case, shift the origin for all countries to start at the first time point when more than 100 cumulative cases had been observed. Note how some curves cross others which is evidence of less infection control at the same relative time in the pandemic for that country (eg., Brazil).

ecdc_top10 %>% align_to_baseline(count>100,group_vars=c('location_name')) %>%
    plot_epicurve(date_column = 'index',color='location_name')

Contributions

Pull requests are gladly accepted on Github.

Adding new datasets

See the Adding new datasets vignette.