sars2pack

Overview

The sars2pack R package provides one-line access to over 40 COVID-related datasets. Datasets are accessed in real time directly from their sources and then transformed to tidy-data form where possible and applicable. The result of each dataset accessor is a ready-to-use R dataset, often a dataframe. Documentation includes dataset descriptions, sources and references, and examples. Online documentation is available in two locations:

The sars2pack documentation, which includes reference docs and detailed dataset descriptions.
Extended workflows and use cases, as an online book

Questions addressed by sars2pack

What are the current and historical total, new cases, and deaths of COVID-19 at the city, county, state, national, and international levels?
How do changes in infection rates differ across locations?
What are the non-pharmacological interventions in place at the local and national levels?
In the United States, what is the geographical distribution of healthcare capacity (ICU beds, total beds, doctors, etc.)?
What are the published values of key epidemic parameters, as curated from the literature?

Installation

# If you do not have BiocManager installed:
install.packages('BiocManager')

# Then, if sars2pack is not already installed:
BiocManager::install('seandavi/sars2pack')

After the one-time installation, load the packge to get started.

library(sars2pack)

Available datasets

name	accessor	data_type	geographical	geospatial	region	resolution	url
United States county-level geographic details	us_county_geo_details	c(“demographics”, “geographic”)	TRUE	TRUE	United States	admin2	LINK
OECD International Unemployment Data	oecd_unemployment_data	c(“economics”, “time series”)	TRUE	FALSE	World	admin0	LINK
healthdata.org COVID-19 Mobility Observations and Projections	healthdata_mobility_data	c(“mobility”, “time series”, “projections”)	TRUE	FALSE	International	c(“admin0”, “admin1”)	LINK
healthdata.org COVID-19 Testing Observations and Projections	healthdata_testing_data	c(“testing”, “time series”, “projections”)	TRUE	FALSE	International	c(“admin0”, “admin1”)	LINK
Our World In Data testing and cases reporting	owid_data	c(“time series”, “cases”, “deaths”, “testing”)	TRUE	FALSE	World	admin0	LINK
CovidTracker data	covidtracker_data	c(“time series”, “cases”, “deaths”, “testing”)	TRUE	FALSE	United States	admin1	LINK
European CDC world tracking	ecdc_data	c(“time series”, “cases”, “deaths”)	TRUE	FALSE	World	admin0	LINK
EU data Github aggregator	eu_data_cache_data	c(“time series”, “cases”, “deaths”)	TRUE	FALSE	Europe	c(“admin0”, “admin1”)	LINK
USA Facts	usa_facts_data	c(“time series”, “cases”, “deaths”)	TRUE	FALSE	United States	admin1	LINK
Johns Hopkins dataset	jhu_data	c(“time series”, “cases”, “deaths”)	TRUE	FALSE	World	admin0	LINK
Johns Hopkins US-centric data	jhu_us_data	c(“time series”, “cases”, “deaths”)	TRUE	FALSE	United States	c(“admin1”, “admin2”)	LINK
New York Times county level data	nytimes_county_data	c(“time series”, “cases”, “deaths”)	TRUE	FALSE	United States	admin2	LINK
New York Times state level data	nytimes_state_data	c(“time series”, “cases”, “deaths”)	TRUE	FALSE	United States	admin1	LINK
The Economist: Excess deaths during COVID pandemic	economist_excess_deaths	c(“time series”, “deaths”, “excess deaths”)	TRUE	FALSE	International	c(“admin0”, “admin1”)	LINK
The : Excess deaths during COVID pandemic	financial_times_excess_deaths	c(“time series”, “deaths”, “excess deaths”)	TRUE	FALSE	International	c(“admin0”, “admin1”)	LINK
US CDC excess deaths dataset	cdc_excess_deaths	c(“time series”, “deaths”, “excess deaths”)	TRUE	FALSE	United States	admin1	LINK
Descartes Labs Mobility Data	descartes_mobility_data	c(“time series”, “mobility”)	TRUE	FALSE	United States	admin1	LINK
Apple mobility data from maps	apple_mobility_data	c(“time series”, “mobility”)	TRUE	FALSE	World	c(“admin0”, “admin1”, “admin2”, “admin3”)	LINK
Healthdata.org projections of hospital utilization and deaths	healthdata_projections_data	c(“time series”, “projections”, “cases”, “deaths”)	TRUE	FALSE	c(“United States”, “World”)	c(“admin1”, “admin2”)	LINK
Healthdata.org mobility data	healthdata_mobility_data	c(“time series”, “projections”, “mobility”)	TRUE	FALSE	c(“United States”, “World”)	c(“admin1”, “admin2”)	LINK
United States CDC Social Vulnerability Index	cdc_social_vulnerability_index	demographics	TRUE	FALSE	United States	admin2	LINK
US county health rankings from ‘https://www.countyhealthrankings.org’	us_county_health_rankings	demographics	TRUE	FALSE	United States	c(“admin0”, “admin1”, “admin2”)	LINK
Country metadata from restcountries.eu	country_metadata	demographics	TRUE	FALSE	World	admin0	LINK
Extensive United States hospital capabilities	us_hospital_details	healthcare capacity	TRUE	TRUE	United States	individual hospital	LINK
Kaiser Family Foundation ICU bed data	kff_icu_beds	healthcare capacity	TRUE	TRUE	United States	Individual hospital	LINK
CovidCare United States Healthcare Capacity	us_healthcare_capacity	healthcare capacity	TRUE	TRUE	United States	Individual hospital	LINK
GISAID metadata from thousands of SARS-CoV-2 sequences	cov_glue_lineage_data	line list	TRUE	FALSE	World	multiple	LINK
beoutbreakprepared	beoutbreakprepared_data	line list	TRUE	FALSE	World	patient	LINK
Published epidemic parameters for COVID-19	param_estimates_published	miscellaneous	FALSE	FALSE	list()	list()	LINK
Google mobility data	google_mobility_data	mobility	TRUE	FALSE	World	c(“admin0”, “admin1”, “admin2”)	LINK
Newick tree from thousands of SARS-CoV-2 sequences	cov_glue_newick_data	phylogenetic	FALSE	FALSE	World	multiple	LINK
Aggregated projections from US CDC	cdc_aggregated_projections	projections	TRUE	FALSE	list()	c(“admin0”, “admin1”)	LINK
CoronaNet government response database	coronanet_government_response_data	public policy	TRUE	FALSE	World	c(“admin0”, “admin1”)	LINK
Oxford Government Policy Intervention time series	government_policy_timeline	public policy	TRUE	FALSE	World	admin0	LINK
United States social distancing policies	us_state_distancing_policy	public policy	TRUE	FALSE	United States	admin1	LINK

Case tracking

Updated tracking of city, county, state, national, and international confirmed cases, deaths, and testing is critical to driving policy, implementing interventions, and measuring their effectiveness. Case tracking datasets include date, a count of cases, and usually numerous other pieces of information related to location of reporting, etc.

Accessing case-tracking datasets is typically done with one function per dataset. The example here is data from the European Centers for Disease Control, or ECDC.

ecdc = ecdc_data()

Get a quick overview of the dataset.

head(ecdc)

## # A tibble: 6 x 8
## # Groups:   location_name, subset [6]
##   date       location_name iso2c iso3c population_2019 continent subset    count
##   <date>     <chr>         <chr> <chr>           <dbl> <chr>     <chr>     <dbl>
## 1 2019-12-31 Afghanistan   AF    AFG          38041757 Asia      confirmed     0
## 2 2019-12-31 Afghanistan   AF    AFG          38041757 Asia      deaths        0
## 3 2019-12-31 Algeria       DZ    DZA          43053054 Africa    confirmed     0
## 4 2019-12-31 Algeria       DZ    DZA          43053054 Africa    deaths        0
## 5 2019-12-31 Armenia       AM    ARM           2957728 Europe    confirmed     0
## 6 2019-12-31 Armenia       AM    ARM           2957728 Europe    deaths        0

The ecdc dataset is just a data.frame (actually, a tibble), so applying standard R or tidyverse functionality can get answers to basic questions with little code. The next code block generates a top10 of countries with the most deaths recorded to date. Note that if you do this on your own computer, the data will be updated to today’s data values.

library(dplyr)
top10 = ecdc %>% filter(subset=='deaths') %>% 
    group_by(location_name) %>%
    filter(count==max(count)) %>%
    arrange(desc(count)) %>%
    head(10) %>% select(-starts_with('iso'),-continent,-subset) %>%
    mutate(rate_per_100k = 1e5*count/population_2019)

Finally, present a nice table of those countries:

knitr::kable(
    top10,
    caption = "Reported COVID-19-related deaths in ten most affected countries.",
    format = 'pandoc')

Reported COVID-19-related deaths in ten most affected countries.
date	location_name	population_2019	count	rate_per_100k
2020-07-06	United_States_of_America	329064917	129947	39.489776
2020-07-06	Brazil	211049519	64867	30.735441
2020-07-06	United_Kingdom	66647112	44220	66.349462
2020-07-06	Italy	60359546	34861	57.755570
2020-07-06	Mexico	127575529	30639	24.016361
2020-07-04	France	67012883	29893	44.607841
2020-07-05	France	67012883	29893	44.607841
2020-07-06	France	67012883	29893	44.607841
2020-05-24	Spain	46937060	28752	61.256500
2020-07-06	India	1366417756	19693	1.441214

Examine the spread of the pandemic throughout the world by examining cumulative deaths reported for the top 10 countries above.

ecdc_top10 = ecdc %>% filter(location_name %in% top10$location_name & subset=='deaths')
plot_epicurve(ecdc_top10,
              filter_expression = count > 10, 
              color='location_name')

Comparing the features of disease spread is easiest if all curves are shifted to “start” at the same absolute level of infection. In this case, shift the origin for all countries to start at the first time point when more than 100 cumulative cases had been observed. Note how some curves cross others which is evidence of less infection control at the same relative time in the pandemic for that country (eg., Brazil).

ecdc_top10 %>% align_to_baseline(count>100,group_vars=c('location_name')) %>%
    plot_epicurve(date_column = 'index',color='location_name')

Overview

Questions addressed by sars2pack

Installation

Available datasets

Case tracking

Contributions

Adding new datasets

Similar work

Links

License

Citation

Developers

Dev status