Add daily incidence to cumulative case counts data.frame

For a data.frame that includes cumulative case counts over time +/- extra columns for location, etc., this function adds an extra column corresponding to the daily incidence counts.

add_incidence_column(
  df,
  date_column = "date",
  count_column = "count",
  incidence_col_name = "inc",
  grouping_columns = c()
)

Arguments

df: a data.frame with at least two columns representing a date or at least ordered quantity and a cumulative count column. These types of data often arise from one of the case-count type datasets.
date_column: character(1) giving the column name of date column in the dataset
count_column: character(1) giving the column name of the cumulative counts in the dataset
incidence_col_name: character(1) giving the desired column name to add
grouping_columns: character() vector with the column names to use for grouping when calculating the incidence data. See examples for details. Be very careful to include the appropriate columns in grouping, or results will be misleading.

Value

a data.frame

Details

Multiple datasets conform to the cumulative counts form, with a date and count column of cumulative cases over time. Other columns may be present.

This function summarizes by the grouping_columns and then within each group, subtracts the previous day's counts. The result is the new case count for each day.

Author

Sean Davis seandavi@gmail.com

Examples

library(ggplot2)
library(dplyr)
#> 
#> Attaching package: ‘dplyr’
#> The following object is masked from ‘package:MASS’:
#> 
#>     select
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union

j = jhu_data()
head(j)
#> # A tibble: 6 × 7
#>   ProvinceState CountryRegion   Lat  Long date       count subset   
#>   <chr>         <chr>         <dbl> <dbl> <date>     <dbl> <chr>    
#> 1 NA            Afghanistan    33.9  67.7 2020-01-22     0 confirmed
#> 2 NA            Afghanistan    33.9  67.7 2020-01-23     0 confirmed
#> 3 NA            Afghanistan    33.9  67.7 2020-01-24     0 confirmed
#> 4 NA            Afghanistan    33.9  67.7 2020-01-25     0 confirmed
#> 5 NA            Afghanistan    33.9  67.7 2020-01-26     0 confirmed
#> 6 NA            Afghanistan    33.9  67.7 2020-01-27     0 confirmed
colnames(j)
#> [1] "ProvinceState" "CountryRegion" "Lat"           "Long"         
#> [5] "date"          "count"         "subset"       

add_incidence_column(j, grouping_columns=c('CountryRegion','ProvinceState'))
#> # A tibble: 701,406 × 8
#> # Groups:   CountryRegion, ProvinceState [285]
#>    ProvinceState CountryRegion   Lat  Long date       count subset      inc
#>    <chr>         <chr>         <dbl> <dbl> <date>     <dbl> <chr>     <dbl>
#>  1 NA            Afghanistan    33.9  67.7 2020-01-22     0 confirmed    NA
#>  2 NA            Afghanistan    33.9  67.7 2020-01-23     0 confirmed     0
#>  3 NA            Afghanistan    33.9  67.7 2020-01-24     0 confirmed     0
#>  4 NA            Afghanistan    33.9  67.7 2020-01-25     0 confirmed     0
#>  5 NA            Afghanistan    33.9  67.7 2020-01-26     0 confirmed     0
#>  6 NA            Afghanistan    33.9  67.7 2020-01-27     0 confirmed     0
#>  7 NA            Afghanistan    33.9  67.7 2020-01-28     0 confirmed     0
#>  8 NA            Afghanistan    33.9  67.7 2020-01-29     0 confirmed     0
#>  9 NA            Afghanistan    33.9  67.7 2020-01-30     0 confirmed     0
#> 10 NA            Afghanistan    33.9  67.7 2020-01-31     0 confirmed     0
#> # … with 701,396 more rows

# get top 10 countries by cumulative
# number of cases
j_top_10 = j %>%
    filter(subset=='deaths') %>%
    dplyr::group_by(CountryRegion) %>%
    dplyr::summarize(count = max(count)) %>%
    dplyr::arrange(dplyr::desc(count)) %>%
    head(10)

j_top_10
#> # A tibble: 10 × 2
#>    CountryRegion   count
#>    <chr>           <dbl>
#>  1 US             997526
#>  2 Brazil         664374
#>  3 India          524093
#>  4 Russia         369350
#>  5 Mexico         324350
#>  6 Peru           212946
#>  7 United Kingdom 176212
#>  8 Italy          164489
#>  9 Indonesia      156381
#> 10 France         143524

# The JHU data divides some countries into
# regions, so we can collapse to regions
# by simply summing over date/country
j = j %>% filter(CountryRegion %in% j_top_10[['CountryRegion']] & subset=='deaths') %>%
    dplyr::group_by(date, CountryRegion) %>%
    dplyr::summarize(count = sum(count))
#> `summarise()` has grouped output by 'date'. You can override using the
#> `.groups` argument.

j
#> # A tibble: 8,380 × 3
#> # Groups:   date [838]
#>    date       CountryRegion  count
#>    <date>     <chr>          <dbl>
#>  1 2020-01-22 Brazil             0
#>  2 2020-01-22 France             0
#>  3 2020-01-22 India              0
#>  4 2020-01-22 Indonesia          0
#>  5 2020-01-22 Italy              0
#>  6 2020-01-22 Mexico             0
#>  7 2020-01-22 Peru               0
#>  8 2020-01-22 Russia             0
#>  9 2020-01-22 US                 0
#> 10 2020-01-22 United Kingdom     0
#> # … with 8,370 more rows

# Add an incidence column to the cumulative dataset
j_inc = add_incidence_column(j, grouping_columns='CountryRegion')

j_inc
#> # A tibble: 8,380 × 4
#> # Groups:   CountryRegion [10]
#>    date       CountryRegion  count   inc
#>    <date>     <chr>          <dbl> <dbl>
#>  1 2020-01-22 Brazil             0    NA
#>  2 2020-01-22 France             0    NA
#>  3 2020-01-22 India              0    NA
#>  4 2020-01-22 Indonesia          0    NA
#>  5 2020-01-22 Italy              0    NA
#>  6 2020-01-22 Mexico             0    NA
#>  7 2020-01-22 Peru               0    NA
#>  8 2020-01-22 Russia             0    NA
#>  9 2020-01-22 US                 0    NA
#> 10 2020-01-22 United Kingdom     0    NA
#> # … with 8,370 more rows

j_inc %>%
    dplyr::filter(count>0) %>%
    plot_epicurve(color='CountryRegion', case_column='inc') +
        geom_smooth() +
        ggtitle('Daily death counts in the top 10 most infected countries')
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'



# Hospitalizations by day in Maryland
covidtracker_data() %>%
    filter(state=='MD') %>%
    add_incidence_column(count_column='hospitalized') %>%
    ggplot(aes(x=date,y=inc)) + geom_smooth() +
    ylab("New Hospitalizations per day") +
    ggtitle('Hospitalizations in Maryland', subtitle = 'From covidtracker')
#> Warning: One or more parsing issues, see `problems()` for details
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
#> Warning: Removed 22 rows containing non-finite values (stat_smooth).

Arguments

Value

Details

See also

Author

Examples