Abstract

Bioconductor has a rich ecosystem of metadata around packages, usage, and build status. This package is a simple collection of functions to access that metadata from R in a tidy data format. The goal is to expose metadata for data mining and value-added functionality such as package searching, text mining, and analytics on packages.

Introduction

Bioconductor has a rich ecosystem of metadata around packages, usage, and build status. This package is a simple collection of functions to access that metadata from R in a tidy data format. The goal is to expose metadata for data mining and value-added functionality such as package searching, text mining, and analytics on packages.

Functionality includes access to :

  • Download statistics
  • General package listing
  • Build reports
  • Package dependency graphs
  • Vignettes

Build reports

The Bioconductor build reports are available online as HTML pages. However, they are not very computable. The biocBuildReport function does some heroic parsing of the HTML to produce a tidy data.frame for further processing in R.

## # A tibble: 6 × 11
##   pkg   author         version git_last_commit git_last_commit_date Deprecated
##   <chr> <chr>          <chr>   <chr>           <dttm>               <lgl>     
## 1 a4    Laure Cougnaud 1.44.0  5b0fc5a         2022-04-26 11:07:42  FALSE     
## 2 a4    Laure Cougnaud 1.44.0  5b0fc5a         2022-04-26 11:07:42  FALSE     
## 3 a4    Laure Cougnaud 1.44.0  5b0fc5a         2022-04-26 11:07:42  FALSE     
## 4 a4    Laure Cougnaud 1.44.0  5b0fc5a         2022-04-26 11:07:42  FALSE     
## 5 a4    Laure Cougnaud 1.44.0  5b0fc5a         2022-04-26 11:07:42  FALSE     
## 6 a4    Laure Cougnaud 1.44.0  5b0fc5a         2022-04-26 11:07:42  FALSE     
## # … with 5 more variables: PackageStatus <chr>, node <chr>, stage <chr>,
## #   result <chr>, bioc_version <chr>

Personal build report

Because developers may be interested in a quick view of their own packages, there is a simple function, problemPage, to produce an HTML report of the build status of packages matching a given author regex supplied to the authorPattern argument. The default is to report only “problem” build statuses (ERROR, WARNING).

problemPage(authorPattern = "V.*Carey")

In similar fashion, maintainers of packages that have many downstream packages that depend on them may wish to check that a change they introduced hasn’t suddenly broken a large number of these. You can use the dependsOn argument to produce the summary report of those packages that “depend on” the given package.

problemPage(dependsOn = "limma")

When run in an interactive environment, the problemPage function will open a browser window for user interaction. Note that if you want to include all your package results, not just the broken ones, simply specify includeOK = TRUE.

Download statistics

Bioconductor supplies download stats for all packages. The biocDownloadStats function grabs all available download stats for all packages in all Experiment Data, Annotation Data, and Software packages. The results are returned as a tidy data.frame for further analysis.

## # A tibble: 6 × 7
##   pkgType  Package  Year Month Nb_of_distinct_IPs Nb_of_downloads Date      
##   <chr>    <chr>   <int> <chr>              <int>           <int> <date>    
## 1 software ABarray  2022 Jan                   51              78 2022-01-01
## 2 software ABarray  2022 Feb                   33              68 2022-02-01
## 3 software ABarray  2022 Mar                   62              99 2022-03-01
## 4 software ABarray  2022 Apr                   55              85 2022-04-01
## 5 software ABarray  2022 May                   66              84 2022-05-01
## 6 software ABarray  2022 Jun                   16              21 2022-06-01

The download statistics reported are for all available versions of a package. There are no separate, publicly available statistics broken down by version.

The majority of Bioconductor Software packages are also available through other channels such as Anaconda, who also provided download statistics for packages installed from their repositories. Access to these counts is provided by the anacondaDownloadStats function:

## # A tibble: 6 × 7
##   Package Year  Month Nb_of_distinct_IPs Nb_of_downloads repo     Date      
##   <chr>   <chr> <chr>              <int>           <dbl> <chr>    <date>    
## 1 a4      2019  Apr                   NA               1 Anaconda 2019-04-01
## 2 a4      2019  Aug                   NA              12 Anaconda 2019-08-01
## 3 a4      2019  Dec                   NA              20 Anaconda 2019-12-01
## 4 a4      2019  Feb                   NA               5 Anaconda 2019-02-01
## 5 a4      2019  Jan                   NA               1 Anaconda 2019-01-01
## 6 a4      2019  Jul                   NA              18 Anaconda 2019-07-01

Note that Anaconda do not provide counts for distinct IP addresses, but this column is included for compatibility with the Bioconductor count tables.

Package details

The R DESCRIPTION file contains a plethora of information regarding package authors, dependencies, versions, etc. In a repository such as Bioconductor, these details are available in bulk for all inclucded packages. The biocPkgList returns a data.frame with a row for each package. Tons of information are avaiable, as evidenced by the column names of the results.

bpi = biocPkgList()
colnames(bpi)
##  [1] "Package"               "Version"               "Depends"              
##  [4] "Suggests"              "License"               "MD5sum"               
##  [7] "NeedsCompilation"      "Title"                 "Description"          
## [10] "biocViews"             "Author"                "Maintainer"           
## [13] "git_url"               "git_branch"            "git_last_commit"      
## [16] "git_last_commit_date"  "Date/Publication"      "source.ver"           
## [19] "win.binary.ver"        "mac.binary.ver"        "vignettes"            
## [22] "vignetteTitles"        "hasREADME"             "hasNEWS"              
## [25] "hasINSTALL"            "hasLICENSE"            "Rfiles"               
## [28] "dependencyCount"       "Imports"               "Enhances"             
## [31] "dependsOnMe"           "suggestsMe"            "VignetteBuilder"      
## [34] "URL"                   "SystemRequirements"    "BugReports"           
## [37] "importsMe"             "Archs"                 "LinkingTo"            
## [40] "Video"                 "linksToMe"             "PackageStatus"        
## [43] "License_restricts_use" "OS_type"               "organism"             
## [46] "License_is_FOSS"

Some of the variables are parsed to produce list columns.

head(bpi)
## # A tibble: 6 × 46
##   Package     Version Depends   Suggests  License MD5sum  NeedsCompilation Title
##   <chr>       <chr>   <list>    <list>    <chr>   <chr>   <chr>            <chr>
## 1 a4          1.44.0  <chr [5]> <chr [6]> GPL-3   cc696d… no               Auto…
## 2 a4Base      1.44.0  <chr [2]> <chr [4]> GPL-3   094c0a… no               Auto…
## 3 a4Classif   1.44.0  <chr [2]> <chr [4]> GPL-3   1f6e6a… no               Auto…
## 4 a4Core      1.44.0  <chr [1]> <chr [2]> GPL-3   9413ac… no               Auto…
## 5 a4Preproc   1.44.0  <chr [1]> <chr [4]> GPL-3   b5b292… no               Auto…
## 6 a4Reporting 1.44.0  <chr [1]> <chr [2]> GPL-3   6bd646… no               Auto…
## # … with 38 more variables: Description <chr>, biocViews <list>, Author <list>,
## #   Maintainer <list>, git_url <chr>, git_branch <chr>, git_last_commit <chr>,
## #   git_last_commit_date <chr>, `Date/Publication` <chr>, source.ver <chr>,
## #   win.binary.ver <chr>, mac.binary.ver <chr>, vignettes <list>,
## #   vignetteTitles <list>, hasREADME <chr>, hasNEWS <chr>, hasINSTALL <chr>,
## #   hasLICENSE <chr>, Rfiles <list>, dependencyCount <chr>, Imports <list>,
## #   Enhances <list>, dependsOnMe <list>, suggestsMe <list>, …

As a simple example of how these columns can be used, extracting the importsMe column to find the packages that import the GEOquery package.

require(dplyr)
bpi = biocPkgList()
bpi %>% 
    filter(Package=="GEOquery") %>%
    pull(importsMe) %>%
    unlist()
##  [1] "bigmelon"                         "BioPlex"                         
##  [3] "ChIPXpress"                       "conclus"                         
##  [5] "crossmeta"                        "DExMA"                           
##  [7] "EGAD"                             "GAPGOM"                          
##  [9] "GEOexplorer"                      "MACPET"                          
## [11] "minfi"                            "MoonlightR"                      
## [13] "phantasus"                        "recount"                         
## [15] "SRAdb"                            "BeadArrayUseCases"               
## [17] "GSE13015"                         "healthyControlsPresenceChecker"  
## [19] "easyDifferentialGeneCoexpression" "geneExpressionFromGEO"           
## [21] "MetaIntegrator"

Package Explorer

For the end user of Bioconductor, an analysis often starts with finding a package or set of packages that perform required tasks or are tailored to a specific operation or data type. The biocExplore() function implements an interactive bubble visualization with filtering based on biocViews terms. Bubbles are sized based on download statistics. Tooltip and detail-on-click capabilities are included. To start a local session: