Approaches to accessing ClinicalTrials.gov data

I have been attending the biannual Clinical Informatics for Cancer Centers (CI4CC) conference and there has been a fair amount of dicussion of ClinicalTrials.gov as a resource for enhancing patient engagement, trial recruitment, and results reporting. There are a number of approaches to search and access in bulk ClinicalTrials.gov data. The ones that I address here are:

  1. The CTRP RESTful API.
  2. The ClinicalTrials.gov API.
  3. The Access to Aggregate Content of ClinicalTrials.gov (AACT) database.
  4. The OpenTrials API.

The CTRP RESTful API

One of the original reasons for me to attend was to help run a hackathon centered around the newest incarnation of API to access ClinicalTrials.gov data. The details of the API are described in text at the API endpoint. Node.js-based code for the server side is available at the GitHub repository. In preparation for the workshop discussion, I prepared a client package using R, the ClinicalTrialsAPI package. The RESTful API is very simple, so a quick code runthrough using the ClinicalTrialsAPI package will give a sense of the capabilities.

The package is available on GitHub and is MIT-licensed.

# devtools::install_github('seandavi/ClinicalTrialsAPI')
library('ClinicalTrialsAPI')

After installing and loading the package, we can get a sense of the data available using a simple query. In the next code block, we are not

res = ct_search()
trialresults = res$trials
str(trialresults[1],list.len=10)
## List of 1
##  $ :List of 47
##   ..$ nci_id                              : chr "NCI-2014-01508"
##   ..$ nct_id                              : chr "NCT02193282"
##   ..$ protocol_id                         : chr "A081105"
##   ..$ ccr_id                              : NULL
##   ..$ ctep_id                             : chr "A081105"
##   ..$ dcp_id                              : NULL
##   ..$ other_ids                           :List of 1
##   .. ..$ :List of 2
##   .. .. ..$ name : chr "Study Protocol Other Identifier"
##   .. .. ..$ value: chr "CALGB A081105"
##   ..$ associated_studies                  :List of 1
##   .. ..$ :List of 2
##   .. .. ..$ study_id     : chr "NCI-2014-01509"
##   .. .. ..$ study_id_type: chr "NCI"
##   ..$ amendment_date                      : chr "2015-12-11T00:00:00"
##   ..$ current_trial_status                : chr "Active"
##   .. [list output truncated]

Filtering results is pretty straightforward, as only limited comparators are available (see docs for details). The fields() function returns all fields and their types as a named character vector.

head(fields())
## accepts_healthy_volunteers_indicator                              acronym 
##                             "string"                             "string" 
##                       amendment_date                       anatomic_sites 
##                               "date"                             "string" 
##                 arms.arm_description                        arms.arm_name 
##                             "string"                             "string"

Finally, the API has a term mapping functionality, useful for type-ahead functionality or mapping text to terms. The client returns a data.frame when called with a text to match.

lookup_term("panc")
##                      term_key                        term term_type count
## 1           pancreatic_cancer           Pancreatic Cancer _diseases   314
## 2        pancreatic_carcinoma        Pancreatic Carcinoma _diseases   314
## 3   pancreatic_adenocarcinoma   Pancreatic Adenocarcinoma _diseases   213
## 4  stage_iv_pancreatic_cancer  Stage IV Pancreatic Cancer _diseases   160
## 5 stage_iii_pancreatic_cancer Stage III Pancreatic Cancer _diseases   154
##   count_normalized codes      score
## 1       0.12093690 C3850 0.06781202
## 2       0.12093690 C3850 0.06781202
## 3       0.08313159 C8294 0.04661374
## 4       0.06288961 C5711 0.02821089
## 5       0.06058007 C7787 0.02717488

Some more advanced queries are available at the ClinicalTrialsAPI github repo.

The ClinicalTrials.gov API

Described , this xml-based API is queriable using the rclinicaltrials R package.

# install.packages('rclinicaltrials')
library('rclinicaltrials')

As a simple example query:

z <- clinicaltrials_search(query = 'lime+disease')
str(z)
## 'data.frame':    20 obs. of  8 variables:
##  $ score               : chr  "0.021171" "0.020384" "0.0073995" "0.0072401" ...
##  $ nct_id              : chr  "NCT01951924" "NCT01333202" "NCT01056133" "NCT01644682" ...
##  $ url                 : chr  "https://ClinicalTrials.gov/show/NCT01951924" "https://ClinicalTrials.gov/show/NCT01333202" "https://ClinicalTrials.gov/show/NCT01056133" "https://ClinicalTrials.gov/show/NCT01644682" ...
##  $ title               : chr  "LIME Study (LFB IVIg MMN Efficacy Study)" "Fresh Lime Alone for Smoking Cessation" "Effect of Fish-oil on Non-alcoholic Steatohepatitis (NASH)" "Replacement of Insecticides to Control Visceral Leishmaniasis (VL)" ...
##  $ status.text         : chr  "Completed" "Completed" "Completed" "Completed" ...
##  $ condition_summary   : chr  "Motor Neuron Disease" "Tobacco Use Disorder" "Non-alcoholic Fatty Liver Disease; Non-alcoholic Steatohepatitis" "Cost-effective and Sustainable Vector Control Methods Will be Established to Reduce VL in India, Bangladesh and Nepal" ...
##  $ intervention_summary: chr  "Drug: Biological : I10E (Human normal Immunoglobulin for intravenous administration 100mg/mL); Drug: Biological"| __truncated__ "Other: Fresh lime" "Other: Omega-3 capsules-Fish Oil" "Other: IWFPL; Other: IDWL; Other: ITN" ...
##  $ last_changed        : chr  "July 18, 2016" "April 8, 2011" "May 10, 2016" "February 16, 2015" ...

Counting records is a useful way to see the “scope” of a term in the ClinicalTrials.gov dtabase.

clinicaltrials_count(query = "myeloma")
## [1] 2236

Terms can be combined either using AND, OR, and NOT in the text query. A character vector of search terms is also supported.

clinicaltrials_count(query = c("type=Intr", "cond=cancer"))
## [1] 45731

Getting bulk data from the API is possible using the clinicaltrials_download function.

y <- clinicaltrials_download(query = 'myeloma', count = 10, include_results = TRUE)
str(y,list.len=5)
## List of 2
##  $ study_information:List of 6
##   ..$ study_info   :'data.frame':    10 obs. of  45 variables:
##   .. ..$ org_study_id                        : chr [1:10] "J0997" "MMRF-11-001" "101565" "IUCRO-0498" ...
##   .. ..$ nct_id                              : chr [1:10] "NCT01045460" "NCT01454297" "NCT01410981" "NCT02212262" ...
##   .. ..$ brief_title                         : chr [1:10] "Trial of Activated Marrow Infiltrating Lymphocytes Alone or in Conjunction With an Allogeneic Granulocyte Macro"| __truncated__ "Relating Clinical Outcomes in Multiple Myeloma to Personal Assessment of Genetic Profile" "Prognostic Potential of Cell Surface Markers and Pim Kinases in Multiple Myeloma" "Role of Osteocytes in Myeloma Bone Disease" ...
##   .. ..$ official_title                      : chr [1:10] "Randomized Trial of Activated Marrow Infiltrating Lymphocytes Alone or in Conjunction With an Allogeneic GM-CSF"| __truncated__ "A Prospective, Longitudinal, Observational Study in Newly Diagnosed Multiple Myeloma (MM) Patients to Assess th"| __truncated__ "Prognostic Potential of Cell Surface Markers and Pim Kinases in Multiple Myeloma" "Role of Osteocytes in Myeloma Bone Disease" ...
##   .. ..$ overall_status                      : chr [1:10] "Active, not recruiting" "Active, not recruiting" "Unknown status" "Recruiting" ...
##   .. .. [list output truncated]
##   ..$ locations    :'data.frame':    82 obs. of  15 variables:
##   .. ..$ name                    : chr [1:82] "Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins" "Mayo Clinic Campus in Scottsdale, AZ" "UC San Diego Moores Cancer Center" "Sharp Health Care" ...
##   .. ..$ address.city            : chr [1:82] "Baltimore" "Scottsdale" "San Diego" "San Diego" ...
##   .. ..$ address.state           : chr [1:82] "Maryland" "Arizona" "California" "California" ...
##   .. ..$ address.zip             : chr [1:82] "21231" "85259" "92093" "92123" ...
##   .. ..$ address.country         : chr [1:82] "United States" "United States" "United States" "United States" ...
##   .. .. [list output truncated]
##   ..$ arms         :'data.frame':    14 obs. of  4 variables:
##   .. ..$ arm_group_label: chr [1:14] "1" "2" "Newly diagnosed Multiple Myeloma" "Multiple Myeloma subjects with bone marrow aspirate/biopsy" ...
##   .. ..$ arm_group_type : chr [1:14] "Experimental" "Experimental" NA NA ...
##   .. ..$ description    : chr [1:14] "aMILs" "aMILs + allogeneic myeloma vaccine" "This is a prospective observational study in patients with symptomatic multiple myeloma who have not yet initia"| __truncated__ "All patients seen at MUSC with a diagnosis of multiple myeloma or possible multiple myeloma who undergo bone ma"| __truncated__ ...
##   .. ..$ nct_id         : chr [1:14] "NCT01045460" "NCT01045460" "NCT01454297" "NCT01410981" ...
##   ..$ interventions:'data.frame':    18 obs. of  8 variables:
##   .. ..$ intervention_type: chr [1:18] "Biological" "Biological" "Other" "Drug" ...
##   .. ..$ intervention_name: chr [1:18] "aMILs" "Allogeneic Myeloma Vaccine" "oligosecretary" "Lenalidomide" ...
##   .. ..$ description      : chr [1:18] "Activated marrow infiltrating lymphocytes" "Allogeneic granulocyte macrophage colony-stimulating factor (GM-CSF)-based myeloma cellular vaccine" "not abvailable" "Dosage forms: 5, 10, 15 and 25 mg capsules. Patients will be continued on the same dose of lenalidomide as they"| __truncated__ ...
##   .. ..$ arm_group_label  : chr [1:18] "1" "2" "oligosecretary" "Myeloma Vaccine, Prevnar-13 Vaccine, & Lenalidomide" ...
##   .. ..$ arm_group_label.1: chr [1:18] "2" NA NA NA ...
##   .. .. [list output truncated]
##   ..$ outcomes     :'data.frame':    34 obs. of  5 variables:
##   .. ..$ measure    : chr [1:34] "Evaluate clinical efficacy of activated marrow infiltrating lymphocytes (aMILs) administered alone or in combin"| __truncated__ "Evaluate Progression-free Survival and Overall Survival" "Anti-tumor immune response" "The effect of aMILs on osteoclastogenesis" ...
##   .. ..$ time_frame : chr [1:34] "Days 60, 180, and 360" "Days 60, 180, and 360" "Days 60, 180, and 360" "Days 60, 180, and 360" ...
##   .. ..$ description: chr [1:34] "2.1.1 Evaluate Response Rates utilizing the Blade' criteria\nComplete Response (CR) rate\nNear Complete Respons"| __truncated__ "Patients will be monitored for progression/relapse on Days 60, 180, and 360, and as clinically indicated. Follo"| __truncated__ "Evaluate tumor specific responses in blood and bone marrow\nExamine T cell responses to DC-pulsed myeloma cell "| __truncated__ "Parameters of bone turnover that will include:\nRANKL/OPG ratio\nSerum C Telopeptide levels\nbAlkaline phosphat"| __truncated__ ...
##   .. ..$ type       : chr [1:34] "primary_outcome" "secondary_outcome" "secondary_outcome" "secondary_outcome" ...
##   .. ..$ nct_id     : chr [1:34] "NCT01045460" "NCT01045460" "NCT01045460" "NCT01045460" ...
##   .. [list output truncated]
##  $ study_results    :List of 3
##   ..$ participant_flow: NULL
##   ..$ baseline_data   : NULL
##   ..$ outcome_data    : NULL

As with any client that access complicated data via an API, dealing with returned results can be challenging. To quote the package documentation:

The data come from a relational database with lots of text fields, so it may take some effort to get the data into a flat format for analysis. For that reason, results come back from the clinicaltrials_download function as a list of dataframes. Each dataframe has a common key variable: nct_id. To merge dataframes, use this key. Otherwise, you can analyze the dataframes separately. They are organized into study information, locations, outcomes, interventions, results, and textblocks. Results, where available, is itself a list with three dataframes: participant flow, baseline data, and outcome data.

The AACT database

The AACT database is a relationalized, open SQL database dump of the ClinicalTrials.gov dataset. The data are modeled as an extended star schema. An online data dictionary is also available. Remarkably, the group who has developed this AACT database has exposed a public PostgreSQL database for the world to use. Of course, R speaks fluently to relational databases. A modern approach using R is to access such databases using the dplyr.

library(dplyr)
library(RPostgreSQL)

Connect to the database. The connection parameters below should work, as this is a public instance.

aact = src_postgres(dbname = 'aact',
                    host = "aact-prod.cr4nrslb1lw7.us-east-1.rds.amazonaws.com",
                    user = 'aact',
                    password = 'aact')
show(aact)
## src:  postgres 9.5.4 [aact@aact-prod.cr4nrslb1lw7.us-east-1.rds.amazonaws.com:5432/aact]
## tbls: baseline_counts, baseline_measurements, brief_summaries,
##   browse_conditions, browse_interventions, calculated_values,
##   central_contacts, conditions, countries, design_group_interventions,
##   design_groups, design_outcomes, designs, detailed_descriptions,
##   drop_withdrawals, eligibilities, facilities, facility_contacts,
##   facility_investigators, id_information, intervention_other_names,
##   interventions, keywords, links, milestones, outcome_analyses,
##   outcome_analysis_groups, outcome_counts, outcome_measurements, outcomes,
##   overall_officials, participant_flows, reported_events,
##   responsible_parties, result_agreements, result_contacts, result_groups,
##   schema_migrations, sponsors, studies, study_references

Starting with the main table, studies, gives an example of the utility of SQL-based data dumps of complex datasets.

study_tbl = tbl(src=aact, 'studies')
head(study_tbl,3)
## Source:   query [?? x 49]
## Database: postgres 9.5.4 [aact@aact-prod.cr4nrslb1lw7.us-east-1.rds.amazonaws.com:5432/aact]
## 
##        nct_id                            nlm_download_date_description
##         <chr>                                                    <chr>
## 1 NCT03052829 ClinicalTrials.gov processed this data on March 13, 2017
## 2 NCT03074786 ClinicalTrials.gov processed this data on March 13, 2017
## 3 NCT03068169 ClinicalTrials.gov processed this data on March 13, 2017
## # ... with 47 more variables: first_received_date <date>,
## #   last_changed_date <date>, first_received_results_date <date>,
## #   received_results_disposit_date <date>, start_month_year <chr>,
## #   start_date_type <chr>, start_date <date>,
## #   verification_month_year <chr>, verification_date <date>,
## #   completion_month_year <chr>, completion_date_type <chr>,
## #   completion_date <date>, primary_completion_month_year <chr>,
## #   primary_completion_date_type <chr>, primary_completion_date <date>,
## #   target_duration <chr>, study_type <chr>, acronym <chr>,
## #   baseline_population <chr>, brief_title <chr>, official_title <chr>,
## #   overall_status <chr>, last_known_status <chr>, phase <chr>,
## #   enrollment <int>, enrollment_type <chr>, source <chr>,
## #   limitations_and_caveats <chr>, number_of_arms <int>,
## #   number_of_groups <int>, why_stopped <chr>, has_expanded_access <lgl>,
## #   expanded_access_type_individual <lgl>,
## #   expanded_access_type_intermediate <lgl>,
## #   expanded_access_type_treatment <lgl>, has_dmc <lgl>,
## #   is_fda_regulated_drug <lgl>, is_fda_regulated_device <lgl>,
## #   is_unapproved_device <lgl>, is_ppsd <lgl>, is_us_export <lgl>,
## #   biospec_retention <chr>, biospec_description <chr>,
## #   plan_to_share_ipd <chr>, plan_to_share_ipd_description <chr>,
## #   created_at <dttm>, updated_at <dttm>

Find the available study types.

study_tbl %>% select(study_type) %>%
    group_by(study_type) %>% summarize(count = n()) %>%
    collect()
## # A tibble: 5 × 2
##                         study_type  count
## *                            <chr>  <dbl>
## 1                              N/A    683
## 2                   Interventional 190919
## 3                    Observational  43469
## 4 Observational [Patient Registry]   2728
## 5                  Expanded Access    403

Filter queries (SQL “LIKE” query) are straightforward.

x  = study_tbl %>% filter(official_title %like% '%TP53%') %>% collect()
dim(x)
## [1] 13 49
x$official_title[1:3]
## [1] "Open-Label, Single Arm, Phase 3b, Multi-Center Study Evaluating the Impact of Venetoclax on the Quality of Life of Relapsed/Refractory Subjects With Chronic Lymphocytic Leukemia (CLL) Including Those With the 17p Deletion or TP53 Mutation OR Those Who Have Received Prior Treatment With B-Cell Receptor Inhibitor"                                                  
## [2] "Treatment of Patients With Advanced Breast Cancer Harboring TP53 Mutations With Dose-dense Cyclophosphamide - the p53 Breast Cancer Trial"                                                                                                                                                                                                                                 
## [3] "A Phase 3 Multicenter, Randomized, Prospective, Open-label Trial of Standard Chemoimmunotherapy (FCR/BR) Versus Rituximab Plus Venetoclax (RVe) Versus Obinutuzumab (GA101) Plus Venetoclax (GVe) Versus Obinutuzumab Plus Ibrutinib Plus Venetoclax (GIVe) in Fit Patients With Previously Untreated Chronic Lymphocytic Leukemia (CLL) Without Del(17p) or TP53 Mutation"

OpenTrials.net

Rather than trying to address this one directly, I refer to an available tutorial.

Provenance

sessionInfo()
## R Under development (unstable) (2016-10-26 r71594)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
## [1] RPostgreSQL_0.4-1       DBI_0.6                 dplyr_0.5.0            
## [4] rclinicaltrials_1.4.7   ClinicalTrialsAPI_0.1.0 magrittr_1.5           
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.9     knitr_1.15.1    R6_2.2.0        stringr_1.2.0  
##  [5] httr_1.2.1      plyr_1.8.4      tools_3.4.0     htmltools_0.3.5
##  [9] lazyeval_0.2.0  yaml_2.1.14     rprojroot_1.2   digest_0.6.12  
## [13] assertthat_0.1  tibble_1.2      bookdown_0.3    purrr_0.2.2    
## [17] curl_2.3        evaluate_0.10   rmarkdown_1.3   blogdown_0.0.18
## [21] stringi_1.1.2   backports_1.0.5 XML_3.98-1.5    jsonlite_1.3
comments powered by Disqus