Introduction to the OmicIDX API

OmicIDX parses and then serves public genomics repository metadata. These metadata are growing quickly, updated often, and are now very large when taken as a whole. Because of the interrelated nature of the metadata and the myriad approaches and use cases that exist, including search, bulk download, and even data mining, we serve the data via a GraphQL endpoint.

Currently, OmicIDX contains the SRA and Biosample metadata sets. These overlap with each other, but SRA metadata contains deeper metadata than Biosample data on the same samples. Biosample, on the other hand, contains many more samples and currently includes metadata about a subset of NCBI GEO, all SRA samples, and some additional samples from projects like Genbank.

GraphQL for accessing OmicIDX data

GraphQL is a query language for APIs and a runtime for fulfilling those queries with existing data. GraphQL provides a complete and understandable description of the data in the API, gives clients the power to ask for exactly what they need and nothing more, makes it easier to evolve APIs over time, and enables powerful developer tools.

GraphQL has only a single url, called the endpoint, which allows access to all data in the API. GraphQL is also a query language. It is the GraphQL query that is submitted to the GraphQL endpoint that results in data being returned.

What is a GraphQL query?

A GraphQL query looks a bit like JSON, except without quotes or commas. Here is an example GraphQL query for a fictitious GraphQL API.

{ 
  allCharacters {
    name
  }
}

If we had a server that contained Star Wars trivia, the response from the server might look like:

{ "data": {
    "allCharacters": [
      { 
        "name":"Luke"
      },
      { 
        "name": "Darth"
      },
      ...
    ]
  }
}

If we changed the query to:

{ 
  allCharacters {
    name
    mass
  }
}

the response would now look like:

{ "data": {
    "allCharacters": [
      { 
        "name":"Luke",
        "mass": 80
      },
      { 
        "name": "Darth",
        "mass": 140
      },
      ...
    ]
  }
}

How do I know what is in the GraphQL endpoint?

The GraphQL schema describes the data model(s) contained in the GraphQL endpoint. GraphQL is strongly typed, has the concept of relationships between data types, and is self-documenting. In fact, one can use the GraphQL endpoint to discover what is in the endpoint. I will not go into the details right now, but this introspection capability makes possible some powerful tooling. One of the most ubiguitous is the so-called GraphiQL (note the i in the name) tool.

Exercise 1

Navigate to GraphiQL and follow along with the video (no sound).¹

Querying OmicIDX programmatically

GraphQL is quite easy to work with programmatically. All queries are made via a post request to the GraphQL endpoint. The POST request needs to be JSON encoded and must include the "query" key. A simple example of the JSON for a basic query might look like:

{ "query": "{ heros { name weight } }"}

Note that the query string (ie., "{ heros { name weight } }") is just a string. It is not formated as JSON itself.

Current OmicIDX GraphQL endpoint²: http://graphql-omicidx.cancerdatasci.org/graphql

For example, let us get the first 500 SRA studies (500 is a limit to the number of results that we can retrieve in one go. The GraphQL might look like this:

{
  allSraStudies {
    edges {
      node {
        accession
        title
        abstract
      }
    }
  }
}

From exercise 1, you know that you can copy this query into the GraphiQL browser and get results. How about using curl? Note that I am doing some gymnastics below to make this work in one command line. In practice, one would probably save the GraphQL query as a file (in json format) and then post that file³.

curl --silent \
  -X POST \
  -H "Content-Type: application/json" \
  http://graphql-omicidx.cancerdatasci.org/graphql \
  --data @- << EOF
  
  { "query":"{
  allSraStudies(first: 1) {
    edges {
      node {
        accession
        title
        abstract
      }
    }
  }
}"
}
 EOF

## {"errors":[{"expose":true,"statusCode":400,"status":400,"body":"    { \"query\":\"{  allSraStudies(first: 1) {    edges {      node {        accession        title        abstract      }    }  }}\"} EOF","type":"entity.parse.failed"}]}

Note that the return value is simply JSON. All our normal tools for working with JSON are available to us to process and manipulate the results.

Using OmicIDX from R

As a review, we know that:

GraphQL queries look like JSON, but are not quite the same.
We can use the GraphiQL graphical tool to help us write our query, including autocompleting fields, etc.
GraphQL queries are POSTed to the graphql endpoint and results are returned as JSON.

In order to query from R, then, we need to:

Define our query.
POST the query, encoded as a simple JSON data structure.
Deal with the resulting JSON data that comes back as a result of the post in #2.

Defining the query

Again, the easiest way to define the query is to experiment witht he GraphiQL query tool. Once a query works as expected, the query can be reused from R.

In this case, we are going to continue with the query we worked with above:

{
  allSraStudies(first: 1) {
    edges {
      node {
        accession
        title
        abstract
      }
    }
  }
}

Performing the query

The CRANpkg('httr') package performs http requests, including POST requests.

With our query defined, we need to prepare the POST body which will be sent to the server as JSON. In R, basicc lists are the information equivalent of JSON objects.

post_body = list(query = "
{
  allSraStudies(first: 20) {
    edges {
      node {
        accession
        title
        sraExperimentsByStudyAccession {
          totalCount
        }
      }
    }
  }
}
")

The URL for the OmicIDX endpoint is: http://graphql-omicidx.cancerdatasci.org/graphql.⁴

endpoint = "http://graphql-omicidx.cancerdatasci.org/graphql"

I am going to lead the reader a bit here and jump to a “handler” function that will convert the returned JSON into convenient R data structures.

handler = function(response) {
  jsonlite::fromJSON(httr::content(response, as='text'),flatten = TRUE)
}

Finally, we are ready to perform our query.

resp = httr::POST(endpoint, body = post_body, encode = 'json')
resp

## Response [http://graphql-omicidx.cancerdatasci.org/graphql]
##   Date: 2019-03-21 18:37
##   Status: 200
##   Content-Type: application/json; charset=utf-8
##   Size: 3.84 kB

And finally, convert the response to convenient R data structure.

result = handler(resp)
knitr::kable(result$data[[1]]$edges)

node.accession	node.title	node.sraExperimentsByStudyAccession.totalCount
DRP000001	Bacillus subtilis subsp. natto BEST195 genome sequencing project	1
DRP000002	Model organism for prokaryotic cell differentiation and development	1
DRP000003	Comprehensive identification and characterization of the nucleosome structure	1
DRP000004	Comprehensive identification and characterization of the transcripts, their expression levels and sub-cellular localizations	1
DRP000005	Comprehensive identification and characterization of the transcripts, their expression levels and sub-cellular localizations	1
DRP000006	Comprehensive identification and characterization of the transcripts, their expression levels and sub-cellular localizations	1
DRP000007	Comprehensive identification and characterization of the binding sites of polymerase II	2
DRP000008	Comprehensive identification and characterization of the binding sites of polymerase II	2
DRP000009	Subsurface mine microbial mat metagenome	1
DRP000010	Oryza sativa Japonica group genome sequencing project by QTL Genomics Research Center, Japan	1
DRP000011	Comprehensive analysis of cytoplasmic mRNAs in HT29 cell.	1
DRP000012	Comprehensive analysis of cytoplasmic mRNAs in HT29 cell at 4hr after treatment with tunicamycin.	1
DRP000013	Comprehensive analysis of cytoplasmic mRNAs in HT29 cell at 16hr after treatment with tunicamycin.	1
DRP000014	Comprehensive analysis of polysomal mRNAs in HT29 cell.	1
DRP000015	Comprehensive analysis of polysomal mRNAs in HT29 cell at 4hr after treatment with tunicamycin.	1
DRP000016	Comprehensive analysis of polysomal mRNAs in HT29 cell at 16hr after treatment with tunicamycin.	1
DRP000017	Massive transcriptional start site mapping of Beas2B IL-4 stimulation cells.	1
DRP000018	Massive transcriptional start site mapping of Beas2B IL-4 non-stimulation cells.	1
DRP000019	Massive transcriptional start site mapping of Beas2B siRNA Stat6 IL-4 stimulation cells.	1
DRP000020	Massive transcriptional start site mapping of Beas2B siRNA Stat6 IL-4 non-stimulation cells.	1

We can change our query a bit to get some more details.

study_experiment_counts_query = "
{
  allSraStudies(first: 10) {
    edges {
      node {
        accession
        title
        sraExperimentsByStudyAccession {
          totalCount
        }
      }
    }
  }
}
"

And streamline our code a bit as a function.

# Borrow "handler" function from above
# Borrow "endpoint" URL from above
graphql_query = function(gql) {
  .handler = function(response) {
    jsonlite::fromJSON(httr::content(response, as='text'),flatten = TRUE)$data[[1]]$edges
  }
  post_body = list(query = gql)
  resp = httr::POST(endpoint, body = post_body, encode = 'json')
  return(.handler(resp))
}

knitr::kable(graphql_query(study_experiment_counts_query))

node.accession	node.title	node.sraExperimentsByStudyAccession.totalCount
DRP000001	Bacillus subtilis subsp. natto BEST195 genome sequencing project	1
DRP000002	Model organism for prokaryotic cell differentiation and development	1
DRP000003	Comprehensive identification and characterization of the nucleosome structure	1
DRP000004	Comprehensive identification and characterization of the transcripts, their expression levels and sub-cellular localizations	1
DRP000005	Comprehensive identification and characterization of the transcripts, their expression levels and sub-cellular localizations	1
DRP000006	Comprehensive identification and characterization of the transcripts, their expression levels and sub-cellular localizations	1
DRP000007	Comprehensive identification and characterization of the binding sites of polymerase II	2
DRP000008	Comprehensive identification and characterization of the binding sites of polymerase II	2
DRP000009	Subsurface mine microbial mat metagenome	1
DRP000010	Oryza sativa Japonica group genome sequencing project by QTL Genomics Research Center, Japan	1

One might be interested in filtering the results, also. For example, to search study titles for “cancer” (case insensitive) we can add a filter to the query.

cancer_studies_query = '
{
  allSraStudies(first: 10 filter: {title: {includesInsensitive: "cancer"}}) {
    edges {
      node {
        accession
        title
      }
    }
  }
}
'
knitr::kable(graphql_query(cancer_studies_query))

node.accession	node.title
DRP000030	human epigenomics sequencing project of breast cancer and normal cell lines
DRP000425	Homo sapiens T24 human bladder cancer cell line Transcriptome
DRP000628	Identification and characterization of miRNA-mRNAs associations in a colon cancer cell line by massively paralleled sequencing
DRP000632	Accumulation of genetic alterations in hepatic progenitor cells in mice lead to the development of liver cancers.
DRP000989	Biliary tract cancer targeted exome
DRP001071	Quantitative detection of mutation alleles derived from lung cancer in plasma cell-free DNA by use of anomaly detection with deep sequencing data
DRP001077	Mutation profiles in inflamed gastric epithelium with Helicobacter pylori infection during the development of gastric cancer.
DRP001085	The microRNA expression signature of bladder cancer assessed by deep sequencing
DRP001358	Single cell analysis of lung adenocarcinoma cell lines and the response to an anti-cancer drug stimulation
DRP002517	Hypermutated human papillomavirus16 genome in cervical cancer

To check the total count of available studies with “cancer” in the title, we can again write a query for that. Note that this query

cancer_studies_query = '
{
  allSraStudies(first: 10 filter: {title: {includesInsensitive: "cancer"}}) {
    totalCount
  }
}
'

A keen eye will notice that there is not an edges component in this query. Our handler above returns the edges component, though. So, we are going to drop back to basic httr code here, again.

httr::content(
  httr::POST(endpoint, body = list(query=cancer_studies_query), encode='json'),
  as="parsed"
)

## $data
## $data$allSraStudies
## $data$allSraStudies$totalCount
## [1] 2298

Related objects

One of the unique aspects of GraphQL is the ability to treat data as a graph, with data entities related to each other logically linked. For example, we can fetch SRA experiments and link them each to the study under which they were performed.

experiments_with_study = '
{
  allSraExperiments(first: 5) {
    edges{
      node{
        accession
        title
        libraryStrategy
        sraStudyByStudyAccession{
          title
          accession
        }
      }
    }
  }
}
'

The graphql query from R still works the same way, so we execute as before.

result = graphql_query(experiments_with_study)
knitr::kable(result)

node.accession	node.title	node.libraryStrategy	node.sraStudyByStudyAccession.title	node.sraStudyByStudyAccession.accession
DRX000001	B. subtilis subsp. natto genome sequencing September 2008	WGS	Bacillus subtilis subsp. natto BEST195 genome sequencing project	DRP000001
DRX000002	B. subtilis subsp. subtilis genome resequencing September 2008	WGS	Model organism for prokaryotic cell differentiation and development	DRP000002
DRX000003	DLD1_normoxia_nucleosome	FL-cDNA	Comprehensive identification and characterization of the nucleosome structure	DRP000003
DRX000004	DLD1_polysome	FL-cDNA	Comprehensive identification and characterization of the transcripts, their expression levels and sub-cellular localizations	DRP000004
DRX000005	DLD1_cytoplasmic	FL-cDNA	Comprehensive identification and characterization of the transcripts, their expression levels and sub-cellular localizations	DRP000005

Fetching all results

TODO

Describe cursors

Using variables

TODO

Describe adding variables to POST body

Conclusion

TODO

The urls in this document are subject to change.↩
The urls in this document are subject to change.↩
*For examples of how to post JSON using curl, see this stackoverflow post ↩
The urls in this document are subject to change.↩

Playing with OmicIDX

Sean Davis

3/14/2019

Introduction to the OmicIDX API

GraphQL for accessing OmicIDX data

What is a GraphQL query?

How do I know what is in the GraphQL endpoint?

Exercise 1

Querying OmicIDX programmatically

Using OmicIDX from R

Defining the query

Performing the query

Fetching all results

Using variables

Conclusion

Contents