Using google cloud registry for private docker images

In this post, I will quickly build a docker image containing the sra-toolkit and a key for dbGaP downloads. Because the key file is private, I will be using the secure Google Container Registry to store the image for later use in genomics workflows.

Background

Container technologies like docker enable quick and easy encapsulation of software, dependencies, and operating systems. One or more containers can render entire software ecosystems portable, enhance reproducibility and reusability, and facilitate sharing of software, tools, and even infrastructure.

While DockerHub is perhaps the most well-known registry where such docker images can be housed, others such as quay.io are also available. Commercial cloud environments, such as the Google Cloud Platform often offer their own registries that use the same secure access controls as other cloud services, allowing docker images with proprietary or private information to be stored and accessed securely. They are also typically quite integrated with commercial cloud services (gcr docs).

I am experimenting with the Google Container Registry (GCR) for a bioinformatics project that I plan to perform in on Google Cloud. This blog post simply serves as notes to myself about details of using that system.

Preliminaries

A Google Cloud account and project is required to follow along here.

In order to allow docker to use google for authorization, we need to do this one-time command to get tie docker to google.

gcloud auth configure-docker

Answer “yes” to the prompt.

Building docker image and adding to GCR

We are going to build a simple docker file that includes the sra-tools package, import a private key for downloading files that are protected, and then store that image to GCR.

The Dockerfile is given below. Note that I am not including the dbGaP key file, as it is private, but you could modify the Dockerfile to include your own key file(s) or simply remove the dbGaP access details entirely.

Building the container is one line:

docker build -t sratoolkit .

At this point, the docker image has been created. Run it locally to test, for example.

docker run -ti sratoolkit /bin/bash

Inside the docker container, all the sra-toolkit binaries are available.

Each Google Cloud Project has a private GCR. Therefore, GCR urls include the PROJECT-ID.

docker tag sratoolkit gcr.io/isb-cgc-01-0006/sratoolkit:2.9.2
docker push gcr.io/isb-cgc-01-0006/sratoolkit:2.9.2

Usage

For fun, we can use the image and the fastq-dump utility to download a single SRA run. The docker image will only run as long as necessary to perform the fastq dump and then will terminate. This post is not about the details of running docker, but note that the following will result in the fastq file from the fastq-dump command ending up in the /tmp directory on your machine.

docker run -v /tmp:/data -ti  gcr.io/isb-cgc-01-0006/sratoolkit:2.9.2\
    /bin/bash -c "cd /root/ncbi/dbGaP-16049/ \
    && fastq-dump   --split-files --gzip \
    --skip-technical -X 10000   SRR390728"

This will result in:

Read 10000 spots for SRR390728
Written 10000 spots for SRR390728

Best practices

Storing keys, key files, and any other private information in a docker image is a risky operation and is not a best practice. Leveraging the security model of GCR mitigates these issues. However, it is really easy to forget about the information that might be leaked if this image were shared or pushed to a public repository like DockerHub.

For a discussion of other models for accessing private information inside a docker container, see this blog post, for example.

From here

At this point, we have created an environment for building docker images, stored a proprietary docker image in the Google Cloud Registry, and successfully used the image to run a simple command to perform a fastq-dump with the files ending up on the docker host machine. This image can be used anywhere a docker image can run, but only if google authentication has been tied into docker.

The Google Genomics Pipeline API is built around scalable genomic workflows and uses docker images for workflow tasks. The image here can be used for dbGaP genomic data extraction as part of such a workflow, all within the secure environment of google cloud.

Avatar
Sean Davis
National Cancer Institute, NIH

Related

comments powered by Disqus