Data availability can be thought of as a service. The service might simply be the file system, but it could be more complex.
Data services vary in the complexity of operations they enable. A filesystem, for example, has no “smarts” in that it delivers bits-and-bytes as asked. A more complex system might include indexing capabilities (VCF files, hdf5). More complexity again comes with systems like MongoDB and PostgreSQL. At the level of Bigquery, parallelism and rudimentary machine learning is possible. At the extreme are projects like Apache Spark that can serve as distributed analysis engines over arbitrarily large datasets.
Loom is an efficient file format based on HDF5 for very large omics datasets, consisting of a main matrix, optional additional layers, a variable number of row and column annotations, and sparse graph objects.
Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.
Columnar memory layout allows applications to avoid unnecessary IO and accelerate analytical processing performance on modern CPUs and GPUs. ]
.right-column[ ]
From the TileDB website:
Building these will usually require a custom API built and maintained centrally.