11-04, 13:30–14:00 (America/New_York), Regency Ballroom B
Virtual Zarr stores enable cloud-optimized access to archival-format data without data duplication. By leveraging xarray, VirtualiZarr provides a powerful tool for constructing virtual Zarr stores. This talk will situate VirtualiZarr in relation to kerchunk, icechunk, and Zarr V3.
Zarr has emerged as a flexible data format for storing cloud-optimized self-describing n-dimensional data cubes. It’s great! There is just one problem. The vast majority of data that is being generated and distributed right now is not in Zarr stores — it’s in NetCDF, or HDF5, or COG. Those legacy data formats all support partial reads, but they aren’t optimized for cloud access (except for COG). What this means is that they require many small network requests to fetch bits of metadata that are sprinkled throughout the file and — when concatenating — across many different files. One solution is to copy data over to Zarr. But this means you double your storage cost and introduce a reprocessing step.
That’s where virtual Zarr stores come in. Virtual Zarr stores capture metadata about where particular chunks of data are stored and allows tools like xarray to access the data from the chunks directly using HTTP range requests. You never need to duplicate the data to Zarr. These virtual Zarr stores can be written to disk using kerchunk or icechunk and distributed alongside data.
So how does VirtualiZarr fit in? VirtualiZarr is the easiest way to construct virtual Zarr stores out of thousands of individual data files. It takes advantage of the powerful merging and concatenation logic that xarray already has and lets you apply those same methods to virtual Zarr stores. VirtualiZarr provides readers for many legacy data formats so you can read the metadata from the original files and produce a virtual Zarr store with the combined metadata and access paths.
I have worked in the scientific Python ecosystem as an environmental researcher, an open source contributor, and a web developer. I am passionate about finding creative ways to enhance understanding of the physical world. My past experience includes maintaining Dask (open source distributed computing tool) and HoloViz (open source high-level visualization tool). In my current role at Element 84 (formerly Azavea), I work on the maintain django/react web applications and push forward tooling and open source best practices for scientists.