Exploring Cloud-Native Geospatial Formats: Hands-on with Raster Data Workshop
11-17, 13:30–16:30 (Pacific/Auckland), WF710

Dig into three cloud-native raster formats—COGs, Zarr, and Kerchunk—and learn how data access works under the hood with hands-on Python exercises, no image libraries required!


Ever wonder what GDAL is doing under the hood when you read a GeoTIFF file? Doubly so when the file is a Cloud-optimized GeoTIFF (COG) on a remote server somewhere? Have you been wondering what this new Zarr thing is all about and how it actually works? Then there's the whole Kerchunk/VirtualiZarr indexing to get cloud-native access for non-cloud-native data formats, what's that about?

Cloud-native geospatial is all the rage these days, and for good reason. As file sizes grow, layer counts increase, and analytical methods become more complex, the traditional download-to-the-desktop approach is quickly becoming untenable for many applications. It's no surprise then that users are turning to cloud-based tools such as Dask to scale out their analyses, or that traditional tooling is adopting new ways of finding and accessing data from cloud-based sources. But as we transition away from opening whole files to now grabbing ranges of bytes off remote servers it seems all the more important to understand exactly how cloud native data formats actually store data and what tools are doing to access it.

This workshop aims to dig into how cloud-native geospatial data formats are enabling new operational paradigms, with a particular focus on raster data formats. We'll start on the surface by surveying the current cloud-native geospatial landscape to gain an understanding of why cloud native is important and how it is being used, including:

  • the core tenets of cloud-native geospatial data formats
  • cloud-native data formats for both raster and non-raster geospatial data
  • the intersection with SpatioTemporal Asset Catalogs (STAC) and how higher-level STAC-based tooling can leverage cloud-native formats for efficient raster data access processing of cloud-native data

Then we'll get hands-on and go deep to build up an in-depth understanding of how cloud native raster formats work. We'll examine the COG format and read a COG from a cloud source by hand using just Python, progressively grabbing data from the image until we can extract a target tile, all without using any image libraries. We'll repeat the same exercise for geospatial data in Zarr format to see how that compares to our experience with COGs. Lastly we'll turn our attention to Kerchunk/VirtualiZarr to see how these technologies might allow us to better optimize data access with non-cloud-native formats.

Prerequisites

This workshop expects some familiarity with geospatial programming in Python. Most of the notebook code is already provided, so any gaps in understanding don't necessarily prohibit completing the exercises. That said, a basic knowledge of STAC and Cloud-Native Geospatial Python tooling and working with rasters as single and multidimensional arrays is quite helpful.

A good primer workshop is Alex Leith of Auspatious's Cloud-Native Geospatial for Earth Observation Workshop. It is recommended to work through those activities or have an equivalent knowledge prior to working through the notebooks in this workshop.

Jarrett Keifer is a Senior Geospatial Software Engineer at Element 84, a commercial geospatial consultancy that uses open-source to build effective customer solutions. His interests include education and outreach, geospatial data formats, and high-performance systems/network programming. He enjoys designing systems to operate at scale, particularly to support remote sensing data processing and earth science applications, and has over ten years of experience contributing to open source projects.

This speaker also appears in: