11-20, 11:30–11:55 (Pacific/Auckland), WG404
Icechunk 2.0 is a transactional, cloud-native storage engine for Zarr. This talk introduces powerful new features for transactionally managing massive geospatial arrays, including efficient appends/inserts.
Icechunk is an open-source, cloud-native transactional storage engine for multi-dimensional arrays, designed to manage massive geospatial datasets. Building upon the foundational features of data versioning and schema evolution presented in its first iteration, Icechunk 2.0 introduces a more stable and performant on-disk format that unlocks powerful new capabilities for managing large-scale array data. This presentation will introduce the new features of Icechunk 2.0 and demonstrate their application to common challenges in geospatial data analysis.
Managing petabyte-scale geospatial data, such as satellite imagery time-series, gridded weather forecasts, and climate model outputs, requires tools that can handle evolving data and complex operational pipelines efficiently. Icechunk 2.0 directly addresses these needs with several key innovations:
- Efficient Array Manipulation: A new indexing capability allows for cheap appends, prepends, and inserts into an array without rewriting existing data chunks. This is transformative for managing growing time-series datasets.
- Flexible Data Organization: Users can now rename or move arrays and groups within the Zarr hierarchy without costly data duplication, simplifying the curation and organization of large data repositories.
- Enhanced Data Governance: The introduction of an amendable commit option simplifies the version history, while a comprehensive operation log and support for repository-level metadata provide crucial data provenance.
- Improved Performance and Stability: The new format enables significantly faster and safer garbage collection and more efficient queries of a repository’s history.
We will demonstrate how Icechunk 2.0 integrates seamlessly into the Scientific Python ecosystem (Xarray, Dask) via its Zarr store interface and how we have built Icechunk support into the managed Earthmover platform. Through real-world examples, we will showcase how these new features can be used to build robust, high-performance data pipelines for cloud-based geospatial analytics and machine learning.
Joe is a climate scientist and civil engineer hailing from the Arizona desert. After a stint as a mountain guide on Denali, Joe launched a scientific career in hydroclimate modeling with a PhD from the University of Washington. After working as a scientist at NCAR, he co-founded CarbonPlan, an open-science climate research and policy think tank. He left CarbonPlan to start Earthmover to focus on the software and data infrastructure challenges at the core of climate science.