Joachim Ungar
Sessions
Over the past years, mapchete has evolved from a tile-based raster and vector processing library into a modular ecosystem for building and operating large-scale geospatial data processing pipelines. Previous presentations at FOSS4G focused on the core package and touching scalable processing patterns using dask and mapchete.
This talk presents the next step: the open source publication of additional components developed in a production context, including mapchete EO, mapchete Hub, and mapchete Hub CLI. These packages extend mapchete's core processing model towards Earth Observation (EO) use cases and distributed execution, with a focus on reproducibility, scalability, and a variety of (pre)processesing capabilities relevant for EO.
mapchete EO provides higher-level primitives for working with satellite imagery (primarily Sentinel-2), including typical preprocessing steps such as cloud masking, BRDF correction, and temporal compositing. These components are derived from operational pipelines used in the EOxCloudles (cloudless.eox.at) product line, where consistent large-scale processing and data quality constraints are critical.
mapchete Hub introduces a service layer for orchestrating distributed processing of mapchete tasks. Processing jobs can be submitted, scheduled, and monitored via an API. The API design is oriented towards the OGC API - Processes standard, aligning mapchete-based workflows with emerging interoperable interfaces in the geospatial ecosystem. The accompanying CLI (mapchete hub CLI) provides a minimal interface for interacting with this system without requiring custom integration.
In addition to the software components, the talk covers recent changes in packaging and distribution. All packages are now published via both PyPI and Conda, and container images are provided through GitHub Container Registry. All packages were moved to a dedicated mapchete organization.
mapchete EO extends mapchete with abstractions and utilities to read from Earth Observation (EO) archives, with a primary focus on Sentinel-2 data. While mapchete provides a tile-based execution model for raster and vector processing, mapchete EO enables reading multidimensional arrays (time series) from well known data archives.
Class-based abstractions for handling Sentinel-2 products were engineered to also enable usage outside of the mapchete context. They provide a unified interface to various data and metadata archives to automatically mask data using all available metadata masks (SCL, L1C, etc.) as well as to apply BRDF correction while reading the datza.
The second part of the talk focuses on operational experience from processing Sentinel-2 data at global scale for the EOxCloudless product line. At this scale, the system has to have multiple layers of fallbacks and retries in order to accomodate I/O related and temporary failures.
Additional challenges arise when processing data across the antimeridian, where data coverage is not consistent between various archives. These edge cases expose limitations that are not apparent in smaller-scale workflows and require careful handling within global processing pipelines.
The presentation will outline these challenges and discuss their implications for the design of robust, large-scale Sentinel-2 processing pipelines within an open source framework.