FOSS4G 2022 general tracks

Charles Stern

Charles is a Data Infrastructure Engineer at Columbia University's Lamont-Doherty Earth Observatory working on Pangeo Forge. He is endlessly curious about elegant, open-source tools that help us understand our changing planet. Charles loves exploring in the mountains and tinkering with anything electronic or mechanical.

The speaker's profile picture

Sessions

08-26
10:00
30min
Pangeo Forge: Crowdsourcing Open Data in the Cloud
Ryan Abernathey, Charles Stern

Geospatial datacubes--large, complex, interrelated multidimensional arrays with rich metadata--arise in analysis-ready geopspatial imagery, level 3/4 satellite products, and especially in ocean / weather / climate simulations and [re]analyses, where they can reach Petabytes in size. The scientific python community has developed a powerful stack for flexible, high-performance analytics of databcubes in the cloud. Xarray provides a core data model and API for analysis of such multidimensional array data. Combined with Zarr or TileDB for efficient storage in object stores (e.g. S3) and Dask for scaling out compute, these tools allow organizations to deploy analytics and machine learning solutions for both exploratory research and production in any cloud platform. Within the geosciences, the Pangeo open science community has advanced this architecture as the “Pangeo platform” (http://pangeo.io/).

However, there is a major barrier preventing the community from easily transitioning to this cloud-native way of working: the difficulty of bringing existing data into the cloud in analysis-ready, cloud-optimized (ARCO) format. Typical workflows for moving data to the cloud currently consist of either bulk transfers of files into object storage (with a major performance penalty on subsequent analytics) or bespoke, case-by-case conversions to cloud optimized formats such as TileDB or Zarr. The high cost of this toil is preventing the scientific community from realizing the full benefits of cloud computing. More generally, the outputs of the toil of preparing scientific data for efficient analysis are rarely shared in an open, collaborative way.

To address these challenges, we are building Pangeo Forge ( https://pangeo-forge.org/), the first open-source cloud-native ETL (extract / transform / load) platform focused on multidimensional scientific data. Pangeo Forge consists of two main elements. An open-source python package--pangeo_forge_recipes--makes it simple for users to define “recipes” for extracting many individual files, combining them along arbitrary dimensions, and depositing ARCO datasets into object storage. These recipes can be “compiled” to run on many different distributed execution engines, including Dask, Prefect, and Apache Beam. The second element of Pangeo Forge is an orchestration backend which integrates tightly with GitHub as a continuous-integration-style service.

We are using Pangeo Forge to populate a multi-petabyte-scale shared library of open-access, analysis-ready, cloud-optimized ocean, weather, and climate data spread across a global federation of public cloud storage–not a “data lake” but a “data ocean”. Inspired directly by the success of Conda Forge, we aim to leverage the enthusiasm of the open science community to turn data preparation and cleaning from a private chore into a shared, collaborative activity. By only creating ARCO datasets via version-controlled recipe feedstocks (GitHub repos), we also maintain perfect provenance tracking for all data in the library.

You will leave this talk with a clear understanding of how to access this data library, craft your own Pangeo Forge recipe, and become a contributor to our growing collection of community-sourced recipes.

Open Data
Room 9