FOSS4G NA 2024

Pangeo History: A Tale of Large Scale Computational Infrastructure
09-10, 11:30–12:00 (America/Chicago), Grand H

Pangeo is an ecosystem of OSS tools like Xarray, Dask, Jupyter, Zarr, and others used in large scale geospatial analysis. This talk tells the history of how these tools came together and where they're going today.


Pangeo is an ecosystem of OSS tools like Xarray, Dask, Jupyter, Zarr, and others used in large scale geospatial analysis. This talk tells the history of how these tools came together and where they're going today. This story is interesting because it blends software, science, and funding/collaboration. Pangeo was both an incredible success (revolutionized large-scale raster computation in geoscience) and a failure (struggled to launch into a fully self-sustaining organization/service, despite momentum). It's a story full of lessons for people looking to make change in geo sciences through software development.

An outline might look something like the following:

Introduction
- What problems do these problems solve
- Who uses them
- Who builds them
- They're pretty successful. How did they get this way and what can we learn?

Pre-Pangeo Years

  • Dask and Xarray started within Anaconda and Climate Corp respectively. Both employers gave some time to focus on integration
  • Grew in popularity, becoming the standard
  • Developed distributed computing capabilities, suddenly able to easily process terabytes
  • Geoscientists started wanting to organize around the projects, first Pangeo meeting established

Pangeo Working Group

  • Three institutions started working together (Columbia University, NCAR, Anaconda)
  • More joined later (UK Met, UCAR, ECMWF)
  • Started on HPC, but then did a Cloud demo with Kubernetes
  • Massive shift to cloud computing
  • Around this time we also brought on Zarr and started to push it as a cloud alternative to HDF5/NetCDF

Cloud Infrastructure Management

  • Pangeo developers started spending lots of time setting up and maintaining Kubernetes clusters for research groups.
  • This was great! But also terrible to maintain.
  • Dev work started to slow down a bit

Community Development

  • Fortunately, at the same time community engagement picked up a bunch
  • Pangeo today has many working groups, discourse forum, workshops, etc..

Pangeo goes Corporate

  • Original Pangeo developers now all run companies
    • Dask -> Coiled for compute
    • Xarray+Zarr -> Earthmover for storage
    • Jupyter -> 2I2C for cloud notebook environments
  • Open source work was great for pieces, but services were hard as an OSS community

Lessons Learned

  • Pay engineers and scientists from the same grants
  • Easy to build momentum when you're solving great pain
  • OSS is great for tools, less great for cloud services

This may not be the right Track (please move around if so).