Bringing it all together: Zarr, Dask, Knowledge Graphs, and LLMs
11-04, 11:30–12:00 (America/New_York), Lake Anne

We describe our open source work that makes NOAA climate datasets discoverable and queryable via a chat interface, building on top of the incredible open source cloud-native geospatial ecosystem to bring science data closer to users than ever before.


In this talk, we describe our ongoing open source work building a system to which users can pose questions in natural language, such as “What are the trends in sea surface temperatures around near-coast gulf waters over the past several decades?” and get answers derived from real data that are not only correct and comprehensive, but also include detailed provenance information so that users may verify them.

NOAA maintains thousands of datasets of various kinds (remote sensing, in-situ, derived etc.) across multiple domains (climate, weather, ecology, etc.) that are consumed by a wide variety of users (scientists, engineers, urban planners, etc.), but discovering, accessing, and using these datasets remains a significant challenge. This talk describes the work undertaken as part of the NOAA-funded “Study to Determine Natural Language Processing Capabilities with the NCCF Open Knowledge Mesh (KM/NLP)” BAA, which aims to study the feasibility of overcoming this challenge through the use of knowledge graphs and state-of-the-art Large Language Models (LLMs).

Key aspects of our solution include: consolidating thousands (or hundreds of thousands) of NetCDF files into virtual datasets using Zarr, scaling computations on these datasets using Xarray and Dask, representing and querying metadata about datasets and variables through knowledge graphs, and using LLMs to translate plain-language user questions into graph queries and data transformations to derive answers.

We highlight how this work is exciting because it takes the fruits of labor of the open source GIS community such as cloud-optimized data formats, cloud compute, metadata ontologies, and builds upon them to solve the “last mile” problem of letting users extract insights from data with minimal effort. We also discuss where these building blocks currently fall short and how they may be improved if we want to scale such solutions to even more – perhaps even ALL – datasets.