FOSS4G 2022 general tracks

Geospatial and Apache Arrow: accelerating geospatial data exchange and compute
08-25, 10:00–10:05 (Europe/Rome), Room 4

The Apache Arrow (https://arrow.apache.org/) project specifies a standardized language-independent columnar memory format. It enables shared computational libraries, zero-copy shared memory, streaming messaging and interprocess communication without serialization overhead, etc. Nowadays, Apache Arrow is supported by many programming languages.

Geospatial data often comes in tabular format, with one (or multiple) column with feature geometries and additional columns with feature attributes. This is a perfect match for Apache Arrow. Defining a standard and efficient way to store geospatial data in the Arrow memory layout (https://github.com/geopandas/geo-arrow-spec/) can help interoperability between different tools and enables us to tap into the full Apache Arrow ecosystem:

  • Efficient, columnar data formats. Apache Arrow contains an implementation of the Apache Parquet file format, and thus gives us access to GeoParquet (https://github.com/opengeospatial/geoparquet) and functionalities to interact with this format in partitioned and/or cloud datasets.
  • The Apache Arrow project includes several mechanisms for fast data exchange (the IPC message format and Arrow Flight for transferring data between processes and machines; the C Data Interface for zero-copy sharing of data between independent runtimes running in the same process). Those mechanisms can make it easier to efficiently share data between GIS tools such as GDAL and QGIS and bindings in Python, R, Rust, with web-based applications, etc.
  • Several projects in the Apache Arrow community are working on high-performance query engines for computing on in-memory and bigger-than-memory data. Being able to store geospatial data in Arrow will make it possible to extend those engines with spatial queries.

I am a core contributor to Pandas and Apache Arrow and maintainer of GeoPandas. I did a PhD at Ghent University and VITO in air quality research, worked at the Paris-Saclay Center for Data Science, and currently, I am a freelance software developer and teacher and working for Voltron Data.

This speaker also appears in: