Joris van den Bossche
I am a core contributor to Pandas and Apache Arrow and maintainer of GeoPandas. I did a PhD at Ghent University and VITO in air quality research, worked at the Paris-Saclay Center for Data Science, and currently, I am a freelance software developer and teacher and working for Voltron Data.
Sessions
GeoPandas is one of the core packages in the Python ecosystem to work with geospatial vector data. By combining the power of several open source geo tools (GEOS/Shapely, GDAL/fiona, PROJ/pyproj) and extending the pandas data analysis library to work with geographic objects, it is designed to make working with geospatial data in Python easier. GeoPandas enables you to easily do operations in Python that would otherwise require desktop applications like QGIS or a spatial database such as PostGIS.
This talk will give an overview of recent developments in the GeoPandas community, both in the project itself as in the broader ecosystem of packages on which GeoPandas depends or that extend GeoPandas. We will highlight some changes and new features in recent GeoPandas versions, such as the new interactive explore() visualisation method, improvements in joining based on proximity, better IO options for PostGIS and Apache Parquet and Feather files, and others. But some of the important improvements coming to GeoPandas are happening in other packages. The Shapely 2.0 release is nearing completion, and will provide fast vectorized versions of all its geospatial functionalities. This will help to substantially improve the performance of GeoPandas. In the area of reading and writing traditional GIS files using GDAL, the pyogrio package is being developed to provide a speed-up on that front. Another new project is dask-geopandas, which is merging the geospatial capabilities of GeoPandas with the scalability of Dask. This way, we can achieve parallel and distributed geospatial operations.
The Apache Arrow (https://arrow.apache.org/) project specifies a standardized language-independent columnar memory format. It enables shared computational libraries, zero-copy shared memory, streaming messaging and interprocess communication without serialization overhead, etc. Nowadays, Apache Arrow is supported by many programming languages.
Geospatial data often comes in tabular format, with one (or multiple) column with feature geometries and additional columns with feature attributes. This is a perfect match for Apache Arrow. Defining a standard and efficient way to store geospatial data in the Arrow memory layout (https://github.com/geopandas/geo-arrow-spec/) can help interoperability between different tools and enables us to tap into the full Apache Arrow ecosystem:
- Efficient, columnar data formats. Apache Arrow contains an implementation of the Apache Parquet file format, and thus gives us access to GeoParquet (https://github.com/opengeospatial/geoparquet) and functionalities to interact with this format in partitioned and/or cloud datasets.
- The Apache Arrow project includes several mechanisms for fast data exchange (the IPC message format and Arrow Flight for transferring data between processes and machines; the C Data Interface for zero-copy sharing of data between independent runtimes running in the same process). Those mechanisms can make it easier to efficiently share data between GIS tools such as GDAL and QGIS and bindings in Python, R, Rust, with web-based applications, etc.
- Several projects in the Apache Arrow community are working on high-performance query engines for computing on in-memory and bigger-than-memory data. Being able to store geospatial data in Arrow will make it possible to extend those engines with spatial queries.