FOSS4G 2022 general tracks

Agile Geo-Analytics: Stream processing of raster- and vector data with dask-geomodeling
2022-08-25, 17:15–17:45 (Europe/Rome), Room 4

We present dask-geomodeling: an open source Python library for stream processing of GIS raster and vector data. The core idea is that data is only processed when required, thereby avoiding unnecessary computations. While setting up a dask-geomodeling computation, there is instant feedback of the result. This results in a fast feedback loop in the (geo) data scientist’s’ work. Big datasets can be processed by parallelizing multiple data queries, both on a single machine or on a distributed system.

Abstract

In geographical information systems (GIS), we often deal with data pipelines to derive map layers from various datasets. For instance, a water depth map is computed by subtracting the digital elevation map (DEM) from a water level map. These procedures are often done using open source products such as PostGIS and QGIS. However, for medium to large datasets (> 10 GB) the extent of these analyses are costly due to memory restrictions and computational cost. As a rule, these issues are tackled by manually cutting the dataset into smaller parts. However, this is a tedious and time-consuming task. In case one needs to this regularly, this is not feasible.

We present the open source Python library dask-geomodeling [1] to solve this issue. Instead of a script, dask-geomodeling requires a so-called “graph”, which is the definition of all operations that are required to compute the derived dataset. This graph is generated by plain Python code, for instance:

plus_one = RasterFileSource('path/to/tiff') + 1 

Note that these operations are lazy: there is no actual computation done and therefore the above line executes fast. Only when actual data is requested:

plus_one.get_data( 
    bbox=(155000, 463000, 156000, 464000), 
    projection='epsg:28992', width=1000, height=1000 
)

An array containing the data is computed. No need to load the whole TIFF-file in memory if you only use a small part!

The computation occurs in two steps. First, a computational graph is generated containing the required functions. While generating the computational graph, the operations may be chunked into smaller parts. Second, this graph is evaluated by dask [2], using any scheduler (single thread, multithreading, multiprocessing, distributed) that is provided dask.

This library is open source under the name “dask-geomodeling” and is distributed on Github, PyPI, and Anaconda. A hosted cloud version is also available under the name Lizard Geoblocks [3]. Currently, we have implemented a range of operations for rasters, vectors, and combinations. The community is welcome to use our library, benefit from it, and expand it!

References

  1. dask-geomodeling, https://github.com/nens/dask-geomodeling, https://dask-geomodeling.readthedocs.io/
  2. dask, https://dask.org/
  3. Lizard Geoblocks, https://lizard.net/

I am currently working as a lead software developer at Nelen & Schuurmans, on the Python backends of 3Di (integrated hydrodynamic model) and Lizard (a data platform for the physical environment).

My background is a mix of different branches of science: I started out with chemistry and went via physics to software development. Especially during my PhD (thesis) I focused on data analysis, tracking colloidal particles in 3D microscopic images on lipid membranes.

Open source projects I am actively contributing to:

  • Shapely / PyGEOS
  • dask-geomodeling
  • trackpy
  • PIMS