FOSS4G NA 2024

Speeding up raster/vector zonal analysis with exactextract
09-11, 13:00–13:30 (America/Chicago), Grand C

Exactextract uses a novel algorithm to provide a fast implementation of raster/vector zonal statistics in C++, Python, and R. This talk will compare exactextract to other implementations, demonstrate its usage, and offer tips for getting the best performance.


The exactextract library provides a fast algorithm for calculating the intersection between a geometry and a rectangular grid in C++, Python, and R. This has numerous applications, the most important of which is raster/vector zonal statistics, answering questions such as “what are predominant land uses of each watershed?”, “what is the population-weighted mean temperature of each county?”, or “what fraction of each state’s population is exposed to each category of drought?”

The exactextract algorithm follows a geometry as it enters and exits each grid cell, allowing identification of partially-covered grid cells in a single pass. A limited number of point-in-polygon tests can then be used to identify cells inside or outside the polygon, with a flood fill algorithm used to propagate this status to adjacent pixels. The result is that the algorithm is able to compute the fraction of each grid cell that intersects a polygon and do so more quickly than algorithms that only check centers but must use a separate point-in-polygon test for each cell. In addition to the performance benefit, accounting for partially covered grid cells may be important in cases where grid sizes are large or polygons have an irregular shape. The implementation of exactextract allows control over the amount of memory that is used, allowing arbitrarily large rasters to be processed with a fixed amount of memory.

Through a series of brief case studies, this talk will show how the library offers simple and high-performing solutions to problems in raster/vector zonal statistics. Factors affecting performance will also be discussed, including raster compression methods and chunk sizes, GDAL block cache configuration, and the spatial distribution of input features.