Efficient pixel-scale upstream covariate computation for environmental machine learning
2026-09-02 , Conference Management Room1

Hydrological ML requires costly upstream catchment aggregation. We present an efficient flow-accumulation-based method bypassing per-pixel delineation, achieving orders-of-magnitude speedups. Implemented in GRASS and Python, this open-source approach enables scalable, high-resolution modeling, demonstrated by a countrywide 90 m Random Forest nitrogen prediction.


Environmental data cubes play an increasingly important role in geospatial machine learning pipelines by collating harmonised layers derived from heterogeneous sources into analysis-ready form. Open-source geospatial infrastructures are central to enabling this harmonisation, interoperability, and reproducibility across domains.
Yet domains such as freshwater hydrology require additional, costly data processing to convert gridded environmental layers into river-network-aware predictor variables that summarise environmental conditions as zonal statistics — such as average clay content or standard deviation of air temperature — across the entire upstream catchment draining to a given stream pixel.
In principle, this upstream aggregation can be achieved by delineating a separate upstream catchment for every stream pixel and computing zonal statistics within it. However, this approach becomes a major computational bottleneck already at the scale of a small European country (tens of thousands of km²), even when parallelised across high-performance computing clusters.
Here we show an efficient and accurate method for calculating ML-ready, river-network-aware predictor variables based on a multi-flow-direction flow accumulation algorithm that requires no per-pixel catchment delineation. We validate the method against conventional delineation-based zonal statistics, demonstrating close numerical agreement while achieving orders-of-magnitude speedup.
We implement the method both as a new feature of the GRASS software suite and as an open-source Python package providing a generic interface to arbitrary flow accumulation backends, such as pysheds, making it directly pluggable into existing ML pipelines. The complete workflow runs reproducibly in a Jupyter notebook, lowering the barrier for users less familiar with geospatial scripting.
We demonstrate the approach on an end-to-end machine learning workflow: high-resolution (90 m) Random Forest predictions of total nitrogen across the full stream network of a European country, trained on multiple environmental predictors computed with the proposed flow-accumulation-based method.
At larger scales, this open and scalable approach supports evidence-based nutrient management and policy decisions, illustrating how FOSS4G tools can accelerate environmental modelling, reproducible research, and applied geospatial machine learning.


Level of technical complexity: 2 - intermediate I make my conference contribution available under the CC BY 4.0 license. The conference contribution comprises the abstract, the text contribution for the conference proceedings, the presentation materials as well as the video recording and live transmission of the presentation: