Learning with Spaceborne LiDAR for Enhancement of Bare-Earth Digital Elevation Models from Global Data

Bare-earth Digital Elevation Models (DEMs) or Digital Terrain Models (DTMs) are fundamental to geospatial applications, from flood modelling and landslide assessment to infrastructure planning and environmental management. However, original publicly accessible global elevation products, such as SRTM (Shuttle Radar Topography Mission), ASTER GDEM (Advanced Spaceborne Thermal Emission and Reflection Radiometer Global Digital Elevation Model), and Copernicus DEM, represent Digital Surface Models (DSMs). DSMs are elevation models that include canopy heights and building structures, rather than true bare-earth topography. Their vertical accuracies (how closely elevations match the real ground height) typically range from 4–15 m RMSE (root mean square error), and their spatial resolutions (size of the smallest discernible detail) are constrained to 30 m or coarser. Airborne Light Detection and Ranging (LiDAR) derived DTMs achieve sub-meter vertical accuracy and spatial resolution because lasers/photons penetrate vegetation to measure ground elevation. However, high-cost LiDAR surveys lead to fragmented coverage, even in developed countries. For example, about 16% of New Zealand’s land surface still lacks airborne LiDAR mapping, resulting in critical data gaps in the very remote, rugged, and densely vegetated terrain where elevation information is most important.

Recent deep learning approaches have the potential to enhance the vertical accuracy and spatial resolution of global DEMs through super-resolution techniques. For example, JSPSR (Joint Spatial Propagation Super-Resolution) networks improve Copernicus GLO-30 DSM from 30 m to 3 m spatial resolution by utilising high-resolution remote sensing imagery, reducing elevation RMSE by over 70% across diverse sites. However, their performance drops in dense forest canopies. Optical sensors cannot penetrate vegetation, so models must infer ground elevation from indirect cues (such as estimating canopy height or shape). Spaceborne LiDAR missions, such as ICESat-2’s ATL08 product, provide global terrain measurements that can penetrate vegetation. However, incorporating these measurements poses two main challenges. First, after rigorous data quality filters, ATL08 photons can cover as little as 0.1–0.2% of dense-forest mountain areas. Second, spaceborne LiDAR provides sparse point data, whereas optical imagery and DEMs are dense raster grids, resulting in a data-geometry mismatch that most conventional network architectures cannot inherently accommodate.

This study describes an open-source deep neural network framework that tackles these two challenges using a triple-branch multi-modal fusion network. It processes three complementary data streams: high-resolution remote sensing imagery via a Swin Transformer encoder, interpolated Copernicus GLO-30 DEMs via a parallel Swin encoder, and serialised ATL08 along-track data via a sparse encoder that preserves measurement features without rasterisation or interpolation. All code and trained models will be released under an open-source license to support open-science principles.

The main innovation is a multi-scale deformable cross-attention mechanism that enables effective fusion of sparse LiDAR measurements with dense raster data. At each scale, individual ATL08 photons independently query the surrounding image and DEM features through learned deformable sampling patterns, allowing each photon to adaptively sample contextual information most relevant to elevation prediction at its location. Inspired by deformable DETR, the deformable cross-attention mechanism implements bidirectional information flow: image and DEM features inform the interpretation of photon measurements during downscaling, while photon-derived elevation features are injected back into the feature maps during upscaling. This design provides that sparse yet accurate LiDAR measurements guide feature extraction, while dense image context enriches photon representations, addressing the key challenge of multi-modal feature fusion in open geospatial data science.

Ultimately, the Spatial Propagation Network (SPN) transforms sparse DSM grids into dense predictions by conditioning content-adaptive kernels on fused multi-modal features. Through multiple propagation iterations, corrections propagate along paths guided by image content, following ridgelines, thalwegs, or areas with similar vegetation characteristics, while re-injecting precise photon measurements at each iteration to retain accuracy at known locations.

Sparse-to-dense progressive supervision computes elevation loss only at ATL08-confirmed locations (typically fewer than 64 per 256×256 training patch), whereas multi-scale deep supervision heads propagate gradient information throughout the network, even in areas without direct elevation constraints. This strategy prevents learning invalid correlations in unobserved areas while retaining end-to-end differentiability.

We evaluate our approach using the DFC30 dataset, which we augment with spatially matching ATL08 measurements. The training set contains 12,728 image-DSM-photon tuples, and the test set contains 3,196, both with airborne LiDAR ground truth. Our results show considerable improvements versus the baseline method, JSPSR. Across all test sites at 3 m spatial resolution, elevation accuracy (RMSE) improves by 8% in open terrain (from 1.1 m to 1.01 m) and 28% in dense forest canopies (from 6.7 m to 4.8 m). Overall, we achieve a vertical accuracy of 3.2 m for all vegetation classes. The largest improvements occur where optical-only methods perform weakest. In indigenous forests with dense understory and complex terrain, our method decreases systematic bias from 15.1 m to 8.8 m.

All code, trained weights, and preprocessing tools will be public under an open-source license. This ensures complete repeatability and allows community adaptation. The trained model enables end-to-end 3 m DTM generation for unmapped areas of New Zealand using global DEMs. It will produce a seamless bare-earth elevation product at a national scale, covering about 268,000 square kilometres. This supports applications such as landslide mapping, hydrological analysis for freshwater management, carbon stock assessment in forests, and infrastructure planning in rural areas.

For the FOSS4G community, this work makes three key contributions. First, it delivers a practical, open-source solution for generating high-resolution bare-earth DTMs by fusing global datasets (Copernicus DEM, remote sensing imagery, and ICESat-2 ATL08) using reproducible methods. Second, it provides an architectural strategy that respects the geometry of different data modalities, e.g., constructed tables, sparse point clouds and dense rasters, by using a template for multi-modal fusion in open geospatial science. Third, it shows that open data and software can solve real-world data gaps to produce operational products that benefit communities, environmental management, and disaster resilience. More broadly, our work supports ongoing efforts in the open geospatial community to improve global terrain characterisation by fusing diverse Earth observation assets, making high-quality elevation data accessible to all.


Full Paper (PDF): fossg4-2026-academic-track/question_uploads/FOSS4G_XCai_V1_mulfjlu.pdf Name and affiliation of all authors, including yourself. Please use the following format, allowing one line per author: "full name - affiliation;":

Xiandong Cai
- Geospatial Research Institute, University of Canterbury, Christchurch, New Zealand
- School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand
Matthew Wilson
- Geospatial Research Institute, University of Canterbury, Christchurch, New Zealand
- School of Earth and Environment, University of Canterbury, Christchurch, New Zealand

Indicate what is (are) the open source project(s) essential in your talk:

Pytoch, TorchGeo, QGIS, GDAL

Give indication of resources (video, web pages, papers, etc.) to read in advance, that will help get up to speed on topics.:

X. Tian, J. Shan, Comprehensive evaluation of the icesat-2 atl08 terrain product, IEEE Transactions on Geoscience and Remote Sensing 59 (10) (2021) 8195–8209.

X. Cai, M. D. Wilson, Jspsr: Joint spatial propagation super-resolution networks for enhancement of bare-earth digital elevation models from global data, Remote Sensing 17 (21) (2025) 3591.

Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong, et al., Swin transformer v2: Scaling up capacity and resolution, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12009–12019.

S. Liu, S. De Mello, J. Gu, G. Zhong, M.-H. Yang, J. Kautz, Learning affinity via spatial propagation networks, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 30, Curran Associates, Inc., 2017.

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable detr: De-formable transformers for end-to-end object detection, arXiv preprint arXiv:2010.04159 (2020).

I make my conference contribution available under the CC BY 4.0 license. The conference contribution comprises the abstract, the text contribution for the conference proceedings, the presentation materials as well as the video recording and live transmission of the presentation: