GeoAI Transformer–LSTM Boosts Maize-Yield Accuracy in Malawi’s Smallholder Fields
11-19, 16:30–16:55 (Pacific/Auckland), WG607

Leave-one-field-out tests on Malawian smallholder plots compare multiband/index linear regression, XGBoost, CNN-LSTM, a frozen ViT and a ViT-LSTM on Sentinel-2 VT–R1 stacks. ViT-LSTM delivers the best accuracy (RMSE 0.022 t ha⁻¹) but runs 2.5 × slower than CNN-LSTM.


  1. Introduction

Maize supplies almost 60 % of Malawi’s caloric intake, so mid-season yield forecasts are pivotal for food-security planning (FAO, 2015). Conventional ground surveys reach farmers only after harvest and sample < 1 % of the 1.8 million smallholdings. Optical earth-observation offers plot-scale coverage, and deep learning is now outperforming index-based regressions (Muruganantham et al., 2022). Transformers have recently eclipsed CNNs in US Corn-Belt studies (Lin et al., 2023), yet their benefit for densely inter-cropped African fields is unknown. We therefore benchmark five modelling paradigms, ranging from linear regression to a novel Vision-Transformer–LSTM (ViT-LSTM) hybrid, on a hand-harvested dataset from Zomba District.

  1. Materials and Methods

Eight rain-fed maize fields (0.2 – 0.9 ha) in Zomba District were GPS-delineated during the 2024/25 season. Grain harvested from ten 10 m × 10 m quadrats per field was dried to 13 % moisture and weighed, yielding plot-level reference values.

Sentinel-2 Level-2A surface reflectance (13 bands) was accessed via Google Earth Engine (GEE). Scenes acquired between 1 December 2024 and 28 February 2025, vegetative tasselling to first silking (VT–R1), were cloud-masked with s2cloudless and aggregated into rolling 10-day medians. All bands (native 10 m / 20 m) were re-projected to EPSG:32736 (UTM 36 S) and bilinearly up-sampled to 1 m to produce inputs compatible with CNN and ViT backbones pre-trained on large-dimension Sentinel-2 imagery (e.g., BigEarthNet). For each field, images were tiled using a 32 × 32 px sliding window with a 16 px stride, augmenting the sample count. Five spectral indices, NDVI, EVI, red-edge NDVI, NDWI, and MSI, were computed from the 1m bands so that every raster layer shared the same grid.

2.1. Models
* LR-Indices: ordinary-least-squares regression implemented with PyTorch BSD-licensed Linear layer, applied to 10-day means of the five indices and raw bands.
* XGBoost: gradient-boosted decision trees using the Apache-2.0 XGBoost library on aggregated bands + indices, fully open source and cross-platform.
* CNN-LSTM: frozen ResNet-101 encoder pretrained on the open BigEarthNet v2.0 Sentinel-2 archive, released under the CDLA-Permissive licence; its patch embeddings and mean indices feed a three-layer LSTM, all implemented in PyTorch (BSD licence).
* Frozen ViT: ViT-B16 encoder pretrained on the same BigEarthNet weights, kept frozen; token sequence plus mean indices pass to a linear head; codebase remains entirely PyTorch and open source.
* ViT-LSTM (proposed): shares the open-source ViT encoder above, but pools tokens and mean indices with an LSTM decoder; the full pipeline (data and code) is published under GPL-3.0 in our repository.

All models were implemented in PyTorch 2.3 and trained on an NVIDIA Quadro P1000 GPU. Hyper-parameters were optimised with Optuna. Deep networks trained for 50 epochs using AdamW, cosine-annealed learning rates, batch = 1, FP16 mixed precision, and early-stopping (patience = 10). A leave-one-field-out (LOFO) scheme ensured each field served once as the unseen test set. Performance was assessed with RMSE and MAE. Exact paired-permutation tests compared fold-wise RMSEs, and average inference time per tile was computed for every fold.

All code, configuration files, and anonymised data are released under GPL-3.0 at https://github.com/jahnical/yield-pred-models-comp, enabling full replication of the workflow.

  1. Results
    * Accuracy: ViT-LSTM achieved the lowest cross-validated RMSE 0.022 t ha⁻¹ and MAE 0.019 t ha⁻¹. CNN-LSTM followed at RMSE 0.088 t ha⁻¹; frozen ViT, 0.219 t ha⁻¹. XGBoost and LR-Indices exceeded 0.22 t ha⁻¹.
    * Significance: Both recurrent models (CNN-LSTM and ViT-LSTM) significantly out-performed non-recurrent baselines (p ≤ 0.02). The gap between ViT-LSTM and CNN-LSTM was also significant (p = 0.046).
    * Speed: LR-Indices and XGBoost predicted in < 0.02 ms per tile. CNN-LSTM needed 14 ms, whereas ViT-LSTM required 36 ms, 2.5 times slower.

  2. Discussion

Explicit spatio-temporal learning is critical because Malawi’s smallholder plots are tiny, irregular and often inter-cropped; spectral signatures therefore vary sharply over just a few metres and change quickly as plants develop. Recurrent layers already capture the crop’s phenological curve, but the self-attention blocks in the transformer let the model weigh non-contiguous pixels and dates, teasing out subtle edge effects and mixed-crop patterns that a CNN-LSTM misses (Liu et. el, 2023). That extra context cuts RMSE by ≈ 0.07 t ha⁻¹ (about 60 %), yet self-attention is quadratic in sequence length, so inference jumps from 14 ms to 36 ms per 32 by 32 px 1 m-tile, a 2.5× latency cost.

Data is streamed through Google Earth Engine, a free (though not open-source) cloud platform, while QGIS for vector editing, Rasterio/xarray for raster I/O, and PyTorch/XGBoost for modelling are fully open-source. This workflow demonstrates how combining free cloud access with FOSS4G tools can deliver high-resolution, scalable yield mapping in resource-constrained settings.

  1. Conclusion

We present the first open-source, plot-scale benchmark that pits classical machine-learning models, CNN-LSTM, and a transformer–recurrent hybrid on Malawian maize yields. ViT-LSTM attains state-of-the-art accuracy (RMSE 0.022 t ha⁻¹), an 60 % improvement over CNN-LSTM, at a four-fold latency cost. All code and data are freely released, inviting the FOSS4G community to replicate, critique, and extend the workflow to other crops, sensors, and regions.

References (abridged)
1. Chen T & Guestrin C (2016) XGBoost: A scalable tree-boosting system. KDD.
Dosovitskiy A et al. (2021) An image is worth 16×16 words: Transformers for image recognition at scale. ICLR.
2. FAO (2015) National Investment Profile: Water for Agriculture and Energy – Malawi.
3. Gaddy D, Li K & Eisner J (2022) Exact paired-permutation testing for structured test statistics. NAACL-HLT.
4. Krause J, Smith L & Brown M (2019) A CNN-RNN framework for crop-yield prediction. Frontiers in Plant Science 10:1750.
5. Lin F et al. (2023) MMST-ViT: Climate-change-aware crop-yield prediction via multi-modal spatial-temporal Vision Transformer. ICCV.
6. Muruganantham P et al. (2022) Systematic review on crop-yield prediction with deep learning and remote sensing. Remote Sensing 14(9):1990.
7. Zupanc A et al. (2023) s2cloudless: An open-source cloud mask for Sentinel-2. Earth Science Informatics 16:399–415.
8. Li, C., Chimimba, E. G., Kambombe, O., Brown, L. A., Chibarabada, T. P., Lu, Y., … Dash, J. (2022). Maize yield estimation in intercropped smallholder fields using satellite data in Southern Malawi. Remote Sensing, 14(10), 2458.
9. Sumbul, G., Utilkar, Y., Kaplan, S., & Demir, B. (2021). BigEarthNet-MM: A large-scale multi-modal benchmark archive for remote sensing. ISPRS Journal of Photogrammetry and Remote Sensing, 182, 49-62.
10. Liu, F., Jiang, X., & Wu, Z. (2023). Attention Mechanism-Combined LSTM for Grain Yield Prediction in China Using Multi-Source Satellite Imagery. Sustainability, 15(12), 9210. https://doi.org/10.3390/su15129210

Kondwani Godwin Munthali is a Senior Lecturer in Computer Science and Head in the Department of Computer Science, Chancellor College - University of Malawi. He has also previously served as Deputy Head and programme coordinator for the MSc in Informatics. Kondwani holds a PhD in Geo-Environmental Science from University of Tsukuba, Japan; Master of Science in GIS degree from the University of Leeds, England and Bachelor of Science degree majoring in Computer Science and minor Electronics from University of Malawi – Chancellor College.
Kondwani has over ten years’ teaching experience in institutions of higher learning for both undergraduate and postgraduate computer science and geoinformation science students. He is also involved in the supervision of both postgraduate and undergraduate students. He has demonstrated skills in curriculum development and review through involvement in the development of undergraduate programmes for University of Malawi (BSc Computer Science, Information Systems and Computer Network Engineering), Malawi University of Science and Technology (BSc in Geoinformation Science) and postgraduate programmes for University of Malawi (MSc in informatics and Bioinformatics and PhD in Computer Science and Information Systems).
Kondwani has over ten years’ experience in software development (database and system architectural design) through a number of projects where he served as lead software developer, database designer and GIS expert for the Ministry of Agriculture, Irrigation and Water Development through World Bank, Department of Energy Affairs through UNDP, Malawi Institute of Tourism through GIZ, MHealth4Africa, Ministry of Education through FEDOMA, Department of E-Government, ASI and National Aids Commission. His research interests are geospatial computing especially geospatial database designing and modelling. Through this assignment, he is keen to develop geospatial capacity in Malawi using open-source software through his accumulated experience in software like Ubuntu, MySQL, PostgreSQL, PostGIS, QGIS Desktop, QGIS Server, Qt, GeoServer, GeoNode
Kondwani is also vastly experienced in training needs assessment, curriculum development and IT capacity development. He has previously served as a lead facilitator in trainings for the projects: NAMIS, MHealth4Africa, Ministry of Education through FEDOMA, Department of E-Government, ESCOM staff ICT capacity building

Mathews Jere is a Malawian computer scientist and MSc candidate in Informatics at University of Malawi (UNIMA). His research blends open-source geo-AI, satellite imagery and computer vision.