Mapping Soil Erosion Classes using Remote Sensing Data and Ensemble Models
Soil Erosion, the displacement of topsoil by water and wind, poses a significant threat to global land health, impacting food security, water quality, climate change, and ecosystem stability. Earth Observation (EO) and remote sensing technologies play a crucial role in monitoring and assessing soil erosion, offering valuable spatial and temporal data for informed decision-making. This paper applied three (3) Machine Learning (ML) models, namely the XGBoost classifier, LightGBM classifier, and CatBoost classifier to perform soil erosion classification in the European Union (EU) region. The data used in this study were sourced from Kaggle, a huge repository of community-published machine learning models and data, and it includes several EO data namely the Landsat 7 seasonal Analysis Ready Data (ARD), BioClim v1.2 historical (1981-2010) average climate data using the CHLSA classification system, annual MODIS EVI data, climatic variables (water vapour, monthly snow probability, annual MODIS LST in daytime or night time, annual CHELSA rainfall V2.1), Human footprint (Hengl et al., 2023), Land cover, Landform and landscape parameters (Hengl, 2018), Lithology (Hengl, 2018). The dataset has a total of 3754 sample points and 139 features. A detailed description of the dataset features can be found here.
During the Exploratory Data Analysis (EDA) process, the visual relationship between the Landsat bands and the target variable (erosion category), revealed that the Near Infrared (NIR) , Short-Wave Infrared I (SWIR1), Short-Wave Infrared II (SWIR2), and Thermal bands were effective in differentiating between the various erosion categories, compared to other bands. This insight gave direction in the feature engineering process. As suggested by Puente et al. (2019), vegetation indices could prove effective in predicting soil erosion. Consequently, we computed various vegetation indices such as the Normalised Difference Water Index (NDWI), Normalised Difference Infrared Index (NDII), and Shortwave Infrared Water Stress Index (SIWSI) as well as applied the Tasseled Cap Transformation which includes Brightness, Wetness and Greenness, to augment the features. To capture textural variations of each pixel location, elevation, and slope-based measures were computed. The Topographic Position Index (TPI) was computed for each position using a 100,000-metre radius, calculating the mean elevation of points within the radius and subtracting it from each point elevation within the radius. Other features computed were the Topographic Wetness Index (TWI), Aspect, LS-Factor, and Stream Power Index (SPI) which reflects the erosive power of streams. Leveraging the thermal band, Land Surface Temperature (LST) was derived. As noted by Ghosal (2021), combining LST with temporal data can identify regions vulnerable to soil erosion.
The development of these models incorporated Scikit-Learn Recursive Feature Elimination (RFE) in the preliminary feature selection process using the XGBoost model as the estimator. The goal of RFE is to return “n” features by training the model on all features, rank all features by importance, and remove the least important features until “n” features remain. The RFE “n” features were set to 200. Afterward, an XGBoost model was trained with the 200 features, and Scikit-Learn’s Randomised Search CV was employed to optimise its hyperparameters, leading to an improved F-1 score for the XGBoost classifier. Using the XGBoost’s classifier feature importance ranking, the top 155 features were selected for use in the final ensemble model for predictions. To provide a more reliable estimate of the performance of the training model, Scikit-Learn's Stratified KFold was implemented with n_splits set to 5 and the erosion category as the stratification variable. By using stratified KFold, a balanced class representation in each fold during training was achieved. For modelling of erosion categories, an ensemble voting classifier combined predictions from three optimised gradient boosting models (XGBoost, LightGBM, CatBoost) using a "soft" voting scheme. This approach aimed to improve accuracy and reduce overfitting compared to individual models. The confusion matrix was used to evaluate the ensemble's performance, considering precision, recall, and F1-score metrics. These metrics assess the model's ability to correctly identify positive and negative cases, with a higher F1 score indicating better overall performance.
The weighted F-1 score reached 0.86, and the weighted precision and recall were 0.86 and 0.86 respectively, indicating that the proposed method using various EO data to predict soil erosion categories (No Gully/badland, Gully, Badland, Landslides) displayed good performance. Specifically, No Gully/badland (0.89, 0.91) and Landslides(1.00, 1.00) had higher precision and recall values, which means that the model can correctly identify areas that fall within these erosion categories with low false positives and false negatives. The Badland(0.49) had the least recall value indicating that the model could not identify a substantial amount of this category.
According to the Feature Importance analysis; Year, Latitude Coordinates, Topographic Wetness Index (TWI), Longitude Coordinates, Maximum Fraction of Absorbed Photosynthetically Active Radiation (FAPAR), Minimum Annual Water Vapour, Mean of Slope, Weighted Difference Vegetation Index (WDVI), Normalised Difference Snow Index (NDSI) and Standard Deviation of Slope emerged as the top ten (10) factors influencing soil erosion. Indicating that Topographic factors and vegetation indices were important for predicting soil erosion. The year was the most important feature, which shows that temporal trends have a huge impact in predicting soil erosion.
In conclusion, this project successfully explored the potential of ensemble learning and EO data for classifying soil erosion, highlighting its promising role in addressing this crucial environmental issue. The proposed framework indicates that Topographic indices like the TWI and vegetation indices like the WDVI hold valuable information for predicting soil erosion. Furthermore, band combinations using near-infrared (NIR), SWIR1, SWIR2, and thermal bands can significantly improve the classification of soil erosion categories. Crucially, EO data like digital elevation models (DEMs) and Analysis Ready Landsat data serve as the foundation for accurate soil erosion prediction. The proposed approach to incorporate multi-temporal EO data offers exciting prospects for even more accurate soil erosion classification.