Machine Learning Methods for Soil property prediction using VNIR and MIR Spectroscopy

Ternikar, Chirag Rajendra

View/Open

Thesis full text (7.183Mb)

Author

Ternikar, Chirag Rajendra

Metadata

Show full item record

Abstract

Soil plays a key role in food security, freshwater regulation, biodiversity and climate mitigation. Soil intersects directly or indirectly with at least seven of the United Nations Sustainable Development Goals. Among the various measurable soil attributes, two important and widely studied are soil texture and soil organic carbon. Soil texture/particle size distribution expressed through USDA texture triangle, is a physical property that constrains hydraulic behaviour, nutrient retention, erodibility and tillage suitability. Soil organic carbon (SOC), in turn, is an informative chemical indicator of soil fertility and an essential component of the global carbon cycle. Conventional laboratory methods for obtaining both properties are accurate yet expensive, destructive and logistically incompatible with the spatial and temporal coverage required for today’s data driven land management and climate reporting frameworks. Over the past two decades, laboratory visible–near infrared (VNIR, 400–2,500 nm) and mid infrared (MIR, 2,500–25,000 nm) spectroscopy have emerged as attractive alternatives. These models once calibrated generate multiple soil attributes from a single, rapid, non destructive spectral scan. However, turning these spectra into trustworthy predictions across diverse soil types and climatic zones still faces some challenges: (i) disentangling the relative diagnostic value of VNIR and MIR domains; (ii) enforcing chemical and mathematical constraints, such as the sum to unity of particle size fractions; and (iii) choosing modelling frameworks and evaluation metrics that remain reliable in the face of spectral non linearity, data heterogeneity and the curse of dimensionality. This thesis addresses those challenges through a staged inquiry that moves from soil texture to soil organic carbon (SOC) and from traditional chemometrics to modern machine learning and explainable AI models. The experimental material comprises two large global and one local soil spectral libraries: the ICRAF–ISRIC texture archive (3,643 global samples measured in both VNIR and MIR), the Open Soil Spectral Library (OSSL, 63,321 VNIR spectra with corresponding SOC values) and a local VNIR dataset of 275 Indian surface soils. Across the four research chapters, a common methodological backbone is maintained: spectral pre processing, clearly stated compositional or structural constraints, systematic cross validation, bootstrapped performance measures for robust inferences and a suite of classical and recently proposed accuracy diagnostics. The differences amongst the chapters are the property under study, the model classes compared, and the choice of reliability and robustness evaluations. Chapter 2 of this thesis establishes quantitative benchmark for USDA texture classification when VNIR and MIR data are available for 3,643 samples from the ICRAF–ISRIC library. Multinomial logistic regression (MNLR) and support vector machines (SVM) are trained on three full spectral matrices: VNIR, MIR and VNIR + MIR and on their band reduced counterparts obtained by Partial Information Correlation. Beyond the traditional confusion matrix, overall accuracy and kappa, two new neighbourhood based metrics were introduced: Added Neighbourhood Accuracy (ANA) and a Correct‒Neighbour‒Far distribution matrix that explicitly partitions errors into near misses and truly wrong assignments on the texture triangle. Three high level insights emerge. First, combined VNIR+MIR invariably outperforms any single domain, underscoring the complementary diagnostic value of overtones, combination bands (VNIR) and fundamental vibrations (MIR). Second, information theoretic band selection reduces dimensionality by ~95 % at the cost of only marginal accuracy losses: a critical gain for lightweight field devices and satellite transfer redundancies. Third, the novel neighbourhood accuracy framework showed that predominant residual misclassification occurred in adjacent texture classes, suggesting that many apparent errors would be inconsequential in hydraulic or agronomic applications. The chapter therefore provides both a performance benchmark and a decision framework: full spectrum vs. sparse; dual domain vs. single domain; logistic vs. kernel classifier. Chapter 3 turns to the long neglected unity constraint on particle size fractions. Many routinely ignore the sum to unity constraint when regressing individual clay, silt and sand fractions. Four regression assisted strategies that enforce compositional closure are benchmarked against a direct discriminant classifier on 275 Indian soils collected from a diverse soil gradient. Five approaches (A1-A5) evaluated are: A1 - independent PLSR models for each fraction and subsequent normalisation; A2 - clay and sand by PLSR, silt as residual; A3 - multi output PLSR with a common latent variable set; A4 - log ratio PLSR (log ratio transformation, two latent models, automatic closure); A5 – classification of spectra directly via PLS DA (bypassing fractions altogether). All regression assisted variants achieve broadly similar quantitative fit for clay (R2 ~ 0.89) and sand (R2 ~ 0.83) but much lower for silt, a reflection of both errors in evaluation procedure and the inherently weaker spectral signature of silt. Yet when these fractions are used for classification into USDA texture triangle, overall accuracy ~ 71 % and kappa ~ 0.62 are obtained for A1–A3, while the direct classifier stagnates at 56 %. Approach A4 is the most economical (two instead of three models) and chemically coherent, matching A1–A3 on accuracy while ensuring closure by construction. Crucially, neighbourhood analysis echoes Chapter 2: misclassifications are towards adjoining classes, preserving functionality. The chapter concludes that log ratio transformation based modelling should be the default strategy, but that direct classifiers remain viable where fraction data are unavailable. In Chapter 4, the focus widens on to Soil Organic Carbon (SOC), a property with a comparatively strong relation to VNIR spectra, yet with immense global variability. In this study, nine nearest neighbour (NN) algorithms – distinguished by the distance metric (Mahalanobis, Euclidean, correlation, cosine, spectral information divergence) and by whether the distance is computed in raw or latent space – are evaluated against a traditional PLSR model on the entire OSSL. Instead of relying solely on classical performance metrics, R2 and RMSE, the evaluation also utilizes the mean absolute error (MAE), error histograms and the lesser used, error correlation matrices to reveal structural redundancy between algorithms. An optimised PLS distance model (o_plsd) emerges as the most accurate (MAE = 1.79 % SOC), compared to PLSR at 2.36 %. Yet, o_plsd proves data-intensive: shrinking the calibration data size increases its MAE twice as fast as incurred by PLSR. Error correlation analysis further shows that several NN variants (pcad vs. plsd; o_plsd vs. o_pcad) are essentially duplicates in latent space in terms of prediction, whereas o_plsd and PLSR are structurally independent and therefore provide complementary insights about the predictions. These results provide a data sensitive decision pathway: prefer o_plsd when abundant, well stratified calibration data are available else revert to PLSR. Finally, always verify structural independence before stacking models and choice of performance metric that can mask such structural redundancies between candidate algorithms. While the previous chapter resolves the local vs global debate within classical frameworks, Chapter 5 explores whether modern machine learning architectures can simultaneously raise accuracy and maintain interpretability. A one dimensional convolutional neural network (1D CNN), XGBoost and PLSR are trained and interpreted through SHAP (Shapley Additive exPlanations) and Variable Importance in Projection (VIP) scores. All three models clear the “excellent prediction” threshold (R² > 0.90; RPD > 3). The 1D CNN shows the lowest absolute errors (MAE = 1.2 %; RMSE = 2.52 %) and XGBoost marginally outperforms PLSR with far fewer parameters to tune. SHAP and VIP converge on three spectral zones: 500–700 nm (organic overtones), ~1,100 nm (first overtone C–H) and ~2,200 nm (clay and mineral signatures). These analyses bridged the gap between predictive performance and interpretability, offering a clearer understanding of the underlying mechanisms in SOC prediction. The robustness evaluation reveals that XGBoost and 1D-CNN are more accurate than PLSR when ample calibration data are available, yet they exhibit steeper declines in performance as calibration size decreases. PLSR shows stable results under reduced calibration data, indicating a suitable option when generalizability is prioritized. These outcomes highlight the value of coupling advanced machine learning models with explainable-AI techniques for improved soil monitoring, providing a pathway to enhanced data-driven strategies in sustainable land management. Collectively, the thesis advances soil spectroscopic practice by (i) integrating neighbourhood aware accuracy metrics, compositional data treatment and structural independence tests, (ii) presenting decision pathways that align model choice with data volume and deployment constraints, and (iii) demonstrating how explainable AI tools can reconcile high accuracy with physical insight. The thesis lays methodological foundations for transferring laboratory models to forthcoming hyperspectral satellite missions (EnMAP, PRISMA, CHIME), deploying sparse band instruments informed by the identified key wavelengths. Further, it advocates for fusing local similarity and deep learning strengths in hybrid frameworks that self adapt to calibration density. In doing so, it lays the groundwork for a new generation of digital soil platforms capable of delivering physically coherent, spatially extensive and decision ready information on soil health.

URI

https://etd.iisc.ac.in/handle/2005/7471

Collections

Civil Engineering (CiE) [457]