Overfitting

Overfitting is the difference between training and testing accuracy, represented by prediction error, and generalization accuracy, represented by generalization error. It occurs because the optimization of internal weights, structure, and parameters seeks the best performance on the training and testing components. Testing and prediction errors consistently decrease with increases in model complexity and will typically drop to zero if complexity is increased sufficiently. This consistent bias decrease with increased complexity occurs as the model learns to reproduce the measurement error and noise in the training data set by increasing the degrees of freedom in the representation. Zero training error means that the model is overfit to the training data set and will typically generalize poorly.

AI-ML Data Uncertainty Risks and Risk Mitigation Using Data Assimilation in Water Resources Management

Artificial intelligence (AI), including machine learning (ML) and deep learning (DL), learns by training and is restricted by the amount and quality of training data. The primary AI-ML risk in water resources is that uncertain data sets will hinder statistical learning to the point where the trained AI will provide spurious predictions and thus limited decision support. Overfitting is a significantly smaller prediction error during training relative to trained model generalization error for an independent validation set (that was not part of training). Training, or statistical learning, involves a tradeoff (the bias–variance tradeoff) between prediction error (or bias) and prediction variability (or variance) which is controlled by model complexity. Increased model complexity decreases prediction bias, increases variance, and increases overfitting possibilities. In contrast, decreased complexity increases prediction error, decreases prediction variability, and reduces tendencies toward overfitting. Better data are the way to make better AI–ML models. With uncertain water resource data sets, there is no quick way to generate improved data. Fortunately, data assimilation (DA) can provide mitigation for data uncertainty risks. The mitigation of uncertain data risks using DA involves a modified bias–variance tradeoff that focuses on increasing solution variability at the expense of increased model bias. Conceptually, increased variability should represent the amount of data and model uncertainty. Uncertainty propagation then produces an ensemble of models and a range of predictions with the target amount of extra variability.

Dynamic Integration of AI-ML Predictions with Process-Based Model Simulations

Data assimilation (DA) is used to integrate artificial intelligence including machine learning (AI-ML) and process-based models to produce a dynamic operational water balance tool for groundwater management. The management tool is a three-step calculation. In the first step, a traditional process-based water budget model provides forward model predictions of aquifer storage from meteorological observations, estimates of pumping and diversion discharge, and estimates of recharge. A Kalman filter-based DA approach is the second step and generates updated storage volumes by combining a trained AI-ML model, providing replacement 'measurements' for missing observations, with forward model predictions. The third 'correction' step uses modified recharge and pumping, adjusted to account for the difference between Kalman update storage and forward model predicted storage, in forward model re-simulation to approximate updated storage volume. Use of modified inputs in the correction provides a mass conservative water budget framework based on AI-ML predictions. Pumping and recharge values are uncertain and unobserved in the study region and can be adjusted without contradicting measurements.

An Observation Error Model for River Discharge Observed at Gauging Stations

Data assimilation (DA) makes the best combination of model simulation results and observed, or measured, values. Ensemble methods are a form of DA that generates multiple equally good, or equally calibrated, models using a description of model and observation uncertainty. Uncertainty is lack of knowledge. The collection of equally good, in the presence of uncertainty, models is an ensemble of models. An observation error model provides the means to describe the amount of uncertainty in model simulation results and in observed values as part of assimilation. Model-related uncertainty comes from model representation limitations created by differences between what the model represents, or simulates, and by what is measured to make an observation. Observation uncertainty comes from observation error. When an observed value is calculated or estimated, rather than measured, additional uncertainty is generated by the estimation procedure.

An observation error model is developed and presented for river discharge observations made at a stream gauging station using a measured water depth value with a derived rating curve to calculate discharge from observed water depth. A rating curve is a poor hydrodynamics model. Consequently, large estimation errors are expected for river discharge calculated using a rating curve, which generates correspondingly large amounts of observation uncertainty for assimilation. Uncertainty is propagated through DA to the spread, or variability, of model outcomes provided by the ensemble of models. When assimilating simulation results and data with significant uncertainty, the goal of assimilation is to optimize the bias-variance tradeoff and thus the spread of ensemble outcomes. Optimizing this tradeoff involves limiting the amount of uncertainty as much as possible to make informed decisions while including sufficient uncertainty to avoid overfitting. The risk from overfitting is production of biased model outcomes and spurious decision support.

AI-ML Accounting for Uncertain Water Resources Data Sets

Artificial intelligence (AI) is the effort to automate intellectual tasks normally performed by humans, and it includes machine learning (ML) and deep learning (DL) approaches. AI-ML is based on statistical learning, which tries to learn statistics-based rules for data analyses from known examples of inputs and corresponding outcomes. Data sets that are noisy, include significant uncertainty, and have extreme values hinder statistical learning. ML and DL aquifer recharge predictors are developed to: (1) examine prediction skill when trained using noisy and uncertain data and (2) identify advantages of AI-ML relative to traditional physics- and process-based calculations. Recharge was selected as the learning outcome because it is not observed and is inherently uncertain. A common-sense baseline is developed and implemented to account for uncertainty and noise in AI-ML predictions. The baseline provides a lower goodness-of-fit threshold that identifies when trained AI-ML generates prediction skill and an upper goodness-of-fit threshold above which the AI-ML is learning to reproduce noise and bias in the training data set (and is likely overfitting). Identified advantages for AI-ML (relative to physics- or process-based calculations) are the ability to use dimensionless trends for features and to represent a complex scenario with the same level of effort as for a simple case.