AI-ML Data Uncertainty Risks and Risk Mitigation Using Data Assimilation in Water Resources Management

Artificial intelligence (AI), including machine learning (ML) and deep learning (DL), learns by training and is restricted by the amount and quality of training data. The primary AI-ML risk in water resources is that uncertain data sets will hinder statistical learning to the point where the trained AI will provide spurious predictions and thus limited decision support. Overfitting is a significantly smaller prediction error during training relative to trained model generalization error for an independent validation set (that was not part of training). Training, or statistical learning, involves a tradeoff (the bias–variance tradeoff) between prediction error (or bias) and prediction variability (or variance) which is controlled by model complexity. Increased model complexity decreases prediction bias, increases variance, and increases overfitting possibilities. In contrast, decreased complexity increases prediction error, decreases prediction variability, and reduces tendencies toward overfitting. Better data are the way to make better AI–ML models. With uncertain water resource data sets, there is no quick way to generate improved data. Fortunately, data assimilation (DA) can provide mitigation for data uncertainty risks. The mitigation of uncertain data risks using DA involves a modified bias–variance tradeoff that focuses on increasing solution variability at the expense of increased model bias. Conceptually, increased variability should represent the amount of data and model uncertainty. Uncertainty propagation then produces an ensemble of models and a range of predictions with the target amount of extra variability.

Bias–Variance Tradeoff Comparison between AI-ML and Data Assimilation (DA)

Water Resources’ AI–ML Data Uncertainty Risk and Mitigation Using Data Assimilation

Abstract: Artificial intelligence (AI), including machine learning (ML) and deep learning (DL), learns by training and is restricted by the amount and quality of training data. Training involves a tradeoff between prediction bias and variance controlled by model complexity. Increased model complexity decreases prediction bias, increases variance, and increases overfitting possibilities. Overfitting is a significantly smaller training prediction error relative to the trained model prediction error for an independent validation set. Uncertain data generate risks for AI–ML because they increase overfitting and limit generalization ability. Specious confidence in predictions from overfit models with limited generalization ability, leading to misguided water resource management, is the uncertainty-related negative consequence. Improved data is the way to improve AI–ML models. With uncertain water resource data sets, like stream discharge, there is no quick way to generate improved data. Data assimilation (DA) provides mitigation for uncertainty risks, describes data- and model-related uncertainty, and propagates uncertainty to results using observation error models. A DA-derived mitigation example is provided using a common-sense baseline, derived from an observation error model, for the confirmation of generalization ability and a threshold identifying overfitting. AI–ML models can also be incorporated into DA to provide additional observations for assimilation or as a forward model for prediction and inverse-style calibration or training. The mitigation of uncertain data risks using DA involves a modified bias–variance tradeoff that focuses on increasing solution variability at the expense of increased model bias. Increased variability portrays data and model uncertainty. Uncertainty propagation produces an ensemble of models and a range of predictions.

Conceptual Comparison of Data Importance in AI-ML and Data Assimilation (DA)