AI-ML Accounting for Uncertain Water Resources Data Sets
Artificial intelligence (AI) is the effort to automate intellectual tasks normally performed by humans, and it includes machine learning (ML) and deep learning (DL) approaches. AI-ML is based on statistical learning, which tries to learn statistics-based rules for data analyses from known examples of inputs and corresponding outcomes. Data sets that are noisy, include significant uncertainty, and have extreme values hinder statistical learning. ML and DL aquifer recharge predictors are developed to: (1) examine prediction skill when trained using noisy and uncertain data and (2) identify advantages of AI-ML relative to traditional physics- and process-based calculations. Recharge was selected as the learning outcome because it is not observed and is inherently uncertain. A common-sense baseline is developed and implemented to account for uncertainty and noise in AI-ML predictions. The baseline provides a lower goodness-of-fit threshold that identifies when trained AI-ML generates prediction skill and an upper goodness-of-fit threshold above which the AI-ML is learning to reproduce noise and bias in the training data set (and is likely overfitting). Identified advantages for AI-ML (relative to physics- or process-based calculations) are the ability to use dimensionless trends for features and to represent a complex scenario with the same level of effort as for a simple case.
Links
Statistical learning of water budget outcomes accounting for target and feature uncertainty
Abstract: Statistical learning seeks to learn statistics-based rules for data analysis tasks from known examples of inputs, or features, and corresponding outcomes and includes machine learning (ML) and deep learning (DL) algorithms. Data sets that are noisy, include significant uncertainty, and have extreme values hinder the learning process. In this study, aquifer recharge predictors are developed using four, random forest or gradient boosting ML methods and Long Short-Term Memory (LSTM) networks, a DL method to: (1) examine predictive skill when trained using noisy and uncertain data and (2) identify advantages of statistical learning implementations for prediction of water budget outcomes relative to process-based water budget calculations. Recharge was selected as the learning outcome because it is not observed and inherently uncertain. Precipitation, potential evapotranspiration (PET), and river discharge are the features, or inputs, and are calculated, or modelled, values and are not directly observed; consequently, they are expected to be noisy and uncertain because of contamination with measurement and model error. A common-sense baseline is developed and implemented to account for uncertainty and noise in outcomes for training and validation; the baseline provides delineation of a lower goodness-of-fit threshold that identifies when trained ML and DL models generate prediction skill and an upper goodness-of-fit threshold above which the models are learning to reproduce noise and bias. For statistical learning regression implementations, features and outcomes need to be transformed to be Gaussian-like. Inherent variability and extreme events in precipitation, discharge, and recharge data sets require power transformation, or at least scaling of logarithms, to enhance predictive skill. Identified advantages to statistical learning of water budget outcomes are the ability to use dimensionless trends for features and to represent a complex study site with the same level of effort as a simple site.