Data & Research

Data on Land and Water meets Data Research

Sterle G, Perdrial, J.N., Li L, Adler T, Underwood K, Rizzo D, Wen H, Addor N, Newman A, Harpold A (2024) CAMELS-Chem: Stream Water Chemistry and Attributes to Facilitate Large Sample Studies. Hydrology and Earth System Sciences. https://doi.org/10.5194/hess-2022-81.

In land and water research, a dynamic cycle emerges between the exploration of data, the curation of datasets and data science research. Large datasets are transforming catchment sciences, but multi-use datasets, that help us diagnose the Earth surface from head to toe, are still sparse. "CAMELS-chem" combines existing data on catchment attributes (Addor et al. 2017) with stream water chemistry and atmospheric deposition data, filling an important gap. It includes over 500 US catchments, 50 catchment attributes, 18 common stream water chemistry constituents and deposition data across the CONUS. These types of multi-use datasets are the basis for many pattern-process iterations in interdisciplinary work.

Find the full dataset on Hydroshare:
https://www.hydroshare.org/resource/841f5e85085c423f889ac809c1bed4ac/.

Ijaz Ul Haq, Byung Suk Lee, Donna M. Rizzo, J.N. Perdrial (2023). An Automated Machine Learning Approach for Detecting Anomalous Peak Patterns in Time Series Data from a Research Watershed in the Northeastern United States Critical Zone. Machine Learning with Applications. https://doi.org/10.48550/arXiv.2309.07992.

One important part of the cycle is to do research on data itself. I'm not a data scientist, but love working with the data team on identifying issues that might have a data science solution. One example is the work led by CZNet Big Data PhD student Ijaz Ul Haq, who uses an automated machine learning system tailored for hydrologists to spot anomalies in time series data from sensors for data cleaning. As domain scientists we know these issues well but doing this "by hand" is a time sink and leads to extensive delays between data collection and interpretation. Ijaz worked on creating labeled datasets by adding synthetic peak patterns to generated time series data and employs automated hyperparameter optimization, which selects the best model from five options: Temporal Convolutional Network, InceptionTime, MiniRocket, Residual Networks, and Long Short-Term Memory based on user preferences for accuracy and computational cost.

Current performance evaluation with our watershed data confirms consistent selection of the most suitable model instance meeting user preferences, significantly supporting cleaning of time series data.