We leveraged publicly available land use, lake catchment and morphometry, and
climate data across a 17-state area of the Midwest and Northeast United States, to
predict chloride concentrations in 49,432 lakes. Our general methodology included: 1)
Acquiring and geoprocessing lake water quality data and site characteristics. 2)
Harmonizing training datasets. 3) Building a machine learning model for chloride
prediction. Calculating model fit. 4) Building a prediction dataset for 49,432 lakes.
Training Dataset
Observational chloride measurements from lakes, reservoirs, and impoundments were
downloaded from the US water quality portal (WQP). All results were converted to mg L-1,
and only data with ResultStatusIdentifier as Accepted or Final noted in the dataset were retained. The initial search of 115,389 observations was
then filtered to data collected after 1990, chloride concentrations < 10,000 mg
L-1, and water samples less than 10 m deep or with depth not listed (where the
assumption was an epilimnion measurement). These quality control steps were taken to
limit inclusion of historical data that may not represent current conditions, remove
naturally saline waterbodies (n =5, adjacent/connected to the Atlantic Ocean), and
remove potentially meromictic lakes (n=0). Multiple observations collected on the same
day were averaged. Lakes with missing watershed information were removed, resulting in
29,675 unique daily observations from 2,773 lakes. Three states (Illinois, Iowa, and
Rhode Island) had no chloride data, and three states (Pennsylvania, Connecticut, and New
Hampshire) had chloride data from only one lake. 2,773 lakes represent 5% of the
region’s lakes.
WQP site identification numbers (IDs) from the dataset were linked to the
high-resolution National Hydrography Dataset (NHD) that accessed bounding box
information of each NHD shapefile and ran a spatial join. The resulting relational table
linked each chloride observation to an individual lake through an NHD ID. For every NHD
lake ID, geospatial lake data were obtained from the LAGOS-NE database (Soranno et
al. 2017), which provides watershed ecological context for all lakes greater than 4 ha
in the 17-state area. Additional site characteristics were extracted from GIS line files
of US interstates, US primary roads, and gridded winter severity data. Across all
predictor variables in the training dataset, minimum values were >= 0.01. After
converting zero values to 0.001, all data were log-transformed.
Machine Learning Model
A quantile regression forest (QRF) was used to model the relationship between
observed chloride concentrations and lake and watershed characteristics. This model was
chosen to accommodate a large number of correlated predictor variables, the presence of
non-linear responses, and the potential importance of interactions among predictor
variables. The QRF was implemented with 1,000 trees using the ranger package in R, with
mtry set to 4 (Wright and Ziegler 2017).
To avoid overfitting the QRF to lakes with a greater number of chloride
observations, we developed a customized sampling routine that constructed individual
trees using the observations from a random subset of the study lakes (95% subset: the in-bag samples). Each resulting tree was used to make out-of-bag predictions on the remaining
observations from the 5% of excluded lakes. All predictions are reported as the median
of the terminal node values from each tree, with the corresponding 90%-prediction
interval calculated from the .05 and 0.95 quantiles of the estimated conditional
distribution of the response variable (Meinshausen 2006). Median terminal node values
were chosen over mean values because they had superior predictive performance on
out-of-bag observations.
Prediction Dataset
A prediction dataset was constructed for the full LAGOS-NE dataset, which contained
51,102 lakes and reservoirs greater than 4 ha in the 17-state area. After removing lakes
with no available land-use data because the watersheds crossed the US/Canada border,
49,432 lakes remained, of which 2,773 were used for training the model. The prediction
dataset was identical in structure to the training dataset, but contained no
observational chloride data.
Wisconsin Lake Chloride Observations
Independent measurements were taken from Wisconsin Lakes in the summer of 2018. Samples were taken from the surface of each lake, and analyzed for chloride (and sulfate) simultaneously by Ion Chromatography, using a hydroxide eluent. The detection limit for chloride is approximately 0.01 ppm. Chloride is determined by a Dionex model ICS 2100 using an electro-chemical suppressor.
References
Meinshausen, N. 2006. Quantile Regression Forests. Journal of Machine Learning
Research 7:983–999.
Soranno, P. A., L. C. Bacon, M. Beauchene, K. E. Bednar, E. G. Bissell, C. K.
Boudreau, M. G. Boyer, M. T. Bremigan, S. R. Carpenter, J. W. Carr, K. S. Cheruvelil, S.
T. Christel, M. Claucherty, S. M. Collins, J. D. Conroy, J. A. Downing, J. Dukett, C. E.
Fergus, C. T. Filstrup, C. Funk, M. J. Gonzalez, L. T. Green, C. Gries, J. D. Halfman,
S. K. Hamilton, P. C. Hanson, E. N. Henry, E. M. Herron, C. Hockings, J. R. Jackson, K.
Jacobson-Hedin, L. L. Janus, W. W. Jones, J. R. Jones, C. M. Keson, K. B. S. King, S. A.
Kishbaugh, J.-F. Lapierre, B. Lathrop, J. A. Latimore, Y. Lee, N. R. Lottig, J. A.
Lynch, L. J. Matthews, W. H. McDowell, K. E. B. Moore, B. P. Neff, S. J. Nelson, S. K.
Oliver, M. L. Pace, D. C. Pierson, A. C. Poisson, A. I. Pollard, D. M. Post, P. O.
Reyes, D. O. Rosenberry, K. M. Roy, L. G. Rudstam, O. Sarnelle, N. J. Schuldt, C. E.
Scott, N. K. Skaff, N. J. Smith, N. R. Spinelli, J. J. Stachelek, E. H. Stanley, J. L.
Stoddard, S. B. Stopyak, C. A. Stow, J. M. Tallant, P.-N. Tan, A. P. Thorpe, M. J.
Vanni, T. Wagner, G. Watkins, K. C. Weathers, K. E. Webster, J. D. White, M. K. Wilmes,
and S. Yuan. 2017. LAGOS-NE: a multi-scaled geospatial and temporal database of lake
ecological context and water quality for thousands of US lakes. GigaScience 6:1–22.
Wright, M. N., and A. Ziegler. 2017. ranger: A Fast Implementation of Random
Forests for High Dimensional Data in C++ and R. Journal of Statistical Software:1–17.