We leveraged publicly available land use, lake catchment and
morphometry, and climate data across a 17-state area of the Midwest
and Northeast United States, to predict chloride concentrations in
49,432 lakes. Our general methodology included: 1) Acquiring and
geoprocessing lake water quality data and site characteristics. 2)
Harmonizing training datasets. 3) Building a machine learning model
for chloride prediction. Calculating model fit. 4) Building a
prediction dataset for 49,432 lakes.
Training Dataset
Observational chloride measurements from lakes, reservoirs, and
impoundments were downloaded from the US water quality portal (WQP).
All results were converted to mg L-1, and only data with
ResultStatusIdentifier as ‘Accepted’ or ‘Final’ noted in the dataset
were retained. The initial search of 115,389 observations was then
filtered to data collected after 1990, chloride concentrations <
10,000 mg L-1, and water samples less than 10 m deep or with depth not
listed (where the assumption was an epilimnion measurement). These
quality control steps were taken to limit inclusion of historical data
that may not represent current conditions, remove naturally saline
waterbodies (n =5, adjacent/connected to the Atlantic Ocean), and
remove potentially meromictic lakes (n=0). Multiple observations
collected on the same day were averaged. Lakes with missing watershed
information were removed, resulting in 29,675 unique daily
observations from 2,773 lakes. Three states (Illinois, Iowa, and Rhode
Island) had no chloride data, and three states (Pennsylvania,
Connecticut, and New Hampshire) had chloride data from only one lake.
2,773 lakes represent 5% of the region’s lakes.
WQP site identification numbers (IDs) from the dataset were linked to
the high-resolution National Hydrography Dataset (NHD) that accessed
bounding box information of each NHD shapefile and ran a spatial join.
The resulting relational table linked each chloride observation to an
individual lake through an NHD ID. For every NHD lake ID, geospatial
lake data were obtained from the LAGOS-NE database (Soranno et al.
2017), which provides watershed ecological context for all lakes
greater than 4 ha in the 17-state area. Additional site
characteristics were extracted from GIS line files of US interstates,
US primary roads, and gridded winter severity data. Across all
predictor variables in the training dataset, minimum values were >=
0.01. After converting zero values to 0.001, all data were
log-transformed.
Machine Learning Model
A quantile regression forest (QRF) was used to model the relationship
between observed chloride concentrations and lake and watershed
characteristics. This model was chosen to accommodate a large number
of correlated predictor variables, the presence of non-linear
responses, and the potential importance of interactions among
predictor variables. The QRF was implemented with 1,000 trees using
the ranger package in R, with mtry set to 4 (Wright and Ziegler 2017).
To avoid overfitting the QRF to lakes with a greater number of
chloride observations, we developed a customized sampling routine that
constructed individual trees using the observations from a random
subset of the study lakes (95% subset: the ‘in-bag samples’). Each
resulting tree was used to make out-of-bag predictions on the
remaining observations from the 5% of excluded lakes. All predictions
are reported as the median of the terminal node values from each tree,
with the corresponding 90%-prediction interval calculated from the .05
and 0.95 quantiles of the estimated conditional distribution of the
response variable (Meinshausen 2006). Median terminal node values were
chosen over mean values because they had superior predictive
performance on out-of-bag observations.
Prediction Dataset
A prediction dataset was constructed for the full LAGOS-NE dataset,
which contained 51,102 lakes and reservoirs greater than 4 ha in the
17-state area. After removing lakes with no available land-use data
because the watersheds crossed the US/Canada border, 49,432 lakes
remained, of which 2,773 were used for training the model. The
prediction dataset was identical in structure to the training dataset,
but contained no observational chloride data.
References
Meinshausen, N. 2006. Quantile Regression Forests. Journal of Machine
Learning Research 7:983–999.
Soranno, P. A., L. C. Bacon, M. Beauchene, K. E. Bednar, E. G.
Bissell, C. K. Boudreau, M. G. Boyer, M. T. Bremigan, S. R. Carpenter,
J. W. Carr, K. S. Cheruvelil, S. T. Christel, M. Claucherty, S. M.
Collins, J. D. Conroy, J. A. Downing, J. Dukett, C. E. Fergus, C. T.
Filstrup, C. Funk, M. J. Gonzalez, L. T. Green, C. Gries, J. D.
Halfman, S. K. Hamilton, P. C. Hanson, E. N. Henry, E. M. Herron, C.
Hockings, J. R. Jackson, K. Jacobson-Hedin, L. L. Janus, W. W. Jones,
J. R. Jones, C. M. Keson, K. B. S. King, S. A. Kishbaugh, J.-F.
Lapierre, B. Lathrop, J. A. Latimore, Y. Lee, N. R. Lottig, J. A.
Lynch, L. J. Matthews, W. H. McDowell, K. E. B. Moore, B. P. Neff, S.
J. Nelson, S. K. Oliver, M. L. Pace, D. C. Pierson, A. C. Poisson, A.
I. Pollard, D. M. Post, P. O. Reyes, D. O. Rosenberry, K. M. Roy, L.
G. Rudstam, O. Sarnelle, N. J. Schuldt, C. E. Scott, N. K. Skaff, N.
J. Smith, N. R. Spinelli, J. J. Stachelek, E. H. Stanley, J. L.
Stoddard, S. B. Stopyak, C. A. Stow, J. M. Tallant, P.-N. Tan, A. P.
Thorpe, M. J. Vanni, T. Wagner, G. Watkins, K. C. Weathers, K. E.
Webster, J. D. White, M. K. Wilmes, and S. Yuan. 2017. LAGOS-NE: a
multi-scaled geospatial and temporal database of lake ecological
context and water quality for thousands of US lakes. GigaScience
6:1–22.
Wright, M. N., and A. Ziegler. 2017. ranger: A Fast Implementation of
Random Forests for High Dimensional Data in C++ and R. Journal of
Statistical Software:1–17.