Data Package Metadata   View Summary

Lake chloride concentrations and model predictions for 49,432 lakes in the Midwest and Northeast United States.

General Information
Data Package:
Local Identifier:edi.452.1
Title:Lake chloride concentrations and model predictions for 49,432 lakes in the Midwest and Northeast United States.
Alternate Identifier:DOI PLACE HOLDER
Abstract:

Lakes in the Midwest and Northeast United States are at risk of anthropogenic chloride contamination, but we have little knowledge of the prevalence and spatial distribution of the problem. The majority of salt pollution in north temperate regions stems from road salt application but other chloride sources include water softeners, synthetic fertilizers, and livestock excretion. Although chloride contamination of lakes is well documented, it is unknown how many lakes are at risk of long-term salinization. We used a quantile regression forest to leverage information from 2,773 lakes to predict the chloride concentration of all 49,432 lakes greater than 4 ha in a 17-state area. The QRF used 22 predictor variables, which included lake morphometry characteristics, watershed land use, and distance to the nearest interstate and road. Model predictions had an r2 of 0.94 for all chloride observations, and 0.87 for predictions of the mean chloride concentration observed at each lake.

Publication Date:2019-12-11

Time Period
Begin:
1990-01-01
End:
2018-12-13

People and Organizations
Contact:Dugan, Hilary A (University of Wisconsin-Madison) [  email ]
Creator:Dugan, Hilary A (University of Wisconsin-Madison)
Creator:Skaff, Nicholas K (University of California, Berkeley University)
Creator:Doubek, Jonathan P (Lake Superior State University)
Creator:Burke, Samantha M (University of Guelph)
Creator:Krivak-Tetley, Flora E (Dartmouth College)
Creator:Summers, Jamie C 

Data Entities
Data Table Name:
chloride prediction model output
Description:
chloride prediction model output
Data Table Name:
chloride prediction model training data
Description:
chloride prediction model training data
Other Name:
QRF_script
Description:
R code which builds a quantile regression forest model using observational chloride data and predictor variables found in lakeCL_trainingData.csv
Detailed Metadata

Data Entities


Data Table

Data:https://pasta-s.lternet.edu/package/data/eml/edi/452/1/de4b65f9d258bf185165717071d40127
Name:chloride prediction model output
Description:chloride prediction model output
Number of Records:49432
Number of Columns:12

Table Structure
Object Name:lakeCL_predictions.csv
Size:6947662 bytes
Authentication:4f2337aece774c6705b54878b4f01579 Calculated By MD5
Text Format:
Number of Header Lines:1
Record Delimiter:\r\n
Orientation:column
Simple Delimited:
Field Delimiter:,
Quote Character:"

Table Column Descriptions
 
Column Name:lagoslakeid  
nhdid  
gnis_name  
nhd_lat  
nhd_long  
LakeArea  
WS_Area  
MaxDepth  
state_name  
prediction_05  
prediction_50  
prediction_95  
Definition:Unique lake identifier developed for LAGOS-NEUnique lake identifier from National Hydrography datasetLake NameLatitudeLongitudeSurface area of the lakeSurface area of the watershedMaximum depth of lakeName of US state that lake is located in (or partially in)Prediction interval: 0.05 quantileMedian predictionPrediction interval: 0.95 quantile
Storage Type:string  
string  
string  
float  
float  
float  
float  
float  
string  
float  
float  
float  
Measurement Type:nominalnominalnominalratioratioratioratiorationominalratioratioratio
Measurement Values Domain:
DefinitionUnique lake identifier developed for LAGOS-NE
DefinitionUnique lake identifier from National Hydrography dataset
DefinitionLake Name
Unitdegree
Typereal
Min36 
Max48.99 
Unitdegree
Typereal
Min-97.22 
Max-67.09 
Unithectare
Typereal
Min1.39 
Max11.11 
Unithectare
Typereal
Min-2.3 
Max14.98 
Unitmeter
Typereal
Min0.1 
Max198.4 
DefinitionName of US state that lake is located in (or partially in)
UnitmilligramsPerLiter
Typereal
Min0.07 
Max140 
UnitmilligramsPerLiter
Typereal
Min0.08 
Max1619 
UnitmilligramsPerLiter
Typereal
Min0.41 
Max2979 
Missing Value Code:      
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
Accuracy Report:                        
Accuracy Assessment:                        
Coverage:                        
Methods:                        

Data Table

Data:https://pasta-s.lternet.edu/package/data/eml/edi/452/1/a97d3a1e5e3c77fb8e35bce5641e6554
Name:chloride prediction model training data
Description:chloride prediction model training data
Number of Records:29010
Number of Columns:31

Table Structure
Object Name:lakeCL_trainingData.csv
Size:6506675 bytes
Authentication:40cdd1a28412bf9340b78b3eaf410921 Calculated By MD5
Text Format:
Number of Header Lines:1
Record Delimiter:\r\n
Orientation:column
Simple Delimited:
Field Delimiter:,
Quote Character:"

Table Column Descriptions
 
Column Name:lagoslakeid  
nhdid  
gnis_name  
ActivityStartDate  
Chloride  
nhd_lat  
nhd_long  
MaxDepth  
state_name  
Month  
LakeArea  
WS_Area  
WinterSeverity  
WS_OpenWater  
WS_Dev_Open  
WS_Dev_Low  
WS_Dev_Med  
WS_Dev_High  
WS_Barren  
WS_DeciduousForest  
WS_EvergreenForest  
WS_MixedForest  
WS_Schrub  
WS_Grassland  
WS_PastureHay  
WS_Crops  
WS_WoodyWetlands  
WS_EmergentWetlands  
WS_RoadDensity  
InterstateDistance  
RoadDistance  
Definition:Unique lake identifier developed for LAGOS-NEUnique lake identifier from National Hydrography datasetLake NameDate of samplingChloride concentrationLatitudeLongitudeMaximum depth of lakeName of US state that lake is located in (or partially in)Month of samplingSurface area of the lakeSurface area of the watershedWinter severity index obtained from ClearRoads (national research consortium, clearroads.org). Calculated from 2000 to 2010 as 0.50 × (average annual snowfall in inches) + 0.05 × (annual duration of snowfall in hours) + 0.05 × (annual duration of blowing snow in hours) + 0.10 × (annual duration of freezing rain in hours).% landuse classified as open water in the watershed. Derived from the National Land Cover Dataset (NLCD).% landuse classified as open space, developed in the watershed. Derived from the National Land Cover Dataset (NLCD).% landuse classified as developed, low intensity in the watershed. Derived from the National Land Cover Dataset (NLCD).% landuse classified as developed, medium intensity in the watershed. Derived from the National Land Cover Dataset (NLCD).% landuse classified as developed, high intensity in the watershed. Derived from the National Land Cover Dataset (NLCD).% landuse classified as barren/transitional in the watershed. Derived from the National Land Cover Dataset (NLCD).% landuse classified as deciduous forest in the watershed. Derived from the National Land Cover Dataset (NLCD).% landuse classified as evergreen forest in the watershed. Derived from the National Land Cover Dataset (NLCD).% landuse classified as mixed forest in the watershed. Derived from the National Land Cover Dataset (NLCD).% landuse classified as schrubland in the watershed. Derived from the National Land Cover Dataset (NLCD).% landuse classified as grassland in the watershed. Derived from the National Land Cover Dataset (NLCD).% landuse classified as pasture/hay in the watershed. Derived from the National Land Cover Dataset (NLCD).% landuse classified as row crops in the watershed. Derived from the National Land Cover Dataset (NLCD).% landuse classified as woody wetlands in the watershed. Derived from the National Land Cover Dataset (NLCD).% landuse classified as herbaceous wetlands in the watershed. Derived from the National Land Cover Dataset (NLCD).Road density in the watershed. Derived from the National Land Cover Dataset (NLCD).Distance to the nearest interstateDistance to the nearest road
Storage Type:string  
string  
string  
date  
float  
float  
float  
float  
string  
float  
float  
float  
float  
float  
float  
float  
float  
float  
float  
float  
float  
float  
float  
float  
float  
float  
float  
float  
float  
float  
float  
Measurement Type:nominalnominalnominaldateTimeratioratioratiorationominalratioratioratioratioratioratioratioratioratioratioratioratioratioratioratioratioratioratioratioratioratioratio
Measurement Values Domain:
DefinitionUnique lake identifier developed for LAGOS-NE
DefinitionUnique lake identifier from National Hydrography dataset
DefinitionLake Name
FormatYYYY-MM-DD
Precision
UnitmilligramsPerLiter
Typereal
Min0.05 
Max2979 
Unitdegree
Typereal
Min36.56 
Max48.72 
Unitdegree
Typereal
Min-96.73 
Max-68.19 
Unitmeter
Typereal
Min0.91 
Max198.4 
DefinitionName of US state that lake is located in (or partially in)
UnitnominalMonth
Typenatural
Min
Max12 
Unithectare
Typereal
Min4.01 
Max66650.33 
Unithectare
Typereal
Min0.54 
Max1482384.63 
Unitdimensionless
Typereal
Min7.6 
Max168.09 
Unitdimensionless
Typereal
Min
Max96.97 
Unitdimensionless
Typereal
Min
Max87.53 
Unitdimensionless
Typereal
Min
Max65.58 
Unitdimensionless
Typereal
Min
Max53.75 
Unitdimensionless
Typereal
Min
Max56.92 
Unitdimensionless
Typereal
Min
Max29.76 
Unitdimensionless
Typereal
Min
Max97.79 
Unitdimensionless
Typereal
Min
Max80.6 
Unitdimensionless
Typereal
Min
Max73.55 
Unitdimensionless
Typereal
Min
Max54.62 
Unitdimensionless
Typereal
Min
Max54.98 
Unitdimensionless
Typereal
Min
Max68.65 
Unitdimensionless
Typereal
Min
Max95.04 
Unitdimensionless
Typereal
Min
Max74.5 
Unitdimensionless
Typereal
Min
Max54.63 
UnitmetersPerHectare
Typereal
Min
Max216.72 
Unitmeter
Typereal
Min0.01 
Max307.26 
Unitmeter
Typereal
Min0.01 
Max41.79 
Missing Value Code:      
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
Accuracy Report:                                                              
Accuracy Assessment:                                                              
Coverage:                                                              
Methods:                                                              

Non-Categorized Data Resource

Name:QRF_script
Entity Type:unknown
Description:R code which builds a quantile regression forest model using observational chloride data and predictor variables found in lakeCL_trainingData.csv
Physical Structure Description:
Object Name:QRF_script.R
Size:3246 bytes
Authentication:92537372927d61f2419d1f14ff16e364 Calculated By MD5
Externally Defined Format:
Format Name:unknown
Data:https://pasta-s.lternet.edu/package/data/eml/edi/452/1/09b7cf713e7780457de96b4df8dd59f1

Data Package Usage Rights

This information is released under the Creative Commons license - Attribution - CC BY (https://creativecommons.org/licenses/by/4.0/). The consumer of these data ("Data User" herein) is required to cite it appropriately in any publication that results from its use. The Data User should realize that these data may be actively used by others for ongoing research and that coordination may be necessary to prevent duplicate publication. The Data User is urged to contact the authors of these data if any questions about methodology or results occur. Where appropriate, the Data User is encouraged to consider collaboration or co-authorship with the authors. The Data User should realize that misinterpretation of data may occur if used out of context of the original study. While substantial efforts are made to ensure the accuracy of data and associated documentation, complete accuracy of data sets cannot be guaranteed. All data are made available "as is." The Data User should be aware, however, that data are updated periodically and it is the responsibility of the Data User to check for new versions of the data. The data authors and the repository where these data were obtained shall not be liable for damages resulting from any use or misinterpretation of the data. Thank you.

Keywords

By Thesaurus:
(No thesaurus)Chloride, lakes, reservoirs, LAGOS, limnology, road salt, salt, impervious surface, salinization, GLEON, prediction

Methods and Protocols

These methods, instrumentation and/or protocols apply to all data in this dataset:

Methods and protocols used in the collection of this data package
Description:

We leveraged publicly available land use, lake catchment and morphometry, and climate data across a 17-state area of the Midwest and Northeast United States, to predict chloride concentrations in 49,432 lakes. Our general methodology included: 1) Acquiring and geoprocessing lake water quality data and site characteristics. 2) Harmonizing training datasets. 3) Building a machine learning model for chloride prediction. Calculating model fit. 4) Building a prediction dataset for 49,432 lakes.

Training Dataset

Observational chloride measurements from lakes, reservoirs, and impoundments were downloaded from the US water quality portal (WQP). All results were converted to mg L-1, and only data with ResultStatusIdentifier as ‘Accepted’ or ‘Final’ noted in the dataset were retained. The initial search of 115,389 observations was then filtered to data collected after 1990, chloride concentrations < 10,000 mg L-1, and water samples less than 10 m deep or with depth not listed (where the assumption was an epilimnion measurement). These quality control steps were taken to limit inclusion of historical data that may not represent current conditions, remove naturally saline waterbodies (n =5, adjacent/connected to the Atlantic Ocean), and remove potentially meromictic lakes (n=0). Multiple observations collected on the same day were averaged. Lakes with missing watershed information were removed, resulting in 29,675 unique daily observations from 2,773 lakes. Three states (Illinois, Iowa, and Rhode Island) had no chloride data, and three states (Pennsylvania, Connecticut, and New Hampshire) had chloride data from only one lake. 2,773 lakes represent 5% of the region’s lakes.

WQP site identification numbers (IDs) from the dataset were linked to the high-resolution National Hydrography Dataset (NHD) that accessed bounding box information of each NHD shapefile and ran a spatial join. The resulting relational table linked each chloride observation to an individual lake through an NHD ID. For every NHD lake ID, geospatial lake data were obtained from the LAGOS-NE database (Soranno et al. 2017), which provides watershed ecological context for all lakes greater than 4 ha in the 17-state area. Additional site characteristics were extracted from GIS line files of US interstates, US primary roads, and gridded winter severity data. Across all predictor variables in the training dataset, minimum values were >= 0.01. After converting zero values to 0.001, all data were log-transformed.

Machine Learning Model

A quantile regression forest (QRF) was used to model the relationship between observed chloride concentrations and lake and watershed characteristics. This model was chosen to accommodate a large number of correlated predictor variables, the presence of non-linear responses, and the potential importance of interactions among predictor variables. The QRF was implemented with 1,000 trees using the ranger package in R, with mtry set to 4 (Wright and Ziegler 2017).

To avoid overfitting the QRF to lakes with a greater number of chloride observations, we developed a customized sampling routine that constructed individual trees using the observations from a random subset of the study lakes (95% subset: the ‘in-bag samples’). Each resulting tree was used to make out-of-bag predictions on the remaining observations from the 5% of excluded lakes. All predictions are reported as the median of the terminal node values from each tree, with the corresponding 90%-prediction interval calculated from the .05 and 0.95 quantiles of the estimated conditional distribution of the response variable (Meinshausen 2006). Median terminal node values were chosen over mean values because they had superior predictive performance on out-of-bag observations.

Prediction Dataset

A prediction dataset was constructed for the full LAGOS-NE dataset, which contained 51,102 lakes and reservoirs greater than 4 ha in the 17-state area. After removing lakes with no available land-use data because the watersheds crossed the US/Canada border, 49,432 lakes remained, of which 2,773 were used for training the model. The prediction dataset was identical in structure to the training dataset, but contained no observational chloride data.

References

Meinshausen, N. 2006. Quantile Regression Forests. Journal of Machine Learning Research 7:983–999.

Soranno, P. A., L. C. Bacon, M. Beauchene, K. E. Bednar, E. G. Bissell, C. K. Boudreau, M. G. Boyer, M. T. Bremigan, S. R. Carpenter, J. W. Carr, K. S. Cheruvelil, S. T. Christel, M. Claucherty, S. M. Collins, J. D. Conroy, J. A. Downing, J. Dukett, C. E. Fergus, C. T. Filstrup, C. Funk, M. J. Gonzalez, L. T. Green, C. Gries, J. D. Halfman, S. K. Hamilton, P. C. Hanson, E. N. Henry, E. M. Herron, C. Hockings, J. R. Jackson, K. Jacobson-Hedin, L. L. Janus, W. W. Jones, J. R. Jones, C. M. Keson, K. B. S. King, S. A. Kishbaugh, J.-F. Lapierre, B. Lathrop, J. A. Latimore, Y. Lee, N. R. Lottig, J. A. Lynch, L. J. Matthews, W. H. McDowell, K. E. B. Moore, B. P. Neff, S. J. Nelson, S. K. Oliver, M. L. Pace, D. C. Pierson, A. C. Poisson, A. I. Pollard, D. M. Post, P. O. Reyes, D. O. Rosenberry, K. M. Roy, L. G. Rudstam, O. Sarnelle, N. J. Schuldt, C. E. Scott, N. K. Skaff, N. J. Smith, N. R. Spinelli, J. J. Stachelek, E. H. Stanley, J. L. Stoddard, S. B. Stopyak, C. A. Stow, J. M. Tallant, P.-N. Tan, A. P. Thorpe, M. J. Vanni, T. Wagner, G. Watkins, K. C. Weathers, K. E. Webster, J. D. White, M. K. Wilmes, and S. Yuan. 2017. LAGOS-NE: a multi-scaled geospatial and temporal database of lake ecological context and water quality for thousands of US lakes. GigaScience 6:1–22.

Wright, M. N., and A. Ziegler. 2017. ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software:1–17.

People and Organizations

Creators:
Individual: Hilary A Dugan
Organization:University of Wisconsin-Madison
Email Address:
hdugan@wisc.edu
Id:https://orcid.org/0000-0003-4674-1149
Individual: Nicholas K Skaff
Organization:University of California, Berkeley University
Email Address:
nskaff@berkeley.edu
Id:https://orcid.org/0000-0002-5929-3966
Individual: Jonathan P Doubek
Organization:Lake Superior State University
Email Address:
jdoubek@lssu.edu
Id:https://orcid.org/0000-0003-2651-4715
Individual: Samantha M Burke
Organization:University of Guelph
Email Address:
samantha.burke2@gmail.com
Individual: Flora E Krivak-Tetley
Organization:Dartmouth College
Email Address:
fkt.gr@dartmouth.edu
Id:https://orcid.org/0000-0003-3521-2460
Individual: Jamie C Summers
Email Address:
jamiecsummers@gmail.com
Id:https://orcid.org/0000-0002-7497-2326
Contacts:
Individual: Hilary A Dugan
Organization:University of Wisconsin-Madison
Email Address:
hdugan@wisc.edu
Id:https://orcid.org/0000-0003-4674-1149

Temporal, Geographic and Taxonomic Coverage

Temporal, Geographic and/or Taxonomic information that applies to all data in this dataset:

Time Period
Begin:
1990-01-01
End:
2018-12-13
Geographic Region:
Description:Midwest and Northeast USA
Bounding Coordinates:
Northern:  49.42Southern:  36.56
Western:  -96.73Eastern:  -68.19

Project

Parent Project Information:

Title:Collaborative Research: Building Analytical, Synthesis, and Human Network Skills Needed for Macrosystem Science: a Next Generation Graduate Student Training Model Based on GLEON
Personnel:
Individual: Kathleen C Weathers
Id:https://orcid.org/0000-0002-3575-6508
Role:Principal Investigator
Funding: NSF: EF-1137327 and EF-1702991

Maintenance

Maintenance:
Description:completed
Frequency:
Other Metadata

Additional Metadata

additionalMetadata
        |___text '\n    '
        |___element 'metadata'
        |     |___text '\n      '
        |     |___element 'unitList'
        |     |     |___text '\n        '
        |     |     |___element 'unit'
        |     |     |     |  \___attribute 'id' = 'metersPerHectare'
        |     |     |     |  \___attribute 'multiplierToSI' = ''
        |     |     |     |  \___attribute 'name' = 'metersPerHectare'
        |     |     |     |  \___attribute 'parentSI' = ''
        |     |     |     |  \___attribute 'unitType' = ''
        |     |     |     |___text '\n          '
        |     |     |     |___element 'description'
        |     |     |     |___text '\n        '
        |     |     |___text '\n        '
        |     |     |___element 'unit'
        |     |     |     |  \___attribute 'id' = 'nominalMonth'
        |     |     |     |  \___attribute 'multiplierToSI' = ''
        |     |     |     |  \___attribute 'name' = 'nominalMonth'
        |     |     |     |  \___attribute 'parentSI' = ''
        |     |     |     |  \___attribute 'unitType' = ''
        |     |     |     |___text '\n          '
        |     |     |     |___element 'description'
        |     |     |     |___text '\n        '
        |     |     |___text '\n      '
        |     |___text '\n    '
        |___text '\n  '

EDI is a collaboration between the University of New Mexico and the University of Wisconsin – Madison, Center for Limnology:

UNM logo UW-M logo