Data Package Metadata View Summary

Lake chloride concentrations and model predictions for 49,432 lakes in the Midwest and Northeast United States.

General Information

Data Package:
Local Identifier:	edi.452.1
Title:	Lake chloride concentrations and model predictions for 49,432 lakes in the Midwest and Northeast United States.
Alternate Identifier:	DOI PLACE HOLDER
Abstract:	Lakes in the Midwest and Northeast United States are at risk of anthropogenic chloride contamination, but we have little knowledge of the prevalence and spatial distribution of the problem. The majority of salt pollution in north temperate regions stems from road salt application but other chloride sources include water softeners, synthetic fertilizers, and livestock excretion. Although chloride contamination of lakes is well documented, it is unknown how many lakes are at risk of long-term salinization. We used a quantile regression forest to leverage information from 2,773 lakes to predict the chloride concentration of all 49,432 lakes greater than 4 ha in a 17-state area. The QRF used 22 predictor variables, which included lake morphometry characteristics, watershed land use, and distance to the nearest interstate and road. Model predictions had an r2 of 0.94 for all chloride observations, and 0.87 for predictions of the mean chloride concentration observed at each lake.
Publication Date:	2019-12-11

Time Period

Begin:

1990-01-01

End:

2018-12-13

People and Organizations
Contact:	Dugan, Hilary A (University of Wisconsin-Madison) [ email ]
Creator:	Dugan, Hilary A (University of Wisconsin-Madison)
Creator:	Skaff, Nicholas K (University of California, Berkeley University)
Creator:	Doubek, Jonathan P (Lake Superior State University)
Creator:	Burke, Samantha M (University of Guelph)
Creator:	Krivak-Tetley, Flora E (Dartmouth College)
Creator:	Summers, Jamie C

Data Entities
Data Table Name:	chloride prediction model output
Description:	chloride prediction model output
Data Table Name:	chloride prediction model training data
Description:	chloride prediction model training data
Other Name:	QRF_script
Description:	R code which builds a quantile regression forest model using observational chloride data and predictor variables found in lakeCL_trainingData.csv

Detailed Metadata

Data Entities

Data Table


Data:	https://pasta-s.lternet.edu/package/data/eml/edi/452/1/de4b65f9d258bf185165717071d40127
Name:	chloride prediction model output
Description:	chloride prediction model output
Number of Records:	49432
Number of Columns:	12

Table Structure

Object Name:

lakeCL_predictions.csv

Size:

6947662 bytes

Authentication:

4f2337aece774c6705b54878b4f01579 Calculated By MD5

Text Format:

Number of Header Lines:

Record Delimiter:

\r\n

Orientation:

column

Simple Delimited:

Field Delimiter:	,
Quote Character:	"

Table Column Descriptions

Column Name:

lagoslakeid

nhdid

gnis_name

nhd_lat

nhd_long

LakeArea

WS_Area

MaxDepth

state_name

prediction_05

prediction_50

prediction_95

Definition:

Unique lake identifier developed for LAGOS-NE

Unique lake identifier from National Hydrography dataset

Lake Name

Latitude

Longitude

Surface area of the lake

Surface area of the watershed

Maximum depth of lake

Name of US state that lake is located in (or partially in)

Prediction interval: 0.05 quantile

Median prediction

Prediction interval: 0.95 quantile

Storage Type:

string

float

string

float

Measurement Type:

nominal

ratio

nominal

ratio

Measurement Values Domain:

Definition

Unique lake identifier developed for LAGOS-NE

Definition

Unique lake identifier from National Hydrography dataset

Definition

Lake Name

Unit	degree
Type	real
Min	36
Max	48.99

Unit	degree
Type	real
Min	-97.22
Max	-67.09

Unit	hectare
Type	real
Min	1.39
Max	11.11

Unit	hectare
Type	real
Min	-2.3
Max	14.98

Unit	meter
Type	real
Min	0.1
Max	198.4

Definition

Name of US state that lake is located in (or partially in)

Unit	milligramsPerLiter
Type	real
Min	0.07
Max	140

Unit	milligramsPerLiter
Type	real
Min	0.08
Max	1619

Unit	milligramsPerLiter
Type	real
Min	0.41
Max	2979

Missing Value Code:

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Accuracy Report:

Accuracy Assessment:

Coverage:

Methods:

Data Table


Data:	https://pasta-s.lternet.edu/package/data/eml/edi/452/1/a97d3a1e5e3c77fb8e35bce5641e6554
Name:	chloride prediction model training data
Description:	chloride prediction model training data
Number of Records:	29010
Number of Columns:	31

Table Structure

Object Name:

lakeCL_trainingData.csv

Size:

6506675 bytes

Authentication:

40cdd1a28412bf9340b78b3eaf410921 Calculated By MD5

Text Format:

Number of Header Lines:

Record Delimiter:

\r\n

Orientation:

column

Simple Delimited:

Field Delimiter:	,
Quote Character:	"

Table Column Descriptions

Column Name:

lagoslakeid

nhdid

gnis_name

ActivityStartDate

Chloride

nhd_lat

nhd_long

MaxDepth

state_name

Month

LakeArea

WS_Area

WinterSeverity

WS_OpenWater

WS_Dev_Open

WS_Dev_Low

WS_Dev_Med

WS_Dev_High

WS_Barren

WS_DeciduousForest

WS_EvergreenForest

WS_MixedForest

WS_Schrub

WS_Grassland

WS_PastureHay

WS_Crops

WS_WoodyWetlands

WS_EmergentWetlands

WS_RoadDensity

InterstateDistance

RoadDistance

Definition:

Unique lake identifier developed for LAGOS-NE

Unique lake identifier from National Hydrography dataset

Lake Name

Date of sampling

Chloride concentration

Latitude

Longitude

Maximum depth of lake

Name of US state that lake is located in (or partially in)

Month of sampling

Surface area of the lake

Surface area of the watershed

Winter severity index obtained from ClearRoads (national research consortium, clearroads.org). Calculated from 2000 to 2010 as 0.50 × (average annual snowfall in inches) + 0.05 × (annual duration of snowfall in hours) + 0.05 × (annual duration of blowing snow in hours) + 0.10 × (annual duration of freezing rain in hours).

% landuse classified as open water in the watershed. Derived from the National Land Cover Dataset (NLCD).

% landuse classified as open space, developed in the watershed. Derived from the National Land Cover Dataset (NLCD).

% landuse classified as developed, low intensity in the watershed. Derived from the National Land Cover Dataset (NLCD).

% landuse classified as developed, medium intensity in the watershed. Derived from the National Land Cover Dataset (NLCD).

% landuse classified as developed, high intensity in the watershed. Derived from the National Land Cover Dataset (NLCD).

% landuse classified as barren/transitional in the watershed. Derived from the National Land Cover Dataset (NLCD).

% landuse classified as deciduous forest in the watershed. Derived from the National Land Cover Dataset (NLCD).

% landuse classified as evergreen forest in the watershed. Derived from the National Land Cover Dataset (NLCD).

% landuse classified as mixed forest in the watershed. Derived from the National Land Cover Dataset (NLCD).

% landuse classified as schrubland in the watershed. Derived from the National Land Cover Dataset (NLCD).

% landuse classified as grassland in the watershed. Derived from the National Land Cover Dataset (NLCD).

% landuse classified as pasture/hay in the watershed. Derived from the National Land Cover Dataset (NLCD).

% landuse classified as row crops in the watershed. Derived from the National Land Cover Dataset (NLCD).

% landuse classified as woody wetlands in the watershed. Derived from the National Land Cover Dataset (NLCD).

% landuse classified as herbaceous wetlands in the watershed. Derived from the National Land Cover Dataset (NLCD).

Road density in the watershed. Derived from the National Land Cover Dataset (NLCD).

Distance to the nearest interstate

Distance to the nearest road

Storage Type:

string

date

float

string

float

Measurement Type:

nominal

dateTime

ratio

nominal

ratio

Measurement Values Domain:

Definition

Unique lake identifier developed for LAGOS-NE

Definition

Unique lake identifier from National Hydrography dataset

Definition

Lake Name

Format	YYYY-MM-DD
Precision

Unit	milligramsPerLiter
Type	real
Min	0.05
Max	2979

Unit	degree
Type	real
Min	36.56
Max	48.72

Unit	degree
Type	real
Min	-96.73
Max	-68.19

Unit	meter
Type	real
Min	0.91
Max	198.4

Definition

Name of US state that lake is located in (or partially in)

Unit	nominalMonth
Type	natural
Min	1
Max	12

Unit	hectare
Type	real
Min	4.01
Max	66650.33

Unit	hectare
Type	real
Min	0.54
Max	1482384.63

Unit	dimensionless
Type	real
Min	7.6
Max	168.09

Unit	dimensionless
Type	real
Min	0
Max	96.97

Unit	dimensionless
Type	real
Min	0
Max	87.53

Unit	dimensionless
Type	real
Min	0
Max	65.58

Unit	dimensionless
Type	real
Min	0
Max	53.75

Unit	dimensionless
Type	real
Min	0
Max	56.92

Unit	dimensionless
Type	real
Min	0
Max	29.76

Unit	dimensionless
Type	real
Min	0
Max	97.79

Unit	dimensionless
Type	real
Min	0
Max	80.6

Unit	dimensionless
Type	real
Min	0
Max	73.55

Unit	dimensionless
Type	real
Min	0
Max	54.62

Unit	dimensionless
Type	real
Min	0
Max	54.98

Unit	dimensionless
Type	real
Min	0
Max	68.65

Unit	dimensionless
Type	real
Min	0
Max	95.04

Unit	dimensionless
Type	real
Min	0
Max	74.5

Unit	dimensionless
Type	real
Min	0
Max	54.63

Unit	metersPerHectare
Type	real
Min	0
Max	216.72

Unit	meter
Type	real
Min	0.01
Max	307.26

Unit	meter
Type	real
Min	0.01
Max	41.79

Missing Value Code:

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Accuracy Report:

Accuracy Assessment:

Coverage:

Methods:

Non-Categorized Data Resource

Name:

QRF_script

Entity Type:

unknown

Description:

R code which builds a quantile regression forest model using observational chloride data and predictor variables found in lakeCL_trainingData.csv

Physical Structure Description:

Object Name:

QRF_script.R

Size:

3246 bytes

Authentication:

92537372927d61f2419d1f14ff16e364 Calculated By MD5

Externally Defined Format:

Format Name:

unknown

Data:

https://pasta-s.lternet.edu/package/data/eml/edi/452/1/09b7cf713e7780457de96b4df8dd59f1

Data Package Usage Rights

This information is released under the Creative Commons license - Attribution - CC BY (https://creativecommons.org/licenses/by/4.0/). The consumer of these data ("Data User" herein) is required to cite it appropriately in any publication that results from its use. The Data User should realize that these data may be actively used by others for ongoing research and that coordination may be necessary to prevent duplicate publication. The Data User is urged to contact the authors of these data if any questions about methodology or results occur. Where appropriate, the Data User is encouraged to consider collaboration or co-authorship with the authors. The Data User should realize that misinterpretation of data may occur if used out of context of the original study. While substantial efforts are made to ensure the accuracy of data and associated documentation, complete accuracy of data sets cannot be guaranteed. All data are made available "as is." The Data User should be aware, however, that data are updated periodically and it is the responsibility of the Data User to check for new versions of the data. The data authors and the repository where these data were obtained shall not be liable for damages resulting from any use or misinterpretation of the data. Thank you.

Keywords

By Thesaurus:
(No thesaurus)	Chloride, lakes, reservoirs, LAGOS, limnology, road salt, salt, impervious surface, salinization, GLEON, prediction

Methods and Protocols

These methods, instrumentation and/or protocols apply to all data in this dataset:

Methods and protocols used in the collection of this data package

Description:

We leveraged publicly available land use, lake catchment and morphometry, and climate data across a 17-state area of the Midwest and Northeast United States, to predict chloride concentrations in 49,432 lakes. Our general methodology included: 1) Acquiring and geoprocessing lake water quality data and site characteristics. 2) Harmonizing training datasets. 3) Building a machine learning model for chloride prediction. Calculating model fit. 4) Building a prediction dataset for 49,432 lakes.

Training Dataset

Observational chloride measurements from lakes, reservoirs, and impoundments were downloaded from the US water quality portal (WQP). All results were converted to mg L-1, and only data with ResultStatusIdentifier as ‘Accepted’ or ‘Final’ noted in the dataset were retained. The initial search of 115,389 observations was then filtered to data collected after 1990, chloride concentrations < 10,000 mg L-1, and water samples less than 10 m deep or with depth not listed (where the assumption was an epilimnion measurement). These quality control steps were taken to limit inclusion of historical data that may not represent current conditions, remove naturally saline waterbodies (n =5, adjacent/connected to the Atlantic Ocean), and remove potentially meromictic lakes (n=0). Multiple observations collected on the same day were averaged. Lakes with missing watershed information were removed, resulting in 29,675 unique daily observations from 2,773 lakes. Three states (Illinois, Iowa, and Rhode Island) had no chloride data, and three states (Pennsylvania, Connecticut, and New Hampshire) had chloride data from only one lake. 2,773 lakes represent 5% of the region’s lakes.

WQP site identification numbers (IDs) from the dataset were linked to the high-resolution National Hydrography Dataset (NHD) that accessed bounding box information of each NHD shapefile and ran a spatial join. The resulting relational table linked each chloride observation to an individual lake through an NHD ID. For every NHD lake ID, geospatial lake data were obtained from the LAGOS-NE database (Soranno et al. 2017), which provides watershed ecological context for all lakes greater than 4 ha in the 17-state area. Additional site characteristics were extracted from GIS line files of US interstates, US primary roads, and gridded winter severity data. Across all predictor variables in the training dataset, minimum values were >= 0.01. After converting zero values to 0.001, all data were log-transformed.

Machine Learning Model

A quantile regression forest (QRF) was used to model the relationship between observed chloride concentrations and lake and watershed characteristics. This model was chosen to accommodate a large number of correlated predictor variables, the presence of non-linear responses, and the potential importance of interactions among predictor variables. The QRF was implemented with 1,000 trees using the ranger package in R, with mtry set to 4 (Wright and Ziegler 2017).

To avoid overfitting the QRF to lakes with a greater number of chloride observations, we developed a customized sampling routine that constructed individual trees using the observations from a random subset of the study lakes (95% subset: the ‘in-bag samples’). Each resulting tree was used to make out-of-bag predictions on the remaining observations from the 5% of excluded lakes. All predictions are reported as the median of the terminal node values from each tree, with the corresponding 90%-prediction interval calculated from the .05 and 0.95 quantiles of the estimated conditional distribution of the response variable (Meinshausen 2006). Median terminal node values were chosen over mean values because they had superior predictive performance on out-of-bag observations.

Prediction Dataset

A prediction dataset was constructed for the full LAGOS-NE dataset, which contained 51,102 lakes and reservoirs greater than 4 ha in the 17-state area. After removing lakes with no available land-use data because the watersheds crossed the US/Canada border, 49,432 lakes remained, of which 2,773 were used for training the model. The prediction dataset was identical in structure to the training dataset, but contained no observational chloride data.

References

Meinshausen, N. 2006. Quantile Regression Forests. Journal of Machine Learning Research 7:983–999.

Soranno, P. A., L. C. Bacon, M. Beauchene, K. E. Bednar, E. G. Bissell, C. K. Boudreau, M. G. Boyer, M. T. Bremigan, S. R. Carpenter, J. W. Carr, K. S. Cheruvelil, S. T. Christel, M. Claucherty, S. M. Collins, J. D. Conroy, J. A. Downing, J. Dukett, C. E. Fergus, C. T. Filstrup, C. Funk, M. J. Gonzalez, L. T. Green, C. Gries, J. D. Halfman, S. K. Hamilton, P. C. Hanson, E. N. Henry, E. M. Herron, C. Hockings, J. R. Jackson, K. Jacobson-Hedin, L. L. Janus, W. W. Jones, J. R. Jones, C. M. Keson, K. B. S. King, S. A. Kishbaugh, J.-F. Lapierre, B. Lathrop, J. A. Latimore, Y. Lee, N. R. Lottig, J. A. Lynch, L. J. Matthews, W. H. McDowell, K. E. B. Moore, B. P. Neff, S. J. Nelson, S. K. Oliver, M. L. Pace, D. C. Pierson, A. C. Poisson, A. I. Pollard, D. M. Post, P. O. Reyes, D. O. Rosenberry, K. M. Roy, L. G. Rudstam, O. Sarnelle, N. J. Schuldt, C. E. Scott, N. K. Skaff, N. J. Smith, N. R. Spinelli, J. J. Stachelek, E. H. Stanley, J. L. Stoddard, S. B. Stopyak, C. A. Stow, J. M. Tallant, P.-N. Tan, A. P. Thorpe, M. J. Vanni, T. Wagner, G. Watkins, K. C. Weathers, K. E. Webster, J. D. White, M. K. Wilmes, and S. Yuan. 2017. LAGOS-NE: a multi-scaled geospatial and temporal database of lake ecological context and water quality for thousands of US lakes. GigaScience 6:1–22.

Wright, M. N., and A. Ziegler. 2017. ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software:1–17.

People and Organizations

Creators:

Individual:

Hilary A Dugan

Organization:

University of Wisconsin-Madison

Email Address:

hdugan@wisc.edu

Id:

https://orcid.org/0000-0003-4674-1149

Individual:

Nicholas K Skaff

Organization:

University of California, Berkeley University

Email Address:

nskaff@berkeley.edu

Id:

https://orcid.org/0000-0002-5929-3966

Individual:

Jonathan P Doubek

Organization:

Lake Superior State University

Email Address:

jdoubek@lssu.edu

Id:

https://orcid.org/0000-0003-2651-4715

Individual:

Samantha M Burke

Organization:

University of Guelph

Email Address:

samantha.burke2@gmail.com

Individual:

Flora E Krivak-Tetley

Organization:

Dartmouth College

Email Address:

fkt.gr@dartmouth.edu

Id:

https://orcid.org/0000-0003-3521-2460

Individual:

Jamie C Summers

Email Address:

jamiecsummers@gmail.com

Id:

https://orcid.org/0000-0002-7497-2326

Contacts:

Individual:

Hilary A Dugan

Organization:

University of Wisconsin-Madison

Email Address:

hdugan@wisc.edu

Id:

https://orcid.org/0000-0003-4674-1149

Temporal, Geographic and Taxonomic Coverage

Temporal, Geographic and/or Taxonomic information that applies to all data in this dataset:

Time Period

Begin:

1990-01-01

End:

2018-12-13

Geographic Region:

Description:

Midwest and Northeast USA

Bounding Coordinates:

Northern:	49.42	Southern:	36.56
Western:	-96.73	Eastern:	-68.19

Project

Parent Project Information:

Title:

Collaborative Research: Building Analytical, Synthesis, and Human Network Skills Needed for Macrosystem Science: a Next Generation Graduate Student Training Model Based on GLEON

Personnel:

Individual:	Kathleen C Weathers
Id:	https://orcid.org/0000-0002-3575-6508
Role:	Principal Investigator

Funding:

NSF: EF-1137327 and EF-1702991

Maintenance

Maintenance:

Description:	completed
Frequency:

Other Metadata

Additional Metadata

additionalMetadata
        |___text '\n    '
        |___element 'metadata'
        |     |___text '\n      '
        |     |___element 'unitList'
        |     |     |___text '\n        '
        |     |     |___element 'unit'
        |     |     |     |  \___attribute 'id' = 'metersPerHectare'
        |     |     |     |  \___attribute 'multiplierToSI' = ''
        |     |     |     |  \___attribute 'name' = 'metersPerHectare'
        |     |     |     |  \___attribute 'parentSI' = ''
        |     |     |     |  \___attribute 'unitType' = ''
        |     |     |     |___text '\n          '
        |     |     |     |___element 'description'
        |     |     |     |___text '\n        '
        |     |     |___text '\n        '
        |     |     |___element 'unit'
        |     |     |     |  \___attribute 'id' = 'nominalMonth'
        |     |     |     |  \___attribute 'multiplierToSI' = ''
        |     |     |     |  \___attribute 'name' = 'nominalMonth'
        |     |     |     |  \___attribute 'parentSI' = ''
        |     |     |     |  \___attribute 'unitType' = ''
        |     |     |     |___text '\n          '
        |     |     |     |___element 'description'
        |     |     |     |___text '\n        '
        |     |     |___text '\n      '
        |     |___text '\n    '
        |___text '\n  '

Copyright 2024 Environmental Data Initiative. This material is based upon work supported by the National Science Foundation under grants #2223103 and #2223104. Any opinions, findings, conclusions, or recommendations expressed in the material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Please contact us with questions, comments, or for technical assistance regarding this web site or the Environmental Data Initiative. Please read our privacy policy to know what information we collect about you and to understand your privacy rights.

EDI is a collaboration between the University of New Mexico and the University of Wisconsin – Madison, Center for Limnology:

Data Package Metadata View Summary

Lake chloride concentrations and model predictions for 49,432 lakes in the Midwest and Northeast United States.

Data Entities

Data Table

Data Table

Non-Categorized Data Resource

Data Package Usage Rights

Keywords

Methods and Protocols

These methods, instrumentation and/or protocols apply to all data in this dataset:

Training Dataset

Machine Learning Model

Prediction Dataset

References

People and Organizations

Temporal, Geographic and Taxonomic Coverage

Project

Parent Project Information:

Maintenance

Additional Metadata

Recently Added

Recently Updated

Data Package Metadata View Summary

Lake chloride concentrations and model predictions for 49,432 lakes in the Midwest and Northeast United States.

+/- Data Entities

Data Table

Data Table

Non-Categorized Data Resource

+/- Data Package Usage Rights

+/- Keywords

+/- Methods and Protocols

These methods, instrumentation and/or protocols apply to all data in this dataset:

Training Dataset

Machine Learning Model

Prediction Dataset

References

+/- People and Organizations

+/- Temporal, Geographic and Taxonomic Coverage

+/- Project

Parent Project Information:

+/- Maintenance

+/- Additional Metadata

Data Entities

Data Package Usage Rights

Keywords

Methods and Protocols

People and Organizations

Temporal, Geographic and Taxonomic Coverage

Project

Maintenance

Additional Metadata