Data Package Metadata View Summary

AquaMatch Chlorophyll a Data from Water Quality Portal: ~1970-2024

General Information

Data Package:

Local Identifier:

edi.1747.1

Title:

AquaMatch Chlorophyll a Data from Water Quality Portal: ~1970-2024

Alternate Identifier:

DOI PLACE HOLDER

Abstract:

This dataset, “AquaMatch Chlorophyll a Data from Water Quality Portal ~1970-2024”, is a component of a forthcoming update to AquaSat (Ross et al., 2019), AquaSat version 2 (“v2”). The overarching purpose of AquaSat V2 is to emphasize the individual parts of the AquaSat pipeline that make-up the matchups between satellite and in-situ measurements. As such, we have greatly expanded and improved upon the AquaSat chlorophyll a dataset in two ways: First, we have incorporated additional recent in situ data beyond what was available at the publication of AquaSat. Second, we have created a data quality tiering system to provide end-users with more guidance on data usage. In this schema we have three tiers: restrictive data that are verifiably self-similar across organizations and time-periods and can be considered highly reliable; narrowed data that we have good reason to believe are self-similar, but for which we can not verify full compatibility across data providers; and inclusive data, which are assumed to be reliable and are harmonized to our best ability given the information available from the data provider. We have also added flag columns to help users understand complexities of the available depth and field sampling data.

This dataset is a derived data product created using records downloaded from the Water Quality Portal (WQP) spanning January 6, 1970, to June 20, 2024. The WQP is a data warehouse for water-related data measured or observed within the United States and US Territories managed by the Environmental Protection Agency, United States Geological Survey, and the National Water Quality Monitoring Council. The dataset does not contain remote sensing matchups but can be paired with Landsat surface reflectances using the pipeline presented in Ross et al. (2019).

Ross, M. R. V., Topp, S. N., Appling, A. P., Yang, X., Kuhn, C., Butman, D. et al. (2019). AquaSat: A data set to enable remote sensing of water quality for inland waters. Water Resources Research, 55, 10012–10025. https://doi.org/10.1029/2019WR024883

Publication Date:

2024-08-22

For more information:
Visit:	DOI PLACE HOLDER

Time Period

Begin:

1970-01-06

End:

2024-06-20

People and Organizations
Contact:	Brousil, Matthew R (Colorado State University) [ email ]
Contact:	Meyer, Michael F (United States Geological Survey) [ email ]
Creator:	Brousil, Matthew R (Colorado State University)
Creator:	Meyer, Michael F (United States Geological Survey)
Creator:	Willi, Katie (Colorado State University)
Creator:	Steele, B G (Colorado State University)
Creator:	De La Torre, Juan (Colorado State University)
Creator:	Ross, Matthew R.V. (Colorado State University)
Organization:	Radical Open Science Syndicate
Organization:	United States Geological Service

Data Entities
Data Table Name:	chla_harmonized_final
Description:	A derived data product created using chlorophyll a measurement records downloaded from the Water Quality Portal (WQP). Contains a subset of columns from the WQP and additional columns added through harmonization and aggregation processes after downloading.
Other Name:	chla_workflow
Description:	The R scripts and data used to produce the harmonized dataset
Other Name:	bookdown_documentation
Description:	Documentation written with the {bookdown} R package that provides detailed information on the entire process of downloading and harmonizing the chlorophyll a data.
Other Name:	README
Description:	Provides context for the entities in this project.

Detailed Metadata

Data Entities

Data Table


Data:	https://pasta-s.lternet.edu/package/data/eml/edi/1747/1/c97458622bc2126fa6d6c84e183101cf
Name:	chla_harmonized_final
Description:	A derived data product created using chlorophyll a measurement records downloaded from the Water Quality Portal (WQP). Contains a subset of columns from the WQP and additional columns added through harmonization and aggregation processes after downloading.
Number of Records:	3393022
Number of Columns:	33

Table Structure

Object Name:

chla_harmonized_final.csv

Size:

814963220 byte

Authentication:

b322ce9eaa0d4bc537466615daae21f9 Calculated By MD5

Text Format:

Number of Header Lines:

Record Delimiter:

Orientation:

column

Simple Delimited:

Field Delimiter:	,
Quote Character:	"

Table Column Descriptions

parameter

OrganizationIdentifier

MonitoringLocationIdentifier

MonitoringLocationTypeName

ResolvedMonitoringLocationTypeName

ActivityStartDate

ActivityStartTime.Time

ActivityStartTime.TimeZoneCode

harmonized_tz

harmonized_local_time

harmonized_utc

ActivityStartDateTime

harmonized_top_depth_value

harmonized_top_depth_unit

harmonized_bottom_depth_value

harmonized_bottom_depth_unit

harmonized_discrete_depth_value

harmonized_discrete_depth_unit

depth_flag

mdl_flag

approx_flag

greater_flag

tier

field_flag

misc_flag

subgroup_id

harmonized_row_count

harmonized_units

harmonized_value

harmonized_value_cv

lat

lon

datum

Column Name:

parameter

OrganizationIdentifier

MonitoringLocationIdentifier

MonitoringLocationTypeName

ResolvedMonitoringLocationTypeName

ActivityStartDate

ActivityStartTime.Time

ActivityStartTime.TimeZoneCode

harmonized_tz

harmonized_local_time

harmonized_utc

ActivityStartDateTime

harmonized_top_depth_value

harmonized_top_depth_unit

harmonized_bottom_depth_value

harmonized_bottom_depth_unit

harmonized_discrete_depth_value

harmonized_discrete_depth_unit

depth_flag

mdl_flag

approx_flag

greater_flag

tier

field_flag

misc_flag

subgroup_id

harmonized_row_count

harmonized_units

harmonized_value

harmonized_value_cv

lat

lon

datum

Definition:

Specifies the type of environmental measurement being recorded.

From the Water Quality Portal User Guide: A designator used to uniquely identify a unique business establishment within a context.

From the Water Quality Portal User Guide: A designator used to describe the unique name, number, or code assigned to identify the monitoring location.

From the Water Quality Portal User Guide: The descriptive name for a type of monitoring location.

A resolved version of the MonitoringLocationTypeName column.

From the Water Quality Portal User Guide: The calendar date on which the field activity is started.

From the Water Quality Portal User Guide: The time of day that is reported when the field activity began, based on a 24-hour timescale.

From the Water Quality Portal User Guide: The time zone for which the time of day is reported. Any of the longitudinal divisions of the earth's surface in which a standard time is kept.

Local time zone in GMT offset format determined either using the ActivityStartTime.TimeZoneCode or through spatial means when the ActivityStartTime.TimeZoneCode was NA.

The calendar date and time on which the field activity is started. Created using the ActivityStartDate and ActivityStartTime.Time columns from WQP using the time zone specified in harmonized_tz. Based on a 24-hour timescale.

The Coordinated Universal Time (UTC) version of the calendar date and time on which the field activity is started. Created using the ActivityStartDate and ActivityStartTime.Time columns from WQP. Based on a 24-hour timescale.

A UTC version of the calendar date and time on which the field activity is started. Created using the ActivityStartDate and ActivityStartTime.Time columns from WQP. Based on a 24-hour timescale.Differs from harmonized_utc because it was generated by the dataRetrieval R package and uses slightly different logic. Differences in some values occur because 1) ActivityStartDateTime is NA for ActivityStartTime.TimeZoneCode values of NA, "AST", "ADT", "GST", "IDLE"; or 2) harmonized_utc handles "00:00:00" values of ActivityStartTime.Time the same as NAs whereas ActivityStartDateTime does not.

A harmonized version of the WQP column, ActivityTopDepthHeightMeasure.MeasureValue. From the Water Quality Portal User Guide: A measurement of the upper vertical location of a vertical location range (measured from a reference point) at which an activity occurred.

The unit for the harmonized_top_depth_value measurement.

A harmonized version of the WQP column, ActivityBottomDepthHeightMeasure.MeasureValue. From the Water Quality Portal User Guide: A measurement of the lower vertical location of a vertical location range (measured from a reference point) at which an activity occurred.

The unit for the harmonized_bottom_depth_value measurement.

A harmonized combination of the two WQP columns, ActivityDepthHeightMeasure.MeasureValue and ResultDepthHeightMeasure.MeasureValue. From the Water Quality Portal User Guide: ActivityDepthHeightMeasure.MeasureValue: A measurement of the vertical location (measured from a reference point) at which an activity occurred.ResultDepthHeightMeasure.MeasureValue: A measurement of the vertical location (measured from a reference point) at which a result occurred. Note: Only in STORET

The unit for the harmonized_discrete_depth_value measurement.

Depth flags are assigned using the harmonized depth columns that result from the depth column harmonization process. They indicate the type of depth data available for a given record.

Indicates that the value was created using the method detection limit method (e.g. replaced with a random number between 0 and 0.5 * (methods detection limit “MDL”))

Indicates that the harmonized_value was corrected based on language indicating that the original WQP record was an approximated value (e.g. “approx 5.0”).Note that occasionally approximate language will be used in a record but not changed or flagged. This occurs when the language is used in a comment-related column and not the result column itself, meaning that there is a usable numeric value provided (and thus doesn’t need correction).

Indicates that the harmonized_value was corrected because the original WQP record was described as "greater than" some value (e.g. "> 5.0")

Indicates the reliability and accuracy of each analytical method across data providers and throughout time.

Indicates whether the sampling method used agrees with the analytical method.

Included as a flexible flag column in order to note important information that isn’t covered by the tiering and flags defined in other columns. Some parameters, like chlorophyll a, will not use this column at all and will therefore just contain NA values in places of flags. Values and their meaning will differ by parameter and as a result, flag values will be explained in the documentation for each parameter.

A unique group identifier used to aggregate and summarize harmonized data. We identified groups to aggregate by creating subgroups (subgroup_id) with identical values from the following columns: parameter, OrganizationIdentifier, MonitoringLocationIdentifier, MonitoringLocationTypeName, ResolvedMonitoringLocationTypeName, ActivityStartDate, ActivityStartTime.Time, ActivityStartTime.TimeZoneCode, harmonized_tz, harmonized_local_time, harmonized_utc, ActivityStartDateTime, harmonized_top_depth_value, harmonized_top_depth_unit, harmonized_bottom_depth_value, harmonized_bottom_depth_unit, harmonized_discrete_depth_value, harmonized_discrete_depth_unit, depth_flag, mdl_flag, approx_flag, greater_flag, tier, field_flag, misc_flag, harmonized_units.

The number of records contributing to the harmonized_value and harmonized_value_cv for the current subgroup_id.

Units of measurement for the harmonized_value column.

The mean chlorophyll a measurement following harmonization and aggregation to the current subgroup_id. Note that we set a threshold for realistic chlorophyll a values at 1,000 ug/L and removed records exceeding the threshold.

The coefficient of variation for harmonized chlorophyll a measurements in the current subgroup_id.

Originally LatitudeMeasure. From the Water Quality Portal User Guide: The measure of the angular distance on a meridian north or south of the equator.

Originally LongitudeMeasure. From the Water Quality Portal User Guide: The measure of the angular distance on a meridian east or west of the prime meridian.

Originally HorizontalCoordinateReferenceSystemDatumName. From the Water Quality Portal User Guide: The name that describes the reference datum used in determining latitude and longitude coordinates.

Storage Type:

string

dateTime

string

dateTime

string

float

string

float

string

float

string

float

string

float

string

Measurement Type:

nominal

dateTime

nominal

dateTime

nominal

ratio

nominal

ratio

nominal

ratio

nominal

ratio

nominal

ratio

nominal

Measurement Values Domain:

Allowed Values and Definitions

Enumerated Domain

Code Definition

Code	chlorophyll
Definition	Notes that the environmental measurements in this dataset are for chlorophyll.
Source

Definition

text

Definition

text

Allowed Values and Definitions

Enumerated Domain

Code Definition

Code	Canal Irrigation
Definition	From the EPA WQX Domain Value: Irrigation canals are the main waterways that bring irrigation water from a water source to the areas to be irrigated. They can be lined with concrete, brick, stone, or a flexible membrane to prevent seepage and erosion.
Source

Code Definition

Code	Canal Transport
Definition	From the EPA WQX Domain Values: Canals are human-made channels for water conveyance, or to service water transport vehicles. In most cases, the engineered works will have a series of dams and locks that create areas of low speed current flow. These areas are referred to as slack water levels, often just called levels.
Source

Code Definition

Code	Channelized Stream
Definition	From the EPA WQX Domain Values: The process of straightening or redirecting natural streams in an artificially modified or constructed stream bed. Channelization has been carried out for numerous reasons, most often to drain wetlands , direct water flow for agricultural use, and control flooding . While this process makes a stream more useful for human activities, it tends to interfere with natural river habitats and to destabilize stream banks by destroying riparian vegetation.
Source

Code Definition

Code	Estuary
Definition	From the EPA WQX Domain Values: A partially enclosed coastal body of brackish water with one or more rivers or streams flowing into it, and with a free connection to the open sea Estuaries form a transition zone between river environments and maritime environments. The sea water entering the estuary is diluted by the fresh water flowing from rivers and streams.
Source

Code Definition

Code	Great Lake
Definition	From the EPA WQX Domain Values: The Great Lakes, also called the Laurentian Great Lakes and the Great Lakes of North America, are a series of interconnected freshwater lakes primarily in the upper mid-east region of North America, on the Canada–United States border, which connect to the Atlantic Ocean through the Saint Lawrence River.
Source

Code Definition

Code	Lake
Definition	From the EPA WQX Domain Values: A lake is an area filled with water, localized in a basin, that is surrounded by land, apart from any river or other outlet that serves to feed or drain the lake.
Source

Code Definition

Code	Reservoir
Definition	From the EPA WQX Domain Values: An enlarged natural or artificial lake, pond or impoundment created using a dam or lock to store water.
Source

Code Definition

Code	River/Stream
Definition	From the EPA WQX Domain Values: A body of water with surface water flowing within the bed and banks of a channel.
Source

Code Definition

Code	River/Stream Intermittent
Definition	From the EPA WQX Domain Values: Normally cease flowing for weeks or months each year.
Source

Code Definition

Code	River/Stream Perennial
Definition	From the EPA WQX Domain Values: A stream or river (channel) that has continuous flow in parts of its stream bed all year round during years of normal rainfall.
Source

Code Definition

Code	Riverine Impoundment
Definition	From the EPA WQX Domain Values: Impoundments (also known as reservoirs) are artificially created standing water bodies, produced by dams on streams or rivers.
Source

Code Definition

Code	Stream
Definition	From the EPA's WQX Domain Values: A body of water with surface water flowing within the bed and banks of a channel.
Source

Code Definition

Code	Lake, Reservoir, Impoundment
Definition	From the EPA WQX Domain Values: Lake: A lake is an area filled with water, localized in a basin, that is surrounded by land, apart from any river or other outlet that serves to feed or drain the lake Reservoir: An enlarged natural or artificial lake, pond or impoundment created using a dam or lock to store water. Riverine Impoundment: Impoundments (also known as reservoirs) are artificially created standing water bodies, produced by dams on streams or rivers.
Source

Code Definition

Code	Canal Drainage
Definition	From the EPA WQX Domain Values: As a channel drainage system it is designed to eliminate the need for further pipework systems to be installed in parallel to the drainage, reducing the environmental impact of production as well as improving water collection.
Source

Code Definition

Code	Pond-Stormwater
Definition	From the EPA WQX Domain Values: Stormwater, also spelled storm water, is water that originates during precipitation events and snow/ice melt.
Source

Code Definition

Code	River/Stream Ephemeral
Definition	From the EPA WQX Domain Values: A stream that flows only briefly during and following a period of rainfall in the immediate locality
Source

Code Definition

Code	Stream: Canal
Definition	From the EPA WQX Domain Values: River/Stream: A body of water with surface water flowing within the bed and banks of a channel.
Source

Allowed Values and Definitions

Enumerated Domain

Code Definition

Code	Lake, Reservoir, Impoundment
Definition	From the EPA WQX Domain Values: Lake: A lake is an area filled with water, localized in a basin, that is surrounded by land, apart from any river or other outlet that serves to feed or drain the lake Reservoir: An enlarged natural or artificial lake, pond or impoundment created using a dam or lock to store water. Riverine Impoundment: Impoundments (also known as reservoirs) are artificially created standing water bodies, produced by dams on streams or rivers.
Source

Code Definition

Code	Stream
Definition	From the EPA WQX Domain Values: River/Stream: A body of water with surface water flowing within the bed and banks of a channel.
Source

Code Definition

Code	Estuary
Definition	From the EPA WQX Domain Values: A partially enclosed coastal body of brackish water with one or more rivers or streams flowing into it, and with a free connection to the open sea Estuaries form a transition zone between river environments and maritime environments. The sea water entering the estuary is diluted by the fresh water flowing from rivers and streams.
Source

Format	YYYY-MM-DD
Precision

Format	hh:mm:ss
Precision

Allowed Values and Definitions

Enumerated Domain

Code Definition

Code	ADT
Definition	From the EPA WQX Domain Values: Atlantic Daylight Time
Source

Code Definition

Code	AKDT
Definition	From the EPA WQX Domain Values: Alaska Daylight Time
Source

Code Definition

Code	AKST
Definition	From the EPA WQX Domain Values: Alaska Standard Time
Source

Code Definition

Code	AST
Definition	From the EPA WQX Domain Values: Atlantic Standard Time
Source

Code Definition

Code	CDT
Definition	From the EPA WQX Domain Values: Central Daylight Time
Source

Code Definition

Code	CST
Definition	From the EPA WQX Domain Values: Central Standard Time
Source

Code Definition

Code	EDT
Definition	From the EPA WQX Domain Values: Eastern Daylight Time
Source

Code Definition

Code	EST
Definition	From the EPA WQX Domain Values: Eastern Standard Time
Source

Code Definition

Code	GMT
Definition	From the EPA WQX Domain Values: Greenwich Mean Time
Source

Code Definition

Code	GST
Definition	From the EPA WQX Domain Values: Guam Standard Time Zone (also Chamorro Standard Time)
Source

Code Definition

Code	HAST
Definition	From the EPA WQX Domain Values: Hawaii-Aleutian Standard Time
Source

Code Definition

Code	HST
Definition	Hawaii Standard Time (assumed; not listed in WQX)
Source

Code Definition

Code	IDLE
Definition	Definition unknown. IDLE is not listed in WQX
Source

Code Definition

Code	MDT
Definition	From the EPA WQX Domain Values: Mountain Daylight Time
Source

Code Definition

Code	MST
Definition	From the EPA WQX Domain Values: Mountain Standard Time
Source

Code Definition

Code	PDT
Definition	From the EPA WQX Domain Values: Pacific Daylight Time
Source

Code Definition

Code	PST
Definition	From the EPA WQX Domain Values: Pacific Standard Time
Source

Code Definition

Code	UTC
Definition	From the EPA WQX Domain Values: Coordinated Universal Time
Source

Allowed Values and Definitions

Enumerated Domain

Code Definition

Code	Etc/GMT+10
Definition	Time zone is 10 hours behind GMT
Source

Code Definition

Code	Etc/GMT+11
Definition	Time zone is 11 hours behind GMT
Source

Code Definition

Code	Etc/GMT+3
Definition	Time zone is 3 hours behind GMT
Source

Code Definition

Code	Etc/GMT+4
Definition	Time zone is 4 hours behind GMT
Source

Code Definition

Code	Etc/GMT+5
Definition	Time zone is 5 hours behind GMT
Source

Code Definition

Code	Etc/GMT+6
Definition	Time zone is 6 hours behind GMT
Source

Code Definition

Code	Etc/GMT+7
Definition	Time zone is 7 hours behind GMT
Source

Code Definition

Code	Etc/GMT+8
Definition	Time zone is 8 hours behind GMT
Source

Code Definition

Code	Etc/GMT+9
Definition	Time zone is 9 hours behind GMT
Source

Code Definition

Code	Etc/GMT-10
Definition	Time zone is 10 hours ahead of GMT
Source

Format	YYYY-MM-DD hh:mm:ss
Precision

Format	YYYY-MM-DDThh:mm:ssZ
Precision

Definition

text

Unit	meter
Type	real
Min	-888
Max	42.4

Allowed Values and Definitions

Enumerated Domain

Code Definition

Code	m
Definition	Meter
Source

Unit	meter
Type	real
Min	-888
Max	243

Allowed Values and Definitions

Enumerated Domain

Code Definition

Code	m
Definition	Meter
Source

Unit	meter
Type	real
Min	-9
Max	5105.1

Allowed Values and Definitions

Enumerated Domain

Code Definition

Code	m
Definition	Meter
Source

Allowed Values and Definitions

Enumerated Domain

Code Definition

Code	0
Definition	No depth data provided.
Source

Code Definition

Code	1
Definition	Only discrete depth data provided.
Source

Code Definition

Code	2
Definition	A record with a top and/or bottom depth. Indicates an integrated sample.
Source

Code Definition

Code	3
Definition	Any combination of discrete and top/bottom depths. Sample depths cannot be reconciled with certainty.
Source

Allowed Values and Definitions

Enumerated Domain

Code Definition

Code	0
Definition	No MDL-based adjustment was performed. This flag implies that a sample concentration was at or exceeded the MDL.
Source

Code Definition

Code	1
Definition	MDL correction was performed.
Source

Code Definition

Code	2
Definition	No MDL correction was applied and the concentration is below the MDL.
Source

Allowed Values and Definitions

Enumerated Domain

Code Definition

Code	0
Definition	The original value was not approximated and did not need correction.
Source

Code Definition

Code	1
Definition	The original value required correction because its result was approximated.
Source

Allowed Values and Definitions

Enumerated Domain

Code Definition

Code	0
Definition	The original result was not described as "greater than" a value and did not need correction.
Source

Code Definition

Code	1
Definition	The original value required correction because its result was described as "greater than" a value.
Source

Allowed Values and Definitions

Enumerated Domain

Code Definition

Code	0
Definition	Restrictive: Data that are verifiably self-similar across organizations and time-periods and can be considered highly reliable and interoperable.
Source

Code Definition

Code	1
Definition	Narrowed: Data that we have good reason to believe are self-similar, but for which we can’t verify full interoperability across data providers.
Source

Code Definition

Code	2
Definition	Inclusive: Data that are assumed to be reliable and are harmonized to our best ability given the information available from the data provider. This tier includes NA or non-resolvable descriptions for the analytical method, which often make up the majority of methods descriptions for any given parameter. Because this tier represents many analytical methods, direct interoperability between samples is not guaranteed.
Source

Allowed Values and Definitions

Enumerated Domain

Code Definition

Code	0
Definition	Restrictive and narrowed analytical tiers with in vitro field methods.
Source

Code Definition

Code	1
Definition	Restrictive and narrowed analytical tiers with in situ field methods.
Source

Code Definition

Code	2
Definition	Any entry in the inclusive tier.
Source

Allowed Values and Definitions

Enumerated Domain

Code Definition

Code	NA
Definition	No Data
Source

Definition

text

Unit	number
Type	real
Min	1
Max	760

Allowed Values and Definitions

Enumerated Domain

Code Definition

Code	ug/L
Definition	microgramPerLiter
Source

Unit	microgramPerLiter
Type	real
Min	0
Max	1000

Unit	dimensionless
Type	real
Min	0
Max	7.5

Unit	degree
Type	real
Min	-14.31
Max	71.37

Unit	degree
Type	real
Min	-170.71
Max	145.71

Allowed Values and Definitions

Enumerated Domain

Code Definition

Code	AMSMA
Definition	From the EPA WQX Domain Values: American Samoa Datum
Source

Code Definition

Code	OLDHI
Definition	From the EPA WQX Domain Values: Old Hawaiian Datum
Source

Code Definition

Code	PR
Definition	From the EPA WQX Domain Values: Puerto Rico Datum
Source

Code Definition

Code	WGS72
Definition	From the EPA WQX Domain Values: World Geodetic System 1972
Source

Code Definition

Code	WGS84
Definition	From the EPA WQX Domain Values: World Geodetic System 1984
Source

Missing Value Code:

Code	NA
Expl	No Data

Code	NA
Expl	No Data

Code	NA
Expl	No Data

Code	NA
Expl	No data

Code	NA
Expl	No data

Code	NA
Expl	No data

Code	NA
Expl	No Data

Code	NA
Expl	No Data

Accuracy Report:

Accuracy Assessment:

Coverage:

Methods:

Non-Categorized Data Resource

Name:

chla_workflow

Entity Type:

application/zip

Description:

The R scripts and data used to produce the harmonized dataset

Physical Structure Description:

Object Name:

chla_workflow.zip

Size:

1040123020 byte

Authentication:

2475ebac9fdc77b6cd2dd2b51ae0f8e9 Calculated By MD5

Externally Defined Format:

Format Name:

application/zip

Data:

https://pasta-s.lternet.edu/package/data/eml/edi/1747/1/1006f1c0287ecfd7f74caf9848889f69

Non-Categorized Data Resource

Name:

bookdown_documentation

Entity Type:

application/zip

Description:

Documentation written with the {bookdown} R package that provides detailed information on the entire process of downloading and harmonizing the chlorophyll a data.

Physical Structure Description:

Object Name:

bookdown_documentation.zip

Size:

16148978 byte

Authentication:

c6a3c435762850a671c5dc0b6762adae Calculated By MD5

Externally Defined Format:

Format Name:

application/zip

Data:

https://pasta-s.lternet.edu/package/data/eml/edi/1747/1/f1258dd3f5881b1ce20a6d3c65b22eab

Non-Categorized Data Resource

Name:

README

Entity Type:

pdf

Description:

Provides context for the entities in this project.

Physical Structure Description:

Object Name:

README.pdf

Size:

70142 byte

Authentication:

fd38d63437764046d370556f8ed61553 Calculated By MD5

Externally Defined Format:

Format Name:

application/pdf

Data:

https://pasta-s.lternet.edu/package/data/eml/edi/1747/1/c47c7c7383225ab55ff591cb59c41e6b

Data Package Usage Rights

This data package is released to the "public domain" under Creative Commons CC0 1.0 "No Rights Reserved" (see: https://creativecommons.org/publicdomain/zero/1.0/). It is considered professional etiquette to provide attribution of the original work if this data package is shared in whole or by individual components. A generic citation is provided for this data package on the website https://portal.edirepository.org (herein "website") in the summary metadata page. Communication (and collaboration) with the creators of this data package is recommended to prevent duplicate research or publication. This data package (and its components) is made available "as is" and with no warranty of accuracy or fitness for use. The creators of this data package and the website shall not be liable for any damages resulting from misinterpretation or misuse of the data package or its components. Periodic updates of this data package may be available from the website. Thank you.

Keywords

By Thesaurus:
LTER Controlled Vocabulary	chlorophyll a, water quality, hydrology, water properties, water, water content, chlorophyll

Methods and Protocols

These methods, instrumentation and/or protocols apply to all data in this dataset:

Methods and protocols used in the collection of this data package

Description:

We downloaded data from the Water Quality Portal (WQP) for records with chlorophyll a related characteristicNames. The resulting data came from a variety of data providers, parameters, and methods. This inevitably results in a heterogeneous dataset that requires rigorous quality control before analytical use. As a part of this quality control we made an effort to remove personally identifying information from comment fields in the dataset immediately following the WQP download. However, as a derived data product, the harmonized chlorophyll a dataset may carry forward data quality issues in the original data from the WQP.

Instrument(s):

Description:

Pre-harmonization: Prior to data harmonization we performed several preparatory steps with the raw data:

1. Some column types were converted (e.g. result column to numeric) and duplicated to allow for harmonization while still retaining the original data

2. Site-specific metadata were joined to the dataset (e.g. ResolvedMonitoringLocationTypeName)

3. Date and time data were cleaned and used to create harmonized_tz (time zone), harmonized_local_time, and harmonized_utc columns

4. We dropped records where both chlorophyll a measurements and detection limits were missing (NA), or where there was insufficient information to resolve the missing chlorophyll measurement using other columns. 5. We filtered the ResultStatusIdentifier column to include only the following statuses: "Accepted", "Final", "Historical", "Validated", "Preliminary", NA. These statuses generally indicate a reliable result having been reached, however we also include NA in an effort to be conservative.

Instrument(s):

Description:	Steps from here forward were considered part of the harmonization process. We next ensure that the media type for all chlorophyll data is "Surface Water", "Water", "Estuary", or NA.
Instrument(s):	R

Description:

We next filtered out records based on indications that they failed data quality assurance or quality control for some reason given by the data provider (these instances are referred to here as “failures”).

After reviewing the contents of the ActivityCommentText, ResultLaboratoryCommentText, ResultCommentText, and ResultMeasureValue_original columns, we developed a list of terms that captured the majority of instances where records had failures or unacceptable measurements. We found the phrasing to be consistent across columns and so we searched for the same (case agnostic) terms in all four locations. The terms are: “beyond accept”, “cancelled”, “contaminat”, “error”, “fail”, “improper”, “instrument down”, “interference”, “invalid”, “no result”, “no test”, “not accept”, “outside of accept”, “problem”, “QC EXCEEDED”, “questionable”, “suspect”, “unable”, “violation”, “reject”, “no data”.

Instrument(s):

Description:

The next harmonization step used method detection limits (MDLs) to clean up the reported result values. When a numeric value was missing for the data record (i.e., NA or text that became NA during an as.numeric call) we checked for non-detect language in the ResultLaboratoryCommentText, ResultCommentText, ResultDetectionConditionText, and ResultMeasureValue columns. This language could be "non-detect", "not detect", "non detect", "undetect", or "below".

If non-detect language existed then we used the DetectionQuantitationLimitMeasure.MeasureValue column for the MDL, otherwise if there was a < and a number in the ResultMeasureValue column we used that number instead.

We then used a random number between 0 and 0.5 * MDL as the record’s value moving forward. Once the process was complete we filtered out any negative values in the dataset.

We produced a new column, mdl_flag, from the MDL cleaning process. Records where no MDL-based adjustment was made and which were at or above the MDL were assigned a 0. Records with corrected values based on the MDL method were assigned a 1. Finally, records where no MDL-based adjustment was made and which contain a numeric value below the provided MDL were assigned a 2.

Instrument(s):

Description:

We next cleaned approximate values using a similar process as for MDL cleaning. We flagged “approximated” values in the dataset. The ResultMeasureValue column was checked for all three of the following conditions:

1. Numeric-only version of the column was still NA after MDL cleaning

2. The original column text contained a number

3. Any of ResultLaboratoryCommentText, ResultCommentText, or ResultDetectionConditionText matched this regular expression, ignoring case: "result approx|RESULT IS APPROX|value approx"

We then used the approximate value as the record’s value moving forward.

Records with corrected values based on the above method are noted with a 1 in the approx_flag column.

Instrument(s):

Description:

The next step was similar to the MDL and approximate value cleaning processes, and followed the approximate cleaning process most closely. The goal was to clean up values that were entered as “greater than” some value. The ResultMeasureValue column was checked for all three of the following conditions:

1. Numeric-only version of the column was still NA after MDL & approximate cleaning

2. The original column text contained a number

3. The original column text contained a >

We then used the “greater than” value (without >) as the record’s value moving forward.

Records with corrected values based on the above method are noted with a 1 in the greater_flag column.

Instrument(s):

Description:	The goal of the preceding three steps was to prevent records with seemingly missing measurement data from being dropped if there was still a chance of recovering a usable value. At this point we finished with that process and we proceeded to check for remaining records with NA values in their harmonized_value column. If they existed, they were dropped.
Instrument(s):	R

Description:	The next step in chla harmonization was converting the units of WQP records. We transformed units provided in the WQP into micrograms per liter (ug/L).
Instrument(s):	R

Description:

The next harmonization step involved cleaning the four depth-related columns obtained from the WQP. There are four columns that explicitly contain depth information for a given WQP entry, all of which use a variety of measurement units: ActivityDepthHeightMeasure.MeasureValue, ResultDepthHeightMeasure.MeasureValue, ActivityTopDepthHeightMeasure.MeasureValue, ActivityBottomDepthHeightMeasure.MeasureValue.

We completed a few pre-processing steps:

1. Convert the following character values to an explicit NA: “NA”, “999”, “-999”, “9999”, “-9999”, “-99”, “99”, “NaN”

2. Convert depths in all four columns to meters from the depth unit listed by the data provider

3. Create a single “discrete” sample depth column using a combination of the ActivityDepthHeightMeasure.MeasureValue and ResultDepthHeightMeasure.MeasureValue columns. We used ActivityDepth value when ResultDepth value was missing, and ResultDepth when ActivityDepth was missing. If both columns have values but disagree we used an average of the two.

This pre-processing resulted in three “harmonized” columns reporting water sampling depth values in meters: harmonized_discrete_depth_value, harmonized_top_depth_value, harmonized_bottom_depth_value.

Sample depth flags were assigned using the harmonized depth columns that result from the pre-processing steps above. If the record had no depth listed it was assigned a depth_flag of 0. A record with only discrete depth listed in the harmonized_discrete_depth_value was given a depth_flag of 1. A record with top and/or bottom depth (harmonized_top_depth_value, harmonized_bottom_depth_value), indicating an integrated sample, was assigned a depth_flag of 2, and any combination of discrete + top and/or bottom depths was assigned a depth_flag of 3, since the sample depth(s) could not be reconciled with certainty.

Description:

We next reviewed the analytical methods used in measuring chlorophyll a, primarily by classifying the text provided with each record in ResultAnalyticalMethod.MethodName. Once these methods were classified we arranged them into hierarchical tiers.

However, prior to classification we checked the ResultAnalyticalMethod.MethodName column for names that indicate non-chlorophyll a measurements. Phrases used to flag and remove unrelated methods from chlorophyll a data were: “sulfate”, “sediment”, “5310”, “counting”, “plasma”, “turbidity”, “coliform”, “carbon”, “2540”, “conductance”, “nitrate”, “nitrite”, “nitrogen”, “alkalin”, “zooplankton”, “phosphorus”, “periphyton”, “peri”, “biomass”, “temperature”, “elemental analyzer”, “2320”.

The next step towards creating tiers was to then classify the methods in ResultAnalyticalMethod.MethodName into either: HPLC methods, spectrophotometer and fluorometer methods, or methods for which a pheophytin correction is recorded as part of the methodology. These classifications were not the final tiers, but they informed the tiering in the final step of this process. The criteria for each of the above classifications were:

- HPLC: Detection of “447”, “chromatography”, or “hplc” in the ResultAnalyticalMethod.MethodName or presence of 70951 or 70953 in the USGSPCode column

- Spectro/fluoro: Detection of “445”, “fluor”, “Welshmeyer”, “fld”, “10200”, “446”, “trichromatic”, “spectrophoto”, “monochrom”, “monchrom”, or “spec” not as part of a word in ResultAnalyticalMethod.MethodName

- Pheophytin correction: Detection of “correct”, “445”, “446”, or “in presence” in ResultAnalyticalMethod.MethodName or detection of “corrected for pheophytin” or “free of pheophytin” in CharacteristicName

Finally, we grouped the data into three tiers:

Tier 0 - Restrictive: Data that are verifiably self-similar across organizations and time-periods and can be considered highly reliable.

Tier 1 - Narrowed: Data that we have good reason to believe are self-similar, but for which we can’t verify full compatibility across data providers.

Tier 2 - Inclusive: Data that are assumed to be reliable and are harmonized to our best ability given the information available from the data provider. This tier includes NA or non-resolvable descriptions for the analytical method, which often make up the majority of method descriptions. Because this tier represents many analytical methods, direct compatibility between samples is not guaranteed.

Note: Spectrophotometer and fluorometer methods that are labeled as pheophytin-corrected are grouped into the “Narrowed” tier. Depending on the exact implementation of EPA method 445, the correction philosophy may vary, and there is no agreed upon method to rectify inconsistencies in data entry related to these methodological differences. The final harmonization product that aggregates simultaneous records does not retain the CharacteristicName or ResultAnalyticalMethod.MethodName columns.

Description:

Next we flagged field sampling methods based primarily on the SampleCollectionMethod.MethodName column. We first classified each record into either in vitro or in situ methods (i.e., in vitro assumes a water sample was collected and taken to a lab for analysis; in situ assumes a measurement was obtained in the field).

We used the following strings to mark in vitro samples: “grab”, “bottle”, “vessel”, “bucket”, “jar”, “composite”, “integrate”, “UHL001”, “surface”, “filter”, “filtrat”, “1060B”, “kemmerer”, “collect”, “rosette”, “equal width”, “vertical”, “van dorn”, “bail”, “sample”, “sampling”, “lab” not in the middle of another word, or a “G” on its own as shorthand for “grab”. In situ samples were detected using “in situ”, “probe”, or “ctd”.

Lastly we created the field flag based on whether the sampling method used agrees with the analytical method. Flags of 0 indicated that the field sampling method is in agreement with the analytical method, 1 indicates that the field sampling methods are uncharacteristic of the analytical method, and anything with tier of 2 is given a field flag of 2 due to the ambiguity associated with those observations’ analytical methods and corresponding sampling methods.

The following rules were used for chlorophyll a field sampling flags:

Flag 0: Restrictive and narrowed tiers with in vitro field methods

Flag 1: Restrictive and narrowed tiers with in situ field methods

Flag 2: Any entry in the inclusive tier

Description:

Next we added a placeholder for the miscellaneous flag column, misc_flag.

Some WQP parameters have additional flagging requirements that chlorophyll a does not, so we include this placeholder to maintain the same columns with potential future parameter data products.

Description:

Before finalizing the dataset we removed chlorophyll a values that were beyond a realistic threshold. We used 1000 ug/L as our cutoff for removal (Wetzel, 2001, Chapter 15, Figure 19).

Description:

The final step of chlorophyll a harmonization was to aggregate simultaneous observations. Any group of samples determined to be simultaneous were simplified into a single record containing the mean and coefficient of variation (CV) of the group. These can be either true duplicate entries in the WQP or records with non-identical values recorded at the same time and place and by the same organization (field and/or lab replicates/duplicates). The CV can be used to filter the dataset based on the amount of variability that is tolerable to specific use cases. Note, however, that many entries will have a CV that is NA because there are no duplicates or 0 because the records are duplicates and all entries have the same harmonized_value.

We identified simultaneous records to aggregate by creating identical subgroups (subgroup_id) from the following columns: parameter, OrganizationIdentifier, MonitoringLocationIdentifier, MonitoringLocationTypeName, ResolvedMonitoringLocationTypeName, ActivityStartDate, ActivityStartDateTime, ActivityStartTime.TimeZoneCode, harmonized_tz, harmonized_local_time, harmonized_utc, harmonized_top_depth_value, harmonized_top_depth_unit, harmonized_bottom_depth_value, harmonized_bottom_depth_unit, harmonized_discrete_depth_value, harmonized_discrete_depth_unit, depth_flag, mdl_flag, approx_flag, greater_flag, tier, field_flag, misc_flag, harmonized_units.

The final, aggregated values are presented in the harmonized_value and harmonized_value_cv columns. The number of rows used per group is recorded in the harmonized_row_count column.

People and Organizations

Publishers:

Organization:

Environmental Data Initiative

Email Address:

info@edirepository.org

Web Address:

https://edirepository.org

Id:

https://ror.org/0330j0z60

Creators:

Individual:

Matthew R Brousil

Organization:

Colorado State University

Email Address:

matthew.brousil@colostate.edu

Id:

https://orcid.org/0000-0001-8229-9445

Individual:

Michael F Meyer

Organization:

United States Geological Survey

Email Address:

mfmeyer@usgs.gov

Id:

https://orcid.org/0000-0001-7741-5982

Individual:

Katie Willi

Organization:

Colorado State University

Email Address:

kathryn.willi@colostate.edu

Id:

https://orcid.org/0000-0001-7163-2206

Individual:

B G Steele

Organization:

Colorado State University

Email Address:

b.steele@colostate.edu

Id:

https://orcid.org/0000-0003-4356-4103

Individual:

Juan De La Torre

Organization:

Colorado State University

Email Address:

juan.delatorre@colostate.edu

Id:

https://orcid.org/0009-0004-7541-8695

Individual:

Matthew R.V. Ross

Organization:

Colorado State University

Email Address:

matthew.ross@colostate.edu

Id:

https://orcid.org/0000-0001-9105-4255

Contacts:

Individual:

Matthew R Brousil

Organization:

Colorado State University

Email Address:

matthew.brousil@colostate.edu

Id:

https://orcid.org/0000-0001-8229-9445

Individual:

Michael F Meyer

Organization:

United States Geological Survey

Email Address:

mfmeyer@usgs.gov

Id:

https://orcid.org/0000-0001-6006-7985

Associated Parties:

Organization:

Radical Open Science Syndicate

Email Address:

matthew.ross@colostate.edu

Web Address:

https://rossyndicate.com/

Role:

Data curator

Organization:	United States Geological Service
Id:	https://ror.org/035a68863
Role:	Funder

Metadata Providers:

Individual:

Juan De La Torre

Organization:

Colorado State University

Email Address:

juan.delatorre@colostate.edu

Id:

https://orcid.org/0009-0004-7541-8695

Individual:

Matthew R Brousil

Organization:

Colorado State University

Email Address:

mbrousil@colostate.edu

Id:

https://orcid.org/0000-0001-8229-9445

Temporal, Geographic and Taxonomic Coverage

Temporal, Geographic and/or Taxonomic information that applies to all data in this dataset:

Time Period

Begin:

1970-01-06

End:

2024-06-20

Geographic Region:

Description:

Conterminous US, Alaska, Hawaii, American Samoa, Puerto Rico, and United States Virgin Islands.

Bounding Coordinates:

Northern:	71.36658	Southern:	17.682
Western:	-163.14	Eastern:	-64.61357

Geographic Region:

Description:

Guam and Commonwealth of the Northern Mariana Islands

Bounding Coordinates:

Northern:	15.15297	Southern:	13.24574
Western:	144.6253	Eastern:	145.7105

Project

Parent Project Information:

Title:

AquaMatch

Personnel:

Individual:

Matthew R.V. Ross

Organization:

Colorado State University

Email Address:

matthew.ross@colostate.edu

Id:

https://orcid.org/0000-0001-9105-4255

Role:

Principal Investigator

Individual:

Matthew Brousil

Organization:

Colorado State University

Email Address:

matthew.brousil@colostate.edu

Id:

https://orcid.org/0000-0001-8229-9445

Role:

Key Contributor

Individual:

B G Steele

Organization:

Colorado State University

Email Address:

b.steele@colostate.edu

Id:

https://orcid.org/0000-0001-4365-4103

Role:

Key contributor

Individual:

Kathryn Willi

Organization:

Colorado State University

Email Address:

kathryn.willi@colostate.edu

Id:

https://orcid.org/0000-0001-7163-2206

Role:

Contributor

Individual:

Michael Meyer

Organization:

United States Geological Service

Email Address:

mfmeyer@usgs.gov

Id:

https://orcid.org/0000-0001-6006-7985

Role:

Contributor

Maintenance

Maintenance:

Description:	completed
Frequency:	notPlanned

Other Metadata

Additional Metadata

additionalMetadata
        |___text '\n      '
        |___element 'metadata'
        |     |___text '\n         '
        |     |___element 'importedFromXML'
        |     |        \___attribute 'dateImported' = '2024-05-05'
        |     |        \___attribute 'filename' = 'EDI EML AquaMatch Chlorophyll a Data from Water Quality Portal_ ~1970-2023.xml'
        |     |        \___attribute 'taxonomicCoverageExempt' = 'True'
        |     |___text '\n      '
        |___text '\n   '

Additional Metadata

additionalMetadata
        |___text '\n      '
        |___element 'metadata'
        |     |___text '\n         '
        |     |___element 'emlEditor'
        |     |        \___attribute 'app' = 'ezEML'
        |     |        \___attribute 'release' = '2024.08.06'
        |     |___text '\n      '
        |___text '\n   '

Copyright 2024 Environmental Data Initiative. This material is based upon work supported by the National Science Foundation under grants #2223103 and #2223104. Any opinions, findings, conclusions, or recommendations expressed in the material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Please contact us with questions, comments, or for technical assistance regarding this web site or the Environmental Data Initiative. Please read our privacy policy to know what information we collect about you and to understand your privacy rights.

EDI is a collaboration between the University of New Mexico and the University of Wisconsin – Madison, Center for Limnology:

Data Package Metadata View Summary

AquaMatch Chlorophyll a Data from Water Quality Portal: ~1970-2024

Data Entities

Data Table

Non-Categorized Data Resource

Non-Categorized Data Resource

Non-Categorized Data Resource

Data Package Usage Rights

Keywords

Methods and Protocols

These methods, instrumentation and/or protocols apply to all data in this dataset:

People and Organizations

Temporal, Geographic and Taxonomic Coverage

Project

Parent Project Information:

Maintenance

Additional Metadata

Additional Metadata

Recently Added

Recently Updated

Data Package Metadata View Summary

AquaMatch Chlorophyll a Data from Water Quality Portal: ~1970-2024

+/- Data Entities

Data Table

Non-Categorized Data Resource

Non-Categorized Data Resource

Non-Categorized Data Resource

+/- Data Package Usage Rights

+/- Keywords

+/- Methods and Protocols

These methods, instrumentation and/or protocols apply to all data in this dataset:

+/- People and Organizations

+/- Temporal, Geographic and Taxonomic Coverage

+/- Project

Parent Project Information:

+/- Maintenance

+/- Additional Metadata

+/- Additional Metadata

Data Entities

Data Package Usage Rights

Keywords

Methods and Protocols

People and Organizations

Temporal, Geographic and Taxonomic Coverage

Project

Maintenance

Additional Metadata

Additional Metadata