Data Package Metadata   View Summary

AquaMatch Chlorophyll a Data from Water Quality Portal: ~1970-2024

General Information
Data Package:
Local Identifier:edi.1747.1
Title:AquaMatch Chlorophyll a Data from Water Quality Portal: ~1970-2024
Alternate Identifier:DOI PLACE HOLDER
Abstract:

This dataset, “AquaMatch Chlorophyll a Data from Water Quality Portal ~1970-2024”, is a component of a forthcoming update to AquaSat (Ross et al., 2019), AquaSat version 2 (“v2”). The overarching purpose of AquaSat V2 is to emphasize the individual parts of the AquaSat pipeline that make-up the matchups between satellite and in-situ measurements. As such, we have greatly expanded and improved upon the AquaSat chlorophyll a dataset in two ways: First, we have incorporated additional recent in situ data beyond what was available at the publication of AquaSat. Second, we have created a data quality tiering system to provide end-users with more guidance on data usage. In this schema we have three tiers: restrictive data that are verifiably self-similar across organizations and time-periods and can be considered highly reliable; narrowed data that we have good reason to believe are self-similar, but for which we can not verify full compatibility across data providers; and inclusive data, which are assumed to be reliable and are harmonized to our best ability given the information available from the data provider. We have also added flag columns to help users understand complexities of the available depth and field sampling data.

This dataset is a derived data product created using records downloaded from the Water Quality Portal (WQP) spanning January 6, 1970, to June 20, 2024. The WQP is a data warehouse for water-related data measured or observed within the United States and US Territories managed by the Environmental Protection Agency, United States Geological Survey, and the National Water Quality Monitoring Council. The dataset does not contain remote sensing matchups but can be paired with Landsat surface reflectances using the pipeline presented in Ross et al. (2019).

Ross, M. R. V., Topp, S. N., Appling, A. P., Yang, X., Kuhn, C., Butman, D. et al. (2019). AquaSat: A data set to enable remote sensing of water quality for inland waters. Water Resources Research, 55, 10012–10025. https://doi.org/10.1029/2019WR024883

Publication Date:2024-08-22
For more information:
Visit: DOI PLACE HOLDER

Time Period
Begin:
1970-01-06
End:
2024-06-20

People and Organizations
Contact:Brousil, Matthew R (Colorado State University) [  email ]
Contact:Meyer, Michael F (United States Geological Survey) [  email ]
Creator:Brousil, Matthew R (Colorado State University)
Creator:Meyer, Michael F (United States Geological Survey)
Creator:Willi, Katie (Colorado State University)
Creator:Steele, B G (Colorado State University)
Creator:De La Torre, Juan (Colorado State University)
Creator:Ross, Matthew R.V. (Colorado State University)
Organization:Radical Open Science Syndicate
Organization:United States Geological Service

Data Entities
Data Table Name:
chla_harmonized_final
Description:
A derived data product created using chlorophyll a measurement records downloaded from the Water Quality Portal (WQP). Contains a subset of columns from the WQP and additional columns added through harmonization and aggregation processes after downloading.
Other Name:
chla_workflow
Description:
The R scripts and data used to produce the harmonized dataset
Other Name:
bookdown_documentation
Description:
Documentation written with the {bookdown} R package that provides detailed information on the entire process of downloading and harmonizing the chlorophyll a data.
Other Name:
README
Description:
Provides context for the entities in this project.
Detailed Metadata

Data Entities


Data Table

Data:https://pasta-s.lternet.edu/package/data/eml/edi/1747/1/c97458622bc2126fa6d6c84e183101cf
Name:chla_harmonized_final
Description:A derived data product created using chlorophyll a measurement records downloaded from the Water Quality Portal (WQP). Contains a subset of columns from the WQP and additional columns added through harmonization and aggregation processes after downloading.
Number of Records:3393022
Number of Columns:33

Table Structure
Object Name:chla_harmonized_final.csv
Size:814963220 byte
Authentication:b322ce9eaa0d4bc537466615daae21f9 Calculated By MD5
Text Format:
Number of Header Lines:1
Record Delimiter:\n
Orientation:column
Simple Delimited:
Field Delimiter:,
Quote Character:"

Table Column Descriptions
 parameterOrganizationIdentifierMonitoringLocationIdentifierMonitoringLocationTypeNameResolvedMonitoringLocationTypeNameActivityStartDateActivityStartTime.TimeActivityStartTime.TimeZoneCodeharmonized_tzharmonized_local_timeharmonized_utcActivityStartDateTimeharmonized_top_depth_valueharmonized_top_depth_unitharmonized_bottom_depth_valueharmonized_bottom_depth_unitharmonized_discrete_depth_valueharmonized_discrete_depth_unitdepth_flagmdl_flagapprox_flaggreater_flagtierfield_flagmisc_flagsubgroup_idharmonized_row_countharmonized_unitsharmonized_valueharmonized_value_cvlatlondatum
Column Name:parameter  
OrganizationIdentifier  
MonitoringLocationIdentifier  
MonitoringLocationTypeName  
ResolvedMonitoringLocationTypeName  
ActivityStartDate  
ActivityStartTime.Time  
ActivityStartTime.TimeZoneCode  
harmonized_tz  
harmonized_local_time  
harmonized_utc  
ActivityStartDateTime  
harmonized_top_depth_value  
harmonized_top_depth_unit  
harmonized_bottom_depth_value  
harmonized_bottom_depth_unit  
harmonized_discrete_depth_value  
harmonized_discrete_depth_unit  
depth_flag  
mdl_flag  
approx_flag  
greater_flag  
tier  
field_flag  
misc_flag  
subgroup_id  
harmonized_row_count  
harmonized_units  
harmonized_value  
harmonized_value_cv  
lat  
lon  
datum  
Definition:Specifies the type of environmental measurement being recorded.From the Water Quality Portal User Guide: A designator used to uniquely identify a unique business establishment within a context.From the Water Quality Portal User Guide: A designator used to describe the unique name, number, or code assigned to identify the monitoring location.From the Water Quality Portal User Guide: The descriptive name for a type of monitoring location.A resolved version of the MonitoringLocationTypeName column.From the Water Quality Portal User Guide: The calendar date on which the field activity is started.From the Water Quality Portal User Guide: The time of day that is reported when the field activity began, based on a 24-hour timescale.From the Water Quality Portal User Guide: The time zone for which the time of day is reported. Any of the longitudinal divisions of the earth's surface in which a standard time is kept.Local time zone in GMT offset format determined either using the ActivityStartTime.TimeZoneCode or through spatial means when the ActivityStartTime.TimeZoneCode was NA.The calendar date and time on which the field activity is started. Created using the ActivityStartDate and ActivityStartTime.Time columns from WQP using the time zone specified in harmonized_tz. Based on a 24-hour timescale.The Coordinated Universal Time (UTC) version of the calendar date and time on which the field activity is started. Created using the ActivityStartDate and ActivityStartTime.Time columns from WQP. Based on a 24-hour timescale.A UTC version of the calendar date and time on which the field activity is started. Created using the ActivityStartDate and ActivityStartTime.Time columns from WQP. Based on a 24-hour timescale.Differs from harmonized_utc because it was generated by the dataRetrieval R package and uses slightly different logic. Differences in some values occur because 1) ActivityStartDateTime is NA for ActivityStartTime.TimeZoneCode values of NA, "AST", "ADT", "GST", "IDLE"; or 2) harmonized_utc handles "00:00:00" values of ActivityStartTime.Time the same as NAs whereas ActivityStartDateTime does not.A harmonized version of the WQP column, ActivityTopDepthHeightMeasure.MeasureValue. From the Water Quality Portal User Guide: A measurement of the upper vertical location of a vertical location range (measured from a reference point) at which an activity occurred.The unit for the harmonized_top_depth_value measurement.A harmonized version of the WQP column, ActivityBottomDepthHeightMeasure.MeasureValue. From the Water Quality Portal User Guide: A measurement of the lower vertical location of a vertical location range (measured from a reference point) at which an activity occurred.The unit for the harmonized_bottom_depth_value measurement.A harmonized combination of the two WQP columns, ActivityDepthHeightMeasure.MeasureValue and ResultDepthHeightMeasure.MeasureValue. From the Water Quality Portal User Guide: ActivityDepthHeightMeasure.MeasureValue: A measurement of the vertical location (measured from a reference point) at which an activity occurred.ResultDepthHeightMeasure.MeasureValue: A measurement of the vertical location (measured from a reference point) at which a result occurred. Note: Only in STORETThe unit for the harmonized_discrete_depth_value measurement.Depth flags are assigned using the harmonized depth columns that result from the depth column harmonization process. They indicate the type of depth data available for a given record.Indicates that the value was created using the method detection limit method (e.g. replaced with a random number between 0 and 0.5 * (methods detection limit “MDL”))Indicates that the harmonized_value was corrected based on language indicating that the original WQP record was an approximated value (e.g. “approx 5.0”).Note that occasionally approximate language will be used in a record but not changed or flagged. This occurs when the language is used in a comment-related column and not the result column itself, meaning that there is a usable numeric value provided (and thus doesn’t need correction).Indicates that the harmonized_value was corrected because the original WQP record was described as "greater than" some value (e.g. "> 5.0")Indicates the reliability and accuracy of each analytical method across data providers and throughout time.Indicates whether the sampling method used agrees with the analytical method.Included as a flexible flag column in order to note important information that isn’t covered by the tiering and flags defined in other columns. Some parameters, like chlorophyll a, will not use this column at all and will therefore just contain NA values in places of flags. Values and their meaning will differ by parameter and as a result, flag values will be explained in the documentation for each parameter.A unique group identifier used to aggregate and summarize harmonized data. We identified groups to aggregate by creating subgroups (subgroup_id) with identical values from the following columns: parameter, OrganizationIdentifier, MonitoringLocationIdentifier, MonitoringLocationTypeName, ResolvedMonitoringLocationTypeName, ActivityStartDate, ActivityStartTime.Time, ActivityStartTime.TimeZoneCode, harmonized_tz, harmonized_local_time, harmonized_utc, ActivityStartDateTime, harmonized_top_depth_value, harmonized_top_depth_unit, harmonized_bottom_depth_value, harmonized_bottom_depth_unit, harmonized_discrete_depth_value, harmonized_discrete_depth_unit, depth_flag, mdl_flag, approx_flag, greater_flag, tier, field_flag, misc_flag, harmonized_units.The number of records contributing to the harmonized_value and harmonized_value_cv for the current subgroup_id.Units of measurement for the harmonized_value column.The mean chlorophyll a measurement following harmonization and aggregation to the current subgroup_id. Note that we set a threshold for realistic chlorophyll a values at 1,000 ug/L and removed records exceeding the threshold.The coefficient of variation for harmonized chlorophyll a measurements in the current subgroup_id.Originally LatitudeMeasure. From the Water Quality Portal User Guide: The measure of the angular distance on a meridian north or south of the equator.Originally LongitudeMeasure. From the Water Quality Portal User Guide: The measure of the angular distance on a meridian east or west of the prime meridian.Originally HorizontalCoordinateReferenceSystemDatumName. From the Water Quality Portal User Guide: The name that describes the reference datum used in determining latitude and longitude coordinates.
Storage Type:string  
string  
string  
string  
string  
dateTime  
dateTime  
string  
string  
dateTime  
dateTime  
string  
float  
string  
float  
string  
float  
string  
string  
string  
string  
string  
string  
string  
string  
string  
float  
string  
float  
float  
float  
float  
string  
Measurement Type:nominalnominalnominalnominalnominaldateTimedateTimenominalnominaldateTimedateTimenominalrationominalrationominalrationominalnominalnominalnominalnominalnominalnominalnominalnominalrationominalratioratioratiorationominal
Measurement Values Domain:
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Codechlorophyll
DefinitionNotes that the environmental measurements in this dataset are for chlorophyll.
Source
Definitiontext
Definitiontext
Allowed Values and Definitions
Enumerated Domain 
Code Definition
CodeCanal Irrigation
DefinitionFrom the EPA WQX Domain Value: Irrigation canals are the main waterways that bring irrigation water from a water source to the areas to be irrigated. They can be lined with concrete, brick, stone, or a flexible membrane to prevent seepage and erosion.
Source
Code Definition
CodeCanal Transport
DefinitionFrom the EPA WQX Domain Values: Canals are human-made channels for water conveyance, or to service water transport vehicles. In most cases, the engineered works will have a series of dams and locks that create areas of low speed current flow. These areas are referred to as slack water levels, often just called levels.
Source
Code Definition
CodeChannelized Stream
DefinitionFrom the EPA WQX Domain Values: The process of straightening or redirecting natural streams in an artificially modified or constructed stream bed. Channelization has been carried out for numerous reasons, most often to drain wetlands , direct water flow for agricultural use, and control flooding . While this process makes a stream more useful for human activities, it tends to interfere with natural river habitats and to destabilize stream banks by destroying riparian vegetation.
Source
Code Definition
CodeEstuary
DefinitionFrom the EPA WQX Domain Values: A partially enclosed coastal body of brackish water with one or more rivers or streams flowing into it, and with a free connection to the open sea Estuaries form a transition zone between river environments and maritime environments. The sea water entering the estuary is diluted by the fresh water flowing from rivers and streams.
Source
Code Definition
CodeGreat Lake
DefinitionFrom the EPA WQX Domain Values: The Great Lakes, also called the Laurentian Great Lakes and the Great Lakes of North America, are a series of interconnected freshwater lakes primarily in the upper mid-east region of North America, on the Canada–United States border, which connect to the Atlantic Ocean through the Saint Lawrence River.
Source
Code Definition
CodeLake
DefinitionFrom the EPA WQX Domain Values: A lake is an area filled with water, localized in a basin, that is surrounded by land, apart from any river or other outlet that serves to feed or drain the lake.
Source
Code Definition
CodeReservoir
DefinitionFrom the EPA WQX Domain Values: An enlarged natural or artificial lake, pond or impoundment created using a dam or lock to store water.
Source
Code Definition
CodeRiver/Stream
DefinitionFrom the EPA WQX Domain Values: A body of water with surface water flowing within the bed and banks of a channel.
Source
Code Definition
CodeRiver/Stream Intermittent
DefinitionFrom the EPA WQX Domain Values: Normally cease flowing for weeks or months each year.
Source
Code Definition
CodeRiver/Stream Perennial
DefinitionFrom the EPA WQX Domain Values: A stream or river (channel) that has continuous flow in parts of its stream bed all year round during years of normal rainfall.
Source
Code Definition
CodeRiverine Impoundment
DefinitionFrom the EPA WQX Domain Values: Impoundments (also known as reservoirs) are artificially created standing water bodies, produced by dams on streams or rivers.
Source
Code Definition
CodeStream
DefinitionFrom the EPA's WQX Domain Values: A body of water with surface water flowing within the bed and banks of a channel.
Source
Code Definition
CodeLake, Reservoir, Impoundment
DefinitionFrom the EPA WQX Domain Values: Lake: A lake is an area filled with water, localized in a basin, that is surrounded by land, apart from any river or other outlet that serves to feed or drain the lake Reservoir: An enlarged natural or artificial lake, pond or impoundment created using a dam or lock to store water. Riverine Impoundment: Impoundments (also known as reservoirs) are artificially created standing water bodies, produced by dams on streams or rivers.
Source
Code Definition
CodeCanal Drainage
DefinitionFrom the EPA WQX Domain Values: As a channel drainage system it is designed to eliminate the need for further pipework systems to be installed in parallel to the drainage, reducing the environmental impact of production as well as improving water collection.
Source
Code Definition
CodePond-Stormwater
DefinitionFrom the EPA WQX Domain Values: Stormwater, also spelled storm water, is water that originates during precipitation events and snow/ice melt.
Source
Code Definition
CodeRiver/Stream Ephemeral
DefinitionFrom the EPA WQX Domain Values: A stream that flows only briefly during and following a period of rainfall in the immediate locality
Source
Code Definition
CodeStream: Canal
DefinitionFrom the EPA WQX Domain Values: River/Stream: A body of water with surface water flowing within the bed and banks of a channel.
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
CodeLake, Reservoir, Impoundment
DefinitionFrom the EPA WQX Domain Values: Lake: A lake is an area filled with water, localized in a basin, that is surrounded by land, apart from any river or other outlet that serves to feed or drain the lake Reservoir: An enlarged natural or artificial lake, pond or impoundment created using a dam or lock to store water. Riverine Impoundment: Impoundments (also known as reservoirs) are artificially created standing water bodies, produced by dams on streams or rivers.
Source
Code Definition
CodeStream
DefinitionFrom the EPA WQX Domain Values: River/Stream: A body of water with surface water flowing within the bed and banks of a channel.
Source
Code Definition
CodeEstuary
DefinitionFrom the EPA WQX Domain Values: A partially enclosed coastal body of brackish water with one or more rivers or streams flowing into it, and with a free connection to the open sea Estuaries form a transition zone between river environments and maritime environments. The sea water entering the estuary is diluted by the fresh water flowing from rivers and streams.
Source
FormatYYYY-MM-DD
Precision
Formathh:mm:ss
Precision
Allowed Values and Definitions
Enumerated Domain 
Code Definition
CodeADT
DefinitionFrom the EPA WQX Domain Values: Atlantic Daylight Time
Source
Code Definition
CodeAKDT
DefinitionFrom the EPA WQX Domain Values: Alaska Daylight Time
Source
Code Definition
CodeAKST
DefinitionFrom the EPA WQX Domain Values: Alaska Standard Time
Source
Code Definition
CodeAST
DefinitionFrom the EPA WQX Domain Values: Atlantic Standard Time
Source
Code Definition
CodeCDT
DefinitionFrom the EPA WQX Domain Values: Central Daylight Time
Source
Code Definition
CodeCST
DefinitionFrom the EPA WQX Domain Values: Central Standard Time
Source
Code Definition
CodeEDT
DefinitionFrom the EPA WQX Domain Values: Eastern Daylight Time
Source
Code Definition
CodeEST
DefinitionFrom the EPA WQX Domain Values: Eastern Standard Time
Source
Code Definition
CodeGMT
DefinitionFrom the EPA WQX Domain Values: Greenwich Mean Time
Source
Code Definition
CodeGST
DefinitionFrom the EPA WQX Domain Values: Guam Standard Time Zone (also Chamorro Standard Time)
Source
Code Definition
CodeHAST
DefinitionFrom the EPA WQX Domain Values: Hawaii-Aleutian Standard Time
Source
Code Definition
CodeHST
DefinitionHawaii Standard Time (assumed; not listed in WQX)
Source
Code Definition
CodeIDLE
DefinitionDefinition unknown. IDLE is not listed in WQX
Source
Code Definition
CodeMDT
DefinitionFrom the EPA WQX Domain Values: Mountain Daylight Time
Source
Code Definition
CodeMST
DefinitionFrom the EPA WQX Domain Values: Mountain Standard Time
Source
Code Definition
CodePDT
DefinitionFrom the EPA WQX Domain Values: Pacific Daylight Time
Source
Code Definition
CodePST
DefinitionFrom the EPA WQX Domain Values: Pacific Standard Time
Source
Code Definition
CodeUTC
DefinitionFrom the EPA WQX Domain Values: Coordinated Universal Time
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
CodeEtc/GMT+10
DefinitionTime zone is 10 hours behind GMT
Source
Code Definition
CodeEtc/GMT+11
DefinitionTime zone is 11 hours behind GMT
Source
Code Definition
CodeEtc/GMT+3
DefinitionTime zone is 3 hours behind GMT
Source
Code Definition
CodeEtc/GMT+4
DefinitionTime zone is 4 hours behind GMT
Source
Code Definition
CodeEtc/GMT+5
DefinitionTime zone is 5 hours behind GMT
Source
Code Definition
CodeEtc/GMT+6
DefinitionTime zone is 6 hours behind GMT
Source
Code Definition
CodeEtc/GMT+7
DefinitionTime zone is 7 hours behind GMT
Source
Code Definition
CodeEtc/GMT+8
DefinitionTime zone is 8 hours behind GMT
Source
Code Definition
CodeEtc/GMT+9
DefinitionTime zone is 9 hours behind GMT
Source
Code Definition
CodeEtc/GMT-10
DefinitionTime zone is 10 hours ahead of GMT
Source
FormatYYYY-MM-DD hh:mm:ss
Precision
FormatYYYY-MM-DDThh:mm:ssZ
Precision
Definitiontext
Unitmeter
Typereal
Min-888 
Max42.4 
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Codem
DefinitionMeter
Source
Unitmeter
Typereal
Min-888 
Max243 
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Codem
DefinitionMeter
Source
Unitmeter
Typereal
Min-9 
Max5105.1 
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Codem
DefinitionMeter
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
DefinitionNo depth data provided.
Source
Code Definition
Code1
DefinitionOnly discrete depth data provided.
Source
Code Definition
Code2
DefinitionA record with a top and/or bottom depth. Indicates an integrated sample.
Source
Code Definition
Code3
DefinitionAny combination of discrete and top/bottom depths. Sample depths cannot be reconciled with certainty.
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
DefinitionNo MDL-based adjustment was performed. This flag implies that a sample concentration was at or exceeded the MDL.
Source
Code Definition
Code1
DefinitionMDL correction was performed.
Source
Code Definition
Code2
DefinitionNo MDL correction was applied and the concentration is below the MDL.
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
DefinitionThe original value was not approximated and did not need correction.
Source
Code Definition
Code1
DefinitionThe original value required correction because its result was approximated.
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
DefinitionThe original result was not described as "greater than" a value and did not need correction.
Source
Code Definition
Code1
DefinitionThe original value required correction because its result was described as "greater than" a value.
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
DefinitionRestrictive: Data that are verifiably self-similar across organizations and time-periods and can be considered highly reliable and interoperable.
Source
Code Definition
Code1
DefinitionNarrowed: Data that we have good reason to believe are self-similar, but for which we can’t verify full interoperability across data providers.
Source
Code Definition
Code2
DefinitionInclusive: Data that are assumed to be reliable and are harmonized to our best ability given the information available from the data provider. This tier includes NA or non-resolvable descriptions for the analytical method, which often make up the majority of methods descriptions for any given parameter. Because this tier represents many analytical methods, direct interoperability between samples is not guaranteed.
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
DefinitionRestrictive and narrowed analytical tiers with in vitro field methods.
Source
Code Definition
Code1
DefinitionRestrictive and narrowed analytical tiers with in situ field methods.
Source
Code Definition
Code2
DefinitionAny entry in the inclusive tier.
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
CodeNA
DefinitionNo Data
Source
Definitiontext
Unitnumber
Typereal
Min
Max760 
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Codeug/L
DefinitionmicrogramPerLiter
Source
UnitmicrogramPerLiter
Typereal
Min
Max1000 
Unitdimensionless
Typereal
Min
Max7.5 
Unitdegree
Typereal
Min-14.31 
Max71.37 
Unitdegree
Typereal
Min-170.71 
Max145.71 
Allowed Values and Definitions
Enumerated Domain 
Code Definition
CodeAMSMA
DefinitionFrom the EPA WQX Domain Values: American Samoa Datum
Source
Code Definition
CodeOLDHI
DefinitionFrom the EPA WQX Domain Values: Old Hawaiian Datum
Source
Code Definition
CodePR
DefinitionFrom the EPA WQX Domain Values: Puerto Rico Datum
Source
Code Definition
CodeWGS72
DefinitionFrom the EPA WQX Domain Values: World Geodetic System 1972
Source
Code Definition
CodeWGS84
DefinitionFrom the EPA WQX Domain Values: World Geodetic System 1984
Source
Missing Value Code:            
CodeNA
ExplNo Data
CodeNA
ExplNo Data
     
CodeNA
ExplNo Data
CodeNA
ExplNo data
 
CodeNA
ExplNo data
 
CodeNA
ExplNo data
             
CodeNA
ExplNo Data
       
CodeNA
ExplNo Data
     
Accuracy Report:                                                                  
Accuracy Assessment:                                                                  
Coverage:                                                                  
Methods:                                                                  

Non-Categorized Data Resource

Name:chla_workflow
Entity Type:application/zip
Description:The R scripts and data used to produce the harmonized dataset
Physical Structure Description:
Object Name:chla_workflow.zip
Size:1040123020 byte
Authentication:2475ebac9fdc77b6cd2dd2b51ae0f8e9 Calculated By MD5
Externally Defined Format:
Format Name:application/zip
Data:https://pasta-s.lternet.edu/package/data/eml/edi/1747/1/1006f1c0287ecfd7f74caf9848889f69

Non-Categorized Data Resource

Name:bookdown_documentation
Entity Type:application/zip
Description:Documentation written with the {bookdown} R package that provides detailed information on the entire process of downloading and harmonizing the chlorophyll a data.
Physical Structure Description:
Object Name:bookdown_documentation.zip
Size:16148978 byte
Authentication:c6a3c435762850a671c5dc0b6762adae Calculated By MD5
Externally Defined Format:
Format Name:application/zip
Data:https://pasta-s.lternet.edu/package/data/eml/edi/1747/1/f1258dd3f5881b1ce20a6d3c65b22eab

Non-Categorized Data Resource

Name:README
Entity Type:pdf
Description:Provides context for the entities in this project.
Physical Structure Description:
Object Name:README.pdf
Size:70142 byte
Authentication:fd38d63437764046d370556f8ed61553 Calculated By MD5
Externally Defined Format:
Format Name:application/pdf
Data:https://pasta-s.lternet.edu/package/data/eml/edi/1747/1/c47c7c7383225ab55ff591cb59c41e6b

Data Package Usage Rights

This data package is released to the "public domain" under Creative Commons CC0 1.0 "No Rights Reserved" (see: https://creativecommons.org/publicdomain/zero/1.0/). It is considered professional etiquette to provide attribution of the original work if this data package is shared in whole or by individual components. A generic citation is provided for this data package on the website https://portal.edirepository.org (herein "website") in the summary metadata page. Communication (and collaboration) with the creators of this data package is recommended to prevent duplicate research or publication. This data package (and its components) is made available "as is" and with no warranty of accuracy or fitness for use. The creators of this data package and the website shall not be liable for any damages resulting from misinterpretation or misuse of the data package or its components. Periodic updates of this data package may be available from the website. Thank you.

Keywords

By Thesaurus:
LTER Controlled Vocabularychlorophyll a, water quality, hydrology, water properties, water, water content, chlorophyll

Methods and Protocols

These methods, instrumentation and/or protocols apply to all data in this dataset:

Methods and protocols used in the collection of this data package
Description:

We downloaded data from the Water Quality Portal (WQP) for records with chlorophyll a related characteristicNames. The resulting data came from a variety of data providers, parameters, and methods. This inevitably results in a heterogeneous dataset that requires rigorous quality control before analytical use. As a part of this quality control we made an effort to remove personally identifying information from comment fields in the dataset immediately following the WQP download. However, as a derived data product, the harmonized chlorophyll a dataset may carry forward data quality issues in the original data from the WQP.

Instrument(s):R
Description:

Pre-harmonization: Prior to data harmonization we performed several preparatory steps with the raw data:

1. Some column types were converted (e.g. result column to numeric) and duplicated to allow for harmonization while still retaining the original data

2. Site-specific metadata were joined to the dataset (e.g. ResolvedMonitoringLocationTypeName)

3. Date and time data were cleaned and used to create harmonized_tz (time zone), harmonized_local_time, and harmonized_utc columns

4. We dropped records where both chlorophyll a measurements and detection limits were missing (NA), or where there was insufficient information to resolve the missing chlorophyll measurement using other columns. 5. We filtered the ResultStatusIdentifier column to include only the following statuses: "Accepted", "Final", "Historical", "Validated", "Preliminary", NA. These statuses generally indicate a reliable result having been reached, however we also include NA in an effort to be conservative.

Instrument(s):R
Description:

Steps from here forward were considered part of the harmonization process.

We next ensure that the media type for all chlorophyll data is "Surface Water", "Water", "Estuary", or NA.

Instrument(s):R
Description:

We next filtered out records based on indications that they failed data quality assurance or quality control for some reason given by the data provider (these instances are referred to here as “failures”).

After reviewing the contents of the ActivityCommentText, ResultLaboratoryCommentText, ResultCommentText, and ResultMeasureValue_original columns, we developed a list of terms that captured the majority of instances where records had failures or unacceptable measurements. We found the phrasing to be consistent across columns and so we searched for the same (case agnostic) terms in all four locations. The terms are: “beyond accept”, “cancelled”, “contaminat”, “error”, “fail”, “improper”, “instrument down”, “interference”, “invalid”, “no result”, “no test”, “not accept”, “outside of accept”, “problem”, “QC EXCEEDED”, “questionable”, “suspect”, “unable”, “violation”, “reject”, “no data”.

Instrument(s):R
Description:

The next harmonization step used method detection limits (MDLs) to clean up the reported result values. When a numeric value was missing for the data record (i.e., NA or text that became NA during an as.numeric call) we checked for non-detect language in the ResultLaboratoryCommentText, ResultCommentText, ResultDetectionConditionText, and ResultMeasureValue columns. This language could be "non-detect", "not detect", "non detect", "undetect", or "below".

If non-detect language existed then we used the DetectionQuantitationLimitMeasure.MeasureValue column for the MDL, otherwise if there was a < and a number in the ResultMeasureValue column we used that number instead.

We then used a random number between 0 and 0.5 * MDL as the record’s value moving forward. Once the process was complete we filtered out any negative values in the dataset.

We produced a new column, mdl_flag, from the MDL cleaning process. Records where no MDL-based adjustment was made and which were at or above the MDL were assigned a 0. Records with corrected values based on the MDL method were assigned a 1. Finally, records where no MDL-based adjustment was made and which contain a numeric value below the provided MDL were assigned a 2.

Instrument(s):R
Description:

We next cleaned approximate values using a similar process as for MDL cleaning. We flagged “approximated” values in the dataset. The ResultMeasureValue column was checked for all three of the following conditions:

1. Numeric-only version of the column was still NA after MDL cleaning

2. The original column text contained a number

3. Any of ResultLaboratoryCommentText, ResultCommentText, or ResultDetectionConditionText matched this regular expression, ignoring case: "result approx|RESULT IS APPROX|value approx"

We then used the approximate value as the record’s value moving forward.

Records with corrected values based on the above method are noted with a 1 in the approx_flag column.

Instrument(s):R
Description:

The next step was similar to the MDL and approximate value cleaning processes, and followed the approximate cleaning process most closely. The goal was to clean up values that were entered as “greater than” some value. The ResultMeasureValue column was checked for all three of the following conditions:

1. Numeric-only version of the column was still NA after MDL & approximate cleaning

2. The original column text contained a number

3. The original column text contained a >

We then used the “greater than” value (without >) as the record’s value moving forward.

Records with corrected values based on the above method are noted with a 1 in the greater_flag column.

Instrument(s):R
Description:

The goal of the preceding three steps was to prevent records with seemingly missing measurement data from being dropped if there was still a chance of recovering a usable value. At this point we finished with that process and we proceeded to check for remaining records with NA values in their harmonized_value column. If they existed, they were dropped.

Instrument(s):R
Description:

The next step in chla harmonization was converting the units of WQP records. We transformed units provided in the WQP into micrograms per liter (ug/L).

Instrument(s):R
Description:

The next harmonization step involved cleaning the four depth-related columns obtained from the WQP. There are four columns that explicitly contain depth information for a given WQP entry, all of which use a variety of measurement units: ActivityDepthHeightMeasure.MeasureValue, ResultDepthHeightMeasure.MeasureValue, ActivityTopDepthHeightMeasure.MeasureValue, ActivityBottomDepthHeightMeasure.MeasureValue.

We completed a few pre-processing steps:

1. Convert the following character values to an explicit NA: “NA”, “999”, “-999”, “9999”, “-9999”, “-99”, “99”, “NaN”

2. Convert depths in all four columns to meters from the depth unit listed by the data provider

3. Create a single “discrete” sample depth column using a combination of the ActivityDepthHeightMeasure.MeasureValue and ResultDepthHeightMeasure.MeasureValue columns. We used ActivityDepth value when ResultDepth value was missing, and ResultDepth when ActivityDepth was missing. If both columns have values but disagree we used an average of the two.

This pre-processing resulted in three “harmonized” columns reporting water sampling depth values in meters: harmonized_discrete_depth_value, harmonized_top_depth_value, harmonized_bottom_depth_value.

Sample depth flags were assigned using the harmonized depth columns that result from the pre-processing steps above. If the record had no depth listed it was assigned a depth_flag of 0. A record with only discrete depth listed in the harmonized_discrete_depth_value was given a depth_flag of 1. A record with top and/or bottom depth (harmonized_top_depth_value, harmonized_bottom_depth_value), indicating an integrated sample, was assigned a depth_flag of 2, and any combination of discrete + top and/or bottom depths was assigned a depth_flag of 3, since the sample depth(s) could not be reconciled with certainty.

Description:

We next reviewed the analytical methods used in measuring chlorophyll a, primarily by classifying the text provided with each record in ResultAnalyticalMethod.MethodName. Once these methods were classified we arranged them into hierarchical tiers.

However, prior to classification we checked the ResultAnalyticalMethod.MethodName column for names that indicate non-chlorophyll a measurements. Phrases used to flag and remove unrelated methods from chlorophyll a data were: “sulfate”, “sediment”, “5310”, “counting”, “plasma”, “turbidity”, “coliform”, “carbon”, “2540”, “conductance”, “nitrate”, “nitrite”, “nitrogen”, “alkalin”, “zooplankton”, “phosphorus”, “periphyton”, “peri”, “biomass”, “temperature”, “elemental analyzer”, “2320”.

The next step towards creating tiers was to then classify the methods in ResultAnalyticalMethod.MethodName into either: HPLC methods, spectrophotometer and fluorometer methods, or methods for which a pheophytin correction is recorded as part of the methodology. These classifications were not the final tiers, but they informed the tiering in the final step of this process. The criteria for each of the above classifications were:

- HPLC: Detection of “447”, “chromatography”, or “hplc” in the ResultAnalyticalMethod.MethodName or presence of 70951 or 70953 in the USGSPCode column

- Spectro/fluoro: Detection of “445”, “fluor”, “Welshmeyer”, “fld”, “10200”, “446”, “trichromatic”, “spectrophoto”, “monochrom”, “monchrom”, or “spec” not as part of a word in ResultAnalyticalMethod.MethodName

- Pheophytin correction: Detection of “correct”, “445”, “446”, or “in presence” in ResultAnalyticalMethod.MethodName or detection of “corrected for pheophytin” or “free of pheophytin” in CharacteristicName

Finally, we grouped the data into three tiers:

Tier 0 - Restrictive: Data that are verifiably self-similar across organizations and time-periods and can be considered highly reliable.

Tier 1 - Narrowed: Data that we have good reason to believe are self-similar, but for which we can’t verify full compatibility across data providers.

Tier 2 - Inclusive: Data that are assumed to be reliable and are harmonized to our best ability given the information available from the data provider. This tier includes NA or non-resolvable descriptions for the analytical method, which often make up the majority of method descriptions. Because this tier represents many analytical methods, direct compatibility between samples is not guaranteed.

Note: Spectrophotometer and fluorometer methods that are labeled as pheophytin-corrected are grouped into the “Narrowed” tier. Depending on the exact implementation of EPA method 445, the correction philosophy may vary, and there is no agreed upon method to rectify inconsistencies in data entry related to these methodological differences. The final harmonization product that aggregates simultaneous records does not retain the CharacteristicName or ResultAnalyticalMethod.MethodName columns.

Description:

Next we flagged field sampling methods based primarily on the SampleCollectionMethod.MethodName column. We first classified each record into either in vitro or in situ methods (i.e., in vitro assumes a water sample was collected and taken to a lab for analysis; in situ assumes a measurement was obtained in the field).

We used the following strings to mark in vitro samples: “grab”, “bottle”, “vessel”, “bucket”, “jar”, “composite”, “integrate”, “UHL001”, “surface”, “filter”, “filtrat”, “1060B”, “kemmerer”, “collect”, “rosette”, “equal width”, “vertical”, “van dorn”, “bail”, “sample”, “sampling”, “lab” not in the middle of another word, or a “G” on its own as shorthand for “grab”. In situ samples were detected using “in situ”, “probe”, or “ctd”.

Lastly we created the field flag based on whether the sampling method used agrees with the analytical method. Flags of 0 indicated that the field sampling method is in agreement with the analytical method, 1 indicates that the field sampling methods are uncharacteristic of the analytical method, and anything with tier of 2 is given a field flag of 2 due to the ambiguity associated with those observations’ analytical methods and corresponding sampling methods.

The following rules were used for chlorophyll a field sampling flags:

Flag 0: Restrictive and narrowed tiers with in vitro field methods

Flag 1: Restrictive and narrowed tiers with in situ field methods

Flag 2: Any entry in the inclusive tier

Description:

Next we added a placeholder for the miscellaneous flag column, misc_flag.

Some WQP parameters have additional flagging requirements that chlorophyll a does not, so we include this placeholder to maintain the same columns with potential future parameter data products.

Description:

Before finalizing the dataset we removed chlorophyll a values that were beyond a realistic threshold. We used 1000 ug/L as our cutoff for removal (Wetzel, 2001, Chapter 15, Figure 19).

Description:

The final step of chlorophyll a harmonization was to aggregate simultaneous observations. Any group of samples determined to be simultaneous were simplified into a single record containing the mean and coefficient of variation (CV) of the group. These can be either true duplicate entries in the WQP or records with non-identical values recorded at the same time and place and by the same organization (field and/or lab replicates/duplicates). The CV can be used to filter the dataset based on the amount of variability that is tolerable to specific use cases. Note, however, that many entries will have a CV that is NA because there are no duplicates or 0 because the records are duplicates and all entries have the same harmonized_value.

We identified simultaneous records to aggregate by creating identical subgroups (subgroup_id) from the following columns: parameter, OrganizationIdentifier, MonitoringLocationIdentifier, MonitoringLocationTypeName, ResolvedMonitoringLocationTypeName, ActivityStartDate, ActivityStartDateTime, ActivityStartTime.TimeZoneCode, harmonized_tz, harmonized_local_time, harmonized_utc, harmonized_top_depth_value, harmonized_top_depth_unit, harmonized_bottom_depth_value, harmonized_bottom_depth_unit, harmonized_discrete_depth_value, harmonized_discrete_depth_unit, depth_flag, mdl_flag, approx_flag, greater_flag, tier, field_flag, misc_flag, harmonized_units.

The final, aggregated values are presented in the harmonized_value and harmonized_value_cv columns. The number of rows used per group is recorded in the harmonized_row_count column.

People and Organizations

Publishers:
Organization:Environmental Data Initiative
Email Address:
info@edirepository.org
Web Address:
https://edirepository.org
Id:https://ror.org/0330j0z60
Creators:
Individual: Matthew R Brousil
Organization:Colorado State University
Email Address:
matthew.brousil@colostate.edu
Id:https://orcid.org/0000-0001-8229-9445
Individual: Michael F Meyer
Organization:United States Geological Survey
Email Address:
mfmeyer@usgs.gov
Id:https://orcid.org/0000-0001-7741-5982
Individual: Katie Willi
Organization:Colorado State University
Email Address:
kathryn.willi@colostate.edu
Id:https://orcid.org/0000-0001-7163-2206
Individual: B G Steele
Organization:Colorado State University
Email Address:
b.steele@colostate.edu
Id:https://orcid.org/0000-0003-4356-4103
Individual: Juan De La Torre
Organization:Colorado State University
Email Address:
juan.delatorre@colostate.edu
Id:https://orcid.org/0009-0004-7541-8695
Individual: Matthew R.V. Ross
Organization:Colorado State University
Email Address:
matthew.ross@colostate.edu
Id:https://orcid.org/0000-0001-9105-4255
Contacts:
Individual: Matthew R Brousil
Organization:Colorado State University
Email Address:
matthew.brousil@colostate.edu
Id:https://orcid.org/0000-0001-8229-9445
Individual: Michael F Meyer
Organization:United States Geological Survey
Email Address:
mfmeyer@usgs.gov
Id:https://orcid.org/0000-0001-6006-7985
Associated Parties:
Organization:Radical Open Science Syndicate
Email Address:
matthew.ross@colostate.edu
Web Address:
https://rossyndicate.com/
Role:Data curator
Organization:United States Geological Service
Id:https://ror.org/035a68863
Role:Funder
Metadata Providers:
Individual: Juan De La Torre
Organization:Colorado State University
Email Address:
juan.delatorre@colostate.edu
Id:https://orcid.org/0009-0004-7541-8695
Individual: Matthew R Brousil
Organization:Colorado State University
Email Address:
mbrousil@colostate.edu
Id:https://orcid.org/0000-0001-8229-9445

Temporal, Geographic and Taxonomic Coverage

Temporal, Geographic and/or Taxonomic information that applies to all data in this dataset:

Time Period
Begin:
1970-01-06
End:
2024-06-20
Geographic Region:
Description:Conterminous US, Alaska, Hawaii, American Samoa, Puerto Rico, and United States Virgin Islands.
Bounding Coordinates:
Northern:  71.36658Southern:  17.682
Western:  -163.14Eastern:  -64.61357
Geographic Region:
Description:Guam and Commonwealth of the Northern Mariana Islands
Bounding Coordinates:
Northern:  15.15297Southern:  13.24574
Western:  144.6253Eastern:  145.7105

Project

Parent Project Information:

Title:AquaMatch
Personnel:
Individual: Matthew R.V. Ross
Organization:Colorado State University
Email Address:
matthew.ross@colostate.edu
Id:https://orcid.org/0000-0001-9105-4255
Role:Principal Investigator
Individual: Matthew Brousil
Organization:Colorado State University
Email Address:
matthew.brousil@colostate.edu
Id:https://orcid.org/0000-0001-8229-9445
Role:Key Contributor
Individual: B G Steele
Organization:Colorado State University
Email Address:
b.steele@colostate.edu
Id:https://orcid.org/0000-0001-4365-4103
Role:Key contributor
Individual: Kathryn Willi
Organization:Colorado State University
Email Address:
kathryn.willi@colostate.edu
Id:https://orcid.org/0000-0001-7163-2206
Role:Contributor
Individual: Michael Meyer
Organization:United States Geological Service
Email Address:
mfmeyer@usgs.gov
Id:https://orcid.org/0000-0001-6006-7985
Role:Contributor

Maintenance

Maintenance:
Description:

completed

Frequency:notPlanned
Other Metadata

Additional Metadata

additionalMetadata
        |___text '\n      '
        |___element 'metadata'
        |     |___text '\n         '
        |     |___element 'importedFromXML'
        |     |        \___attribute 'dateImported' = '2024-05-05'
        |     |        \___attribute 'filename' = 'EDI EML AquaMatch Chlorophyll a Data from Water Quality Portal_ ~1970-2023.xml'
        |     |        \___attribute 'taxonomicCoverageExempt' = 'True'
        |     |___text '\n      '
        |___text '\n   '

Additional Metadata

additionalMetadata
        |___text '\n      '
        |___element 'metadata'
        |     |___text '\n         '
        |     |___element 'emlEditor'
        |     |        \___attribute 'app' = 'ezEML'
        |     |        \___attribute 'release' = '2024.08.06'
        |     |___text '\n      '
        |___text '\n   '

EDI is a collaboration between the University of New Mexico and the University of Wisconsin – Madison, Center for Limnology:

UNM logo UW-M logo