These methods, instrumentation and/or protocols apply to all data in this dataset:Methods and protocols used in the collection of this data package |
---|
Description: |
We downloaded data from the Water Quality Portal (WQP) for records with chlorophyll a related characteristicNames. The resulting data came from a variety of data providers, parameters, and methods. This inevitably results in a heterogeneous dataset that requires rigorous quality control before analytical use. As a part of this quality control we made an effort to remove personally identifying information from comment fields in the dataset immediately following the WQP download. However, as a derived data product, the harmonized chlorophyll a dataset may carry forward data quality issues in the original data from the WQP.
| Instrument(s): | R |
| Description: |
Pre-harmonization: Prior to data harmonization we performed several preparatory steps with the raw data:
1. Some column types were converted (e.g. result column to numeric) and duplicated to allow for harmonization while still retaining the original data
2. Site-specific metadata were joined to the dataset (e.g. ResolvedMonitoringLocationTypeName)
3. Date and time data were cleaned and used to create harmonized_tz (time zone), harmonized_local_time, and harmonized_utc columns
4. We dropped records where both chlorophyll a measurements and detection limits were missing (NA), or where there was insufficient information to resolve the missing chlorophyll measurement using other columns. 5. We filtered the ResultStatusIdentifier column to include only the following statuses: "Accepted", "Final", "Historical", "Validated", "Preliminary", NA. These statuses generally indicate a reliable result having been reached, however we also include NA in an effort to be conservative.
| Instrument(s): | R |
| Description: |
Steps from here forward were considered part of the harmonization process.
We next ensure that the media type for all chlorophyll data is "Surface Water", "Water", "Estuary", or NA.
| Instrument(s): | R |
| Description: |
We next filtered out records based on indications that they failed data quality assurance or quality control for some reason given by the data provider (these instances are referred to here as “failures”).
After reviewing the contents of the ActivityCommentText, ResultLaboratoryCommentText, ResultCommentText, and ResultMeasureValue_original columns, we developed a list of terms that captured the majority of instances where records had failures or unacceptable measurements. We found the phrasing to be consistent across columns and so we searched for the same (case agnostic) terms in all four locations. The terms are: “beyond accept”, “cancelled”, “contaminat”, “error”, “fail”, “improper”, “instrument down”, “interference”, “invalid”, “no result”, “no test”, “not accept”, “outside of accept”, “problem”, “QC EXCEEDED”, “questionable”, “suspect”, “unable”, “violation”, “reject”, “no data”.
| Instrument(s): | R |
| Description: |
The next harmonization step used method detection limits (MDLs) to clean up the reported result values. When a numeric value was missing for the data record (i.e., NA or text that became NA during an as.numeric call) we checked for non-detect language in the ResultLaboratoryCommentText, ResultCommentText, ResultDetectionConditionText, and ResultMeasureValue columns. This language could be "non-detect", "not detect", "non detect", "undetect", or "below".
If non-detect language existed then we used the DetectionQuantitationLimitMeasure.MeasureValue column for the MDL, otherwise if there was a < and a number in the ResultMeasureValue column we used that number instead.
We then used a random number between 0 and 0.5 * MDL as the record’s value moving forward. Once the process was complete we filtered out any negative values in the dataset.
We produced a new column, mdl_flag, from the MDL cleaning process. Records where no MDL-based adjustment was made and which were at or above the MDL were assigned a 0. Records with corrected values based on the MDL method were assigned a 1. Finally, records where no MDL-based adjustment was made and which contain a numeric value below the provided MDL were assigned a 2.
| Instrument(s): | R |
| Description: |
We next cleaned approximate values using a similar process as for MDL cleaning. We flagged “approximated” values in the dataset. The ResultMeasureValue column was checked for all three of the following conditions:
1. Numeric-only version of the column was still NA after MDL cleaning
2. The original column text contained a number
3. Any of ResultLaboratoryCommentText, ResultCommentText, or ResultDetectionConditionText matched this regular expression, ignoring case: "result approx|RESULT IS APPROX|value approx"
We then used the approximate value as the record’s value moving forward.
Records with corrected values based on the above method are noted with a 1 in the approx_flag column.
| Instrument(s): | R |
| Description: |
The next step was similar to the MDL and approximate value cleaning processes, and followed the approximate cleaning process most closely. The goal was to clean up values that were entered as “greater than” some value. The ResultMeasureValue column was checked for all three of the following conditions:
1. Numeric-only version of the column was still NA after MDL & approximate cleaning
2. The original column text contained a number
3. The original column text contained a >
We then used the “greater than” value (without >) as the record’s value moving forward.
Records with corrected values based on the above method are noted with a 1 in the greater_flag column.
| Instrument(s): | R |
| Description: |
The goal of the preceding three steps was to prevent records with seemingly missing measurement data from being dropped if there was still a chance of recovering a usable value. At this point we finished with that process and we proceeded to check for remaining records with NA values in their harmonized_value column. If they existed, they were dropped.
| Instrument(s): | R |
| Description: |
The next step in chla harmonization was converting the units of WQP records. We transformed units provided in the WQP into micrograms per liter (ug/L).
| Instrument(s): | R |
| Description: |
The next harmonization step involved cleaning the four depth-related columns obtained from the WQP. There are four columns that explicitly contain depth information for a given WQP entry, all of which use a variety of measurement units: ActivityDepthHeightMeasure.MeasureValue, ResultDepthHeightMeasure.MeasureValue, ActivityTopDepthHeightMeasure.MeasureValue, ActivityBottomDepthHeightMeasure.MeasureValue.
We completed a few pre-processing steps:
1. Convert the following character values to an explicit NA: “NA”, “999”, “-999”, “9999”, “-9999”, “-99”, “99”, “NaN”
2. Convert depths in all four columns to meters from the depth unit listed by the data provider
3. Create a single “discrete” sample depth column using a combination of the ActivityDepthHeightMeasure.MeasureValue and ResultDepthHeightMeasure.MeasureValue columns. We used ActivityDepth value when ResultDepth value was missing, and ResultDepth when ActivityDepth was missing. If both columns have values but disagree we used an average of the two.
This pre-processing resulted in three “harmonized” columns reporting water sampling depth values in meters: harmonized_discrete_depth_value, harmonized_top_depth_value, harmonized_bottom_depth_value.
Sample depth flags were assigned using the harmonized depth columns that result from the pre-processing steps above. If the record had no depth listed it was assigned a depth_flag of 0. A record with only discrete depth listed in the harmonized_discrete_depth_value was given a depth_flag of 1. A record with top and/or bottom depth (harmonized_top_depth_value, harmonized_bottom_depth_value), indicating an integrated sample, was assigned a depth_flag of 2, and any combination of discrete + top and/or bottom depths was assigned a depth_flag of 3, since the sample depth(s) could not be reconciled with certainty.
|
| Description: |
We next reviewed the analytical methods used in measuring chlorophyll a, primarily by classifying the text provided with each record in ResultAnalyticalMethod.MethodName. Once these methods were classified we arranged them into hierarchical tiers.
However, prior to classification we checked the ResultAnalyticalMethod.MethodName column for names that indicate non-chlorophyll a measurements. Phrases used to flag and remove unrelated methods from chlorophyll a data were: “sulfate”, “sediment”, “5310”, “counting”, “plasma”, “turbidity”, “coliform”, “carbon”, “2540”, “conductance”, “nitrate”, “nitrite”, “nitrogen”, “alkalin”, “zooplankton”, “phosphorus”, “periphyton”, “peri”, “biomass”, “temperature”, “elemental analyzer”, “2320”.
The next step towards creating tiers was to then classify the methods in ResultAnalyticalMethod.MethodName into either: HPLC methods, spectrophotometer and fluorometer methods, or methods for which a pheophytin correction is recorded as part of the methodology. These classifications were not the final tiers, but they informed the tiering in the final step of this process. The criteria for each of the above classifications were:
- HPLC: Detection of “447”, “chromatography”, or “hplc” in the ResultAnalyticalMethod.MethodName or presence of 70951 or 70953 in the USGSPCode column
- Spectro/fluoro: Detection of “445”, “fluor”, “Welshmeyer”, “fld”, “10200”, “446”, “trichromatic”, “spectrophoto”, “monochrom”, “monchrom”, or “spec” not as part of a word in ResultAnalyticalMethod.MethodName
- Pheophytin correction: Detection of “correct”, “445”, “446”, or “in presence” in ResultAnalyticalMethod.MethodName or detection of “corrected for pheophytin” or “free of pheophytin” in CharacteristicName
Finally, we grouped the data into three tiers:
Tier 0 - Restrictive: Data that are verifiably self-similar across organizations and time-periods and can be considered highly reliable.
Tier 1 - Narrowed: Data that we have good reason to believe are self-similar, but for which we can’t verify full compatibility across data providers.
Tier 2 - Inclusive: Data that are assumed to be reliable and are harmonized to our best ability given the information available from the data provider. This tier includes NA or non-resolvable descriptions for the analytical method, which often make up the majority of method descriptions. Because this tier represents many analytical methods, direct compatibility between samples is not guaranteed.
Note: Spectrophotometer and fluorometer methods that are labeled as pheophytin-corrected are grouped into the “Narrowed” tier. Depending on the exact implementation of EPA method 445, the correction philosophy may vary, and there is no agreed upon method to rectify inconsistencies in data entry related to these methodological differences. The final harmonization product that aggregates simultaneous records does not retain the CharacteristicName or ResultAnalyticalMethod.MethodName columns.
|
| Description: |
Next we flagged field sampling methods based primarily on the SampleCollectionMethod.MethodName column. We first classified each record into either in vitro or in situ methods (i.e., in vitro assumes a water sample was collected and taken to a lab for analysis; in situ assumes a measurement was obtained in the field).
We used the following strings to mark in vitro samples: “grab”, “bottle”, “vessel”, “bucket”, “jar”, “composite”, “integrate”, “UHL001”, “surface”, “filter”, “filtrat”, “1060B”, “kemmerer”, “collect”, “rosette”, “equal width”, “vertical”, “van dorn”, “bail”, “sample”, “sampling”, “lab” not in the middle of another word, or a “G” on its own as shorthand for “grab”. In situ samples were detected using “in situ”, “probe”, or “ctd”.
Lastly we created the field flag based on whether the sampling method used agrees with the analytical method. Flags of 0 indicated that the field sampling method is in agreement with the analytical method, 1 indicates that the field sampling methods are uncharacteristic of the analytical method, and anything with tier of 2 is given a field flag of 2 due to the ambiguity associated with those observations’ analytical methods and corresponding sampling methods.
The following rules were used for chlorophyll a field sampling flags:
Flag 0: Restrictive and narrowed tiers with in vitro field methods
Flag 1: Restrictive and narrowed tiers with in situ field methods
Flag 2: Any entry in the inclusive tier
|
| Description: |
Next we added a placeholder for the miscellaneous flag column, misc_flag.
Some WQP parameters have additional flagging requirements that chlorophyll a does not, so we include this placeholder to maintain the same columns with potential future parameter data products.
|
| Description: |
Before finalizing the dataset we removed chlorophyll a values that were beyond a realistic threshold. We used 1000 ug/L as our cutoff for removal (Wetzel, 2001, Chapter 15, Figure 19).
|
| Description: |
The final step of chlorophyll a harmonization was to aggregate simultaneous observations. Any group of samples determined to be simultaneous were simplified into a single record containing the mean and coefficient of variation (CV) of the group. These can be either true duplicate entries in the WQP or records with non-identical values recorded at the same time and place and by the same organization (field and/or lab replicates/duplicates). The CV can be used to filter the dataset based on the amount of variability that is tolerable to specific use cases. Note, however, that many entries will have a CV that is NA because there are no duplicates or 0 because the records are duplicates and all entries have the same harmonized_value.
We identified simultaneous records to aggregate by creating identical subgroups (subgroup_id) from the following columns: parameter, OrganizationIdentifier, MonitoringLocationIdentifier, MonitoringLocationTypeName, ResolvedMonitoringLocationTypeName, ActivityStartDate, ActivityStartDateTime, ActivityStartTime.TimeZoneCode, harmonized_tz, harmonized_local_time, harmonized_utc, harmonized_top_depth_value, harmonized_top_depth_unit, harmonized_bottom_depth_value, harmonized_bottom_depth_unit, harmonized_discrete_depth_value, harmonized_discrete_depth_unit, depth_flag, mdl_flag, approx_flag, greater_flag, tier, field_flag, misc_flag, harmonized_units.
The final, aggregated values are presented in the harmonized_value and harmonized_value_cv columns. The number of rows used per group is recorded in the harmonized_row_count column.
|
|
|
|