These methods, instrumentation and/or protocols apply to all data in this dataset:Methods and protocols used in the collection of this data package |
---|
Description: | Three primary sets of data were analyzed: the first consists of the EML metadata that accompanies each data package in EDI’s main collection; the second is a summary of download events for individual data files; and the third consists of citations of data archived in the EDI repository obtained by a Google Scholar search. |
| Description: | EDI’s data collection and FAIR analysis
There is no universal definition of a data package (Lowenberg et al. 2019), nor even within a community does complete agreement exist (Gries et al. 2021), which has ramifications for the following analyses. In environmental sciences, it is important that data packages are designed to document the context of a specific research project and data collection with metadata, data, and code. Hence, in some cases, data encompass a combination of thematically different observations that are needed to fully comprehend the context of a particular research study (e.g., the abiotic conditions during sampling and concurrent observations of the biota). Alternatively, data may be separated into several data packages according to different aspects of a study. Following the above example, one package may contain meteorological data while a different package contains observations of the biota. In other cases, observations taken over time may be published as a single data series that is regularly updated and versioned or as separate packages for each observation period (e.g., annually). Similarly, observations spanning more than one location may be split into different data packages along spatial criteria. High-volume data may also be separated into individual packages to simplify management, download and processing. This heterogeneity should be considered when interpreting the following analyses, which are based on numbers of data series.
Metadata for the approximately 9,000 data series in EDI’s main collection (newest revisions only) were analyzed for specific attributes, including keywords, start and end dates of the data collection period, and the sampling locations. Analysis was performed by using the R statistical programming language to parse and record attribute information from the metadata. This information was then recorded into a corresponding table of key-value pairs for keyword analysis or into time-period bins for temporal analysis or into latitude/longitude pairs for spatial analysis. These data and the R source code are published in the EDI data repository (Gries et al. 2022).
The set of metadata was then processed to determine compliance with criteria identified as being representative of FAIR data. The two sources of FAIR criteria used in this analysis are the FAIR Data Maturity Model proposed by Bahim et al (2020) and the MetaDIG criteria (Jones and Slaughter 2019) adopted by DataONE. A detailed discussion of how FAIR criteria were mapped to EML attributes may be found in Smith (2022). In total, 46 criteria combined from each approach were analyzed to determine their presence in EDI’s metadata. Again, this analysis was performed by using R, with results being recorded into criteria-based bins. |
| Description: | Download Events
Download “request” events for data files were obtained from the repository audit system database. These events are annotated with the downloaded data file identifier, an event date-timestamp, and the requesting HTTP User-Agent record. To analyze only user initiated requests for data files, download events that did not contain a valid User-Agent record (i.e., the record was null or contained non-identifiable content) were excluded. The User-Agent record was used to categorize the originating actor of the request as either a “robot”, “human”, or “program”. Download events identified as a “robot” (i.e., initiated by a search engine or other web crawler) were filtered out by matching the string content found in the HTTP User-Agent record with known robot string patterns that are published by the Make Data Count project (Cousijn et al., 2019). The remaining download events were further labeled, also based on the User-Agent strings, as either “human” (i.e., initiated through a web browser) or “program” (i.e., initiated by a computer program). Human requests for data were identified by matching the User-Agent string to known web browser labels, while program requests were identified by User-Agent strings that are associated with the programming environment being used to access the repository web-service API. The approach used to identify robots in this research is not foolproof but does serve the needs of this analysis.
Using the above approach, download events for 2021 were filtered and categorized. Of nearly 3 million download events, 180,000 were identified as either human or program-initiated requests for data. Each download event record lists the data entity which was used to identify the corresponding data package from which data were downloaded. Once the data package is known, its metadata were analyzed to determine the thematic classification of the data and temporal ranges of data-collection time spans. |
| Description: | Data Citations
Journal citations for data series were collected by using Google Scholar to search for the “shoulder” of the data package DOI, which is a unique substring found at the start of all DOIs registered to EDI. A small number of “citations” not found by Google Scholar were added based on author assurance of data package use. The set of citations was restricted to the years 2013 through 2021 and any citation of a data package not of the main collection was removed. The validity of data package citations was confirmed by accessing the publication through the University of Wisconsin library system. A total of 2,595 data package citations were confirmed. Similar to download events, the data package citations were summed into bins based on the data package identifier and again used as proxies for the reuse of thematic and time-span data. |
|
|
|