Data Package Metadata   View Summary

Data and code for EDI overview paper, data collection characteristics, FAIR evaluation, downloads, and citations

General Information
Data Package:
Local Identifier:edi.1175.1
Title:Data and code for EDI overview paper, data collection characteristics, FAIR evaluation, downloads, and citations
Alternate Identifier:DOI PLACE HOLDER
Abstract:
The Environmental Data Initiative (EDI) is a trustworthy, stable data repository and data management support organization for the environmental scientist. EDI provides tools and support that allow the environmental researcher to easily integrate data publishing into the research workflow. Almost ten years since going into production, these data and code were used to provide a general description of EDI’s collection of data and its data management philosophy and placement in the repository landscape. They show how comprehensive metadata and the repository infrastructure lead to highly findable, accessible, interoperable, and reusable (FAIR) data by evaluating compliance with specific community proposed FAIR criteria. Finally, they provide measures and patterns of data (re)use, assuring that EDI is fulfilling its stated premise.
Publication Date:2022-07-21
For more information:
Visit: DOI PLACE HOLDER

Time Period
Begin:
2022
End:
2022

People and Organizations
Contact:Gries, Corinna (Environmental Data Initiative) [  email ]
Creator:Gries, Corinna (Environmental Data Initiative)
Creator:Servilla, Mark (Environmental Data Initiative)

Data Entities
Data Table Name:
datasetDurationKeywords
Description:
Length of observation as indicated by begin and end dates, list of keywords for every dataset in EDI
Data Table Name:
dl_cit_meta_package
Description:
numbers of downloads and citations linked to subjects and length of observation per dataset
Data Table Name:
edi_eml_content_long
Description:
FAIR criteria parsed from EML metadata for each data package
Data Table Name:
geog_distribution
Description:
bounding box and centroid information for each data package
Data Table Name:
keyword_count_word_edit
Description:
most frequently used keywords and number datasets they are used to describe
Data Table Name:
keyword_pairs
Description:
most commonly used keywords as they are being used together to describe datasets
Other Name:
r_code
Description:
R code scripts used for analysis
Detailed Metadata

Data Entities


Data Table

Data:https://pasta-s.lternet.edu/package/data/eml/edi/1175/1/7898abeea2417283215c7dc0aabba356
Name:datasetDurationKeywords
Description:Length of observation as indicated by begin and end dates, list of keywords for every dataset in EDI
Number of Records:8605
Number of Columns:4

Table Structure
Object Name:datasetDurationKeywords.csv
Size:1853054 byte
Authentication:08173bf6f5e0c958bd6cb0a2bda807aa Calculated By MD5
Text Format:
Number of Header Lines:1
Record Delimiter:\r\n
Orientation:column
Simple Delimited:
Field Delimiter:,
Quote Character:"

Table Column Descriptions
 dl_dataset_iddurationendYearkeywords
Column Name:dl_dataset_id  
duration  
endYear  
keywords  
Definition:Basic dataset ID from EDI without versionDifference between begin date and end date from metadata in years End year of observation from EML metadataComma separated list of keywords from EML metadata
Storage Type:string  
float  
float  
string  
Measurement Type:nominalratiorationominal
Measurement Values Domain:
DefinitionBasic dataset ID from EDI without version
Unitnumber
Typeinteger
UnitnominalYear
Typeinteger
DefinitionComma separated list of keywords from EML metadata
Missing Value Code:  
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
Accuracy Report:        
Accuracy Assessment:        
Coverage:        
Methods:        

Data Table

Data:https://pasta-s.lternet.edu/package/data/eml/edi/1175/1/f8c2bcc6588e4ac9949249ba5c4296ad
Name:dl_cit_meta_package
Description:numbers of downloads and citations linked to subjects and length of observation per dataset
Number of Records:8605
Number of Columns:14

Table Structure
Object Name:dl_cit_meta_package.csv
Size:527196 byte
Authentication:480b81319d044d4542bc139462604fa8 Calculated By MD5
Text Format:
Number of Header Lines:1
Record Delimiter:\r\n
Orientation:column
Simple Delimited:
Field Delimiter:,
Quote Character:"

Table Column Descriptions
 scopedataset_idweb_downloadscript_downloadnum_citationsdurationendYearbiodiversitydisturbanceprimaryProdorgMatterinorgNutrabioticallClass
Column Name:scope  
dataset_id  
web_download  
script_download  
num_citations  
duration  
endYear  
biodiversity  
disturbance  
primaryProd  
orgMatter  
inorgNutr  
abiotic  
allClass  
Definition:scope of data package in EDIbasic dataset Id without version number of manual web downloads Number of downloads initiated by a script or programnumber of journal article, thesis or report citing this datasetlength of observation in yearsend year of observations from metadatawhether or not a keyword in the group of biodiversity was foundwhether or not a keyword in the group of disturbance was foundwhether or not a keyword in the group of primary production was foundwhether or not a keyword in the group of organic matter was foundwhether or not a keyword in the group of inorganic nutrients was foundwhether or not a keyword in the group of abiotic conditions was foundwhether or not the dataset was classified into the main categories
Storage Type:string  
string  
float  
float  
float  
float  
float  
string  
string  
string  
string  
string  
string  
string  
Measurement Type:nominalnominalratioratioratioratiorationominalnominalnominalnominalnominalnominalnominal
Measurement Values Domain:
Definitionscope of data package in EDI
Definitionbasic dataset Id without version
Unitnumber
Typeinteger
Unitnumber
Typeinteger
Unitnumber
Typeinteger
Unitnumber
Typeinteger
UnitnominalYear
Typeinteger
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
Definitionbiodiversity keywords were not detected
Source
Code Definition
Code1
Definitionbiodiversity keywords were detected
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
Definitiondisturbance keywords were not detected
Source
Code Definition
Code1
Definitiondisturbance keywords were detected
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
Definitionprimary production keywords were not detected
Source
Code Definition
Code1
Definitionprimary production keywords were detected
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
Definitionorganic matter keywords were not detected
Source
Code Definition
Code1
Definitionorganic matter keywords were detected
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
Definitioninorganic nutrient keywords were not detected
Source
Code Definition
Code1
Definitioninorganic nutrient keywords were detected
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
Definitionabiotic condition keywords were not detected
Source
Code Definition
Code1
Definitionabiotic condition keywords were detected
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
Definitionnot classified
Source
Code Definition
Code1
Definitionclassified
Source
Missing Value Code:    
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
             
Accuracy Report:                            
Accuracy Assessment:                            
Coverage:                            
Methods:                            

Data Table

Data:https://pasta-s.lternet.edu/package/data/eml/edi/1175/1/66dd80f34eb2b78a57380130af7d389d
Name:edi_eml_content_long
Description:FAIR criteria parsed from EML metadata for each data package
Number of Records:395505
Number of Columns:4

Table Structure
Object Name:edi_eml_content_long.csv
Size:22795679 byte
Authentication:706a95267d94c7e62d2cd3588209a7bd Calculated By MD5
Text Format:
Number of Header Lines:1
Record Delimiter:\r\n
Orientation:column
Simple Delimited:
Field Delimiter:,
Quote Character:"

Table Column Descriptions
 scopeeml_idparametervalue
Column Name:scope  
eml_id  
parameter  
value  
Definition:scope of data package in EDIdataset id without versionname of the parameter evaluatednumber of positive detections or other measure (e.g. date) for parameter measured
Storage Type:string  
string  
string  
string  
Measurement Type:nominalnominalnominalnominal
Measurement Values Domain:
Definitionscope of data package in EDI
Definitiondataset id without version
Definitionname of the parameter evaluated
Definitionnumber of positive detections or other measure (e.g. date) for parameter measured
Missing Value Code:      
CodeNA
Explnot available
Accuracy Report:        
Accuracy Assessment:        
Coverage:        
Methods:        

Data Table

Data:https://pasta-s.lternet.edu/package/data/eml/edi/1175/1/20485c3ab6910023a5c345f21a165837
Name:geog_distribution
Description:bounding box and centroid information for each data package
Number of Records:28752
Number of Columns:8

Table Structure
Object Name:geog_distribution.csv
Size:2601346 byte
Authentication:472527d0b04b9ba5be45f947fa08bff1 Calculated By MD5
Text Format:
Number of Header Lines:1
Record Delimiter:\r\n
Orientation:column
Simple Delimited:
Field Delimiter:,
Quote Character:"

Table Column Descriptions
 scopeeml_idnorthsoutheastwestlatitudelongitude
Column Name:scope  
eml_id  
north  
south  
east  
west  
latitude  
longitude  
Definition:scope of data package in EDIdataset id without versionnorth bounding boxsouth bounding boxeast bounding boxwest bounding boxcentroid latitudecentroid longitude
Storage Type:string  
string  
float  
float  
float  
float  
float  
float  
Measurement Type:nominalnominalratioratioratioratioratioratio
Measurement Values Domain:
Definitionscope of data package in EDI
Definitiondataset id without version
Unitdegree
Typereal
Unitdegree
Typereal
Unitdegree
Typereal
Unitdegree
Typereal
Unitdegree
Typereal
Unitdegree
Typereal
Missing Value Code:    
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
CodeNA
Explnot available
Accuracy Report:                
Accuracy Assessment:                
Coverage:                
Methods:                

Data Table

Data:https://pasta-s.lternet.edu/package/data/eml/edi/1175/1/bf26b71caef17af74113a5c6f3c686b0
Name:keyword_count_word_edit
Description:most frequently used keywords and number datasets they are used to describe
Number of Records:818
Number of Columns:2

Table Structure
Object Name:keyword_count_word_edit.csv
Size:14449 byte
Authentication:e1900ee85b89fbc8ecf1c1b66b054f75 Calculated By MD5
Text Format:
Number of Header Lines:1
Record Delimiter:\r\n
Orientation:column
Simple Delimited:
Field Delimiter:,
Quote Character:"

Table Column Descriptions
 single_kwcount
Column Name:single_kw  
count  
Definition:keyword in questionnumber of datasets using the keyword
Storage Type:string  
float  
Measurement Type:nominalratio
Measurement Values Domain:
Definitionkeyword in question
Unitnumber
Typeinteger
Missing Value Code:  
CodeNA
Explnot available
Accuracy Report:    
Accuracy Assessment:    
Coverage:    
Methods:    

Data Table

Data:https://pasta-s.lternet.edu/package/data/eml/edi/1175/1/91c235409ce6baea8df3bb270ca2ff78
Name:keyword_pairs
Description:most commonly used keywords as they are being used together to describe datasets
Number of Records:640603
Number of Columns:2

Table Structure
Object Name:keyword_pairs.csv
Size:17498915 byte
Authentication:45b44032639a179ff55f649c61cfe83b Calculated By MD5
Text Format:
Number of Header Lines:1
Record Delimiter:\r\n
Orientation:column
Simple Delimited:
Field Delimiter:,
Quote Character:"

Table Column Descriptions
 fromto
Column Name:from  
to  
Definition:keyword onekeyword two
Storage Type:string  
string  
Measurement Type:nominalnominal
Measurement Values Domain:
Definitionkeyword one
Definitionkeyword two
Missing Value Code:    
Accuracy Report:    
Accuracy Assessment:    
Coverage:    
Methods:    

Non-Categorized Data Resource

Name:r_code
Entity Type:zip
Description:R code scripts used for analysis
Physical Structure Description:
Object Name:r_code.zip
Size:16736 byte
Authentication:00629c8ccc08c4137060e36c999933c4 Calculated By MD5
Externally Defined Format:
Format Name:zip
Data:https://pasta-s.lternet.edu/package/data/eml/edi/1175/1/a4ade30dfe27e12b4a9c1efab328ba41

Data Package Usage Rights

This data package is released to the "public domain" under Creative Commons CC0 1.0 "No Rights Reserved" (see: https://creativecommons.org/publicdomain/zero/1.0/). It is considered professional etiquette to provide attribution of the original work if this data package is shared in whole or by individual components. A generic citation is provided for this data package on the website https://portal.edirepository.org (herein "website") in the summary metadata page. Communication (and collaboration) with the creators of this data package is recommended to prevent duplicate research or publication. This data package (and its components) is made available "as is" and with no warranty of accuracy or fitness for use. The creators of this data package and the website shall not be liable for any damages resulting from misinterpretation or misuse of the data package or its components. Periodic updates of this data package may be available from the website. Thank you.

Keywords

By Thesaurus:
(No thesaurus)metadata, FAIR data, EDI data holdings, data reuse, open science

Methods and Protocols

These methods, instrumentation and/or protocols apply to all data in this dataset:

Methods and protocols used in the collection of this data package
Description:
Three primary sets of data were analyzed: the first consists of the EML metadata that accompanies each data package in EDI’s main collection; the second is a summary of download events for individual data files; and the third consists of citations of data archived in the EDI repository obtained by a Google Scholar search.
Description:
EDI’s data collection and FAIR analysis There is no universal definition of a data package (Lowenberg et al. 2019), nor even within a community does complete agreement exist (Gries et al. 2021), which has ramifications for the following analyses. In environmental sciences, it is important that data packages are designed to document the context of a specific research project and data collection with metadata, data, and code. Hence, in some cases, data encompass a combination of thematically different observations that are needed to fully comprehend the context of a particular research study (e.g., the abiotic conditions during sampling and concurrent observations of the biota). Alternatively, data may be separated into several data packages according to different aspects of a study. Following the above example, one package may contain meteorological data while a different package contains observations of the biota. In other cases, observations taken over time may be published as a single data series that is regularly updated and versioned or as separate packages for each observation period (e.g., annually). Similarly, observations spanning more than one location may be split into different data packages along spatial criteria. High-volume data may also be separated into individual packages to simplify management, download and processing. This heterogeneity should be considered when interpreting the following analyses, which are based on numbers of data series. Metadata for the approximately 9,000 data series in EDI’s main collection (newest revisions only) were analyzed for specific attributes, including keywords, start and end dates of the data collection period, and the sampling locations. Analysis was performed by using the R statistical programming language to parse and record attribute information from the metadata. This information was then recorded into a corresponding table of key-value pairs for keyword analysis or into time-period bins for temporal analysis or into latitude/longitude pairs for spatial analysis. These data and the R source code are published in the EDI data repository (Gries et al. 2022). The set of metadata was then processed to determine compliance with criteria identified as being representative of FAIR data. The two sources of FAIR criteria used in this analysis are the FAIR Data Maturity Model proposed by Bahim et al (2020) and the MetaDIG criteria (Jones and Slaughter 2019) adopted by DataONE. A detailed discussion of how FAIR criteria were mapped to EML attributes may be found in Smith (2022). In total, 46 criteria combined from each approach were analyzed to determine their presence in EDI’s metadata. Again, this analysis was performed by using R, with results being recorded into criteria-based bins.
Description:
Download Events Download “request” events for data files were obtained from the repository audit system database. These events are annotated with the downloaded data file identifier, an event date-timestamp, and the requesting HTTP User-Agent record. To analyze only user initiated requests for data files, download events that did not contain a valid User-Agent record (i.e., the record was null or contained non-identifiable content) were excluded. The User-Agent record was used to categorize the originating actor of the request as either a “robot”, “human”, or “program”. Download events identified as a “robot” (i.e., initiated by a search engine or other web crawler) were filtered out by matching the string content found in the HTTP User-Agent record with known robot string patterns that are published by the Make Data Count project (Cousijn et al., 2019). The remaining download events were further labeled, also based on the User-Agent strings, as either “human” (i.e., initiated through a web browser) or “program” (i.e., initiated by a computer program). Human requests for data were identified by matching the User-Agent string to known web browser labels, while program requests were identified by User-Agent strings that are associated with the programming environment being used to access the repository web-service API. The approach used to identify robots in this research is not foolproof but does serve the needs of this analysis. Using the above approach, download events for 2021 were filtered and categorized. Of nearly 3 million download events, 180,000 were identified as either human or program-initiated requests for data. Each download event record lists the data entity which was used to identify the corresponding data package from which data were downloaded. Once the data package is known, its metadata were analyzed to determine the thematic classification of the data and temporal ranges of data-collection time spans.
Description:
Data Citations Journal citations for data series were collected by using Google Scholar to search for the “shoulder” of the data package DOI, which is a unique substring found at the start of all DOIs registered to EDI. A small number of “citations” not found by Google Scholar were added based on author assurance of data package use. The set of citations was restricted to the years 2013 through 2021 and any citation of a data package not of the main collection was removed. The validity of data package citations was confirmed by accessing the publication through the University of Wisconsin library system. A total of 2,595 data package citations were confirmed. Similar to download events, the data package citations were summed into bins based on the data package identifier and again used as proxies for the reuse of thematic and time-span data.

People and Organizations

Publishers:
Organization:Environmental Data Initiative
Email Address:
info@edirepository.org
Web Address:
https://edirepository.org
Id:https://ror.org/0330j0z60
Creators:
Individual: Corinna Gries
Organization:Environmental Data Initiative
Email Address:
cgries@wisc.edu
Id:https://orcid.org/0000-0002-9091-6543
Individual: Mark Servilla
Organization:Environmental Data Initiative
Email Address:
mark.servilla@gmail.com
Id:https://orcid.org/0000-0002-3192-7306
Contacts:
Individual: Corinna Gries
Organization:Environmental Data Initiative
Email Address:
cgries@wisc.edu
Id:https://orcid.org/0000-0002-9091-6543

Temporal, Geographic and Taxonomic Coverage

Temporal, Geographic and/or Taxonomic information that applies to all data in this dataset:

Time Period
Begin:
2022
End:
2022
Geographic Region:
Description:EDI's data holdings describe worldwide collections
Bounding Coordinates:
Northern:  78.7Southern:  -85.62
Western:  -170.36Eastern:  176.59

Project

Parent Project Information:

Title:Environmental Data Initiative: Sustaining the Legacy of Scientific Data
Personnel:
Individual: Corinna Gries
Organization:Environmental Data Initiative
Email Address:
cgries@wisc.edu
Id:https://orcid.org/0000-0002-9091-6543
Role:PI
Individual: Mark Servilla
Organization:Environmental Data Initiative
Email Address:
mark.servilla@gmail.com
Id:https://orcid.org/0000-0002-3192-7306
Role:PI
Abstract:The Environmental Data Initiative (EDI) facilitates the publication of environmental data generated by publicly funded research projects. With a mission to ensure the long-term viability and legacy of publicly funded scientific data, EDI is committed to making environmental data Findable, Accessible, Interoperable, and Reusable (FAIR). EDI provides support, training, and resources to help archive and publish high-quality data and metadata, providing accountability and transparency to data providers, while opening the door to answering new questions through Big Data analyses. EDI is actively engaged in the national and international community of data curators to promote data management best practices and stewardship.
Additional Award Information:
Funder:National Science Foundation
Number:1931174
Title:Collaborative Research: Environmental Data Initiative: Sustaining the Legacy of Scientific Data
URL:https://edirepository.org/
Additional Award Information:
Funder:National Science Foundation
Number:1931143
Title:Collaborative Research: Environmental Data Initiative: Sustaining the Legacy of Scientific Data
URL:https://edirepository.org/

Maintenance

Maintenance:
Description:complete
Frequency:
Other Metadata

Additional Metadata

additionalMetadata
        |___text '\n    '
        |___element 'metadata'
        |     |___text '\n      '
        |     |___element 'emlEditor'
        |     |        \___attribute 'app' = 'ezEML'
        |     |        \___attribute 'release' = '2022.06.04'
        |     |___text '\n    '
        |___text '\n  '

EDI is a collaboration between the University of New Mexico and the University of Wisconsin – Madison, Center for Limnology:

UNM logo UW-M logo