Data Package Metadata   View Summary

Results of semantic queries for "carbon cycling" for datasets in the DataONE catalog

General Information
Data Package:
Local Identifier:edi.1368.1
Title:Results of semantic queries for "carbon cycling" for datasets in the DataONE catalog
Alternate Identifier:DOI PLACE HOLDER
Abstract:

DataONE (https://www.dataone.org) is a federation of institutions involved with the earth and environmental sciences that share data through common cyberinfrastructure. In 2016, the DataONE project carried out a quantification of the utility of semantic query, by measuring the precision and recall of relevant datasets available through that catalog. Precision is defined as the proportion of relevant data in the retrieved results, and recall is the proportion of relevant data retrieved, compared to all relevant data present in the repository (see Methods).

This dataset contains the queries and results of that study. Four data tables are included. First, a table of the 10 queries, which were formatted in several ways, including natural language and text strings (for plain text searches of various parts of metadata), and URIs for measurements in the EcoSystem Ontology (ECSO). A second table contains 994 relevant datasets in the DataONE catalog, with a column for each of the ten queries and boolean value indicating whether the dataset is a match for that query. Two query results tables are included, for the raw and summarized results of the query tests. A fifth entity contains the zipped code (R language) used to perform the queries in the DataONE system.

When run against approximately 1000 datasets (in October, 2016), results for the ten queries ranged from 0-50% (precision) and 0-100% (recall), indicating that traditional searches may sometimes be adequate to return all relevant data in a corpus, but results can be erratic and inconsistent, with potentially large returns of irrelevant data in the result set.

Publication Date:2023-03-08
For more information:
Visit: DOI PLACE HOLDER

Time Period
Date:
2016

People and Organizations
Contact:O'Brien, Margaret (University of California, Santa Barbara) [  email ]
Creator:O'Brien, Margaret (University of California, Santa Barbara)
Creator:Jones, Matthew (National Center for Ecological Analysis and Synthesis, Director of Computing)
Creator:Schildhauer, Mark (National Center for Ecological Analysis and Synthesis)
Creator:Hou, Sophie (The Ronin Institute for Independent Scholarship)
Creator:Mecum, Bryce 
Creator:McCusker, Jamie (Rensselaer Polytechnic Institute)
Creator:McGuinness, Deborah (Rensselaer Polytechnic Institute)

Data Entities
Data Table Name:
carbon cycling queries
Description:
Queries related to carbon cycling, for testing semantic annotation, DataONE Semantics project, 2016 (use case 52). Contains query fragments for 3 text searches (within user-contributed metadata) and 4 methods of annotation with ontology URIs
Data Table Name:
Test corpus F ground truth, carbon flux queries
Description:
Test corpus of datasets in DataONE semantic query tests. Contains dataset ids, booleans for each of the 10 queries
Data Table Name:
Raw data, DataONE semantic PR testing
Description:
Raw data (hits) for DataONE's semantic precision and recall testing, Run 4, October 2016
Data Table Name:
Prec_Recall_Results_20161005
Description:
summarized Precision and Recall results from DataONE semantic Query testing, 2016
Other Name:
D1_2016_semantic_query_scripts
Description:
Query code for DataONE semantic query testing, 2016
Detailed Metadata

Data Entities


Data Table

Data:https://pasta-s.lternet.edu/package/data/eml/edi/1368/1/68bffd6d1c0092ccd341f40870f0cc1f
Name:carbon cycling queries
Description:Queries related to carbon cycling, for testing semantic annotation, DataONE Semantics project, 2016 (use case 52). Contains query fragments for 3 text searches (within user-contributed metadata) and 4 methods of annotation with ontology URIs
Number of Records:80
Number of Columns:4

Table Structure
Object Name:uc52_queries_all.csv
Size:7758 byte
Authentication:944e6e4ef3084e35e348bae034b60b3c Calculated By MD5
Text Format:
Number of Header Lines:1
Record Delimiter:\n
Orientation:column
Simple Delimited:
Field Delimiter:,
Quote Character:"

Table Column Descriptions
 Query_IDSOLR_Index_TypeQuery_FragOntology_Set_ID
Column Name:Query_ID  
SOLR_Index_Type  
Query_Frag  
Ontology_Set_ID  
Definition:identifier for this querytype of query this query fragment is designed forSOLR query fragment that will be searched, containing a SOLR key and value (text or URI)place holder column, not used in this study
Storage Type:string  
string  
string  
string  
Measurement Type:nominalnominalnominalnominal
Measurement Values Domain:
Definitionidentifier for this query
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Codenat_lang
DefinitionA natural language version of the query, often a sentence
Source
Code Definition
Codefull_text
DefinitionQuery text to search anywhere in the metadata (EML) record
Source
Code Definition
Codemetacat_ui
Definitionquery fragment is designed to search fields found in the Metacat User Interface
Source
Code Definition
Codemetacat_filtered
Definitionquery fragment is designed to search dataset column-metadata will be searched
Source
Code Definition
Codebioportal_annot
DefinitionQuery will search for annotations added by the bioportal algorithm
Source
Code Definition
Codeesor_annot
DefinitionQuery will search for annotations added by the ESOR algorithm
Source
Code Definition
Codeesor_cosine
DefinitionQuery will search for annotations added by the ESOR cosine algorithm
Source
Code Definition
Codemanual_annot
DefinitionQuery to search for annotations added manually
Source
DefinitionSOLR query fragment that will be searched, containing a SOLR key and value (text or URI)
Allowed Values and Definitions
Enumerated Domain 
Code Definition
CodeNone
Definitioncolumn not used
Source
Missing Value Code:      
CodeNone
Explplaceholder column, not used
Accuracy Report:        
Accuracy Assessment:        
Coverage:        
Methods:        

Data Table

Data:https://pasta-s.lternet.edu/package/data/eml/edi/1368/1/3f210380ebdf5d4d02b674e087ce58b8
Name:Test corpus F ground truth, carbon flux queries
Description:Test corpus of datasets in DataONE semantic query tests. Contains dataset ids, booleans for each of the 10 queries
Number of Records:925
Number of Columns:12

Table Structure
Object Name:test_corpus_f_groundtruth_carbon_flux_queries.csv
Size:92433 byte
Authentication:f546d6abc800e9c9f2291fed1a06d192 Calculated By MD5
Text Format:
Number of Header Lines:1
Record Delimiter:\n
Orientation:column
Simple Delimited:
Field Delimiter:,
Quote Character:"

Table Column Descriptions
 Dataset_IDq1q2q3q4q5q6q7q8q9q10formatid
Column Name:Dataset_ID  
q1  
q2  
q3  
q4  
q5  
q6  
q7  
q8  
q9  
q10  
formatid  
Definition:Identifier for the datasetT/F value, for this queryT/F value, for this queryT/F value, for this queryT/F value, for this queryT/F value, for this queryT/F value, for this queryT/F value, for this queryT/F value, for this queryT/F value, for this queryT/F value, for this queryFormat of this EML metadata record
Storage Type:string  
string  
string  
string  
string  
string  
string  
string  
string  
string  
string  
string  
Measurement Type:nominalnominalnominalnominalnominalnominalnominalnominalnominalnominalnominalnominal
Measurement Values Domain:
DefinitionIdentifier for the dataset
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
DefinitionFalse
Source
Code Definition
Code1
DefinitionTrue (dataset matches this query)
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
DefinitionFalse
Source
Code Definition
Code1
Definitiont
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
DefinitionFalse
Source
Code Definition
Code1
DefinitionTrue (dataset matches this query)
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
DefinitionFalse
Source
Code Definition
Code1
DefinitionTrue (dataset matches this query)
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
DefinitionFalse
Source
Code Definition
Code1
DefinitionTrue (dataset matches this query)
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
DefinitionFalse
Source
Code Definition
Code1
DefinitionTrue (dataset matches this query)
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
DefinitionFalse
Source
Code Definition
Code1
DefinitionTrue (dataset matches this query)
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
DefinitionFalse
Source
Code Definition
Code1
DefinitionTrue (dataset matches this query)
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
DefinitionFalse
Source
Code Definition
Code1
DefinitionTrue (dataset matches this query)
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code0
DefinitionFalse
Source
Code Definition
Code1
DefinitionTrue (dataset matches this query)
Source
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Codenan
DefinitionNo schema specified
Source
Code Definition
Codeeml://ecoinformatics.org/eml-2.0.0
DefinitionEML 2.0.0
Source
Code Definition
Codeeml://ecoinformatics.org/eml-2.0.1
DefinitionEML 2.0.1
Source
Code Definition
Codeeml://ecoinformatics.org/eml-2.1.0
DefinitionEML 2.1.0
Source
Code Definition
Codeeml://ecoinformatics.org/eml-2.1.1
DefinitionEML 2.1.1
Source
Missing Value Code:                        
Accuracy Report:                        
Accuracy Assessment:                        
Coverage:                        
Methods:                        

Data Table

Data:https://pasta-s.lternet.edu/package/data/eml/edi/1368/1/04aa98db8310ae10e44d118df6d3c84d
Name:Raw data, DataONE semantic PR testing
Description:Raw data (hits) for DataONE's semantic precision and recall testing, Run 4, October 2016
Number of Records:36872
Number of Columns:7

Table Structure
Object Name:RHE85O~H.CSV
Size:4543253 byte
Authentication:0ac94273587e0b6ba5a4d33e3403fcee Calculated By MD5
Text Format:
Number of Header Lines:1
Record Delimiter:\n
Orientation:column
Simple Delimited:
Field Delimiter:,
Quote Character:"

Table Column Descriptions
 Dataset_IDR_TimeD1_nodeQuery_IDSOLR_Index_TypeRun_IDOntology_Set_ID
Column Name:Dataset_ID  
R_Time  
D1_node  
Query_ID  
SOLR_Index_Type  
Run_ID  
Ontology_Set_ID  
Definition:Dataset identifier, returnedRun DateDataONE node where query was runIdentifier for the query that was this dataset was returned fortype of query this query fragment is designed forRun identifier (date and time)place holder column, not used in this study
Storage Type:string  
dateTime  
string  
string  
string  
string  
string  
Measurement Type:nominaldateTimenominalnominalnominalnominalnominal
Measurement Values Domain:
DefinitionDataset identifier, returned
FormatYYYY-MM-DD
Precision
Allowed Values and Definitions
Enumerated Domain 
Code Definition
CodeCN-Sanbox2
DefinitionQuery was run in the coordinating node (CN) named Sandbox 2
Source
DefinitionIdentifier for the query that was this dataset was returned for
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Codenat_lang
DefinitionA natural language version of the query, often a sentence
Source
Code Definition
Codefull_text
DefinitionQuery text to search anywhere in the metadata (EML) record
Source
Code Definition
Codemetacat_ui
Definitionquery fragment is designed to search fields found in the Metacat User Interface
Source
Code Definition
Codemetacat_filtered
Definitionquery fragment is designed to search dataset column-metadata will be searched
Source
Code Definition
Codebioportal_annot
DefinitionQuery will search for annotations added by the bioportal algorithm
Source
Code Definition
Codeesor_annot
DefinitionQuery will search for annotations added by the ESOR algorithm
Source
Code Definition
Codeesor_cosine
DefinitionQuery will search for annotations added by the ESOR cosine algorithm
Source
Code Definition
Codemanual_annot
DefinitionQuery to search for annotations added manually
Source
DefinitionRun identifier (date and time)
Definitionplace holder column, not used in this study
Missing Value Code:            
CodeNone
Explvalue not used
Accuracy Report:              
Accuracy Assessment:              
Coverage:              
Methods:              

Data Table

Data:https://pasta-s.lternet.edu/package/data/eml/edi/1368/1/101b5d042373331f8378396e90692d27
Name:Prec_Recall_Results_20161005
Description:summarized Precision and Recall results from DataONE semantic Query testing, 2016
Number of Records:80
Number of Columns:7

Table Structure
Object Name:Prec_Recall_Results_20161005.csv
Size:13589 byte
Authentication:2aa2c3e8da57e210b8054e31b547dac7 Calculated By MD5
Text Format:
Number of Header Lines:1
Record Delimiter:\n
Orientation:column
Simple Delimited:
Field Delimiter:,
Quote Character:"

Table Column Descriptions
 Test_Corpus_IDQuery_IDSOLR_Index_TypeRun_IDOntology_Set_IDPrecisionRecall
Column Name:Test_Corpus_ID  
Query_ID  
SOLR_Index_Type  
Run_ID  
Ontology_Set_ID  
Precision  
Recall  
Definition:name of the test corpus fileID for the query that was runtype of query this query fragment is designed forRun identifier (date and time)place holder column, not used in this studyPrecision value for this Solr_Index_Type type and Query IDRecall value for this Solr_Index_Type type and Query ID
Storage Type:string  
string  
string  
string  
string  
float  
float  
Measurement Type:nominalnominalnominalnominalnominalratioratio
Measurement Values Domain:
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Code~/dataone/gitcheckout/semantic-query/lib/ground_truth/test_corpus_f_groundtruth_carbon_flux_queries.csv
Definitionfile location (on original server) of the test corpus and ground truth file
Source
DefinitionID for the query that was run
Allowed Values and Definitions
Enumerated Domain 
Code Definition
Codebioportal_annot
DefinitionQuery will search for annotations added by the bioportal algorithm
Source
Code Definition
Codeesor_annot
DefinitionQuery will search for annotations added by the ESOR algorithm
Source
Code Definition
Codeesor_cosine
DefinitionQuery will search for annotations added by the ESOR cosine algorithm
Source
Code Definition
Codefull_text
DefinitionQuery text to search anywhere in the metadata (EML) record
Source
Code Definition
Codemanual_annot
DefinitionQuery to search for annotations added manually
Source
Code Definition
Codemetacat_filtered
Definitionquery fragment is designed to search dataset column-metadata will be searched
Source
Code Definition
Codemetacat_ui
Definitionquery fragment is designed to search fields found in the Metacat User Interface
Source
Code Definition
Codenat_lang
DefinitionA natural language version of the query, often a sentence
Source
DefinitionRun identifier (date and time)
Definitionplace holder column, not used in this study
Unitpercent
Typereal
Unitpercent
Typereal
Missing Value Code:        
CodeN/A
Explvalue not used
CodeNaN
Explontology set not used
 
Accuracy Report:              
Accuracy Assessment:              
Coverage:              
Methods:              

Non-Categorized Data Resource

Name:D1_2016_semantic_query_scripts
Entity Type:zip
Description:Query code for DataONE semantic query testing, 2016
Physical Structure Description:
Object Name:D1_2016_semantic_query_scripts.zip
Size:6624 byte
Authentication:fc6cb6eeb6b206aa281deea48236b013 Calculated By MD5
Externally Defined Format:
Format Name:zip
Data:https://pasta-s.lternet.edu/package/data/eml/edi/1368/1/f32bb7520f0dd0fcb2ca137007fe96be

Data Package Usage Rights

This data package is released to the "public domain" under Creative Commons CC0 1.0 "No Rights Reserved" (see: https://creativecommons.org/publicdomain/zero/1.0/). It is considered professional etiquette to provide attribution of the original work if this data package is shared in whole or by individual components. A generic citation is provided for this data package on the website https://portal.edirepository.org (herein "website") in the summary metadata page. Communication (and collaboration) with the creators of this data package is recommended to prevent duplicate research or publication. This data package (and its components) is made available "as is" and with no warranty of accuracy or fitness for use. The creators of this data package and the website shall not be liable for any damages resulting from misinterpretation or misuse of the data package or its components. Periodic updates of this data package may be available from the website. Thank you.

Keywords

By Thesaurus:
(No thesaurus)Query testing, Precision, Recall, Semantic annotation

Methods and Protocols

These methods, instrumentation and/or protocols apply to all data in this dataset:

Methods and protocols used in the collection of this data package
Description:

A field of study, “carbon cycling”, was identified as important to researchers, and particularly to synthesis scientists. A set of natural language queries related to carbon cycling were drafted based on interactions with NCEAS’s researchers.

Description:

A corpus of 925 datasets was identified in the DataONE catalog that had sufficient metadata for detailed queries and annotation (measurement-level). All datasets in the corpus had metadata in Ecological Metadata Language (EML), because DataONE contributors who use EML all require measurement-level metadata.

Description:

The corpus was reviewed, and datasets relevant to carbon cycling were annotated with classes from the ECSO ontology using 3 mechanisms:

a. Manual annotation (a knowledgeable curator assigned classes to specific columns of data.

b. Bioportal annotation algorithm

c. Two annotation algorithms using the Earth Science Ontology Repository (ESOR) annotation algorithms. RPI's annotation web service indexes a fixed set of ontologies but uses the context and entropy scoring of classes to determine the best-scoring annotation results (original and cosine method)

Annotated datasets were cached in a DataONE coordinating node (test node), and the annotations added to the SOLR index with explicit keys.

Description:

SOLR searches were performed against the test node and “hits” tabulated. Precision and recall were summarized.

a. Definition of Recall: A/(A+B), where A = # of relevant retrieved and B = # of relevant records NOT retrieved

b. Definition of Precision: A/(A+C), where A = # of relevant retrieved and C = # of irrelevant records retrieved

Description:

Results were tabulated for 7 types of searches:

1. Full_text: Anywhere in the metadata (EML) record

2. Metacat_ui: Fields found in the Metacat User Interface

3. Metacat_filtered: Dataset column-metadata will be searched

4. Manual_annot: Annotations added manually

5. Automated annotation - Bioportal_annot

6. Automated annotation - ESOR_annot

7. Automated annotation- ESOR-cosine

People and Organizations

Publishers:
Organization:Environmental Data Initiative
Email Address:
info@edirepository.org
Web Address:
https://edirepository.org
Id:https://ror.org/0330j0z60
Creators:
Individual: Margaret O'Brien
Organization:University of California, Santa Barbara
Address:
University of California,
SANTA BARBARA, CA 93111-1444 United States
Email Address:
margaret.obrien@ucsb.edu
Id:https://orcid.org/0000-0002-1693-8322
Individual: Matthew Jones
Organization:National Center for Ecological Analysis and Synthesis
Position:Director of Computing
Email Address:
jones@nceas.ucsb.edu
Id:https://orcid.org/0000-0003-0077-4738
Individual: Mark Schildhauer
Organization:National Center for Ecological Analysis and Synthesis
Email Address:
schild@nceas.ucsb.edu
Id:https://orcid.org/0000-0003-0632-7576
Individual: Sophie Hou
Organization:The Ronin Institute for Independent Scholarship
Email Address:
cy.sophie.hou@gmail.com
Id:https://orcid.org/0000-0002-8087-1775
Individual: Bryce Mecum
Email Address:
mecum@nceas.ucsb.edu
Id:https://orcid.org/0000-0002-0381-3766
Individual: Jamie McCusker
Organization:Rensselaer Polytechnic Institute
Email Address:
mccusj2@rpi.edu
Web Address:
https://tw.rpi.edu/person/JamieMcCusker
Id:https://orcid.org/0000-0003-1085-6059
Individual: Deborah McGuinness
Organization:Rensselaer Polytechnic Institute
Email Address:
dlm@cs.rpi.edu
Id:https://orcid.org/0000-0001-7037-4567
Contacts:
Individual: Margaret O'Brien
Organization:University of California, Santa Barbara
Address:
Marine Science Institute,
University of California,
Santa Barbara, CA 93106 United States
Phone:
8058932071 (voice)
Email Address:
margaret.obrien@ucsb.edu
Id:https://orcid.org/0000-0002-1693-8322

Temporal, Geographic and Taxonomic Coverage

Temporal, Geographic and/or Taxonomic information that applies to all data in this dataset:

Time Period
Date:
2016

Project

Parent Project Information:

Title:DataONE Semantic Query
Personnel:
Individual: Matthew Jones
Organization:National Center for Ecological Analysis and Synthesis
Position:Director of Informatics R&D
Email Address:
jones@nceas.ucsb.edu
Id:https://orcid.org/0000-0003-0077-4738
Role:Principal Investigator
Abstract:

Searching in DataONE currently focuses on fielded and full-text metadata. It does not allow precise queries of measurement types primarily because the metadata corpus contains uncontrolled descriptions of measurements (e.g., variable names, descriptions, units, etc.), making it impossible to find all datasets that use a particular measurement type. DataONE plans to extend its search system using scalable semantic annotation and inferencing. Key activities include: (a) Defining the scope of prototypes and the production framework; (b) Selecting contributing ontologies and terminologies (e.g., CF, CUAHSI, ENVO, SWEET) and extending these as needed to provide coverage of measurement types; (c) Defining the annotation representation framework, e.g., by extending PROV-O or AO to enable measurement type association with attributes of entities within a data package; (d) Implementing annotation storage framework and user interfaces to facilitate ease of manual annotation by DataONE users through investigator tools; (e) Developing data mining algorithms to infer from data and metadata the measurement types aligned with the ontology, and incorporating automated annotation capabilities into Investigator Tools; and (f) Developing extensions to the DataONE content indexing, programmatic query services, and user interfaces to enable discovery through controlled definitions of measurement types.

Additional Award Information:
Funder:National Science Foundation
Funder ID:http://dx.doi.org/10.13039/100000001
Number:1430508
Title:DataONE (Data Observation Network for Earth)
URL:https://www.nsf.gov/awardsearch/showAward?AWD_ID=1430508
Other Metadata

Additional Metadata

additionalMetadata
        |___text '\n    '
        |___element 'metadata'
        |     |___text '\n      '
        |     |___element 'fetchedFromEDI'
        |     |        \___attribute 'dateFetched' = '2023-03-08'
        |     |        \___attribute 'packageID' = 'sssss.1.1'
        |     |___text '\n    '
        |___text '\n  '

Additional Metadata

additionalMetadata
        |___text '\n    '
        |___element 'metadata'
        |     |___text '\n      '
        |     |___element 'importedFromXML'
        |     |        \___attribute 'dateImported' = '2023-03-08'
        |     |        \___attribute 'filename' = 'sssss.1.1.xml'
        |     |        \___attribute 'taxonomicCoverageExempt' = 'True'
        |     |___text '\n    '
        |___text '\n  '

Additional Metadata

additionalMetadata
        |___text '\n    '
        |___element 'metadata'
        |     |___text '\n      '
        |     |___element 'emlEditor'
        |     |        \___attribute 'app' = 'ezEML'
        |     |        \___attribute 'release' = '2023.02.19'
        |     |___text '\n    '
        |___text '\n  '

Additional Metadata

additionalMetadata
        |___text '\n    '
        |___element 'metadata'
        |     |___text '\n      '
        |     |___element 'replicationPolicy' in ns 'http://ns.dataone.org/service/types/v1' ('d1v1:replicationPolicy')
        |     |     |  \___attribute 'numberReplicas' = '1'
        |     |     |  \___attribute 'replicationAllowed' = 'true'
        |     |     |___text '\n        '
        |     |     |___element 'preferredMemberNode'
        |     |     |     |___text 'urn:node:ADC'
        |     |     |___text '\n      '
        |     |___text '\n    '
        |___text '\n  '

EDI is a collaboration between the University of New Mexico and the University of Wisconsin – Madison, Center for Limnology:

UNM logo UW-M logo