This data package was submitted to a staging environment for testing purposes only. Use of these data for anything other than testing is strongly discouraged.

Data Package Summary    View Full Metadata

  • Results of semantic queries for "carbon cycling" for datasets in the DataONE catalog
  • O'Brien, Margaret; University of California, Santa Barbara
    Jones, Matthew; Director of Computing; National Center for Ecological Analysis and Synthesis
    Schildhauer, Mark; National Center for Ecological Analysis and Synthesis
    Hou, Sophie; The Ronin Institute for Independent Scholarship
    Mecum, Bryce
    McCusker, Jamie; Rensselaer Polytechnic Institute
    McGuinness, Deborah; Rensselaer Polytechnic Institute
  • 2023-03-08
  • O'Brien, M., M. Jones, M. Schildhauer, S. Hou, B. Mecum, J. McCusker, and D. McGuinness. 2023. Results of semantic queries for "carbon cycling" for datasets in the DataONE catalog ver 1. Environmental Data Initiative. (Accessed 2024-12-27).
  • DataONE ( is a federation of institutions involved with the earth and environmental sciences that share data through common cyberinfrastructure. In 2016, the DataONE project carried out a quantification of the utility of semantic query, by measuring the precision and recall of relevant datasets available through that catalog. Precision is defined as the proportion of relevant data in the retrieved results, and recall is the proportion of relevant data retrieved, compared to all relevant data present in the repository (see Methods).

    This dataset contains the queries and results of that study. Four data tables are included. First, a table of the 10 queries, which were formatted in several ways, including natural language and text strings (for plain text searches of various parts of metadata), and URIs for measurements in the EcoSystem Ontology (ECSO). A second table contains 994 relevant datasets in the DataONE catalog, with a column for each of the ten queries and boolean value indicating whether the dataset is a match for that query. Two query results tables are included, for the raw and summarized results of the query tests. A fifth entity contains the zipped code (R language) used to perform the queries in the DataONE system.

    When run against approximately 1000 datasets (in October, 2016), results for the ten queries ranged from 0-50% (precision) and 0-100% (recall), indicating that traditional searches may sometimes be adequate to return all relevant data in a corpus, but results can be erratic and inconsistent, with potentially large returns of irrelevant data in the result set.

  • edi.1368.1  (Uploaded 2023-03-08)  
  • This data package is released to the "public domain" under Creative Commons CC0 1.0 "No Rights Reserved" (see: It is considered professional etiquette to provide attribution of the original work if this data package is shared in whole or by individual components. A generic citation is provided for this data package on the website (herein "website") in the summary metadata page. Communication (and collaboration) with the creators of this data package is recommended to prevent duplicate research or publication. This data package (and its components) is made available "as is" and with no warranty of accuracy or fitness for use. The creators of this data package and the website shall not be liable for any damages resulting from misinterpretation or misuse of the data package or its components. Periodic updates of this data package may be available from the website. Thank you.
  • Analyze this data package using:           

EDI is a collaboration between the University of New Mexico and the University of Wisconsin – Madison, Center for Limnology:

UNM logo UW-M logo