DataONE ( is a federation of institutions involved with the earth and environmental sciences that share data through common cyberinfrastructure. In 2016, the DataONE project carried out a quantification of the utility of semantic query, by measuring the precision and recall of relevant datasets available through that catalog. Precision is defined as the proportion of relevant data in the retrieved results, and recall is the proportion of relevant data retrieved, compared to all relevant data present in the repository (see Methods).
This dataset contains the queries and results of that study. Four data tables are included. First, a table of the 10 queries, which were formatted in several ways, including natural language and text strings (for plain text searches of various parts of metadata), and URIs for measurements in the EcoSystem Ontology (ECSO). A second table contains 994 relevant datasets in the DataONE catalog, with a column for each of the ten queries and boolean value indicating whether the dataset is a match for that query. Two query results tables are included, for the raw and summarized results of the query tests. A fifth entity contains the zipped code (R language) used to perform the queries in the DataONE system.
When run against approximately 1000 datasets (in October, 2016), results for the ten queries ranged from 0-50% (precision) and 0-100% (recall), indicating that traditional searches may sometimes be adequate to return all relevant data in a corpus, but results can be erratic and inconsistent, with potentially large returns of irrelevant data in the result set.