Funktionen

4.3.2.5 Controlled vocabularies and authority files

As you have seen so far, metadata standards define the categories with which data can be described in more detail. On the one hand, these include interdisciplinary categories such as title, author, date of publication, type of study, etc., but on the other hand they also include subject-specific categories such as substance temperature in chemistry or materials science. However, there is no definition or validity check of how you fill the respective categories with information. 
What date format do you use? Is the temperature given in Celsius or Fahrenheit and with “°” or “degrees”? Is it a “survey” or a “questionnaire”? These questions seem superficial at first glance, but predefined and uniform terms and formats are closely related to machine processing, search results and linkage with other research data. For example, if the date format does not correspond to the format a search system works with, the research data with the incompatible format will not be found. If questionnaires are searched for, but the term “survey” is used in the metadata, it is not certain that the associated research data will also be found.
For the purpose of linguistic standardisation in the description of metadata, so-called controlled vocabularies have been developed. In the simplest form, these can be pure word lists that regulate the use of language in the description of metadata, but also complex, structured thesauri. Thesauri are word networks that contain words and their semantic relations to other words. This makes it possible, among other things, to unambiguously resolve polysemous (ambiguous) terms (see e.g. Thesaurus of Plant Traits). A semantic thesaurus, often referred to as a semantic network, is a structured vocabulary or knowledge representation system that goes beyond traditional thesauri by capturing not only hierarchical relationships (such as broader terms and narrower terms) between concepts or terms but also semantic relationships, associations, and meaning. It is designed to provide a richer and more contextually meaningful representation of the terms or concepts within a specific domain or knowledge area. Thesauri primarily focus on managing and facilitating the retrieval of words or terms used in documents or databases. Their main purpose is to help users find synonyms, related terms, and broader or narrower terms to improve information retrieval and search. In contrast, ontologies, such as ENVO or the unit ontologies (e.g. UO) are used to represent and define complex relationships and concepts within a specific domain. They aim to capture the semantics and meaning of concepts, enabling a more comprehensive understanding within a specific domain or subject area. They serve as a formal and standardised framework for organising and categorising information, making data more accessible, interoperable, and understandable across various applications and disciplines. By providing a common vocabulary and a set of defined relationships, ontologies are invaluable tools in fields such as information retrieval, data integration, artificial intelligence, and the Semantic Web. They enable researchers and systems to achieve a shared understanding of complex domains, fostering more efficient data management and knowledge exchange.
This video[1] explains the importance of semantic search engines to find relevant data.
Ontologies and thesauri are distinct systems for structuring and categorising information. Ontologies serve to represent complex relationships and concepts within a specific domain. In contrast, thesauri primarily aid in word or term retrieval, offering synonyms, related terms, and hierarchical relationships to enhance information search and indexing. Ontologies provide a comprehensive, detailed representation of knowledge, while thesauri are more focused on managing and organising terms for search and categorization. The choice between the two depends on the specific needs and goals of a given project or application.
As a researcher or research group, how can you ensure the use of consistent terms and formats? As an individual in a scientific discipline, it is worth asking about controlled vocabularies within that discipline at the beginning of a research project. A simple search on the internet is usually enough. Even in a research group with a research project lasting several years, a controlled vocabulary should be searched for before the project begins and before the first analysis. If none can be found, it is worthwhile, depending on the number of researchers involved in the project and the number of sites involved, to create an internal project document for the uniform coordination of the terms and technical terms used, which should be used in the respective metadata categories.
GFBio provides a Terminology Service containing a collection of terminologies that are relevant in the field of biology and environmental sciences. This service acts as a resource, providing researchers with a common framework for defining and categorising various aspects of biodiversity, including species names, ecological interactions, and ecosystem processes. By utilising GFBio's Terminology Service, scientists can ensure consistency and precision in their research, promoting clarity and enabling seamless integration and comparison of biodiversity data across different studies. This unified approach to vocabulary and terminology strengthens collaboration and understanding within the scientific community, advancing our knowledge and conservation efforts in the diverse and intricate world of biodiversity.
In addition to controlled vocabularies, there are also many authority files which, in addition to uniform naming, make many entities uniquely referenceable. For example ORCID, short for Open Researcher and Contributor ID, which identifies academic and scientific authors via a unique code. The specification of such an ID distinguishes any frequently occurring and therefore ambiguous names and should hence be used as a matter of preference.
Probably the best-known standards file in Germany is the Gemeinsame Normdatei (GND, eng. Integrated Authority File), which is maintained by the Deutsche Nationalbibliothek (DNB, eng. German National Library), among others. It describes not only persons, but also “corporate bodies, conferences, geographies, subject terms and works related to cultural and scientific collections”. Each entity in the GND is given its own GND ID, which uniquely references that entity. For example, Charles Darwin has the ID 118523813 and Alfred Russel Wallace has the ID 118806009 in the GND. These IDs can be used to reference Darwin and Wallace unambiguously in metadata with reference to the GND (e.g. Darwin in the DNB, Wallace in Wikidata). Another example from research is the so-called Research Organization Registry (ROR), that provides an unique identifier for each research organisation.
GeoNames is an online encyclopaedia of places, also called a gazetteer. It contains all countries and more than 11 million place names that are assigned a unique ID. This makes it possible, for example, to directly distinguish between places with the same name without knowing the officially assigned municipality code (in Germany, the postcode). For example, Manchester in the UK (2643123), Manchester in the state of New Hampshire in the US (5089178) and Manchester in the state of Connecticut in the US (4838174) can be clearly distinguished.
In general, investigate about specific requirements as soon as you know where you want to store or publish your research data. Once you know these requirements, you can create your own metadata. When referring to specific commonly known entities, always try to use a unique ID, specifying the thesaurus used.
If you want to know whether a controlled vocabulary or ontology already exists for your scientific discipline or a specific subject area, you can carry out a search at BARTOC, the “Basic Register of Thesauri, Ontologies & Classifications” as a first step. As a second step, BioPortal and the EMBL-EBI Ontology Lookup Service are very suitable resources for our domain.

[1] (published by GFBio project, 2017)


Bisher wurde noch kein Kommentar abgegeben.