Venn and the Art of Metadata

by Mindy Carner

The relationship between metadata, controlled vocabularies and taxonomies as used in Information Management (IM) can be confusing. Each is inextricably tied to the others in an IM system’s functional requirements. This almost oneness can make their intersections seem more like redundancies than interdependencies, and can cause confusion when a company is trying to create coherent, usable business requirements. Having a clear understanding of the differences and connections between these concepts will enable teams to create a path to understanding the best ways to classify, describe, store and retrieve their content. In this post, we will define each and explain the roles that they play in the navigational and search experience of information and knowledge management systems. Along the way we will illustrate how each of these important elements – metadata, controlled vocabulary and taxonomy – play a vital role in the successful implementation of the others.

 

 

 

 

 

 

 

 

 

Figure 1: An interpretative Venn Diagram that could represent the relationship between Metadata, Controlled Vocabularies, Taxonomy, and the intersections between them.

Metadata

Metadata is structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource." In other words, metadata is the data that describes an information artefact – whether it be a book or a person. People create metadata – often about themselves! – every day. You create metadata whenever you fill out a form online. Those online forms are smart, aren’t they? When you need to fill in your name, you do so in an open text field since a name can be almost any combination of letters. When you need to input a date, that field knows that the months are between 1-12, days are between 1-31 and years are usually from about 1900 to the present year. When you fill in a state, there is usually a dropdown list from which you select the correctly spelled form of the state’s name. These forms accommodate all of this and more because they were developed according to a metadata model.

Figure 2: A very simple metadata form in which users may fill out some basic metadata between themselves.

Metadata models define how data is structured once it has been imported into a system. Within a model, metadata is often designated into 3 categories.  In our example above, we have descriptive metadata to describe the person filling out the form (name, age) and we have administrative metadata to deal with any administrative requirements (City, State and Zip code). If this model also described digital artefacts,  then there would also be technical metadata such as file type and file size.

Figure 3: The Metadata Model outline the fields and their possible values and formats for the form in Figure 2.

The metadata model is, importantly, the point of convergence between the metadata, controlled vocabularies and taxonomy.

Controlled Vocabularies

The Getty, one of the world's largest cultural and philanthropic organizations dedicated to the visual arts, and owning one of the most respected collections of digital visual artefacts, defines controlled vocabulary as “an organized arrangement of words and phrases used to index content and/or to retrieve content through browsing or searching.” Going back to our example, the dropdown list of states, as noted before, includes correctly spelled state names for people to choose from. This may seem simple, perhaps to make the form-filling process easier while reducing manual errors, but controlling vocabularies like this also improves search and navigation. When the value of the state field is going to be reliably recorded, for example, as “New York” instead of “NY” or, worse, “New Yokr,” the result is a high quality data set that can be trusted, and that will always return all of the results for a search for users who live in The Empire State.

Figure 4: A controlled Vocabulary for the "State" field that is marked as "controlled" in the Metadata Model in Figure 3.

The metadata here is State = New York. It describes where the person who filled out the form lives. It is a piece of data in the record about that person.

The controlled vocabulary is an important data-normalizing factor ensuring reliable, predictable spelling and formatting of the data. These two elements combine in the metadata model.

Taxonomy

The National Information Standards Organization defines taxonomy as “a controlled vocabulary consisting of preferred terms, all of which are connected in a hierarchy or polyhierarchy." Taxonomies are often utilized alongside or as part of a metadata model, and though they are considered a form of controlled vocabulary, they go the extra step of applying hierarchies that group concepts into logical classifications.

While controlled vocabularies serve the primary purpose of placing authority control on metadata to be indexed, ensuring proper spelling and format, taxonomies are more complex, modelling a domain of knowledge, creating a ‘by topic’ structure that can be used to improve a system’s global and search navigation by creating logical groupings of concepts. It enables users to intuitively browse through a system that may contain hundreds of different topics, concepts, pages, assets or items. Think of the global and left-hand navigation on most websites and systems. On a retail shoe shopping site for example, whose navigation will guide you through choices like women’s vs men’s shoes, casual vs. business, etc. Those terms and their grouping are defined in the company’s taxonomy. This is the application of taxonomic concept grouping.

Taxonomies are also used to improve search. A very rich search experience can be created by, for example, leveraging terms from the taxonomy in a system’s faceted search interface that, again, groups the concepts into logical classes that make the search experience intuitive.

Conclusion

It really is impossible to say that taxonomy is not the same as metadata or controlled vocabulary and vice versa. Indeed, a taxonomy is often considered a type of controlled vocabulary, and both of these are used to make up the actual metadata of an asset. Yet, these are each unique concepts that come together, along with a few other controls, to form a metadata model that will inform the very navigation and searchability of an information system. Understanding the difference between these concepts, and the important role that each plays in the functional requirements of any information management system is the first step in gathering this critical information for a successful implementation.

 


 

    Mindy Carner is a Senior Associate at Optimity Advisors.

 

 

 

 

Asset 7
  • Washington, D.C.
  • Berlin
  • Brussels
  • London
  • Los Angeles
  • Minneapolis
  • New England
  • New York