If you still have questions about the term metadata, you’re not alone. Many data science professionals have spent countless hours studying metadata. I’m one of them, and in this article, I’ll do my best to answer the question, What is metadata? while providing some advice on how to use and manage it effectively.

The word metadata roughly translates as “data beyond data,” given that meta (Greek μετα) means “past” or “beyond.” You’ve probably heard of a meta title tag or meta description. Definitely different permutations, but I’d like to cover the topic with more depth. 

As for a definition, let’s use this for the purposes of this article: 

What is Metadata?
Metadata is data about data that can be used to describe, categorize, and manage digital assets. Broadly speaking, search engines use metadata to accurately understand, index, and rank content in search results, with the goal of making each search result as relevant as possible.

Simple enough. But before we dig into its complexities and value to search engines, you need to know what data is.

How Metadata Came to Be

Let’s take a quick trip back to the earliest days of business computing. Back then, most computer usage involved recording transactions, like an order or invoice, and then fulfilling that order. While there were a few data models at the time, people typically stored the data tape, using magnetic sections to store binary information in fixed-length chunks called records.

If you wanted to store information about where someone lived in order to send a bill, then all of the fields about the address had to be kept in the same record.

Eventually, people came up with the idea of joining rows of two tables together by using special markers called keys, where one key (the primary key) identified a row in a table, while another key (the foreign key) stored this same number as a pointer in a different field of another table.

Using records and keys, you could create a table of bills and a table of addresses, instead of each record having to store the same information over and over again. Then the bill record could just point to the address record, significantly reducing the amount of data needed to be captured — and the row in the referenced address table could then be said to be about the address being referenced.

Here you have some of the earliest examples of metadata management.

Metadata for data is like labels for labels — it’s all about organization.

In other words, the row that contained the address could be considered metadata about the address reference in the bill listing.

The implication of this is important: metadata is simply data that describes (is about) other data. Sometimes that metadata was made explicit in the data model, such as in the bill and address example given above. Other times, the metadata is more subtle and implicit, representing assumed information that may or may not be captured in the model, e.g., tags.

In enterprise search, metadata is useful for organizing, grouping, and navigating all of the documents and content resources amassed by an enterprise throughout its existence. An enterprise search engine uses metadata, among other things, to create a relevant search result page for the searcher.

However, barriers can arise when different index owners or taxonomy managers use different terms to describe what are essentially the same digital assets—and this is where artificial intelligence can be helpful.

The Different Types of Metadata

Different kinds of metadata include, but certainly aren’t limited to:

Descriptive metadata

Includes descriptive information about a content resource, including category, topic, author, type of digital assets like a white paper, video, or web page (tag, abstract, creation date, and so on), and even physical products. This is ideal for allowing knowledge articles to be classified. 

Product metadata

Similar to descriptive metadata, product metadata lists the attributions and descriptions of a product. Metadata can include a style, brand name, occasions in which something might be used, color, size, is it sustainable — all these descriptors can then be exposed as faceted navigation for the shopper.  

Structural metadata

Describes the structure of database objects like indexes, tables, columns, or keys.

Administrative metadata

A type of “guide metadata” (i.e., metadata that helps humans navigate data assets) that includes information on managing a resource; preservation metadata falls under this category, meaning information on how to save a resource as does the date an asset was created or modified.  

Technical metadata

Another category of guide metadata is that which describes the structure of information in a data warehouse or business intelligence system.

Schematic metadata

A schema describes the allowable relationships that a metadata element can have with other elements in the data system, along with constraints on those relationships. In the case of relational databases, the schema structure is defined explicitly ahead of time, because the database needs this information to tell it how to both interpret and efficiently store certain types of data (such as numbers or dates).

Content schemas (or document metadata), such as those used to describe a webpage, Microsoft Word documents, or the puppy and kitten videos you watch during your lunch break, are usually implicit and externally applied. 

This means that the system can interpret the text even without the schema but can accept or reject the validity of that document (a process called schema validation) based upon the schema in a more advisory approach.

Schema-less systems still have an implicit structure, it’s just that the structure is not necessarily available to the machine to use.

Reifications and Annotations

Finally, in some data storage representations such as graphs, where you can create assertions (observed facts), you can create a form of metadata known as a reification. For instance, suppose that you have the following statement (an assertion): 

“Jane spent $45 for a blouse.” 

If you wanted to provide some kind of context about the statement, such as: 

“Mark reported that Jane spent $45 for a blouse.” 

This would be considered a reification — a statement about a statement.

Reifications are part of a broader class of metadata called annotations. An annotation is a comment about another statement or set of statements, and may provide additional classification, descriptions, alternative phrases, and provenance metadata (where the statement was made and who reported it). When you see activity streams on social media, for example, you’re looking at annotations about statements being made.

Metadata and Classification

Similar metadata is involved with classification. Classification can be used to improve the efficiency of workflows, reduce the risk of errors, and make it easier for users to find the information they need. The address example above might have some kind of a classification system: a billing address, a shipping address, a forwarding address, a post office box, and so forth, each of which may require special kinds of processing.

A good database developer will turn categorizations into fields in an address_type table, with metadata elements about the distinctions between the different categories being entered in the database. Then the developer would create a category drop down in the interface, so the user could select the right category. Not surprisingly, such categorical metadata can have a significant impact on user interface design.

Dimensional Analysis as Metadata

Dimensional analysis is a process of decomposing a complex problem into smaller, more manageable dimensions. These dimensions can then be used to analyze the problem and to identify solutions.

Consider a situation where an architect stores information about materials used in the construction of a house. The type of wood beams, for example, or the size of the chimney, as numbers in metadata fields with names like fireplace_length, fireplace_height and so forth, with their lengths given as 8.25, 23.75, 2.5, etc.

Now, if you happen to be familiar with fireplaces or construction and had grown up in the United States, you would assume that these numbers were in units of feet. If, however, you had learned about architecture in Europe, your first assumption would likely be that these dimensions were given in meters. This could make for a huge fireplace.

A computer, of course, would have no clue about what the unit of measurement was, and if not defined, a 3D printer might very well give you outputs perfect for a giant fireplace, when what you really wanted was… a normal one. 

Again, units are metadata; they are critical pieces of information about the interpretation of fields of content that aren’t necessarily stored with that content. This can be especially important with unstructured data, also referred to as unstructured content.

In a relational database, a (bad) solution would be to add a unit description into the metadata field name. This would tell the database user how to interpret the units — but doesn’t necessarily tell the computer itself that without parsing out the name. A better solution would be to set up a table that consisted of property names per table and that identified the dimension of a given field based upon the metadata for that field.

How to Optimize Metadata for Your Search Engine 

Metadata can play a critical role in the end-user experience. What your ecommerce buyer sees on a product results page, the text snippets included in site search engine results — it’s all influenced by metadata.

As such, adhere to a few best practices to ensure your search engine is as accurate and relevant as possible: 

  • Identify the most important metadata fields for your search index
  • Map metadata fields to the appropriate data types
  • Normalize metadata values to ensure consistency
  • Have a system in place for handling special cases and exceptions

The importance of mapping metadata to the appropriate fields in a search index shouldn’t be understated. Check out Mapping Metadata: Tips for Best Search Results for more on how to map metadata fields with Coveo.

Maintaining Metadata

One last important point to consider: metadata typically provides context to data. It’s the information that the data itself is implicitly assuming in order to be true or meaningful. Because context often falls outside of the structure of that data, you should be careful about working with any system or metadata management tool that bills itself as a “metadata server.”

They can determine the potential context of information based on what’s known about existing information, but you should make sure you have a way to validate that the contextual metadata so produced is in fact correct and consistent.

And that’s another issue. Metadata can become stale — and when you are talking about enterprise-sized repositories, maintaining metadata at scale is hard. However, good machine learning can eliminate some of the need for maintenance as it learns from users’ behaviors.

Summary

There is no question that metadata is important in computing, especially as vector search, machine learning, and natural language processing all become central to metadata and management overall. 

Hopefully, this brief overview provides a good reference for your own work with metadata.

Dig Deeper

Your enterprise’s metadata standard will vary depending on a number of factors, and taxonomy is one of them. But there are some myths floating around about the importance and prioritization of taxonomy when it comes to enterprise search. Learn more about the three myths slowing down your digital transformation.


Have a minute? We’d love to get your feedback.