Trouble understanding metadata? In this article, we’ll dig into the questions, what is metadata and how to use metadata, as well as provide advice on managing it.
Going solely on the name, metadata is “data beyond data,” given that meta (Greek μετα) means past or beyond.
However, this particular definition actually tells you very little about what metadata actually is, or does, because before you can explore the meaning of metadata or its value to a search engine, you need to know what data is.
The Emergence of Metadata
In the earliest days of business computing, most computer usage involved recording transactions, like an order or invoice, and then fulfilling that order. While there were a few data models at the time, typically the data was stored on tape – using magnetic sections to store binary information in fixed-length chunks called records.
If you wanted to store information about where someone lived in order to send a bill, then all of the fields about the address had to be kept in the same record.
Eventually, several people came up with the idea of joining rows of two tables together by using special markers called keys, where one key (the primary key) identified a row in a table, while another key (the foreign key) stored this same number as a pointer in a different field of another table.
This meant that, instead of each record having to store the same information over and over again, you could create a table of bills and a table of addresses. Then the bill record could just point to the address record. This significantly reduced the amount of data needed to be captured, and the row in the referenced address table could then be said to be about the address being referenced.
Metadata for data is like labels for labels — it’s all about organization.
In other words, the row that contained the address could be considered metadata about the address reference in the bill listing.
The implication of this is important: metadata is simply data that describes (is about) other data. Sometimes that metadata was made explicit in the data model, such as the in bill and address example given above. Other times, the metadata is more subtle and implicit, representing assumed information that may or may not be captured in the model, e.g., a tag.
In enterprise search, metadata is useful for organizing, grouping, and navigating all of the documents and content resources amassed by an enterprise throughout its existence. An enterprise search engine uses metadata, among other things, to create a relevant search result page for the searcher. However, barriers can arise when different index owners or taxonomy managers use different terms to describe what are essentially the same digital assets – and this is where artificial intelligence can be helpful.
Different kinds of metadata include, but certainly aren’t limited to:
- Descriptive metadata: Includes descriptive information about a content resource or digital asset, like a white paper, video, or web page, such as a tag or abstract.
- Structural metadata: Describes the structure of database objects like indexes, tables, columns, or keys.
- Administrative metadata: A type of ‘guide metadata’ (i.e., metadata that helps humans navigate data assets) that includes information on managing a resource; preservation metadata falls under this category, meaning information on how to save a resource.
- Technical metadata: Another category of ‘guide metadata’ that describes the structure of information in a data warehouse or business intelligence system
Dimensional Analysis as Metadata
Consider a situation where an architect stores information about materials used in the construction of a house. The type of wood beams, for example, or the size of the chimney, as numbers in metadata fields with names like fireplace_length, fireplace_height and so forth, with their lengths given as 8.25, 23.75, 2.5, etc.
Now, if you happen to be familiar with fireplaces or construction and had grown up in the United States, you would assume that these numbers were in units of feet. If, however, you had learned about architecture in Europe, your first assumption would likely be that these dimensions were given in meters — making for a huge fireplace.
A computer, of course, would have no clue about what the unit of measurement was, and if not defined, a 3D printer might very well give you outputs perfect for a giant (or a mouse).
Again, units are metadata; they are critical pieces of information about the interpretation of fields of content that aren’t necessarily stored with that content. This can be especially important with unstructured data, also referred to as unstructured content.
In a relational database, a (bad) solution would be to add a unit description into the metadata field name. This would tell the database user how to interpret the units — but doesn’t necessarily tell the computer itself that without parsing out the name. A better solution would be to set up a table that consisted of property names per table and that identified the dimension of a given field based upon the metadata for that field.
Metadata and Classification
Similar metadata is involved with classification. The addresses example might have some kind of a classification system: a billing address, a shipping address, a forwarding address, a post office box, and so forth – each of which may require special kinds of processing.
A good database developer will turn categorizations into fields in an address_type table, with metadata elements about the distinctions between the different categories being entered in the database. Then the developer would create a category drop down in the interface, so the user could select the right category. Not surprisingly, such categorical metadata can have a significant impact on user interface design.
Schema is yet another form of metadata.
A schema describes the allowable relationships that a metadata element can have with other elements in the data system, along with constraints on those relationships. In the case of relational databases, the schema structure is defined explicitly ahead of time, because the database needs this information to tell it how to both interpret and efficiently store certain types of data (such as numbers or dates).
Content schemas (or document metadata), such as those used to describe a web page, Microsoft Word documents, or the puppy and kitten videos you watch during your lunch break, are usually implicit and externally applied.
This means that the system can interpret the text even without the schema but can accept or reject the validity of that document (a process called schema validation) based upon the schema in a more advisory approach.
Schema-less systems still have an implicit structure, it’s just that the structure is not necessarily available to the machine to use.
Reifications and Annotations
Finally, in some data storage representations such as graphs, where you can create assertions (observed facts), you can create a form of metadata known as a reification.
For instance, suppose that you have a statement (an assertion) such as “Jane spent $45 for a blouse.” If you wanted to provide some kind of context about the statement (such as “Mark reported that Jane spent $45 for a blouse.”) this would be considered a reification: a statement about a statement.
Reifications are part of a broader class of metadata called annotations. An annotation is a comment about another statement or set of statements, and may provide additional classification, descriptions, alternative phrases, and provenance (where the statement was made and who reported it).
When you see activity streams on social media, you’re looking at annotations about statements being made.
About Metadata: A Summary
One important point to consider. Metadata typically provides context to data; it’s the information that the data itself is implicitly assuming in order to be true or meaningful. Because context often falls outside of the structure of that data, you should be careful about working with any system that bills itself as a metadata server.
They can determine the potential context of information based on what’s known about existing information, but you should be careful to have a way to validate that the contextual metadata so produced is in fact correct and consistent.
There is no question that metadata will become increasingly important in computing, especially as graph-oriented data systems, machine learning, and natural language processing all become central to data and metadata management.
Your enterprise’s metadata standard will vary depending on a number of factors, and taxonomy is one of them.
But there are some myths floating around about the importance and prioritization of taxonomy when it comes to enterprise search. Learn more about the three myths slowing down your digital transformation.