Metadata in Software Development

In modern programming, there’s a lot of cool stuff that can be done with metadata. Failing to consider it in one form or another is setting aside a major tool in any system architecture. Sadly, I see this from time to time, a project with a lot of information associated with it… typically the metadata ends up being expressed only as requirements… and is not applied any further in automation. My post on Edgewater’s blog highlights code generation as one of my favorite ways metadata can be leveraged using things like standardized requirements documentation as metadata.

Wikipedia argues that metadata is divided into two categories, “data about data types” such as XSDs and WSDLs, and “data about content” is only helpful in determining application. According to their current article, metadata is akin to information about a container such as a box.. how big is it? how much can it hold? Metacontent is information about what the box contains… what is it? how much does it weigh? what are its storage instructions? what are the contents of the instruction manual? The former information is often best applied at design time. The latter, is often more economically applied at runtime. I would argue that it goes deeper than that, as well. One man’s metacontent is another’s metadata. The term metadata could be used to cover:

raw data
structure/language definition (syntax) (describing how to express a schema)
structure/language expression definition (schema)
content
content description/classification

Metadata in information systems has two major aspects: expression and application. Expression is everything from the media used to the language (even characters) that are used to communicate information. Expression implies how the information can be managed and transmitted, and even how the information can be applied. Application is all about how the information is used, impacting utilization of resources requried to leverage it at the time that it is applied.

In application development, metadata can take many forms. It can look like lots of different things. Here’s a list enumerating a few forms in the development world, along with pros and cons of each:

BLOB (Fully Custom/Encoded Expression)

Notes: At the core of all other computer data expression is a specialized BLOB. I mean, on disk, there’s no visible difference between an Excel Spreadsheet and a JPEG image… it’s just a block of bytes. The meaning of each byte is interpreted by the application that uses it. (Its beauty is in the eye of the beholder.)
Pro’s: Most flexible expression. Any information that can be expressed within a computer can be encoded to live in a Binary file.
Con’s: Any “beholder” must typically be created from scratch, making it typically tightly coupled with its application. Harder to manage in terms of media & transmission.

Plain Text

Pro’s: Very flexible, customizable, can be loosely coupled to the application. Lots of options with respect to media and transmission.
Con’s: Plain text can take any form, often has to be parsed using custom tools, care has to be taken to make sure plain text is extensible, human readable, and structured enough for processing.

Pro’s: Much more structured by definition, easy to extend by definition. Many many useful tools exist to define, manage, communicate, index, and consume XML. Very loosely tied to its application(s).
Con’s: Not quite as human-readable as say, plain text might be.

JSON

Pro’s: Extensible, structured, blurs the line between code & metadata, since JSON (JavaScript Object Notation) is evaluated by runtimes as code. Meant for web client consumption.
Con’s: not nearly as broadly supported as XML, tools are not available in as many contexts. Not so well supported for consumption outside web clients.

XAML

Pro’s: XML based, and like JSON, is an object initialization notation. It is more widely known for describing UI elements in WPF and Silverlight. Directly compiles to code.
Con’s: Requires WPF or Silverlight runtimes to use, not well supported beyond client-side rich apps, media may be typically described as “embedded in binary code”.

Embedded in Programming Languages

Pro’s: typically compiled into code, metadata can be expressed as code litterals or code attribution. This means the data is almost fundamentally instantly available to the running application that consumes it. Done well, especially with code attribution, this tends to provide meaningful human-readable information to a programmer (typically more so than comments) and a runtime processor at the same time.
Con’s: embedded metadata must be processed at runtime, every time. In cases where there are large amounts of data involved, done wrong, this can be time consuming. Typcially forces metadata content management into source code control, which arguably may not be the best way to manage it. (source control of metadata is a “Pro”, but putting it in something like TFS can make it hard to reach.) Litterals and attribution can tie metadata to the code that consumes it.

Excel Spreadsheets

Pro’s: typically easily read & shared, easy to manage & communicate as a file.
Con’s: not so easy to collaborate on, typically concurrent content managers must merge changes. Automation can be klunky.

Database Tables

Pro’s: Easy medium for a developer to work with
Con’s: Not so easy a medium for anyone else, hard to manage in source control, requires custom tools for end-users to work with if they need to.

SharePoint Lists

Pro’s: Easy medium for anyone to use / collaborate, SharePoint provides content management tools. Data is easily decoupled, accessible via web (and web services), and exportable to files such as Excel spreadsheets.
Con’s: Hard to keep metadata sync’d with source code control. (If this is a requirement, the workaround involves exporting documents and checking them in.)

Metadata application, in programming terms, can be accomplished in several different places. It can be applied at:

Design Time

Common application: code generation based on metadata or metacontent information.

Run Time

Common application: runtime behavior/presentation modification and/or extension based on metacontent information. This could include interpreted language processors. One could argue that .NET Common Intermediate Language (CIL) is metadata for that reason, since a runtime interprets it and executes native code from it.

Test Time

Common application: baseline test data, test configuration. (In turn, these can be applied at “design time” such as generating unit test code, or runtime, such as gathering baseline test data to compare to results, or determining executability and/or parameters for tests.)

Deploy Time

Common application: Deployment targets/configuration based on build content metacontent.

Code generation is one of the most compelling uses for metadata in information systems, but it’s certainly not the only use. One of my other favorite topics, SharePoint, highlights this by making metacontent a cornerstone of content and document management, classification, reporting, and search indexing. Throwing a document in a plain old file share provides you with a little metadata; the folder names it its file path typically has meaning, as well as the dates, times, and even the permissions applied to it. Throwing the same document in SharePoint provides a ton more information about the document content by default, and typically indexes it, making search a far more powerful tool.

Published by Jim Wilcox (The Granite State Hacker)

Leave a Reply Cancel reply