by Jon Udell

Modeling biz docs in XML

analysis
Nov 29, 20026 mins

Learning XML Schema won't be easy, but don't let that stop you

THE GOOD NEWS is that Office 11 supports XML Schema. The bad news is that XML Schema has been described even by XML experts as “confusing,” “impenetrable,” “fuzzy,” and “as user-friendly as a stick in the eye.” A successor to the SGML/XML DTD (Standard Generalized Markup Language/XML document type definition), XML Schema is a language for writing rules that constrain the kinds of elements that can appear in documents and the ways in which they can be sequenced, grouped, and nested.

XML Schema is still a relatively new specification. The W3C Recommendation for XML Schema was published in May 2001. XML parsers that support XML Schema haven’t done so for very long, and there is not yet much experience using it. Most people who are adept at defining document structure learned how to do so by writing DTDs. Some of the allergic reaction to XML Schema can, therefore, be chalked up to normal reluctance to learn new skills.

Of course, it’s hard to work up a lot of nostalgia for the DTD legacy. Adjectives such as “confusing” and “impenetrable” were also flung at SGML DTD. Back in the day, more than a few large document management projects — like too many modern ERP systems — produced a lot of sound and fury, signifying nothing. The fact is that, although sets of documents do exhibit databaselike properties that we can usefully formalize and exploit, this kind of information management is still in its infancy.

Boeing, one notable exception, has always understood that documentation is integral to its business. The company likes to joke that a jet is “five million parts flying in formation.” The documents that describe that inventory are themselves part of the inventory, and are engineered accordingly. Applying that same discipline to routine business documents such as rÈsumÈs, expense reports, and purchase orders, though, was never a serious option. Sure, it would be nice to tag all this stuff for intelligent search, aggregation, and data mining. But there were no general-purpose tools for tagging documents that are individually low-value (albeit collectively high-value), and no business case could be made for creating special-purpose tools to do that instrumentation. Office 11, which aims to bring special-purpose capability to general-purpose tools, is arguably one of the most disruptive technologies in the pipeline.

“Got a question?” writes Phil Windley, CIO of the State of Utah, on his Weblog. “Somewhere, on some government computer, the information you need is probably available. Information you paid for and the government would gladly share with you — if only they could find it.” Upgrading the word processors and spreadsheets on those government computers to versions that not only can read and write XML, but, more crucially, can enforce rules about datatypes and structures, is part of the solution. Assuming, of course, that such rules can be written, deployed, and unobtrusively applied and maintained over time. “Therein,” observes Windley, “lies the rub.”

There is very little extant knowledge about how to model unstructured and semistructured data in XML. Unlike SGML, the XML DTD was always optional, because the framers of XML knew there was enormous value in documents that were merely well-formed, even if not valid with respect to a DTD. RSS (Rich Site Summary), for example, the wildly popular XML format for content syndication, has no DTD or schema. “Many on this list will find it shocking,” wrote XML co-inventor Tim Bray on the xml-dev mailing list ( https://lists.xml.org ), “but lots of important XML dialects don’t have any DTDs or schemas. People e-mail back and forth some examples, they cut some code, and then everything’s working and they’re too busy to go back and write a schema.” Although he doesn’t wholly approve of this practice, Bray is a realist who understands that it happens often, and it can yield good results. But if schemas don’t exist, applications can’t enforce them. So where are the schemas going to come from?

One possibility is to infer schemas from example documents. Tools can do this, but so far, not with much sophistication. Microsoft, for example, offers a .Net namespace (Microsoft.XsdInference) that will infer a schema from an XML document, and even refine that schema based on further examples. The results make a useful starting point, and inferencing is a promising technology that can and should evolve, but the fact is that modeling XML data is a complex subject that even the best human experts have yet to codify. XML Schema delivers a much richer set of modeling tools than were available to DTD authors. Learning to use them well is going to be a challenge.

One of the great strengths of XML Schema, for example, is its support for regular expressions, the protean pattern-matching technology that helped Perl dominate the first-generation dynamic Web. However, what is true for Perl and other regular-expression-savvy languages will also be true for XML Schema: Although it’s tempting to use complex patterns, simple ones are best for maintenance and reuse.

RDBMS experts who approach XML Schema will need to adapt their thinking in a number of ways. For example, in XML Schema, uniqueness constraints can apply at any level of a nested structure. XPath expressions are used to bind those constraints to their targets.

Object-oriented programmers will appreciate the way in which XML Schema permits the derivation of specific types from more general ones. But they will also find, as elsewhere, that there are limits to the use of inheritance, and that design by composition — rather than by derivation — is often the better strategy.

XML Schema arguably ought to have been simpler. James Clark, who was technical lead of the XML working group and editor of the XPath and XSLT Recommendations, clearly thinks so. He has championed an alternative schema language, RELAX NG (Regular Language Description for XML, Next Generation), which is now on the Organization for the Advancement of Structured Information Standards (OASIS) and ISO standards tracks. RELAX NG aims to simplify the description of XML structures, but relies on XML Schema for the definition of datatypes.

There is a real danger that enterprises, seeing too many approaches to XML data modeling, will wait for the dust to settle. That would be a shame. Yes, it’s a hard problem, but we’ll have to tackle it sooner or later. Web services won’t fly until we can usefully model real business documents. That’s something we can only learn by doing in a hands-on laboratory such as Office 11.