XML co-creator looks to cartography design protocols to chart a user-friendly visual browse engine See correction below WHEN HE MANAGED the University of Waterloo’s New Oxford English Dictionary Project, Tim Bray developed tools for processing large quantities of structured text. He founded Open Text on the basis of that technology, and later became one of XML’s co-inventors. His current venture is as CTO of Vancouver, British Columbia-based Antarctica Systems. InfoWorld Test Center lead analyst Jon Udell asked him about his quest for the ultimate search tools. Why did you start Antarctica? I’m an old search guy, I founded Open Text back when we had one of the early Internet search engines. Then we got into the business of intranet automation and document management. But I was left unsatisfied. I decided that Web user interfaces were causing pain. Since we rolled out the blue-underlined-text and back-button paradigm a decade ago, there hasn’t been much progress. On the desktop, your data is always presented visually. But the second you step off your desktop, you’re out in the world of typing queries, pressing Enter, and looking at large lists of results. So that’s what Antarctica does. We’ve created a visual browse engine. Why do you use maps as the metaphor for display and navigation? It’s based on experimentation and also on the ideas of [Yale University professor Emeritus and Web design guru] Edward Tufte who pointed out that maps win in terms of the amount of information you can convey. There are two constraints on achieving the Tufte effect. One, the design of the display is an incredibly hard and subtle thing to do well. In terms of the design, you can cheat a bit because cartography is a well-established discipline with lots of rules about how to use color and layout. Second, categorizing the underlying data is a real challenge. You, for example, have partnered with several companies that do autocategorization. If there’s already a taxonomy in place — and that’s true for every research institution, a lot of places in the financial sector, manufacturers with hierarchical parts inventories — then it’s really mind-bogglingly simple. You take the data set, load it into a relational table, and twiddle the XML until you like the way the map looks. I sense a bit of handwaving here. What sort of schema is describing the structure? Often, it’s as simple as a hierarchical Unix pathname. You’d be astounded how often that happens. In the Medline database, each of the millions of papers has a category, and the value of the category is a Unix pathname. If you look at an 11-digit part number, there is often something that looks like a Unix pathname hiding behind that. When you consider the data, which is valuable enough for people to be willing to invest in navigational tools for it, you’ll often find that they’ve already invested in some categorization work. And when they haven’t? The classic case is the enterprise with hundreds of thousands of Word, Excel, and PowerPoint files, and very little information aside from title and date filed. You can draw a map, but it’s not very interesting. In those situations we need to work with a partner to get some categorization done: Autonomy, Semio, Stratify, Applied Semantics, Vivisimo. They all have different sweet spots. Some claim to be able to construct a semantically rich taxonomy at runtime, by just looking at the stuff. Others require that you have a fairly carefully crafted taxonomy before you go in. Then there are interesting differences in approach. Some are using just Bayesian statistical rules based on word distributions. Others have huge amounts of natural language understanding. They all export results in XML, so it makes them real easy to talk to. Is there a relevant exchange standard here, for example, XML topic maps? Not that I’ve seen. However, we got a dump from a well-known and fairly antique ERP system, for parts and inventory at a large manufacter, and it came out as beautiful, clean, easy-to-parse XML. So we do make progress in the world. How do you pitch Antarctica to an average CTO? We don’t think anybody should buy technology just because it looks cool. You should require plausible demonstrations that it pays for itself. We claim three benefits. First, when people look for things, they’ll find more, based on the inherent clustering that happens in a taxonomy. Second, you control the visibility rules. You can make things that are recent come to the top, or things that are accessed a lot, or things that are expensive. So people spend less time looking. Third, when people are browsing rather than searching, they’ll get where they’re going quicker. Consider the manufacturing inventory case. They have tons of highly detailed reports coming out of the ERP system. But they can’t see where most of the capital tied up in inventory is sitting, and then drill down on that geographically or organizationally or by product line. For that kind of thing, maps win. Correction In this article, the following question was originally wrongly attributed: “Second, categorizing the underlying data is a real challenge. You, for example, have partnered with several companies that do autocategorization.” Software Development