On-the-fly document clustering

reviews
Oct 29, 20042 mins

Vivísimo Velocity brings speed, accuracy, and flexibility to enterprise search

Even the best Web search engines deliver so many hits that users overlook relevant documents or find it too bothersome to explore a subject further. Alas, many attempts to hone search results fail because they rely on document preprocessing or manual classification that introduces inaccuracies and delays. Vivísimo’s Clustering Engine sidesteps these costly taxonomy projects by organizing search and database queries into meaningful, hierarchical folders on the fly.

This hot technology, which powers the company’s public Clusty.com search site, now has a corporate sibling. Vivísimo Velocity is a Linux server application, front-ended by a Web administration tool, that combines dynamic document clustering, crawls of as many as one million enterprise documents, and metasearches of an unlimited number of other search engines or documents.

I was impressed with how well Velocity federated search results from internal search projects and external sources. In a few hours, I’d combined intranet searches from Convera RetrievalWare and a Google Search Appliance, several external Web sites, plus documents on a local file server. Significantly, this was done without fiddling with settings; therefore, I believe full implementation shouldn’t require special IT expertise.

At the next level you can customize crawls so there’s no need to touch original content. For example, I built an XSL style sheet to recognize the existing structure of a subscription news site — thus multiple articles listed on an index page were correctly recognized as separate documents and placed in correct clusters.

Velocity appears to live up to its name with speedy implementation while precisely integrating results in easy-to-navigate clusters. Its arguably high price is far lower than the cost of multiyear search projects that never return their investment.

Vivísimo Velocity Vivísimo Cost: Starts at $10,000 per year for 50,000 documents Available: End of November