by Jon Udell

The power of voice

analysis
Dec 13, 20026 mins

Fast-Talk Communications brings full-text search to audio recordings

CHEAP STORAGE MAKES it feasible to save voice recordings of many of our meetings, teleconferences, interviews, and other conversations. In some environments — call centers and certain sectors of finance and government — that already happens. But audio surveillance isn’t yet routine, and the thorny legal, social, and cultural issues it raises haven’t yet been widely debated. That’s because, until now, there was no practical way to mine voice data.

As with other forms of practical obscurity, this artificial barrier was bound to topple, and now it has. Fast-Talk Communications’ revolutionary phonetic indexing and search technology brings the magic of full-text search to the formerly opaque realms of audio recordings and video soundtracks. If you consider the way in which Google has already become everyone’s indispensable “outboard brain,” and extrapolate that to all the voice data that exists — and to the vast quantities that soon will exist — it’s hard to avoid the conclusion that Fast-Talk is one of the most disruptive technologies in the pipeline.

A phonetic search engine

What Fast-Talk sells is an engine and a software development kit, not an end-user product. The kit includes a “technology demo,” however, which is a fully functional tool that has changed how I work in a dramatic way. Though I’ve been a journalist on and off for many years, I had never integrated audio recording into my routine. Finding quotes in those recordings was a painful process, and sending them out for transcription (as my InfoWorld colleagues routinely do) incurred delay and expense. So, being a fast typist, I just captured what I needed live. That technique was stressful, not always accurate, and obviously not appropriate for most people. So when I interviewed Antarctica Systems CTO Tim Bray recently for InfoWorld’s CTO Zone (see ” Mapping the future “), I used Fast-Talk to record, index, and then search the conversation.

The Fast-Talk engine can work with multiple audio formats, using pluggable “media accessors” to encapsulate them. The technology demo supports only WAV files, which it indexes to create PAT (phonetic audio track) indexes. If you want to search video, Fast-Talk recommends using VirtualDub, an open-source program, to extract the audio track as a WAV file. You can use Fast-Talk’s demo to index pre-existing WAV files or, as I did, to index a WAV file while recording. This near-real-time indexing meant I was able to begin searching the index as soon as the 45-minute conversation ended. That was true because Fast-Talk’s phonetic technology is orders of magnitude faster than the conventional alternative: speech-to-text translation followed by text indexing.

Like many great innovations, Fast-Talk is simple to describe. Phonemes are the basic units of sound in a language, and North American English has 39 of them. You can look up a word’s phonetic spelling in the Carnegie Mellon dictionary (see Kevin Lenzo’s Web site at www.speech.cs.cmu.edu/cgi-bin/cmudict ). “Dictionary,” for example, works out to “D IH K SH AH N EH R IY.” Fast-Talk’s indexer recognizes phonemes and notes the time of their occurrence. The searcher converts text input to phoneme strings, looks for them, and returns their time-codes. It’s as simple — and brilliant — as that.

Fast-Talk in action

When my interview with Tim Bray was done, the first segment I looked for was the one where Bray said, “Jean Paoli spent four hours showing me XDocs.” The name “Jean Paoli” was, not surprisingly, ineffective as a search term. But “four hours” found the segment instantly, as did “fore ours” — which of course resolves to the same string of phonemes. “Zhawn Powli” also worked, illustrating what will soon become a new strategy for users of voice-aware search engines: When in doubt, spell it out phonetically. In practice, I find myself resorting to this strategy less often than I’d have expected. And it was fairly obvious when to do so. I guessed correctly that “MySQL” would not work, for example, but that “my sequel” would.

The query language is dead simple, but there’s an interesting twist on proximity. In a conventional search engine, proximity means “find a word within so many words of another word.” In Fast-Talk’s engine, it means “find a string of phonemes within so many seconds of another string of phonemes.”

I was unable to find any variant of “XDocs,” but I chalk that up to the recording’s poor quality — I was testing an IP phone at the time. There were some dropouts, and “XDocs” came during one of them. The marginal recording quality was, in fact, an excellent test. Like most people, I have no special audio engineering skill and no special recording equipment. To succeed in the real world, Fast-Talk will have to work well with whatever raw material it can get — and it does. Although it is tuned for North American English, the international nature of our industry made it inevitable that I would push those limits. Sure enough, the accents I threw at it included Ximian CTO Miguel de Icaza’s (Mexican), OpenLink Software CEO Kingsley Idehen’s (Nigerian/British), and Systinet CEO Roman Stanek’s (Czech), with usable results in each case. It’s preferable, of course, to have a high-quality recording of a native speaker of North American English. When I indexed a well-modulated phone conversation that Test Center Director Steve Gillmor had with Microsoft’s Mark Lucovsky, the results were simply uncanny.

Developers will find Fast-Talk to be a clean, well-documented toolkit. The engine is packaged as a static link library for use in Microsoft’s C++ environment, and from other languages by way of a COM (Component Object Model) wrapper. (There’s not yet a managed interface for .Net, but C# or Visual Basic .Net programmers can use the COM API.) The API supports multithreading so that indexing and search tasks can be parceled out to a set of processors. Non-Windows packaging of the engine, when needed, will be straightforward to produce.

Call centers are obvious first candidates for the Fast-Talk treatment. “Think about running a support center,” says Patrick Taylor, Atlanta-based Fast-Talk’s vice president of sales and marketing. In theory, answers to hard questions are written down in a knowledge base. In practice, that rarely happens. “It’s compelling to just index everything that’s said by the best experts,” suggests Taylor, “so you can instantly find where they mention, say, NT kernel error 304.”

Clearly, that’s just the tip of the iceberg. The implications are both exhilarating and frightening. “This business of recording everything scares the bejesus out of me,” says Ray Ozzie, CEO of Groove Networks in Beverly, Mass. With entry-level deployment of Fast-Talk starting at $10,000, routine meetings and phone calls won’t be indexed anytime soon. But it’s coming, and it is scary. As always, great power brings great responsibility. The genie’s out of the lamp, though, so we’ll just have to learn to use this new power well.