Paul Krill
Editor at Large

MIT ports Tesseract OCR to JavaScript

news
Oct 21, 20163 mins

Port from developers at MIT supports dozens of languages and makes it easier and cheaper to build image-processing applications

With their JavaScript port of the Tesseract optical character recognition engine, developers at MIT are looking to provide convenience and lower costs in building image-processing applications.

Tesseract.js, released this month, supports more than 60 languages, automatic text orientation, and script detection. Running in either a browser or a server via Node.js, it features a simple interface for reading paragraph, word, and character bounding boxes.

“We’ve seen people use it to build Web applications for scanning receipts, for motivational poster applications, and in general it’s useful for anything where user-supplied pictures with text on them need to be recognized or edited,” said co-developers Kevin Kwok and Guillermo Webster, students at MIT.

The developers believed there were practical reasons people might want JavaScript-based OCR. “The first reason is convenience — the C++ version of Tesseract can be tricky to install, and nearly impossible for people with rare setups or limited privileges,” the developers said. The advantage of a pure JavaScript library is it can run on pretty much any system with a JavaScript interpreter.

“The second reason is that for some applications, it’s just too expensive or painful to set up a server to offload image processing onto,” the said. “Tesseract.JS lets you offload the computationally expensive task of text recognition to the client, allowing your service to scale to arbitrarily many users without having to figure out how to set up — and to pay for — compute clusters doing OCR.”

Tesseract.js is built on top of the Tesseract engine. Using the Emscripten compiler, developers cross-compiled the Tesseract library to create tesseract.js-core and added  a system to automatically download and persist language files. Computation is done a separate thread to boost application performance.

“We tried to make the actual API layer that developers interact with as smooth and painless as possible,” the students said. “After a developer includes the script in their project, they only have to write the line: Tesseract.recognize(myImage).then(function (result) { console.log(result) }).” No boilerplate code is required for initialization, and there is no need for manual management of pointers.

The developers, though, say some users have been disappointed with Tesseract.js after a few test runs, in part because of its being geared toward use with documents and not photographs. “One of these reasons is that Tesseract was designed first and foremost for scanning documents — it really shines when it’s given high contrast, high resolution paper documents. But with photographs, it tends to get confused.” For now, it is recommended developers pre-process the images they feed into Tesseract.js to improve the contrast, scale up the resolution, and remove background noise. But the developers are looking into providing these functions as part of Tesseract.js itself, as well as adding support for more file formats.

Paul Krill

Paul Krill is editor at large at InfoWorld. Paul has been covering computer technology as a news and feature reporter for more than 35 years, including 30 years at InfoWorld. He has specialized in coverage of software development tools and technologies since the 1990s, and he continues to lead InfoWorld’s news coverage of software development platforms including Java and .NET and programming languages including JavaScript, TypeScript, PHP, Python, Ruby, Rust, and Go. Long trusted as a reporter who prioritizes accuracy, integrity, and the best interests of readers, Paul is sought out by technology companies and industry organizations who want to reach InfoWorld’s audience of software developers and other information technology professionals. Paul has won a “Best Technology News Coverage” award from IDG.

More from this author