Last week’s big news was Google’s acquisition of ReCaptcha. I decided to take a look at the current state of the Java captcha solutions and compare them to ReCaptcha to see what we are losing. First of all a JavaWorld article discusses captchas implementation in detail. In general, captchas are easy to implement, a simple one will take just a few hours for a skilled Java programmer. There are two main factors to consider when installing a captcha on a web site: 1. the effort to recognize a captcha should be minimized for a human 2. the effort to recognize a captcha should be maximized for an OCR program So, I decided to research ReCaptcha and a number of other free Java captcha implementations. I installed the latest trial FineReader OCR software to test the second factor, and utilized myself to test the first one. It needs to be mentioned that I used default OCR settings, and a skilled spammer would get much better results by fine tuning dictionaries, character sets considered during recognition process and so on. ReCaptcha claims to be not just a spam-protection, but a means of crowd-computing: when you type the words you help to OCR the New York Times old issues. It gives you two words – one is a known by their system, and the second one is supposed to be recognized by you to help them with their OCR duty. ReCaptcha astonished me. FineReader was able to recognize about one third of their captchas. And it was able to recognize at least one of the two captcha words for almost all the samples that I took! Here are some screenshots: 1, 2 and 3. There are several reasons for ReCaptcha’s poor performance. First of all they use dictionary words, and OCR’s task is much simplified – even if it fails to recognize a letter or two it just looks up the most resembling word from a dictionary. Another reason is no colors in captchas, however this is most probably b/c of the color blind people: they will suit anyone who is restricting their access. Then I assessed the JCaptcha. The first example captcha from their site got recognized pretty well: 1. Others failed, especially the colorized ones, so I think it’s pretty safe, and they look like human-freindly too. Oh, probably not too friendly for the color blind… Another Java captcha is a Simple Captcha. None of the samples from their website got recognized by the OCR, but then I installed their example web application and some of the samples produced were ORC-recognized, not without human help though – I had to select the text area myself: 1. There is a number of other small Java captcha implementations. Kaptcha. It lists JCaptcha and Simple Captcha on the front page as not-so-good solutions and claims to be a better one. The captchas from it appeared really hard to recognize by OCR, and were human readable too. DuCaptcha claims to be specifically tested against the OCR software, and that seems to be true – not even one letter gets recognized. However, it uses colors to dissemble the values, which is yet again might be not appropriate for the color-blind. To wrap up: there is a wide range of customizable Java captcha implementations. I would not use ReCaptcha myself, b/c it is hosted on the third-party servers, and is not a very reliable protection from the OCR-armed spammers. If you really want to help them to recognize the old New York Times issues – get them a FineReader license 🙂 Software Development