Defeating Digg.com's Captcha
I just wrote up a little blog/article on how to defeat Digg.com's weak CAPTCHA. So, be aware if you use CAPTCHA, at least use a strong one.http://bhiv.com/2005/09/30/defeating-diggs-captcha/
---
While using digg, I was surprised to see such an obviously weak CAPTCHA challenge. I was able to create a script that defeats it with a 88% accuracy within a couple hours using nothing but free software. (If you are looking for code, forget it. This is almost too much information)
Diggs CAPTCHA Weaknesses:
1. Dictionary Words
2. Same background
3. Same Font
4. No deformations
5. All lowercase letters
6. Constant colors
Tools
* gocr - a GPL Optical Character Recognition program
* ImageMagick - for command time image editing
* Perl - to tie everything together
Sample Size
100 images with 95 different words with an average word length of 5.3 letters.
First Test
Just dumping all the images through gocr yielded 26% correct responses. Not too shabby. It yields some easily manipulated results:
* http://bhiv.com/wp-content/digg-captcha-groups.jpg = groUDS
* http://bhiv.com/wp-content/digg-captcha-single.jpg = single
* http://bhiv.com/wp-content/digg-captcha-police.jpg = t o.l,i.c,e . . ... ,
* http://bhiv.com/wp-content/digg-captcha-because.jpg = be.cause.
Looking at the results Im sure that we could improve the results with a little string manipulation.
Tweaking output
We can mess with the output to yield better results
* Convert all output to lowercase
* Remove non letter characters
* Spell checker
The first two yield 53% correct responses; just with this simple tweak we are able to get more correct guesses than incorrect. With adding the first guess of a spell checker it bumps the accuracy to 67%
Tweaking input
* Removing boarder
* Adjusting contrast and brightness
* Using edge detection
So http://bhiv.com/wp-content/digg-captcha-groups.jpg becomes http://bhiv.com/wp-content/digg-captcha-groups2.jpg
Since we are already over 2/3rds accurate we dont need to adjust the input of every image, just the results that arent dictionary words. Part of them problem is that while one adjustment will improve results for one image, it will degrade the results for another. My solution was to try 10 variations, run them through the OCR and then spell check. I then had the program pick the solution with duplicate results, in the case of a tie or no duplicate I had the program pick the one with the fewest number of variations. This method resulted in the final accuracy of 88%.
Problems with this technique
While these quick results have come close to becoming usable, they are still a far cry from 100% accuracy. Since digg uses a consistent font I could train gocr for problematic letters (such as p) also given that in 100 images I received 5 sets of duplicate words I would estimate their dictionary is only a couple thousand and could hand tweak the results.
[/b]Other resouces[/b]
* PWNtcha - a project to build a captcha decoder
* Breaking a Visual CAPTCHA - the breaking of EZ-Gimpy CAPTCHAs
Disclaimer
I did contact digg last week to let them know I would be publishing this and offered them the opportunity to have it delayed while the upgraded their CAPTCHA. I havent heard from them as of now. I still offer them the opportunity to contact me and I will temporarily remove this article.
---
If you like it you could always digg it at -> http://digg.com/security/Defeating_D...80%99s_CAPTCHA