Decoding Weak CAPTCHA’s
TLDR; Weak CAPTCHA services utilized on the internet can be programmatically solved with a fairly high success rate.
Problem
A lot of firms, including mine, have begun to recommended CAPTCHA’s be used on all web forms which feed into existing business processes (registration pages, contact pages, etc). This recommendation can be a double edged sword, because there are still several CAPTCHA services that utilize weak CAPTCHA’s, which can be readily decoded with modern image analysis techniques. That being said several individuals have asked if there is a systematic way to test the strength of a given CAPTCHA, to determine weather it’s weak or not.
Solution
There are two major methodologies currently being widely used to decode weak CAPTCHA’s. The first technique is to remove the noise from CAPTCHA images by reversing the programmatic functions, used to add visual abstractions. Then simply comparing each character to a set of known sample characters. This method relays heavily on evaluating each weak CAPTCHA service offering and creating reliable function sets to solve individual CAPTCHA’s. The best tool for using this technique to test for known weak CAPTCHA types is pwntcha.
The second methodology uses vector based image analysis to compare each pixels location to the expected location given, each possible character. After consolidating all of these pixel location checks, each possible character is ranked based on its probability of being correct. The success of this method relays heavily on the use of a reference font, thus if the reference font and the CAPTCHA’s font are distantly different the analysis won’t go well. The best freely available tool I’ve found using this technique to test the strength of CAPTCHA’s is captcha-decoder.
How to use
Unfortunately almost every implementation of CAPTCHA’s is going to be different enough to make web scrapping a sample set of CAPTCHA images difficult. Thus the first step is always going to be downloading three to five CAPTCHA images for testing.
Then we can run each image through pwntcha and see if it can identify the image as a known weak CAPTCHA type.
Pwntcha <img>
Test run using Paypal’s known weak CAPTCHA samples 100/100
Test run of vBullentin’s known weak CAPTCHA samples 100/100
Last we can run captcha-decoder on each of the sample images to try and get an idea if vector based analysis is going to be successful. You will have to use your best judgment once you receive the results to determine if the risk is high enough to create an issue. Generally if all the correct letters are guessed with over 70% confidence the CAPTCHA should be considered weak. However an organization may believe 70% is too high and may only have a much lower tolerance.
decaptcha <img or img url> –min 0 –max 20 –limit 5 –channels 5 –tolerance 7
Current font test image on mondor (a public API and web resource site)
So here in this case the variable boldness of the letters tricked the vector analysis into thinking the K was an X and the L was an I.
References
Pwntcha – http://caca.zoy.org/wiki/PWNtcha
Pwntcha known compiling issues – https://blog.bmonkeys.net/2014/build-pwntcha-on-ubuntu-14-04
Note: if you have an issue with bootstrap, edit the bootstrap file to include automake version 13 and 14.
Captcha-decoder – https://github.com/mekarpeles/captcha-decoder
Note: if you have an issue with installing, make sure the python-dev system package is installed.