We are all familiar with CAPTCHA—an acronym for “Completely Automated Public Turing test to tell Computers and Humans Apart”. CAPTCHA is a test to tell whether the one who solves the test is human or machine. The machine in this case is practically computer software acting as a robot, also known as bot. CAPTCHA prevents bots from using various types of computing services or collecting certain types of sensitive information. For example to prevent automated free email address registration or automated form submission in—CAPTCHA protected—polling. The assumption of CAPTCHA is only a human would pass the CAPTCHA test, while bots would fail. This article introduces methods that interested parties may use to defeat the CAPTCHA protection by using bots, at the lowest possible cost and strategies to improve the CAPTCHA difficulty to prevent abuse by such bots.

Why Would You Need an Automated CAPTCHA Solver?

There are both legitimate and illegal reasons to use automated CAPTCHA solving. I’ll start with the illegal ones. For spammers, it’s in their interest to harvest as many email addresses as possible because they are paid based on the numbers of spam they generate and CAPTCHA is getting in their way. Therefore, they really need a cost effective way to overcome the CAPTCHA protection. Another illegal use case scenario is when a party wants to “skew” the result of online polling to suit their needs—where the polling data entry protected by CAPTCHA. As for the legal ones, it could be a new business partner wanting to automate access to the service of a certain company but the service is protected by CAPTCHA (to prevent abuse). However, the service provider has yet to provide an Application Programming Interface (API) for its service to be used by the new business partner—maybe due to the time constraint or budget constraint to provide the API. In this case the new business partner doesn’t have a choice but resort to automate the CAPTCHA solving needs.

Approaches to Implement an Automated CAPTCHA Solver

There are two major approaches to implement an automated CAPTCHA solver:

  1. Using a third party CAPTCHA solving service.
  2. Creating a bot that uses Optical Character Recognition (OCR) to try solving the CAPTCHA characters.

There are several providers of third party CAPTCHA solving services at the moment, for example: Death by CAPTCHA (http://deathbyCAPTCHA.com), de-captcher (http://www.de-captcher.com/) and decaptcher2 (http://decaptcher2.com/). Most of these services work by using “human automation”, i.e. they use human automation to recognize the CAPTCHA characters and send back the result to you. The pros and cons of using third party CAPTCHA service like these are:

  • The pros: the accuracy probability is higher than using an OCR approach because human automation is inherently better in recognizing CAPTCHA than machines and the service providers usually provides you with easy to use API to interface with their CAPTCHA solving service over the net.
  • The cons: the cost for a high number of CAPTCHA solving needs is quite prohibitive because it adds up quickly over time and there’s the problem of latency. Where the speed at which the CAPTCHA is solved doesn’t meet your solving “timeout” requirement—in the latter case, the CAPTCHA is solved correctly but it takes too much time that the session for the CAPTCHA solving page has expired.

In my experience, CAPTCHA solving services tend to be better at solving CAPTCHAs—relative to OCR approach—but have the aforementioned latency problem.

The second approach is much more complex than the first—than using third party CAPTCHA solving services. However, it lacks in precision compared to the first approach. Moreover, the second approach could not solve complex CAPTCHAs in many situations. However, for rather trivial CAPTCHAs, the second approach is much more cost effective and more or less usable. You might be surprised that in practice, trivial CAPTCHAs are still widely used, especially for websites for very specific services, such as mobile (cellphone) operator—usually prepaid ones where subscribers can top-up their account via web, another example is online ticketing for events and so on. These service providers don’t have lots of hits because only those wanting to use their services would go to their websites. Perhaps, that’s the reason why they don’t employ sophisticated CAPTCHAs, or maybe the present (trivial) CAPTCHA is good enough for them.

The focus of this article is the second approach, i.e. using OCR to defeat the CAPTCHA. Of course this solution cannot solve even “simple” CAPTCHA one hundred percent of the time. Nonetheless, this article is only meant to be introductory material to understand the architecture of such a solution. It’s not meant to be a guide to “fight” CAPTCHA used by the big boys like Google, Facebook or Twitter. That would require far more advanced CAPTCHA solving solutions.

Implementing Our Simple CAPTCHA Solver

We are going to use a readily available OCR library to build our CAPTCHA solver bot. Details of the tools to get the CAPTCHA images are not going to be explained here. The focus is only on building a small program to solve the readily available CAPTCHA image. Nonetheless, this article explains the generic architecture of a complete CAPTCHA solver solution.

Prerequisites

This section assumes that you are quite proficient in using a C/C++ Integrated Development Environment (IDE), or using a C/C++ compiler via command line directly. It also assumes that you know the basics on creating Windows DLLs and linking with them. If you are still confused, you can use your favorite search engine to look for relevant articles on the subject.

The Big Picture

Now, let’s start with the big picture. The overall architecture of a CAPTCHA solver solution looks like 1. There are two main components of a CAPTCHA solver solution, the web “scraper” and the CAPTCHA solver itself, as shown in 1.

Figure: CAPTCHA Solver Solution Basic Architecture

The purpose of the web scraper is to scrape the target web page, i.e. “browse” the target web page as if a human would browse a webpage, extract data required to process the page and sending “automated” feedback to the target web page. For example, if a web form is on the target web page, the web scraper would extract the form entries from the web page, then the web scraper fills the required data to the form entries and sends the “response” to the target web page—as if human enters required data and then clicking on the submit button on the target web page. In a more complicated target web page, the data entry process is protected by a CAPTCHA. Therefore, the web scraper must call or implement a CAPTCHA solver to fulfill the CAPTCHA check requirement.

Let’s take a look the solution in 1 in more detail. These are the steps carried out in 1:

  1. The web scraper fetches the contents of the target web page.
  2. The web scraper extracts the CAPTCHA image from the target web page.
  3. The CAPTCHA image is sent to the CAPTCHA solver.
  4. The CAPTCHA solver solves the CAPTCHA and emits CAPTCHA string as the result.
  5. The CAPTCHA string is sent back to the web scraper.
  6. The web scraper sends the feedback—including the CAPTCHA string—to the target web page URL.

This article only focuses on the CAPTCHA solver component. As for the web scraper, it’s a completely different subject and it varies depending on the web site that’s being scraped.

Using Open Source OCR Library to Solve CAPTCHA

One of the ways to defeat CAPTCHA automatically is to use OCR library to recognize the string in the CAPTCHA. Contrary to what you might think; OCR library recognizes string not just by trying to recognize individual letters (and digits) but also by using context information. For example, if you know that the string you’re trying to recognize contains only letters, you can feed that information to the OCR library to boost the recognition accuracy. Similarly, if the target string contains only digits with no alphabet, you can instruct the library to recognize only digits, not letters. Other possible context is the language of the string you’re trying to solve.

Now, let’s move to the concrete implementation. This article shows you how to implement the CAPTCHA solver by using the open source Tesseract OCR library. The library is available at https://code.google.com/p/tesseract-ocr/. Tesseract is written in C++. Therefore, the most natural way to use it is to write your CAPTCHA solver in C++ or C. You have to be aware though, that C++ uses name mangling, i.e. the function name seen on the source code is not the same as the one in the compiled object file, dll or executable produced by the compiler.

Anyway, Tesseract depends on Leptonica, another open source library that handles various image file formats. Therefore, you need to link to Leptonica as well as Tesseract in your program in order to use Tesseract OCR for CAPTCHA solving.

The implementation provided here is Windows specific. You can download the Visual Studio 2008 code for Tesseract in this link: https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02-vs2008.zip&can=2&q=. Additionally, you can download the Leptonica v1.68 dependency here: https://code.google.com/p/leptonica/downloads/list. For the sake of portability between different languages, the implementation here is in the form of a “plain C” Windows DLL that interfaces to Tesseract DLL—and indirectly to Leptonica DLL because Tesseract depends on Leptonica. I will also provide the code of a simple test application to test the DLL. Perhaps you’re still confused about this; 2 should clarify what I meant.

Figure: Our CAPTCHA Solver Implementation Architecture

It is clear form 2 that we have to create two things, first is the Windows DLL wrapper code and the second is the test application to make sure our DLL is working as intended. The Windows DLL wrapper code consists of two files: CAPTCHA_solver_dll.h and CAPTCHA_solver_dll.cpp .

2 shows the presence of Tesseract “learning” Data. If you install Tesseract in your machine, this data is placed in tessdata directory in the Tesseract installation directory. You don’t need to install Tesseract if you want to use it in your own program. However, you need to have the Tesseract “learning” data—the tessdata directory and its contents—somewhere in the machine that would run your program and you must set the TESSDATA_PREFIX environment variable to the absolute path of the directory containing the tessdata directory, not the path of the tessdata directory. You can do that via Control Panel|System|Advanced System Settings|Environment Variables|System variables. After that, it’s highly advisable to log-off and log-on again or to restart the machine because sometimes the new environment variable is not updated as we wished if you don’t do so. Setting TESSDATA_PREFIX environment variable is needed because Tesseract requires this environment variable when it runs to query the “learning” data.

Want to learn more?? The InfoSec Institute Ethical Hacking course goes in-depth into the techniques used by malicious, black hat hackers with attention getting lectures and hands-on lab exercises. While these hacking skills can be used for malicious purposes, this class teaches you how to use the same hacking techniques to perform a white-hat, ethical hack, on your organization. You leave with the ability to quantitatively assess and measure threats to information assets; and discover where your organization is most vulnerable to black hat hackers. Some features of this course include:

  • Dual Certification - CEH and CPT
  • 5 days of Intensive Hands-On Labs
  • Expert Instruction
  • CTF exercises in the evening
  • Most up-to-date proprietary courseware available

Now, let’s move to the details of using Tesseract in the CAPTCHA_solver_dll.cpp file. Using Tesseract is quite easy. These are the logical steps to solve a CAPTCHA image with Tesseract:

  1. Initialize tesseract API object to be used.
  2. Check the whether the input file format is supported or not.
  3. Process the input image file to obtain the CAPTCHA string.
  4. Copy the result string to the output buffer. This is required because Tesseract uses an internal representation for string which is not guaranteed to be compatible with the string format we want—plain C string, i.e. null-terminated string.

Now that the algorithm to use Tesseract is clear, I’ll show you the C++ code that implements the algorithm. 1 shows the solve_CAPTCHA() function which invokes Tesseract to “solve” (read) the CAPTCHA string passed in the input CAPTCHA image passed to the function via the image_file_path input parameter. This is the only function an application needs to use Tesseract via our Windows DLL wrapper. The image_file_path input parameter in solve_CAPTCHA() function contains path to the CAPTCHA image to be solved. 1 doesn’t show the entire code in CAPTCHA_solver_dll.cpp, only those important to implement the very thin wrapper to Tesseract. The implementation of the steps/algorithm above in 1 is very straight forward.

Listing: solve_CAPTCHA() Function Listing in CAPTCHA_solver_dll.cpp File

#include "stdafx.h"
#include "CAPTCHA_solver_dll.h"

...

// Variable to store the result of CAPTCHA processing
static char g_CAPTCHA_string[MAX_CAPTCHA_STRING_LENGTH + 1];

...

// This is an exported function.
///
/// This function invokes tesseract library function to solve the CAPTCHA image
/// in the image_file_path parameter.
///
///Path of the CAPTCHA image file.
///  Pointer to string that will hold the CAPTCHA string result
///
CAPTCHA_SOLVER_DLL_API char* solve_CAPTCHA(	const char* image_file_path )
{
	//
	// STEP 1: Initialize tesseract object to be used.
	//
	const char* lang = "eng";
	char* config_file_path = "digits"; /* Hardcode the config file to be used  							         to "$TESSDATA_PREFIX/configs/digits"
						  NOTE: As long as $TESSDATA_PREFIX has been
						  exported to as Windows environment variable,
						using only the word "digits" here should work. */
	tesseract::TessBaseAPI  api;

	api.Init(image_file_path			/* datapath */,
		  lang					/* language */,
		  tesseract::OEM_DEFAULT	/* OcrEngineMode mode */,
		  &config_file_path		/* char **configs */,
		  1				/* configs_size -- only config_file_path */,
		 NULL				/* const GenericVector *vars_vec */,
		 NULL				/* const GenericVector *vars_values */,
		 false				/* bool set_only_non_debug_params */);

	tesseract::PageSegMode pagesegmode = tesseract::PSM_AUTO;

	if (api.GetPageSegMode() == tesseract::PSM_SINGLE_BLOCK)
		api.SetPageSegMode(pagesegmode);

	//
	// STEP 2: Check the whether the input file format is supported or not
	//
	FILE* fin = fopen(image_file_path, "rb");
	if (fin == NULL) {
		return NULL;
	}
	fclose(fin);

	PIX   *pixs;
	if ((pixs = pixRead(image_file_path)) == NULL) {
		return NULL;
	}
	pixDestroy(&pixs);

	//
	// STEP 3: Process the image.
	// The result is a STRING object pointed by text_out variable below.
	//
	STRING text_out;
	if (!api.ProcessPages(image_file_path, NULL, 0, &text_out)) {
		return NULL;
	}

	//
	// STEP 4: Copy the result string to the output buffer
	// a. Use text_out.strdup() to get a pointer to copy of the CAPTCHA solver result.
	// b. Free the heap consumed by the duplicate of the CAPTCHA string result.
	//
	memset(g_CAPTCHA_string, '\0', sizeof(g_CAPTCHA_string));
	char* result = text_out.strdup();
	strncpy(g_CAPTCHA_string, result, sizeof(g_CAPTCHA_string));
	free(result);

	return g_CAPTCHA_string;
}

The PIX object in 1 is a Leptonica object. PIX object handles the input image to be passed to Tesseract. Most of the image-related processing in Tesseract is handled by Leptonica. The CAPTCHA_SOLVER_DLL_API identifier in 1 is a macro to define the linkage type of the function. You can see the details of this identifier in 2 (CAPTCHA_solver_dll.h and). CAPTCHA_SOLVER_DLL_API identifier in 1 maps to __declspec(dllexport) because the CAPTCHA_SOLVER_DLL_EXPORTS constant is defined in the preprocessor setting of the Visual Studio project containing the CAPTCHA_solver_dll.cpp file. As you can see in 2, if CAPTCHA_SOLVER_DLL_EXPORTS constant is defined, CAPTCHA_SOLVER_DLL_API identifier resolves to __declspec(dllexport).

1, gives a “context” hint—a.k.a heuristic—to Tesseract in the form of language setting and configuration file setting. The language is set to English and the configuration file is set to digits only, i.e. Tesseract should interpret the inputs as digit only. This is done in step 1 in In 1. This can be done because it is assumed that we have done preliminary assessment on the target CAPTCHA and the result is the input CAPTCHA always consists of digits.

Listing: CAPTCHA_solver_dll.h File

#define __CAPTCHA_SOLVER_DLL_H__

// The following ifdef block is the standard way of creating macros which make exporting
// from a DLL simpler. All files within this DLL are compiled with the
// CAPTCHA_SOLVER_DLL_EXPORTS symbol defined on the command line.
// This symbol should not be defined on any project that uses this DLL.
// This way any other project whose source files include this file see
// CAPTCHA_SOLVER_DLL_API functions as being imported from a DLL, whereas this DLL sees
// symbols defined with this macro as being exported.
#ifdef CAPTCHA_SOLVER_DLL_EXPORTS
#define CAPTCHA_SOLVER_DLL_API __declspec(dllexport)
#else
#define CAPTCHA_SOLVER_DLL_API __declspec(dllimport)
#endif

#ifndef MAX_CAPTCHA_STRING_LENGTH
#define MAX_CAPTCHA_STRING_LENGTH 256
#endif

#ifdef __cplusplus
extern "C" {
#endif

CAPTCHA_SOLVER_DLL_API char* solve_CAPTCHA(	const char* image_file_path );

#ifdef __cplusplus
}
#endif

#endif // __CAPTCHA_SOLVER_DLL_H__

With the Windows DLL wrapper completed, we can now move to the test application source code. 3 shows the source code of the test application for our Tesseract wrapper library. This test application is again, Windows-specific. If you are using Visual Studio to compile the code in 3, set the character set in the project setting to Multi-Byte Character Set (MBCS)—via the “Project Properties”|Configuration Properties|Project Defaults|Character Set setting. This setting instructs Visual Studio to compile the project in MBCS mode, i.e. ANSI C-compatible mode. Thus, the string handling in the code would be set to ANSI C string “mode”. This is important to do because by default, Visual Studio sets the character set to Unicode, which is not compatible with the output from the Tesseract wrapper library we built earlier.

Listing Test Application (CAPTCHA_solver_dll_test_app) Linked to CAPTCHA_solver_dll.dll

// CAPTCHA_solver_dll_test_app.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include "CAPTCHA_solver_dll.h"

int _tmain(int argc, _TCHAR* argv[])
{
	char CAPTCHA_string[MAX_CAPTCHA_STRING_LENGTH];

	/// Invocation rule: test_app [image_file_path]
	if (argc != 2) {
		printf("Error! Wrong input parametersn");
		printf("Usage: %s [image_file_path]n", argv[0]);
		return 0;
	}

	/// Step 1: solve CAPTCHA
	memset(CAPTCHA_string, '\0', sizeof(CAPTCHA_string));
	strncpy_s(CAPTCHA_string, sizeof(CAPTCHA_string), solve_CAPTCHA(argv[1]), _TRUNCATE);

	/// Step 2: show CAPTCHA string
	printf("CAPTCHA string = %sn", CAPTCHA_string);

	return 0;
}

The code in 3 is a Windows-specific C source code because the string function is Windows-specific—a secure version of the default C string function. The line in 3 that invokes solve_CAPTCHA() function in the wrapper DLL we built earlier is:

 strncpy_s(CAPTCHA_string, sizeof(CAPTCHA_string), solve_CAPTCHA(argv[1]), _TRUNCATE);

You can look up the details of the strncpy_s() secure string copy function at: http://msdn.microsoft.com/en-us/library/5dae5d43(v=vs.80).aspx while the _TRUNCATE constant is explained here: http://msdn.microsoft.com/en-us/library/ms175769(v=vs.80).aspx. This function is a secure version of the strncpy() function.

As you can see, using the wrapper DLL involve only one function call in the code that uses the library. Of course, you have to link against the wrapper library in your Visual Studio project or in other type of IDE that you use. Nothing is out of the ordinary in the code in 3. Therefore, you should be able to grasp it right away.

Testing Our CAPTCHA Solver Application

At this point, the entire CAPTCHA solver solution is complete. It’s time to put it into test. 3 shows the CAPTCHAs I used to test the CAPTCHA solver solution explained in the previous sections.

Figure: CAPTCHA Samples Used for Testing (lumped together into one image)

4 shows how I invoke the test application to solve the CAPTCHA string in image 8.jpg and 9.jpg respectively. As you can see, the test application correctly reads the CAPTCHA string.

Figure: Running the CAPTCHA Solver Test Application

As mentioned in 1explanation, the Tesseract wrapper DLL gives heuristics to Tesseract that the input consists of digits and it should be regarded as English in nature, not other character sets such as Chinese, Thais or Japanese. 1 shows the result of invoking our test application with the above input (CAPTCHA) files.

Table: CAPTCHA Solving Result

CAPTCHA Image

Reference (Correct)

CAPTCHA String

CAPTCHA Solving Result

0.jpg

159769

Partially Correct: 159759
1.jpg

816675

Correct: 816675
2.jpg

671684

Partially Correct: 671584
3.jpg

321338

Partially Correct: 321335
4.jpg

670834

Completely False: 5193311
5.jpg

682209

Completely False: 5822179
6.jpg

223143

Correct: 223143
7.jpg

805928

Partially Correct:7805928
8.jpg

970825

Correct: 970825
9.jpg

686608

Correct: 686608

1 show that the precision of our CAPTCHA solving test application is 40%, against ten input CAPTCHA images. That’s not that bad for a first try, isn’t it? Moreover, there are 40% almost correct guesses, with only one character missed or there is one extra character. In several cases, it seems Tesseract mistook the digit six as digit five.

Anyway, the automated CAPTCHA solver solution I presented here is very rudimentary. It doesn’t do any preprocessing to the input image which could improve the CAPTCHA solver accuracy, albeit maybe just a little. But, with 40% near miss, that could boost the accuracy to a whopping 80% accuracy.

Closing Thoughts

There are several possible ways to improve the CAPTCHA solver accuracy, first we could do preprocessing to make the CAPTCHA image clearer and second, we can add one more “context” as heuristic to the CAPTCHA solving solution, such as giving a hint to Tesseract that the input is always six characters.

In the end, automated CAPTCHA solving is a gray area because it’s not clear in terms of legality in many places. In Indonesia (where I live), it’s legal only due to absence of regulation at the moment, because the basic premise in Indonesian Law is something not yet regulated deemed legal. I hope that this article opens up a new understanding on how automated CAPTCHA solving might be carried-out.

Want to learn more?? The InfoSec Institute Ethical Hacking course goes in-depth into the techniques used by malicious, black hat hackers with attention getting lectures and hands-on lab exercises. While these hacking skills can be used for malicious purposes, this class teaches you how to use the same hacking techniques to perform a white-hat, ethical hack, on your organization. You leave with the ability to quantitatively assess and measure threats to information assets; and discover where your organization is most vulnerable to black hat hackers. Some features of this course include:

  • Dual Certification - CEH and CPT
  • 5 days of Intensive Hands-On Labs
  • Expert Instruction
  • CTF exercises in the evening
  • Most up-to-date proprietary courseware available

INTERESTED IN LEARNING MORE? CHECK OUT OUR ETHICAL HACKING TRAINING COURSE. FILL OUT THE FORM BELOW FOR A COURSE SYLLABUS AND PRICING INFORMATION.