Python Captcha Solver (2019)

Python Captcha Solver

Captcha Solver using Python

If a page shows a CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart), this is basically a way to announce that no scrapers are welcome. This tutorial explains captcha solver using Python.

 

And also covers deep learning implementations to solve CAPTCHA using convolutional neural networks.

There are ways around this, however. Some sites offer a “CAPTCHA solving API”, offering quick solving times for a fair price. Some real-life projects have used OCR software such as Tesseract to build a CAPTCHA solver instead.

 

It’s also a good idea to verify whether the CAPTCHA only appears after a number of requests or randomly, in which case you can implement a backing-off or cooldown strategy to simply wait for a while before trying again.

 

Breaking CAPTCHA’s Using Deep Learning

Breaking CAPTCHA

This final example is definitely the most challenging one, as well as the one that is mostly related to “data science,” rather than web scraping. In fact, we’ll not use any web scraping tools here.

 

Instead, we’re going to walk through a relatively contained example to illustrate how you could incorporate a predictive model in your web scraping pipeline in order to bypass a CAPTCHA check.

 

We’re going to need to install some tools first. We’ll use “OpenCV,” an extremely thorough library for computer vision, as well as “numpy” for some basic data wrangling. Finally, we’ll use the “captcha” library to generate example images. All of these can be installed as follows:

pip install -U opencv-python pip install -U numpy

pip install -U captcha

 

Next, create a directory somewhere in your system to contain the Python scripts we will create. The first script (“constants.py”) will contain some constants we’re going to use:

CAPTCHA_FOLDER = 'generated_images' LETTERS_FOLDER = 'letters'
CHARACTERS = list('QWERTPASDFGHKLZXBNM') NR_CAPTCHAS = 1000
NR_CHARACTERS = 4
MODEL_FILE = 'model.hdf5' LABELS_FILE = 'labels.dat'
MODEL_SHAPE = (100, 100)
Another script (“generate.py”) will generate a bunch of CAPTCHA images and save them to the “generated_images” directory:
from random import choice
from captcha.image import ImageCaptcha
import os.path
from os import makedirs
from constants import *
makedirs(CAPTCHA_FOLDER) image = ImageCaptcha()
for i in range(NR_CAPTCHAS):
captcha = ''.join([choice(CHARACTERS) for c in range(NR_CHARACTERS)]) filename = os.path.join(CAPTCHA_FOLDER, '{}_{}.png'.format(captcha, i)) image.write(captcha, filename)
print('Generated:', captcha)

After running this script, you should end up with a collection of CAPTCHA images

 

[Note: You can free download the complete Office 365 and Office 2019 com setup Guide.]

 

Isn’t This Cheating?    

Of course, we’re lucky here that we are generating the CaptCha’s ourselves and hence have the opportunity to keep the answers as well.

 

In the real world, however, CaptCha’s do not expose their answer (it would kind of refute the point of the CaptCha), so that we would need to figure out another way to create our training set.

 

One way is to look for the library a particular site is using to generate its CaptCha’s and use it to collect a set of training images of your own, replicating the originals as closely as possible.

 

Another approach is to manually label the images yourself, which is as dreadful as it sounds, though you might not need to label thousands of images to get the desired result.

 

since people make mistakes when filling in CaptCha’s, too, we have more than one chance to get the answer right and hence do not need to target a 100 percent accuracy level. even if our predictive model is only able to get one out of ten images right, that is still sufficient to break through a CaptCha after some retries.

 

Next, we’re going to write another script that will cut up our images into separate pieces, one per character. We could try to construct a model that predicts the complete answer all at once, though in many cases it is much easier to perform the predictions character by character.

 

To cut up our image, we’ll need to invoke OpenCV to perform some heavy lifting for us. A complete discussion regarding OpenCV and computer vision would require a blog in itself, so we’ll stick to some basics here.

 

The main concepts we’ll use here are thresholding, opening, and contour detection. To see how this works, let’s create a small test script first to show these concepts in action:

import cv2
import numpy as np
# Change this to one of your generated images: image_file = 'generated_images/ABQM_116.png'
image = cv2.imread(image_file) cv2.imshow('Original image', image)
# Convert to grayscale, followed by thresholding to black and white gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
_, thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_ OTSU)
cv2.imshow('Black and white', thresh)
# Apply opening: "erosion" followed by "dilation" denoised = thresh.copy()
kernel = np.ones((4, 3), np.uint8)
denoised = cv2.erode(denoised, kernel, iterations=1) kernel = np.ones((6, 3), np.uint8)
denoised = cv2.dilate(denoised, kernel, iterations=1) cv2.imshow('Denoised', denoised)
# Now find contours and overlay them over our original image
_, cnts, _ = cv2.findContours(denoised.copy(), cv2.RETR_TREE, cv2.CHAIN_ APPROX_NONE)
cv2.drawContours(image, cnts, contourIdx=-1, color=(255, 0, 0), thickness=-1)
cv2.imshow('Contours', image)
cv2.waitKey(0)

 

If you run this script, you should obtain a list of preview windows. In the first two steps, we open our image with OpenCV and convert it to a simple pure black and white representation. Next, we apply an “opening” morphological transformation, which boils down to an erosion followed by dilation.

 

The basic idea of erosion is just like soil erosion: this transformation “erodes away” boundaries of the foreground object (which is assumed to be in white) by sliding a “kernel” over the image (a “window,” so to speak) so that only those white pixels are retained if all pixels in the surrounding kernel are white as well.

 

Otherwise, it gets turned to black. Dilation does the opposite: it widens the image by setting pixels to white if at least one pixel in the surrounding kernel was white.

 

Applying these steps is a very common tactic to remove noise from images. The kernel sizes used in the script above are simply the result of some trial and error, and you might want to adjust these with other types of CAPTCHA images.

 

Note that we allow for some noise in the image to remain present. We don’t need to obtain a perfect image as we trust that our predictive model will be able to “look over there.”

 

From left to right: original image, image after conversion to black and white, image after applying an opening to remove noise, and the extracted contours overlaid in blue over the original image.

 

Next, we use OpenCV’s findContours method to extract “blobs” of connected white pixels. OpenCV comes with various methods to perform this extraction and different ways to represent the result (e.g., simplifying the contours or not, constructing a hierarchy or not, and so on).

 

Finally, we use the drawContours method to draw the discovered blobs. 

The contour-Idx argument here indicates that we want to draw all top-level contours, and the thickness value of -1 instructs OpenCV to fill up the contours.

 

We now still need a way to use the contours to create separate images: one per character. The way how we’ll do so is by using masking.

 

Note that OpenCV also allows to fetch out the “bounding rectangle” for each contour, which would make “cutting” the image much easier, though this might get us into trouble in case parts of the characters are near to each other. Instead, we’ll use the approach illustrated by the following code fragment:

import cv2
import numpy as np
image_file = 'generated_images/ABQM_116.png'
# Perform thresholding, erosion and contour finding as shown before image = cv2.imread(image_file)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
_, thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_ OTSU)
denoised = thresh.copy()
kernel = np.ones((4, 3), np.uint8)
denoised = cv2.erode(denoised, kernel, iterations=1) kernel = np.ones((6, 3), np.uint8)
denoised = cv2.dilate(denoised, kernel, iterations=1)
_, cnts, _ = cv2.findContours(denoised.copy(), cv2.RETR_TREE, cv2.CHAIN_ APPROX_NONE)
# Create a fresh 'mask' image
mask = np.ones((image.shape[0], image.shape[1]), dtype="uint8") * 0 # We'll use the first contour as an example
contour = cnts[0]
# Draw this contour over the mask
cv2.drawContours(mask, [contour], -1, (255, 255, 255), -1)
cv2.imshow('Denoised image', denoised) cv2.imshow('Mask after drawing contour', mask) result = cv2.bitwise_and(denoised, mask) cv2.imshow('Result after and operation', result) retain = result > 0
result = result[np.ix_(retain.any(1), retain.any(0))]
cv2.imshow('Final result', result) cv2.waitKey(0)

 

First, we create a new black image with the same size as the starting, denoised image. We take one contour and draw it in white on top of this “mask.”

 

Next, the denoised image and mask are combined in a bitwise “and” operation, which will retain white pixels if the corresponding pixels in both input images were white, and sets it to black otherwise. Next, we apply some clever numpy slicing to crop the image.

 

. On the top left, the starting image is shown. On the right, a new image is created with the contour drawn in white and filled. These two images are combined in a bitwise “and” operation to obtain the image in the second row. The bottom image shows the final result after applying to crop.

 

This is sufficient to get started, though there still is one problem we need to solve: overlap. In case of characters overlap, they would be discovered as one large contour. To work around this issue, we’ll apply the following operations.

 

First, starting from a list of contours, check whether there is a significant degree of overlap between two distinct contours, in which case we only retain the largest one.

 

Next, we order the contours based on their size, take the first n contours, and order these on the horizontal axis, from left to right (with n being the number of characters in a CAPTCHA).

 

This still might lead to fewer contours than we need, so that we iterate over each contour, and check whether its width is higher than an expected value.

 

A good heuristic for the expected value is to take the estimated width based on the distance from the leftmost white pixel to the rightmost white pixel divided by the number of characters we expect to see.

 

In case a contour is wider than we expect, we cut it up into m equal parts, with m being equal to the width of the contour divided by the expected width.

 

This is a heuristic that still might lead to some characters not being perfectly cut off (some characters are larger than others), but this is something we’ll just accept.

In case we don’t end up with the desired number of characters at the end of all this, we’ll simply skip over the given image.

 

We’ll put all of this in a separate list of functions (in a file “functions.py”):

import cv2
import numpy as np
from math import ceil, floor
from constants import *
def overlaps(contour1, contour2, threshold=0.8):
# Check whether two contours' bounding boxes overlap area1 = contour1['w'] * contour1['h']
area2 = contour2['w'] * contour2['h'] left = max(contour1['x'], contour2['x'])
right = min(contour1['x'] + contour1['w'], contour2['x'] + contour2['w'])
top = max(contour1['y'], contour2['y'])
bottom = min(contour1['y'] + contour1['h'], contour2['y'] + contour2['h'])
if left <=right and bottom>= top:
intArea = (right - left) * (bottom - top) intRatio = intArea / min(area1, area2)
if intRatio >= threshold:
# Return True if the second contour is larger
return area2 > area1
# Don't overlap or doesn't exceed threshold
return None
def remove_overlaps(cnts): contours = []
for c in cnts:
x, y, w, h = cv2.boundingRect(c)
new_contour = {'x': x, 'y': y, 'w': w, 'h': h, 'c': c}
for other_contour in contours:
overlap = overlaps(other_contour, new_contour)
if overlap is not None:
if overlap:
# Keep this one... contours.remove(other_contour) contours.append(new_contour)
# ... otherwise do nothing: keep the original one
break
else:
# We didn't break, so no overlap found, add the contour contours.append(new_contour)
return contours
def process_image(image):
# Perform basic pre-processing
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
_, thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)
denoised = thresh.copy()
kernel = np.ones((4, 3), np.uint8)
denoised = cv2.erode(denoised, kernel, iterations=1) kernel = np.ones((6, 3), np.uint8)
denoised = cv2.dilate(denoised, kernel, iterations=1)
return denoised
def get_contours(image): # Retrieve contours
_, cnts, _ = cv2.findContours(image.copy(), cv2.RETR_TREE, cv2.CHAIN_ APPROX_NONE)
# Remove overlapping contours contours = remove_overlaps(cnts)
# Sort by size, keep only the first NR_CHARACTERS
contours = sorted(contours, key=lambda x: x['w'] * x['h'],
reverse=True)[:NR_CHARACTERS] # Sort from left to right
contours = sorted(contours, key=lambda x: x['x'], reverse=False)
return contours
def extract_contour(image, contour, desired_width, threshold=1.7):
mask = np.ones((image.shape[0], image.shape[1]), dtype="uint8") * 0 cv2.drawContours(mask, [contour], -1, (255, 255, 255), -1)
result = cv2.bitwise_and(image, mask) mask = result > 0
result = result[np.ix_(mask.any(1), mask.any(0))]
if result.shape[1] > desired_width * threshold:
# This contour is wider than expected, split it amount = ceil(result.shape[1] / desired_width) each_width = floor(result.shape[1] / amount)
# Note: indexing based on im[y1:y2, x1:x2] results = [result[0:(result.shape[0] - 1),
(i * each_width):((i + 1) * each_width - 1)] \
for i in range(amount)] return results
return [result]
def get_letters(image, contours):
desired_size = (contours[-1]['x'] + contours[-1]['w'] - contours[0]['x']) \
/ NR_CHARACTERS
masks = [m for l in [extract_contour(image, contour['c'], desired_size) \
for contour in contours] for m in l]
return masks
With this, we’re finally ready to write our cutting script (“cut.py”)
from os import makedirs
import os.path
from glob import glob from functions import * from constants import *
image_files = glob(os.path.join(CAPTCHA_FOLDER, '*.png'))
for image_file in image_files:
print('Now doing file:', image_file)
answer = os.path.basename(image_file).split('_')[0] image = cv2.imread(image_file)
processed = process_image(image) contours = get_contours(processed) if not len(contours):
print('[!] Could not extract contours')
continue
letters = get_letters(processed, contours)
if len(letters) != NR_CHARACTERS:
print('[!] Could not extract desired amount of characters')
continue
if any([l.shape[0] < 10 or l.shape[1] < 10 for l in letters]):
print('[!] Some of the extracted characters are too small')
continue
for i, mask in enumerate(letters): letter = answer[i]
outfile = '{}_{}.png'.format(answer, i) outpath = os.path.join(LETTERS_FOLDER, letter) if not os.path.exists(outpath):
makedirs(outpath)
print('[i] Saving', letter, 'as', outfile) cv2.imwrite(os.path.join(outpath, outfile), mask)

If you run this script, the “letters” directory should now contain a directory for each letter; We’re now ready to construct our deep learning model.

 

We’ll use a simple convolutional neural network architecture, using the “Keras” library.

pip install -U keras

 

For Keras to work, we also need to install a backend (the “engine” Keras will use, so to speak). You can use the rather limited “theano” library, Google’s “Tensorflow,” or Microsoft’s “CNTK.” We assume you’re using Windows, so CNTK is the easiest option to go with. (If not, install the “theano” library using pip instead.)

 

To install CNTK, navigate to https://docs.microsoft.com/en-us/cognitive-toolkit/setup-windows- python?tabs=cntkpy231 and look for the URL corresponding with your Python version. 

 

If you have a compatible GPU in your computer, you can use the “GPU” option.

 

If this doesn’t work or you run into trouble, stick to the “CPU” option. Installation is then performed as such (using the GPU Python 3.6 version URL):

pip install -U https://cntk.ai/PythonWheel/GPU/cntk-2.3.1-cp36-cp36m- win_amd64.whl

 

Next, we need to create a Keras configuration file. Run a Python REPL and import Keras as follows:

>>> import keras
Using TensorFlow backend. Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "\site-packages\keras\ init .py", line 3, in <module> from . import utils
File "\site-packages\keras\utils\ init .py", line 6, in <module> from . import conv_utils
File "\site-packages\keras\utils\conv_utils.py", line 3, in <module> from .. import backend as K
File "\site-packages\keras\backend\ init .py", line 83, in <module> from .tensorflow_backend import *
File "\site-packages\keras\backend\tensorflow_backend.py", line 1, in
<module>
import tensorflow as tf
ModuleNotFoundError: No module named 'tensorflow'

 

Keras will complain about the fact that it can’t find Tensorflow, its default backend. That’s fine; simply exit the REPL.

Next, navigate to “%USERPROFILE%\.keras” in Windows’ file explorer.

 

There should be a “keras.json” file there. Open this file using Notepad or another text editor, and replace the contents so that it reads as follows.

 {
"floatx": "float32",
"epsilon": 1e-07,
"backend": "cntk", "image_data_format": "channels_last"
}

 

Using Another Back End        

In case you’re using tensorflow, just leave the “backend” value set to “tensorflow.” If you’re using theano, set the value to “theano.”

 

Note that in the latter case, you might also need to look for a “.theanorc.txt” file on your system and change its contents as well to get things to work on your system.

Especially the “device” entry that you should set to “cpu” in case theano has trouble finding your GpU.

 

Once you’ve made this change, try test-importing Keras once again into a fresh REPL session. You should now get the following:

>>> import keras Using CNTK backend

Selected GPU[1] GeForce GTX 980M as the process wide default device.

 

Keras is now set up and is recognizing our GPU. If CNTK would complain, remember to try the CPU version instead, though keep in mind that training the model will take much longer in this case (and so will theano and Tensorflow in case you can only use CPU-based computing).

 

We can now create another Python script to train our model (“train.py”):

import cv2
import pickle
from os import listdir
import os.path
import numpy as np
from glob import glob
from sklearn.preprocessing import LabelBinarizer from sklearn.model_selection import train_test_split from keras.models import Sequential
from keras.layers.convolutional import Conv2D, MaxPooling2D
from keras.layers.core import Flatten, Dense
from constants import *
data = [] labels = []
nr_labels = len(listdir(LETTERS_FOLDER))
# Convert each image to a data matrix
for label in listdir(LETTERS_FOLDER):
for image_file in glob(os.path.join(LETTERS_FOLDER, label, '*.png')): image = cv2.imread(image_file)
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# Resize the image so all images have the same input shape image = cv2.resize(image, MODEL_SHAPE)
# Expand dimensions to make Keras happy image = np.expand_dims(image, axis=2) data.append(image)
labels.append(label)
# Normalize the data so every value lies between zero and one data = np.array(data, dtype="float") / 255.0
labels = np.array(labels)
# Create a training-test split
(X_train, X_test, Y_train, Y_test) = train_test_split(data, labels,
test_size=0.25, random_state=0)
# Binarize the labels
lb = LabelBinarizer().fit(Y_train) Y_train = lb.transform(Y_train) Y_test = lb.transform(Y_test)
# Save the binarization for later with open(LABELS_FILE, "wb") as f:
pickle.dump(lb, f)
# Construct the model architecture model = Sequential()
model.add(Conv2D(20, (5, 5), padding="same",
input_shape=(MODEL_SHAPE[0], MODEL_SHAPE[1], 1),
activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
model.add(Conv2D(50, (5, 5), padding="same", activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2))) model.add(Flatten())
model.add(Dense(500, activation="relu")) model.add(Dense(nr_labels, activation="softmax")) model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
# Train and save the model
model.fit(X_train, Y_train, validation_data=(X_test, Y_test), batch_size=32, epochs=10, verbose=1)
model.save(MODEL_FILE)

We’re doing a number of things here. First, we loop through all images we have created, resize them, and store their pixel matrix as well as their answer. Next, we normalize the data so that each value lies between zero and one, which makes things a bit easier on the neural network.

 

Next, since Keras can’t work with “Q”, “W”,… labels directly, we need to binarize these: every label is converted to an output vertex with each index corresponding to one possible character, with its value set to one or zero, so that “Q” would become “[1, 0, 0, 0,…],” “W” would become “[0, 1, 0, 0,…],” and so on.

 

We save this conversion as we’ll also need it to perform the conversion back to characters again during application of the model.

 

Next, we construct the neural architecture (which is relatively simple, in fact), and start training the model. If you run this script, you’ll get an output as follows:

 

Using CNTK backend

Selected GPU[0] GeForce GTX 980M as the process wide default device.

Train on 1665 samples, validate on 555 samples Epoch 1/10

C:\Users\Seppe\Anaconda3\lib\site-packages\cntk\core.py:361:

UserWarning:  your data is of type "float64", but your input variable (uid "Input4") expects "<class'numpy.float32'>".

Please convert your data beforehand to speed up training. (sample.dtype, var.uid, str(var.dtype)))

 

We’re getting a 92 percent accuracy on the validation set, not bad at all!

The only thing that remains now is to show how we’d use this network to predict a CAPTCHA (“apply.py”):

from keras.models import load_model
import pickle
import os.path
from glob import glob from random import choice from functions import * from constants import *
withopen(LABELS_FILE, "rb") as f: lb = pickle.load(f)
model = load_model(MODEL_FILE)
# We simply pick a random training image here to illustrate how predictions work. In a real setup, you'd obviously plug this into your web scraping
# pipeline and pass a "live" captcha image
image_files = list(glob(os.path.join(CAPTCHA_FOLDER, '*.png'))) image_file = choice(image_files)
print('Testing:', image_file)
image = cv2.imread(image_file) image = process_image(image) contours = get_contours(image)
letters = get_letters(image, contours)
for letter in letters:
letter = cv2.resize(letter, MODEL_SHAPE) letter = np.expand_dims(letter, axis=2) letter = np.expand_dims(letter, axis=0) prediction = model.predict(letter)
predicted = lb.inverse_transform(prediction)[0]
print(predicted)

If you run this script, you should see something like the following:

Using CNTK backend

Selected GPU[0] GeForce GTX 980M as the process wide default device.

Testing: generated_images\NHXS_322.png N

H X S

 

As you can see, the network correctly predicts the sequence of characters in the CAPTCHA. This concludes our brief tour of CAPTCHA cracking. As we’ve discussed before, keep in mind that several alternative approaches exist, such as training an OCR toolkit or using a service with “human crackers” at low cost.

 

Also keep in mind that you might have to fine-tune both OpenCV and the Keras model in case you plan to apply this idea on other CAPTCHA’s and that the CAPTCHA generator we’ve used here is still relatively “easy.”

 

Most important, however, remains the fact that CAPTCHA’s signpost a warning, basically explicitly stating that web scrapers are not welcome. Keep this intricacy in mind as well before you set off cracking CAPTCHA’s left and right.

 

Even a Traditional Model Might Work as we’ve seen, it’s not that trivial to set up a deep learning pipeline.

 

In case you’re wondering whether a traditional predictive modeling technique such as random forests or support vector machines might also work (both of these are available in sci-kit-learn, for instance, and are much quicker to set up and train), the answer is that yes, in some cases, these might work, albeit at a heavy accuracy cost.

 

Such traditional techniques have a hard time understanding the two-dimensional structure of images, which is exactly what a convolutional neural network aims to solve. this being said, we’ve set up pipelines using a random forest and about 100 manually labeled CaptCha images that obtained a low accuracy of about 10 percent, though enough to get the answer right after a handful of tries.

Recommend