I’m not a human: Breaking the Google ReCaptcha

Summary

Intro
ReCaptcha
Analyzing the advanced risk analyzer
Image captcha
Image captcha breaker

Intro

The goal of their new system is twofold:

To minimize the effort for legitimate users

Require tasks from attackers that are more challenging to computers than text recognition

ReCaptcha is driven by an advanced risk analysis system that evaluates requests and accordingly selects difficulty of captcha to be solved (checkbox, similar images challenge)
Succeeds at

Influencing the risk analysis system, bypass restrictions and deploy large scale attacks.

A low cost DNN based attack for annotation of images (solving 70.78% of challenges averaging 19 secs per challenge, FB captcha accuracy at 83.5%)

ReCaptcha

Widget

Javascript code is obfuscated

Collects user browser info

Series of checks to verify user's browser

Workflow

Request containing

The referrer

Sitekey

Cookie for google.com

Encrypted browser check info

Advanced risk analysis system responds with appropriate challenge

Challenge types

Harder challenges presented when

Low reputation for user

Requests multiple challenges

Wrong answer several times

Versions

No captcha reCaptcha (checkbox)

No action required if user has high reputation

Image reCaptcha

Similar images challenge

Sample image + 9 candidates for selection

Keyword describing content

2 to 4 Correct answers

Text reCaptcha

Distorted text

For user with lower reputation

This kind for user agent failing certain browser checks

These are harder to solve but also solvable by bots so irrelevant now

Solution

55 seconds time interval otherwise new challenge

Html field recaptcha-token populated with token is submitted to website which in turn sends it to Google

High rep user's token becomes valid on Google's side automatically

Website submits a request on completion of desired action containing

A shared secret

The response token

User's IP address (optional)

Analyzing the advanced risk analyzer

Check how various characteristics of user and user’s browser information affect analysis
Aim to issue requests for captchas that will be considered legitimate & thus receive checkbox captchas

Browsing history

Quantify minimum amount of browsing history required for the google tracking cookie for checkbox

Create a lot of activity on Google sites (also with correlation between activities)

9th day from cookie creation resulted in checkbox captcha

Browser Environment

Obfuscated code reveal google servers will receive and process, at least, the following information:

Plug-ins

User-agent

Screen resolution

Execution time, timezone

Number of click/keyboard/touch actions in the iframe of the captcha

It tests the behavior of many browser-specific functions and CSS rules

It checks the rendering of canvas elements

Likely cookies server-side (it's executed on the www.google.com domain)

Canvas rendering

Creates a canvas element and draws a predefined composition

Fingerprinting the GPU essentially

Element encoded in base64 and sent back

Used to find discrepancy with user agent data

Not unique

User Agent

Suspicious

Outdated browser/browser-engine

Mismatch (Uses firefox but reports chrome)

Misformated user agent

Screen resolution and mouse

No negative effect for any combination of screen res and mouse configuration (including timing of movements and patterns)

Site restriction

Solve captchas on controlled site and associate it with target site which could reduce network costs of building the token

Redesigned reCaptcha handles that using Referrer and read-only document.location.hostname

Workaround is to

Setup virtual host on server and set ServerName and other fields to correspond to target site

a2ensite (serve the configured site) and run on localhost to trick reCaptcha

Also needs sitekey which can be obtained from website source

Easy captcha breaker

Create cookies that seem to originate from legitimate users in a clean virtual machine where the browser automation system stores non-account google.com cookies

Token Harvesting

No restriction for large no. of cookies from same IP address

Restriction for large no. concurrent requests (Dos)

Evaluation

Aged cookies used for testing with different captcha request rates resulting in observation of dropping requests at higher rates

1,200 - 2,500 checkbox captchas per hour

Image captcha

Solution Flexibility

High number of challenges with 2 correct answers

Captcha breaking system with 3 image selection

Pass

n Correct + k Wrong (k <= 1)

n - 1 Correct

Fail

n - 1 Correct + k Wrong (k > 0)

Image repetition

Some repetition of challenges with exact same images in the same order implies a small pool of challenges with periodic update

Images are also repeated with different MD5 hashes (identified using perceptual hashes)

Image captcha breaker

Solving using semantic similarity between sample image and candidate images. Image annotation module receives sample image, candidate images and the keyword/hint provided.

Image annotation services and libraries

GRIS (Google reverse image search)

Best guess description for candidates (Non english is converted using google translate)

List of websites (page titles are useful)

Available sizes (Higher res useful for annotation)

Clarifai

Deconv network returning confidence-sorted 20 tags

Alchemy

Image recognition API returning upto 8 specific tags

TDL

Confidence-sorted 8 tags

NeuralTalk

RNN for free-form (sentence like) description of image (which is broken down into individual words)

Caffe model

Local processing model in caffe

Returns confidence-sorted 5 tags, specificity-sorted 5 tags

Tag classifier

Unsupervised machine learning classifier to guess content of image using a subset of the tags

Uses Word2Vector for cosine similarity

Using angle between documents and tf-idf magnitude of <tag, hint> word vector pairs between each of the candidate images

term_frequency-inversed_document_frequency is a measure of how important the word is in the document. Magnitudes of rarer words are scaled up in the logarithmic scale (Read)

Ultimately to find subset of useful tags

History module

Labelled dataset with tags annotated with the hints

Maintain hint list

Solution

Pass all candidates through GRIS

If no hint provided, search for the sample image in dataset to obtain its hint

Search for candidates in dataset, compare current tags with annotated hints and select those that match

Remaining images are discarded if there's a match in the hint list since that means the hint was seen before but non of the images in the dataset match for this image

Rest is undecided

Now combine the results from best guess and page title modules (lowest weight) as well

If adequate number of images not found, select from undecided list, images that have the most overlapping tags with sample image.

Sachin-A/breaking_recaptcha.md