Skip to content

Instantly share code, notes, and snippets.

@Sachin-A
Last active October 10, 2017 17:36
Show Gist options
  • Save Sachin-A/691ba9ccfcc06d5a55bc3f1c2d71ef3a to your computer and use it in GitHub Desktop.
Save Sachin-A/691ba9ccfcc06d5a55bc3f1c2d71ef3a to your computer and use it in GitHub Desktop.
Summary of ReCaptcha paper

I’m not a human: Breaking the Google ReCaptcha

Summary

Contents

  1. Intro
  2. ReCaptcha
  3. Analyzing the advanced risk analyzer
  4. Image captcha
  5. Image captcha breaker

Intro

  • The goal of their new system is twofold:
  • To minimize the effort for legitimate users
  • Require tasks from attackers that are more challenging to computers than text recognition
  • ReCaptcha is driven by an advanced risk analysis system that evaluates requests and accordingly selects difficulty of captcha to be solved (checkbox, similar images challenge)

  • Succeeds at

  • Influencing the risk analysis system, bypass restrictions and deploy large scale attacks.
  • A low cost DNN based attack for annotation of images (solving 70.78% of challenges averaging 19 secs per challenge, FB captcha accuracy at 83.5%)

ReCaptcha

Widget

  • Javascript code is obfuscated
  • Collects user browser info
  • Series of checks to verify user's browser

Workflow

  • Request containing
    1. The referrer
    2. Sitekey
    3. Cookie for google.com
    4. Encrypted browser check info
  • Advanced risk analysis system responds with appropriate challenge

Challenge types

  • Harder challenges presented when
  • Low reputation for user
  • Requests multiple challenges
  • Wrong answer several times
  • Versions
  • No captcha reCaptcha (checkbox)
    • No action required if user has high reputation
  • Image reCaptcha
    • Similar images challenge
    • Sample image + 9 candidates for selection
    • Keyword describing content
    • 2 to 4 Correct answers
  • Text reCaptcha
    • Distorted text
    • For user with lower reputation
    • This kind for user agent failing certain browser checks
    • These are harder to solve but also solvable by bots so irrelevant now

Solution

  • 55 seconds time interval otherwise new challenge
  • Html field recaptcha-token populated with token is submitted to website which in turn sends it to Google
  • High rep user's token becomes valid on Google's side automatically
  • Website submits a request on completion of desired action containing
    • A shared secret
    • The response token
    • User's IP address (optional)

Analyzing the advanced risk analyzer

  • Check how various characteristics of user and user’s browser information affect analysis
  • Aim to issue requests for captchas that will be considered legitimate & thus receive checkbox captchas
  1. Browsing history
  • Quantify minimum amount of browsing history required for the google tracking cookie for checkbox
  • Create a lot of activity on Google sites (also with correlation between activities)
  • 9th day from cookie creation resulted in checkbox captcha
  1. Browser Environment
  • Obfuscated code reveal google servers will receive and process, at least, the following information:
    • Plug-ins
    • User-agent
    • Screen resolution
    • Execution time, timezone
    • Number of click/keyboard/touch actions in the iframe of the captcha
    • It tests the behavior of many browser-specific functions and CSS rules
    • It checks the rendering of canvas elements
    • Likely cookies server-side (it's executed on the www.google.com domain)
  • Canvas rendering
    • Creates a canvas element and draws a predefined composition
    • Fingerprinting the GPU essentially
    • Element encoded in base64 and sent back
    • Used to find discrepancy with user agent data
    • Not unique
  • User Agent
    • Suspicious
      • Outdated browser/browser-engine
      • Mismatch (Uses firefox but reports chrome)
      • Misformated user agent
  • Screen resolution and mouse
    • No negative effect for any combination of screen res and mouse configuration (including timing of movements and patterns)
  1. Site restriction
    • Solve captchas on controlled site and associate it with target site which could reduce network costs of building the token
    • Redesigned reCaptcha handles that using Referrer and read-only document.location.hostname
    • Workaround is to
      • Setup virtual host on server and set ServerName and other fields to correspond to target site
      • a2ensite (serve the configured site) and run on localhost to trick reCaptcha
      • Also needs sitekey which can be obtained from website source

Easy captcha breaker

Create cookies that seem to originate from legitimate users in a clean virtual machine where the browser automation system stores non-account google.com cookies

  1. Token Harvesting
    • No restriction for large no. of cookies from same IP address
    • Restriction for large no. concurrent requests (Dos)
  2. Evaluation
    • Aged cookies used for testing with different captcha request rates resulting in observation of dropping requests at higher rates
    • 1,200 - 2,500 checkbox captchas per hour

Image captcha

  1. Solution Flexibility
    • High number of challenges with 2 correct answers
    • Captcha breaking system with 3 image selection
    • Pass
      • n Correct + k Wrong (k <= 1)
      • n - 1 Correct
    • Fail
      • n - 1 Correct + k Wrong (k > 0)
  2. Image repetition
    • Some repetition of challenges with exact same images in the same order implies a small pool of challenges with periodic update
    • Images are also repeated with different MD5 hashes (identified using perceptual hashes)

Image captcha breaker

Solving using semantic similarity between sample image and candidate images. Image annotation module receives sample image, candidate images and the keyword/hint provided.

  1. Image annotation services and libraries
    • GRIS (Google reverse image search)
      • Best guess description for candidates (Non english is converted using google translate)
      • List of websites (page titles are useful)
      • Available sizes (Higher res useful for annotation)
    • Clarifai
      • Deconv network returning confidence-sorted 20 tags
    • Alchemy
      • Image recognition API returning upto 8 specific tags
    • TDL
      • Confidence-sorted 8 tags
    • NeuralTalk
      • RNN for free-form (sentence like) description of image (which is broken down into individual words)
    • Caffe model
      • Local processing model in caffe
      • Returns confidence-sorted 5 tags, specificity-sorted 5 tags
  2. Tag classifier
    • Unsupervised machine learning classifier to guess content of image using a subset of the tags
    • Uses Word2Vector for cosine similarity
      • Using angle between documents and tf-idf magnitude of <tag, hint> word vector pairs between each of the candidate images
      • term_frequency-inversed_document_frequency is a measure of how important the word is in the document. Magnitudes of rarer words are scaled up in the logarithmic scale (Read)
      • Ultimately to find subset of useful tags
  3. History module
    • Labelled dataset with tags annotated with the hints
    • Maintain hint list
  4. Solution
    • Pass all candidates through GRIS
    • If no hint provided, search for the sample image in dataset to obtain its hint
    • Search for candidates in dataset, compare current tags with annotated hints and select those that match
    • Remaining images are discarded if there's a match in the hint list since that means the hint was seen before but non of the images in the dataset match for this image
    • Rest is undecided
    • Now combine the results from best guess and page title modules (lowest weight) as well
    • If adequate number of images not found, select from undecided list, images that have the most overlapping tags with sample image.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment