Skip to content

Instantly share code, notes, and snippets.

@sebastian-nagel
sebastian-nagel / saturn.pretty.wat
Last active December 17, 2024 11:41 — forked from pjox/saturn.pretty.wat
Common Crawl format example for https://en.wikipedia.org/wiki/Saturn
WARC/1.0
WARC-Type: metadata
WARC-Target-URI: https://en.wikipedia.org/wiki/Saturn
WARC-Date: 2024-12-11T20:20:04Z
WARC-Record-ID: <urn:uuid:74b1614e-97bb-4a19-b02f-defc603ab81c>
WARC-Refers-To: <urn:uuid:90f1a666-d5ba-4e8d-806d-4d848e77a0f8>
Content-Type: application/json
Content-Length: 1910
{
# -*- coding: utf-8 -*-
"""
common-crawl-cdx.py
A simple example program to analyze the Common Crawl index.
This is implemented as a single stream job which accesses S3 via HTTP,
so that it can be easily be run from any laptop, but it could easily be
converted to an EMR job which processed the 300 index files in parallel.