Skip to content

Instantly share code, notes, and snippets.

@sebastian-nagel
sebastian-nagel / saturn.pretty.wat
Last active December 17, 2024 11:41 — forked from pjox/saturn.pretty.wat
Common Crawl format example for https://en.wikipedia.org/wiki/Saturn
WARC/1.0
WARC-Type: metadata
WARC-Target-URI: https://en.wikipedia.org/wiki/Saturn
WARC-Date: 2024-12-11T20:20:04Z
WARC-Record-ID: <urn:uuid:74b1614e-97bb-4a19-b02f-defc603ab81c>
WARC-Refers-To: <urn:uuid:90f1a666-d5ba-4e8d-806d-4d848e77a0f8>
Content-Type: application/json
Content-Length: 1910
{
@sebastian-nagel
sebastian-nagel / jython_webgraph_commands.sh
Last active September 28, 2020 13:38
webgraph commands
### Jython
# install Jython (see https://www.jython.org/download)
wget https://repo1.maven.org/maven2/org/python/jython-standalone/2.7.2/jython-standalone-2.7.2.jar
# clone pywebgraph (fork with modifications)
git clone https://github.com/commoncrawl/py-web-graph.git
cd py-web-graph
# copy console.py into current working directory so that "pywebgraph" is visible as package
cp pywebgraph/console.py .
@sebastian-nagel
sebastian-nagel / REAMDE.md
Created October 21, 2019 13:05
character set and content language correlations
from warcio.archiveiterator import ArchiveIterator
with open('path/to/file.wet.gz', 'rb') as stream:
for record in ArchiveIterator(stream):
if record.rec_type == 'conversion':
url = record.rec_headers.get_header('WARC-Target-URI')
text = record.content_stream().read().decode('utf-8')
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>
<![CDATA[ http://www.example.com/sitemap1.xml ]]>
</loc>
<lastmod>
<![CDATA[ 2018-12-12 02:06:56 ]]>
</lastmod>
</sitemap>
@sebastian-nagel
sebastian-nagel / watlinks.path.freq.txt
Created October 19, 2017 14:20
Link path identifiers from a single Common Crawl WAT file
@sebastian-nagel
sebastian-nagel / cdx_get_warc_record.py
Last active March 9, 2018 08:59
Python script to export Common Crawl WARC records found via CDX to a file named my.warc.gz: `zgrep '...pattern...' cdx-*.gz | python3 cdx_get_warc_record.py >my.warc.gz`
import fileinput
import sys
import boto3
import botocore
import ujson as json
no_sign_request = botocore.client.Config(
# hanging executor on Spark 2.1.0 and Python 2.7
from pyspark import SparkContext
class BadEncodedException(Exception):
def __init__(self, reason):
self.msg = str(reason)
super(BadEncodedException, self).__init__(self.msg)
@sebastian-nagel
sebastian-nagel / cs_despam_host_pagerank.py
Last active November 9, 2022 22:17
Simple spam detection of Common Search host-level page rank list: detect blocks of hosts with similar rank and host names which ev. form link farms
import fileinput
import sys
import tldextract
from _collections import defaultdict
from math import log
RANK_DIVERGENCE_THR = 0.02
HOST_LENGTH_DIVERGENCE_THR = 0.15