Created
July 19, 2016 22:54
-
-
Save Veejay/3e6e0b4fa0112d7a394611b78cf22237 to your computer and use it in GitHub Desktop.
Extracts generator meta information from WAT files
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'json' | |
class WatExtractor | |
attr_reader :file | |
def initialize file_name | |
@file = File.new file_name | |
end | |
def target_uri? line | |
line =~ /\AWARC-Target-URI:/ | |
end | |
def envelope? line | |
line =~ /\A\{/ | |
end | |
def process | |
generators = [] | |
current = {} | |
while line = file.readline | |
if target_uri? line | |
match_data = /\AWARC-Target-URI: (?<url>.*)\r\n\z/.match(line) | |
puts match_data.inspect | |
current.store('url', match_data[:url]) | |
end | |
if envelope? line | |
if line =~ /\"generator\"/ | |
json = JSON.parse(line) | |
meta_tags = json['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['HTML-Metadata']['Head']['Metas'] | |
generator = meta_tags.detect do |tag| | |
tag['name'].eql?('generator') | |
end | |
current.store('generator', (generator || {}).fetch('content', "")) | |
generators.push(current) | |
current = {} | |
else | |
next | |
end | |
end | |
end | |
rescue EOFError | |
return generators | |
end | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
TODO
MD5 digest
of the host in the loop for quick comparison)process
should be a public method)Queue
object and aThread
write them to a file so that if it crashes, we don't have to reprocess everything over again