Skip to content

Instantly share code, notes, and snippets.

@mrgordon
Created September 26, 2012 00:14
Show Gist options
  • Save mrgordon/3785241 to your computer and use it in GitHub Desktop.
Save mrgordon/3785241 to your computer and use it in GitHub Desktop.
hopefully this can work
websites = {}
@@words_of_interest = Marshal::load(File.read('models/tags.array'))
Website.order("id desc").limit(7495).each do |w|
websites[w.id] = w.url
end
rows = CSV.read("/tmp/answers_from_cat_modified_utf8.csv", :headers=>true)
output = CSV.open('output.csv', 'wb')
header_written = false
rows.each do |row|
unless header_written
output << row.headers.to_a + ['tags']
header_written = true
end
url = row['cf_url_verified'] || row['url']
next if url == "(null)"
website_ids = websites.select{|k,v| v.include?(url)}.keys.take(25)
text = ""
p website_ids
# temporary hack to skip MASSIVE site
next if website_ids.include?(28781)
website_ids.each do |id|
puts "processing #{id}"
data = Website.find(id).data || ''
text += data + ' '
end
p text.size
p "Tagging..."
results = Hash[TextProcessing.nuanced_match(@@words_of_interest, text).select {|k,v| v > 0 }].to_json
p "Tagging finished"
output << row.to_hash.values + [results]
output.flush
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment