Skip to content

Instantly share code, notes, and snippets.

@grangier
Forked from lusis/goose-jruby-example.rb
Created December 26, 2011 05:25
Show Gist options
  • Save grangier/1520567 to your computer and use it in GitHub Desktop.
Save grangier/1520567 to your computer and use it in GitHub Desktop.
Example of using Goose from JRuby
require 'rubygems'
require 'java'
require 'chronic'
libs = []
libs << "lib/jars/*.jar"
libs.each do |lib|
Dir[lib].each do |jar|
puts "loading #{jar}"
require jar
end
end
module Maverick
include_package "com.gravity.goose"
end
module MaverickExtractors
include_package "com.gravity.goose.extractors"
end
class MyRubyDateExtractor < MaverickExtractors::PublishDateExtractor
def extract(rawdoc)
pub_date = rawdoc.select("div[class=submitted]").text
Chronic.parse(pub_date).to_java
end
end
@config = Maverick::Configuration.new
@config.local_storage_path = "./tmp"
@config.enable_image_fetching = false
@config.publish_date_extractor = MyRubyDateExtractor.new
url = "http://www.hollyscoop.com/paris-hilton/britney-shows-us-her-assets.html"
@goose = Maverick::Goose.new(@config)
@article = @goose.extract_content(url)
disp = <<EOT
Article title: #{@article.title}
Article pubdate: #{@article.publish_date}
Article tags: #{@article.meta_keywords}
Article:
------------------------------------------
#{@article.cleaned_article_text}
EOT
puts disp
lib/jars/
|-- akka-actor-1.1.3.jar
|-- akka-typed-actor-1.1.3.jar
|-- commons-codec-1.4.jar
|-- commons-io-2.0.1.jar
|-- commons-lang-2.6.jar
|-- commons-logging-1.1.1.jar
|-- goose-2.1.0.jar
|-- httpclient-4.1.2.jar
|-- httpcore-4.1.2.jar
|-- jsoup-1.5.2.jar
|-- log4j-1.2.16.jar
|-- scala-library-2.9.0-1.jar
|-- slf4j-api-1.6.1.jar
`-- slf4j-log4j12-1.6.1.jar
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment