Created
September 27, 2011 15:50
-
-
Save lusis/1245439 to your computer and use it in GitHub Desktop.
Example of using Goose from JRuby
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'rubygems' | |
require 'java' | |
require 'chronic' | |
libs = [] | |
libs << "lib/jars/*.jar" | |
libs.each do |lib| | |
Dir[lib].each do |jar| | |
puts "loading #{jar}" | |
require jar | |
end | |
end | |
module Maverick | |
include_package "com.gravity.goose" | |
end | |
module MaverickExtractors | |
include_package "com.gravity.goose.extractors" | |
end | |
class MyRubyDateExtractor < MaverickExtractors::PublishDateExtractor | |
def extract(rawdoc) | |
pub_date = rawdoc.select("div[class=submitted]").text | |
Chronic.parse(pub_date).to_java | |
end | |
end | |
@config = Maverick::Configuration.new | |
@config.local_storage_path = "./tmp" | |
@config.enable_image_fetching = false | |
@config.publish_date_extractor = MyRubyDateExtractor.new | |
url = "http://www.hollyscoop.com/paris-hilton/britney-shows-us-her-assets.html" | |
@goose = Maverick::Goose.new(@config) | |
@article = @goose.extract_content(url) | |
disp = <<EOT | |
Article title: #{@article.title} | |
Article pubdate: #{@article.publish_date} | |
Article tags: #{@article.meta_keywords} | |
Article: | |
------------------------------------------ | |
#{@article.cleaned_article_text} | |
EOT | |
puts disp |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
lib/jars/ | |
|-- akka-actor-1.1.3.jar | |
|-- akka-typed-actor-1.1.3.jar | |
|-- commons-codec-1.4.jar | |
|-- commons-io-2.0.1.jar | |
|-- commons-lang-2.6.jar | |
|-- commons-logging-1.1.1.jar | |
|-- goose-2.1.0.jar | |
|-- httpclient-4.1.2.jar | |
|-- httpcore-4.1.2.jar | |
|-- jsoup-1.5.2.jar | |
|-- log4j-1.2.16.jar | |
|-- scala-library-2.9.0-1.jar | |
|-- slf4j-api-1.6.1.jar | |
`-- slf4j-log4j12-1.6.1.jar |
How did you build the goose-2.1.0.jar ?
I'm assuming you have maven installed here
Check out the goose source tree and run mvn clean package
. This will leave a two jar files in the target
- one is the sources, the other is the jar you want.
Since I originally wrote this, it's possible some of the dependencies have changed. I just did a quick check and the versions look the same. You can run mvn dependency:tree | grep compile
to see what jars you'll need. If you ran the build, they'll all have been downloaded.
The best way to just grab them all is to run mvn dependency:copy-dependencies
. This will shove them all in target/dependency
for you.
@lusis thank you very much for your detailed answer. Everything works as expected. Very usefull gist !
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
So glad you got this hooked up @lusis!