More web scrapers with the upcoming scRUBYt!

Jan 13, 2009

In the previous articles I gave a brief glimpse at the upcoming scRUBYt! release. We learned how to do some basic html scraping, and then followed it up by scraping multiple pages. As promised, in this installment I'll go over how to get more detailed logging output to help you debug your scraper while in development and how to submit and navigate forms.

Logging Scraper Output

There's been quite a fundamental change to the way scRUBYt! works internally of late. It doesn't manifest itself visually in the way you interface to it, but it does mean logging the output is now much easier and cleaner. Using our most basic example from the first tutorial, you just need to pass the :log_level into the extractor:

@extractor = Scrubyt::Extractor.new(:log_level => :verbose) do
  fetch "http://www.google.com/ncr"
end

By default, the output is directed to stdout so you'd see the following on your screen:

start
fetch: http://www.google.com/ncr
end

If you're after a more complex example, here is the scraper definition from the second part of the tutorial series and the corresponding log output:

@extractor = Scrubyt::Extractor.new(:log_level => :verbose) do
  fetch "http://www.google.com/ncr"
  fill_textfield "q", "ruby"
  submit
  page_detail "//h3[@class='r']/a" do
    title "//title"
    summary "//p", :script => Proc.new{|result| result if result.match(%r{(\w+\W+){25}})}
  end
  next_page "//a[text()*='Next']", :limit => 2
end

start
fetch: http://www.google.com/ncr
textfield: 'q' = 'ruby'
with options ''
submit
next detail: 'page' = 'http://ruby-lang.org/'
with args: ''
next detail: 'page' = 'http://en.wikipedia.org/wiki/Ruby_(programming_language)'
with args: ''

etc...

next page: /search?hl=en&amp;ie=UTF-8&amp;q=ruby&amp;start=10&amp;sa=N
fetch: http://www.google.com/search?hl=en&amp;ie=UTF-8&amp;q=ruby&amp;start=10&amp;sa=N
next detail: 'page' = 'http://www.rubycentral.com/book/'
with args: ''

etc...

end

At present the valid :log_level values are :none, :critical, :error, :warn, :info, :debug, and :verbose (in increasing order of noise). If you want to direct the log output to something other than stdout the only way at the moment is to override the Scrubyt::Logger#log method. I'm looking at ways to make it easier to substitute in a file based or other logging approach.

Making Your Scraper Navigate Forms

So now that we know how to log the output, let's do something more useful with our scraper. We can't really take our Google example from previous posts any further given how simple their interface is, so let us move over to Amazon. Say I wanted to grab a list the books on ruby that are for sale. Sure, I could probably get this information via an Amazon API...; but that's not really the point now is it ;)

@extractor = Scrubyt::Extractor.new(:log => :debug) do
  fetch "http://www.amazon.com/"
  select_option "url", "Books"
  fill_textfield "field-keywords", "ruby"
  submit
  book_detail "//td[@class='dataColumn']/table/tr/td/a" do
    title "//h1[@class='parseasinTitle']"
    price "//b[@class='priceLarge']"
    saving "//td[@class='price']"
    isbn "//li[text()*='ISBN-10:']"
  end
end

So let's run through what we've got here, hopefully some of it looks familiar from the previous examples we've gone through. First we fetch the page to start with, then in the select field named "url" we choose the options that says "Books", and we then submit the form. scRUBYt! will keep track of the last form you input any data to, so if there are multiple forms on the page then you just need to target the appropriate input fields. From there, the submit action will work out what it needs to do.

Next is to define a detail block, so we point out the XPath to the heading/link for each book on the page and say we want to navigate to that page and extract the title, price, saving, etc. I've been a little cheeky with the isbn definition, saying just find me any LI tag that contains the string "ISBN-10:". If we were to look at the results generated you'd see:

puts @extractor.results.inspect
[{:book=>[{:title=>"The Ruby Programming Language [ILLUSTRATED]  (Paperback)"}, {:price=>"$26.39"}, 
          {:saving=>"$13.60\n      (34%)\n    "}, 
          {:isbn=>"ISBN-10: 0596516177"}]}, 
 {:book=>[{:title=>"The Ruby Programming Language (Paperback)"}, 
          {:price=>nil}, 
          {:saving=>nil}, 
          {:isbn=>"ISBN-10: 020171096X"}]}, 
 {:book=>[{:title=>"Beginning Ruby: From Novice to Professional (Beginning from Novice to Professional) (Paperback)"}, 
          {:price=>"$26.39"}, 
          {:saving=>"$13.60\n      (34%)\n    "}, 
          {:isbn=>"ISBN-10: 1590597664"}]}, 
 {:book=>[{:title=>"Beginning Ruby: From Novice to Professional (Kindle Edition)"}, 
          {:price=>nil}, 
          {:saving=>nil}, 
          {:isbn=>nil}]}, 
 {:book=>[{:title=>"Programming Ruby: The Pragmatic Programmers' Guide, Second Edition [ILLUSTRATED]  (Paperback)"}, 
          {:price=>"$29.67"}, 
          {:saving=>"$15.28\n      (34%)\n    "}, 
          {:isbn=>"ISBN-10: 0974514055"}]}, 
 {:book=>[{:title=>"The Ruby Way, Second Edition: Solutions and Techniques in Ruby Programming (2nd Edition) (Addison-Wesley Professional Ruby Series) (Paperback)"}, 
          {:price=>"$29.69"}, 
          {:saving=>"$15.30\n      (34%)\n    "}, 
          {:isbn=>"ISBN-10: 0672328844"}]}, 
 {:book=>[{:title=>"Ruby Way, The: Solutions and Techniques in Ruby Programming (Kindle Edition)"}, 
          {:price=>nil}, 
          {:saving=>nil}, 
          {:isbn=>nil}]}, 
 {:book=>[{:title=>"Learning Ruby [ILLUSTRATED]  (Paperback)"}, 
          {:price=>"$23.09"}, 
          {:saving=>"$11.90\n      (34%)\n    "}, 
          {:isbn=>"ISBN-10: 0596529864"}]}

]

And we have a reasonable snapshot of the data. You'll see there though, that it's not perfect. Firstly, we are missing information for some results. We've got spaces and carriage returns in the saving data, and we probably don't need the "ISBN-10:" string at the front of the ISBN result. And what if we wanted to link to the actual result so someone could actually buy the book on Amazon?

All good questions, and all easily solvable. I'll follow it up with a post in the next day or so and highlight some of the new ways of specifying constraints on your data in scRUBYt!. And special thanks have to go to Homeflow who have been funding at least a day of time for Peter and I each week lately, hence the increased level of development in scRUBYt!. It's nice when you have clients who want to actively give back.

Hi, I'm Glenn! 👋 I've spent most of my career working with or at startups. I'm currently the Director of Product @ Ockam where I'm helping developers build applications and systems that are secure-by-design. It's time we started securely connecting apps, not networks.

Previously I led the Terraform product team @ HashiCorp, where we launched Terraform Cloud and set the stage for a successful IPO. Prior to that I was part of the Startup Team @ AWS, and earlier still an early employee @ Heroku. I've also invested in a couple of dozen early stage startups.