Web Scraping - The Amazon Example Finale

Jan 15, 2009

So yesterday I showed you a fairly contrived example of how to build a web scraper using scRUBYt! to get data from Amazon (they've got an API that would be much easier and more robust if you need access to this info, but that's not the point at the moment). But if you look at the results, they're not the greatest. There is too much noise in some fields, and we probably want to share there data with another system so a ruby based Hash object isn't going to work.

Removing empty results with scRUBYt!

For various reasons, sometimes you may not get all the data you want back for every record. It's usually related to your result definition being too restrictive, or a change in format on a specific page. Maybe the price information is in a different DIV if it is on sale. In any event, you need to make a decision on what to do. In the new release of scRUBYt! there are three immediate options that come to mind, but we may well build more in if required. First, and what we did yesterday in the example, is to do nothing. You'll get the nil/empty result returned back to you to handle as you see fit. Second, you can simply drop any fields that are nil (I've just displayed the first few results):

@extractor = Scrubyt::Extractor.new do
  fetch "http://www.amazon.com/"
  select_option "url", "Books"
  fill_textfield "field-keywords", "ruby"
  submit
  book_detail "//td[@class='dataColumn']/table/tr/td/a" do
    title "//h1[@class='parseasinTitle']"
    price "//b[@class='priceLarge']", :remove_blank => true
    saving "//td[@class='price']"
    isbn "//li[text()*='ISBN-10:']"
  end
end

puts @extractor.results.inspect
=> [{:book=>[{:title=>"The Ruby Programming Language [ILLUSTRATED]  (Paperback)"}, 
             {:price=>"$26.39"}, 
             {:saving=>"$13.60\n      (34%)\n    "}, 
             {:isbn=>"ISBN-10: 0596516177"}]}
    {:book=>[{:title=>"The Ruby Programming Language (Paperback)"}, 
             {:saving=>nil}, 
             {:isbn=>"ISBN-10: 020171096X"}]}, 
    {:book=>[{:title=>"Beginning Ruby: From Novice to Professional (Beginning from Novice to Professional) (Paperback)"}, 
             {:price=>"$26.39"}, 
             {:saving=>"$13.60\n      (34%)\n    "}, 
             {:isbn=>"ISBN-10: 1590597664"}]}
    ...]

We've set :remove_blank to true on the price field, and as a result you'll see that the 2nd result contains no price element. Alternatively, you could drop any detail block (in this example, any single book) which is missing the field:

@extractor = Scrubyt::Extractor.new do
  fetch "http://www.amazon.com/"
  select_option "url", "Books"
  fill_textfield "field-keywords", "ruby"
  submit
  book_detail "//td[@class='dataColumn']/table/tr/td/a" do
    title "//h1[@class='parseasinTitle']", :required => true
    price "//b[@class='priceLarge']", :required => true
    saving "//td[@class='price']"
    isbn "//li[text()*='ISBN-10:']"
  end
end

puts @extractor.results.inspect

=> [{:book=>[{:title=>"The Ruby Programming Language [ILLUSTRATED]  (Paperback)"},
             {:price=>"$26.39"},
             {:saving=>"$13.60\n      (34%)\n    "},
             {:isbn=>"ISBN-10: 0596516177"}]},
    {:book=>[{:title=>"Beginning Ruby: From Novice to Professional (Beginning from Novice to Professional) (Paperback)"},
             {:price=>"$26.39"},
             {:saving=>"$13.60\n      (34%)\n    "},
             {:isbn=>"ISBN-10: 1590597664"}]},
    {:book=>[{:title=>"Programming Ruby: The Pragmatic Programmers' Guide, Second Edition [ILLUSTRATED]  (Paperback)"},
             {:price=>"$29.67"},
             {:saving=>"$15.28\n      (34%)\n    "},
             {:isbn=>"ISBN-10: 0974514055"}]},
    ...]

This time the "The Ruby Programming Language (Paperback)" book isn't included in the results at all. For our purposes though, I've decided that I only want to know about books that I have all the details for. Instead of setting :required on every result, I can specify it on the book_detail definition which will give the same output:

@extractor = Scrubyt::Extractor.new do
  fetch "http://www.amazon.com/"
  select_option "url", "Books"
  fill_textfield "field-keywords", "ruby"
  submit
  book_detail "//td[@class='dataColumn']/table/tr/td/a", :required => :all do
    title "//h1[@class='parseasinTitle']"
    price "//b[@class='priceLarge']"
    saving "//td[@class='price']"
    isbn "//li[text()*='ISBN-10:']"
  end
end

Sanitizing scRUBYt! output

The output still isn't quite what we want. We've dumped the empty results, but we've still got that ugly "ISBN-10: " in front of the ISBN. We could clean it up later, but it's creating additional work for ourselves. And if we want this thing to scale (I've got scrapers which scrape thousands of pages off a single site) trying to keep all that data hanging around in memory isn't going to work. So let's do as much of possible within the scRUBYt! definition as we're collecting the data:

@extractor = Scrubyt::Extractor.new do
  fetch "http://www.amazon.com/"
  select_option "url", "Books"
  fill_textfield "field-keywords", "ruby"
  submit
  book_detail "//td[@class='dataColumn']/table/tr/td/a", :required => :all do
    title "//h1[@class='parseasinTitle']"
    price "//b[@class='priceLarge']"
    saving "//td[@class='price']"
    isbn "//li[text()*='ISBN-10:']", :script => Proc.new{|isbn| isbn.gsub("ISBN-10: ", "")}
  end
end
puts @extractor.results.inspect
=> [{:book=>[{:title=>"The Ruby Programming Language [ILLUSTRATED]  (Paperback)"},
             {:price=>"$26.39"},
             {:saving=>"$13.60\n      (34%)\n    "},
             {:isbn=>"0596516177"}]},
    ... ]

And now you'll see that we're getting a much cleaner ISBN result. Just create a Proc, the result will be passed into in and then do as you see fit. An if statement to check it contains something you expect, a regexp, the possibilities are endless. You can also combine this with the other options like :require and :remove_blank. So let's really jazz this thing up. We'll clean up the saving, and pull in the description and that's all the data we need:

@extractor = Scrubyt::Extractor.new do
  fetch "http://www.amazon.com/"
  select_option "url", "Books"
  fill_textfield "field-keywords", "ruby"
  submit
  book_detail "//td[@class='dataColumn']/table/tr/td/a", :required => :all  do
    title "//h1[@class='parseasinTitle']"
    price "//b[@class='priceLarge']"
    saving "//td[@class='price']", :script => Proc.new{|saving| saving.match(/(\$[\d\.]*)/)[1]}
    isbn "//li[text()*='ISBN-10:']", :script => Proc.new{|isbn| isbn.gsub("ISBN-10: ","")}
    description "//div[@id='productDescription']//div[@class='content']"
  end
end
puts @extractor.results.inspect

=> [{:book=>[{:title=>"Beginning Ruby: From Novice to Professional (Beginning from Novice to Professional) (Paperback)"},
             {:price=>"$26.39"},
             {:saving=>"$13.60"},
             {:isbn=>"1590597664"}, 
             {:description=>"Product Description\n  Ruby is perhaps best known as the engine powering the..."}]}, 
    {:book=>[{:title=>"Programming Ruby: The Pragmatic Programmers' Guide, Second Edition [ILLUSTRATED]  (Paperback)"}, 
             {:price=>"$29.67"}, 
             {:saving=>"$15.28"}, 
             {:isbn=>"0974514055"}, 
             {:description=>"Product Description\n  Ruby is an increasingly popular, fully object-oriented dynamic..."}]},
    ...]

Outputting results to XML

Inevitably there comes a time where you want to consume this data in something other than your ruby application. At that point, passing around a Hash is probably not the best idea. Alternatively, you might have a scraper that has to scrape hundreds to thousands of pages. Storing all the results in a Hash as you go will bring your machine to it's knees. So here comes one of the largest changes to the way the new release of scRUBYt! works.

Previously, you always had results returned as a Hash and/or XML depending on your need. Everything was held in memory until you destroyed your extractor. Now, the standard XML option is to stream the results out to a file as they are processed and remove them from memory. There is no way to retrieve the results as XML within your program, they have to be streamed out to a file (and really, why would you want XML within your app when you can have native ruby structures instead?). So to save our scraper above out to an XML file you just pass a new output format and an instance of a File to the extractor:

@file = File.new("amazon_results.xml", "w")
@extractor = Scrubyt::Extractor.new :output => :xml_file, :file => @file do
  fetch "http://www.amazon.com/"
  select_option "url", "Books"
  fill_textfield "field-keywords", "ruby"
  submit
  book_detail "//td[@class='dataColumn']/table/tr/td/a", :required => :all  do
    title "//h1[@class='parseasinTitle']"
    price "//b[@class='priceLarge']"
    saving "//td[@class='price']", :script => Proc.new{|saving| saving.match(/(\$[\d\.]*)/)[1]}
    isbn "//li[text()*='ISBN-10:']", :script => Proc.new{|isbn| isbn.gsub("ISBN-10: ","")}
    description "//div[@id='productDescription']//div[@class='content']"
  end
end

Now a call to @extractor.results at the end of the scrape will return no results. I hope that wasn't too much and it's given you a good view into how to create your very own web scraper. If you have any questions, head on over to the scRUBYt! forums or post them in the comments.

Hi, I'm Glenn! 👋 I've spent most of my career working with or at startups. I'm currently the Director of Product @ Ockam where I'm helping developers build applications and systems that are secure-by-design. It's time we started securely connecting apps, not networks.

Previously I led the Terraform product team @ HashiCorp, where we launched Terraform Cloud and set the stage for a successful IPO. Prior to that I was part of the Startup Team @ AWS, and earlier still an early employee @ Heroku. I've also invested in a couple of dozen early stage startups.