Web Spider Creation with scRUBYt! - Part II

Dec 17, 2008

Continuing on from the previous post, Web Spidering and Data Extraction with scRUBYt!, this article will help you delve a little deeper with the scRUBYt! scraping framework both in terms of your understanding of how to use it...; and actually delving deeper in your crawl to more pages.

A quick recap on the last web spider

So as far as we got last week was to go off to Google, put in a search for the word "ruby", and then list the link text for each of the results. Here's the code we ended up with to get that far:

@extractor = Skimr::Extractor.new(:agent => :standard) do
  fetch "http://www.google.com/ncr"
  fill_textfield "q", "ruby"
  submit
  page_title "//h3[@class='r']"
end

How to scrape deeper pages

But this is a fairly contrived example, and it's not exactly going to save you a huge amount of time over a quick manual copy-and-paste job from the results page. But what if you didn't want to just have a list of the links, you wanted some kind of summary or additional detail on each of the links? Lets actually go to each website, and see what kind of content they've got and grab something useful from it:

@extractor = Skimr::Extractor.new(:agent => :standard) do
  fetch "http://www.google.com/ncr"
  fill_textfield "q", "ruby"
  submit
  page_detail "//h3[@class='r']/a" do
    page_body "//body"
  end
end

If you offer up the XPath to a link element with a result name ending in _detail, and then pass in a block, scRUBYt! will follow the link before trying to process the block. That means we can use this technique for following each of the results Google gives us, and at each we return all of the text contain on the page between the <body> tags.

Now you could take the hash returned, and do some post processing on it to get something meaningful from the text extracted. But for the sake of example, I'll make some big assumptions and assume that every page is going to have at least a <title> tag, and a <p> tag with more than just a few words in it.

@extractor = Skimr::Extractor.new(:agent => :standard) do
  fetch "http://www.google.com/ncr"
  fill_textfield "q", "ruby"
  submit
  page_detail "//h3[@class='r']/a" do
    title "//title"
    summary "//p", :script => Proc.new{|result| result if result.match(%r{(\w+\W+){25}})}
  end
end

As you may have noticed, you can pass a proc in as a parameter to your result definition. The output of the XPath match will be passed in to the proc, and the result ultimately returned as the final result for that definition? Make sense? If not, what I've done above is look for all <p> tags on the page and pass them in to my proc definition. The proc then runs a regexp against it to check that at least 25 words exist within the <p>, if there is then all the <p> content is returned otherwise nil is returned.

What if the results are paginated?

You could potentially create a highly recursive extractor to handle this, but it's such a common case that we've included a method to do it for you:

@extractor = Skimr::Extractor.new(:agent => :standard) do
  fetch "http://www.google.com/ncr"
  fill_textfield "q", "ruby"
  submit
  page_detail "//h3[@class='r']/a" do
    title "//title"
    summary "//p", :script => Proc.new{|result| result if result.match(%r{(\w+\W+){25}})}
  end
  next_page "//a[text()*='Next']", :limit => 2
end

I've used the XPath text() function here to highlight its usefulness. I use it quite a lot as a shortcut to get things working and test, and it's been a lifesaver in many scenarios where the markup is inconsistent or you want to keep the scraper definition generic. And thankfully, it works for this scenario. However, be wary of using it as is in production as it may have some unexpected side effects. If one of the results that came back had the word "Next" in the title, then scRUBYt! would diligently follow that link and you'd end up on the wrong page.

What's next?

In the next installment I'll briefly cover how to handle logging of the scrape to help you diagnose any problems, and how to handle more complex form completion.

Hi, I'm Glenn! 👋 I've spent most of my career working with or at startups. I'm currently the Director of Product @ Ockam where I'm helping developers build applications and systems that are secure-by-design. It's time we started securely connecting apps, not networks.

Previously I led the Terraform product team @ HashiCorp, where we launched Terraform Cloud and set the stage for a successful IPO. Prior to that I was part of the Startup Team @ AWS, and earlier still an early employee @ Heroku. I've also invested in a couple of dozen early stage startups.