Web Spidering and Data Extraction with scRUBYt!

Dec 09, 2008

Some of you may be aware that I work on (time permitting) the scRUBYt! project with Peter Szinek. Hopefully some of you have actually found an excuse to use the tool, I know there are quite a few hundred other satisfied users out there. Well Peter has been furiously working away on polishing up the lastest release, we've also gone back and refactored a lot of the internals and improved the test coverage of library. Given the gnarly levels of recursion in it, at times it was proving difficult to add in the new features we wanted.

The skimr branch is our first attempt at refactoring, but to do it we've sacrificed quite a lot. A lot of functionality is currently missing, and the syntax has changed slightly. I wouldn't yet consider this to be a release candidate, but it has been getting used successfully for a few months now in production so I think it's worth a look.

Whats new?

Well apart from being a lot less code, it's significantly faster and requires much less RAM on larger web scrapes. This is in part due to the fact that you can now stream your results out to a file, rather than trying to hold your entire dataset in memory before dumping them. If you don't stream your results out to a file, the default is to return them as a Hash making it much easier to develop custom output handlers, or integrate the results into your existing ruby code.

Creating your own web crawler

To begin with, we will start with the tried and true google/ruby example that has served so well with previous scRUBYt! releases. So we start by defining a new Skimr extractor:

@extractor = Skimr::Extractor.new(:agent => :standard) do
end

You'll notice here we pass in an agent type. It's an optional parameter, and if you leave it out it will default to :standard which means a combination of mechanize/hpricot to parse your results. Others will become available in future releases to allow you to scrape AJAX heavy sites again. Next, to tell it what page we want to start at:

@extractor = Skimr::Extractor.new(:agent => :standard) do
  fetch "http://www.google.com/ncr"
end

Now we run over to Google (the ncr bit tells Google not to redirect me to a country specific site), because we're about to start a search:

@extractor = Skimr::Extractor.new(:agent => :standard) do
  fetch "http://www.google.com/ncr"
  fill_textfield "q", "ruby"
  submit
end

Hopefully the above is fairly obvious, we've entered the term "ruby" into the field named "q", and hit submit. Play along in your browser so you can see what we are playing with.

Extracting data from the website

Okay, so we've got the navigation part covered. Now we want to pull out a list of all the results, it's quite simple:

@extractor = Skimr::Extractor.new(:agent => :standard) do
  fetch "http://www.google.com/ncr"
  fill_textfield "q", "ruby"
  submit
  page_title "//h3[@class='r']"
end

Just provide an XPath to the element on the page that you want, and scRUBYt! will extract all elements that fit that definition. As the results are now available as a Hash object we could simply do the following:

>> @extractor.results.first
=> {:page_title=>"Ruby Programming Language"}

>> @extractor.results.last
=> {:page_title=>"Welcome! [Ruby-Doc.org: Documenting the Ruby  Language]"}

I'll follow this up in a couple of days with some examples of scraping deeper pages, merging multiple result sets, some of the new features we've included, and talk about some of the future enhancements that are coming soon.

Hi, I'm Glenn! 👋 I've spent most of my career working with or at startups. I'm currently the Director of Product @ Ockam where I'm helping developers build applications and systems that are secure-by-design. It's time we started securely connecting apps, not networks.

Previously I led the Terraform product team @ HashiCorp, where we launched Terraform Cloud and set the stage for a successful IPO. Prior to that I was part of the Startup Team @ AWS, and earlier still an early employee @ Heroku. I've also invested in a couple of dozen early stage startups.