Glenn Gillen

scRUBYt! Gets Plugins!

Yes! You heard right! As you may have gathered, it's been a rather frantic month of development for scRUBYt! and currently this is the addition I'm most proud of. One of the most common requests used to be along the lines of "when do you plan to support xxx format output?". Now, scRUBYt! is oblivious to output formats. That's right, it natively supports nothing, nada, zilch. But to make it useful, we've written a Hash output plugin which we're shipping it with.

How to use a scRUBYt! output plugin

Firstly, you need to make sure you have the plugin you require installed. At the time of writing there will only be the two I've written, Hash and XmlFile. Then in your ruby file require the plugin. As the current edge release isn't yet packaged as a gem you'll need to test this with the github checkout and reference the output plugin explicitly:

require "plugins/scrubyt_xml_file_output/scrubyt_xml_file_output"

If you've been following the tutorials talking about web scraping with the new version for the past few weeks you'll have seen how to direct output to a plugin. To request Hash output it's:

@extractor = Scrubyt::Extractor.new :output => :hash do
    fetch "http://www.google.com/search?&q=ruby"
    result "//html/body/div[5]/div/div/h2/a"
end

and for XmlFile it is:

@file = File.open("results.xml", "w")
@extractor = Scrubyt::Extractor.new :output => :xml_file, :file => @file do
    fetch "http://www.google.com/search?&q=ruby"
    result "//html/body/div[5]/div/div/h2/a"
end

The XmlFile output takes an additional parameter which is the file to stream the results out to.

Creating your own plugin

That's great for those of you that are happy with XML or Hash output, but what about if you want some other custom format? Well it's time to create your own. I'll show you the actual code that implements the XmlFile output to show you how simple it is:

require 'rexml/document'
require "#{File.dirname(__FILE__)}/inflector"
require "#{File.dirname(__FILE__)}/inflections"

class Scrubyt::Output::XmlFile < Scrubyt::Output::Plugin  
  @subscribers = {}
  on_initialize :setup_file
  before_extractor :open_root_node
  after_extractor :close_root_node
  on_save_result :save_xml


  def setup_file(args = {})
    @file = args[:file]
  end

  def open_root_node(*args)
    @file.write("<root>")
  end

  def save_xml(name, results)
    if results.is_a?(::Hash)
      @file.write results.to_xml
    else
      results.each do |result|
        @file.write result.to_xml(name)
      end
    end
  end

  def close_root_node(*args)
    @file.write("</root>")
  end
end

The require lines at the top are only needed for this output format. REXML to construct the XML tags for me, and some inflections I've put together to turn the Hash and Array objects into XML. Now into analysing the class proper.

class Scrubyt::Output::XmlFile < Scrubyt::Output::Plugin  
  @subscribers = {}

At the moment, you'll need to inintialize this instance variable to be an empty Hash for the events to get attached correctly. I'm looking for a way to remove it, stay tuned. But for now you'll need to put it in.

on_initialize :setup_file
before_extractor :open_root_node
after_extractor :close_root_node
on_save_result :save_xml

Here we've got four events to listen for, the concept should be familiar if you're coming from Rails. Essentially all we are doing is saying "When we initialize run the setupfile method. Before the extractor actually starts, run the method called openrootnode. Whenever we get a result to save, call savexml. And finally, after the extractor run the method called closerootnode."

def setup_file(args = {})
  @file = args[:file]
end

This is fairly straightforward. If you've got any custom logic that needs to happen when the output plugin is initialized you can place it in here. Any parameter that is passed in to Extractor.new() is passed through for you to access here.

def open_root_node(*args)
  @file.write("<root>")
end

Now just to open the XML file, and keep it somewhat consistent with the old scRUBYt! XML output we open a node within the file.

def save_xml(name, results)
  if results.is_a?(::Hash)
    @file.write results.to_xml
  else
    results.each do |result|
      @file.write result.to_xml(name)
    end
  end
end

Here is where the majority of the magic happens. The save_xml method will be passed the desired name for the result, and a hash of the results. This is essentially the same format you'd get if you used the Hash output format, except for each individual detail block rather than then entire extractor.

The reason for the if/else scenario is for when results are not part of a detail block. If you're just returning results straight (like the Google example at the top of this post) then "results" in this context will be a list/Array of all the matching results rather than a Hash.

Passing results back to the extractor

Not everyone is going to want to stream results out to a file though, so to deal with this you can make a results method available on the instance of your plugin. As I said earlier, even Hash operates as a plugin now so we can see an example of how this work in the Hash output plugin:

class Scrubyt::Output::Hash < Scrubyt::Output::Plugin
  @subscribers = {}
  on_initialize :setup_results
  on_save_result :store_hash

  def setup_results(args = {})
    @results = []
  end

  def results
    @results
  end

  def store_hash(name, passed_results)
    @results << passed_results
  end
end

Here we setup a @results instance oninitialize, and then onsaveresults simply pushes the passedresults into @results. Confused yet? Hopefully the code is clear enough to make sense.

All that happens then is that back in your extractor definition the call to @extractor.results is passed through to the first output plugin it can find.

Naming Conventions and Namespacing

The only additional requirement for a plugin to work in scRUBYt! is that it is correctly named and namespaced. As you may have noticed the ones I've provided are called Scrubyt::Output::Hash and Scrubyt::Output::XmlFile, that means they can be targeted using :output => :hash and :output => :xmlfile respectively. If you wanted to call your output GlennsBadExample it would be namespaced as Scrubyt::Output::GlennsBadExample and you'd then just need to require the appropriate file and use :output => :glennsbad_example

Oh the possibilities! So what's next?

We're only just starting to see the possibilities that this will offer our extractors. It opens up the possibility of pushing results not only to a different format, but possibly a completely different service. It's now trivial to create an output format that streams results directly into backgroundRB, a nanite worker, or a web service for further processing and data warehousing. By the time you read this, you'll also be able to pass in an array of outputs like :output => [:hash, :xml_file] and have both plugins generate the appropriate format(s). For the scraper I'm currently working on where I have two different companies wanting the same data, this could be just the ticket for interfacing directly to their API as I scrape.

I'd love to hear what ideas people might have for this, or how you think it could be improved. We're really hopeful that this is the kind of thing that makes developing and extending scRUBYt! really easy for those with more complicated needs.

Glenn Gillen

I'm an advisor to, and investor in, early-stage tech startups. Beyond that I'm an incredibly fortunate husband and father. Working on a developer-facing tool or service? Thinking about starting one? Email me and let me know or come to one of our days to help make it a reality.