scRUBYt! Gets Plugins!

Jan 16, 2009

Yes! You heard right! As you may have gathered, it's been a rather frantic month of development for scRUBYt! and currently this is the addition I'm most proud of. One of the most common requests used to be along the lines of "when do you plan to support xxx format output?". Now, scRUBYt! is oblivious to output formats. That's right, it natively supports nothing, nada, zilch. But to make it useful, we've written a Hash output plugin which we're shipping it with.

How to use a scRUBYt! output plugin

Firstly, you need to make sure you have the plugin you require installed. At the time of writing there will only be the two I've written, Hash and XmlFile. Then in your ruby file require the plugin. As the current edge release isn't yet packaged as a gem you'll need to test this with the github checkout and reference the output plugin explicitly:

require "plugins/scrubyt_xml_file_output/scrubyt_xml_file_output"

If you've been following the tutorials talking about web scraping with the new version for the past few weeks you'll have seen how to direct output to a plugin. To request Hash output it's:

@extractor = Scrubyt::Extractor.new :output => :hash do

    fetch "http://www.google.com/search?&amp;q=ruby"

    result "//html/body/div[5]/div/div/h2/a"

end

and for XmlFile it is:

@file = File.open("results.xml", "w")

@extractor = Scrubyt::Extractor.new :output => :xml_file, :file => @file do

    fetch "http://www.google.com/search?&amp;q=ruby"

    result "//html/body/div[5]/div/div/h2/a"

end

The XmlFile output takes an additional parameter which is the file to stream the results out to.

Creating your own plugin

That's great for those of you that are happy with XML or Hash output, but what about if you want some other custom format? Well it's time to create your own. I'll show you the actual code that implements the XmlFile output to show you how simple it is:

require 'rexml/document'

require "#{File.dirname(__FILE__)}/inflector"

require "#{File.dirname(__FILE__)}/inflections"



class Scrubyt::Output::XmlFile < Scrubyt::Output::Plugin  

  @subscribers = {}

  on_initialize :setup_file

  before_extractor :open_root_node

  after_extractor :close_root_node

  on_save_result :save_xml





  def setup_file(args = {})

    @file = args[:file]

  end



  def open_root_node(*args)

    @file.write("<root>")

  end



  def save_xml(name, results)

    if results.is_a?(::Hash)

      @file.write results.to_xml

    else

      results.each do |result|

        @file.write result.to_xml(name)

      end

    end

  end



  def close_root_node(*args)

    @file.write("</root>")

  end

end

The require lines at the top are only needed for this output format. REXML to construct the XML tags for me, and some inflections I've put together to turn the Hash and Array objects into XML. Now into analysing the class proper.

class Scrubyt::Output::XmlFile < Scrubyt::Output::Plugin  

  @subscribers = {}

At the moment, you'll need to inintialize this instance variable to be an empty Hash for the events to get attached correctly. I'm looking for a way to remove it, stay tuned. But for now you'll need to put it in.

on_initialize :setup_file

before_extractor :open_root_node

after_extractor :close_root_node

on_save_result :save_xml

Here we've got four events to listen for, the concept should be familiar if you're coming from Rails. Essentially all we are doing is saying "When we initialize run the setupfile method. Before the extractor actually starts, run the method called openrootnode. Whenever we get a result to save, call savexml. And finally, after the extractor run the method called closerootnode."

def setup_file(args = {})

  @file = args[:file]

end

This is fairly straightforward. If you've got any custom logic that needs to happen when the output plugin is initialized you can place it in here. Any parameter that is passed in to Extractor.new() is passed through for you to access here.

def open_root_node(*args)

  @file.write("<root>")

end

Now just to open the XML file, and keep it somewhat consistent with the old scRUBYt! XML output we open a node within the file.

def save_xml(name, results)

  if results.is_a?(::Hash)

    @file.write results.to_xml

  else

    results.each do |result|

      @file.write result.to_xml(name)

    end

  end

end

Here is where the majority of the magic happens. The save_xml method will be passed the desired name for the result, and a hash of the results. This is essentially the same format you'd get if you used the Hash output format, except for each individual detail block rather than then entire extractor.

The reason for the if/else scenario is for when results are not part of a detail block. If you're just returning results straight (like the Google example at the top of this post) then "results" in this context will be a list/Array of all the matching results rather than a Hash.

Passing results back to the extractor

Not everyone is going to want to stream results out to a file though, so to deal with this you can make a results method available on the instance of your plugin. As I said earlier, even Hash operates as a plugin now so we can see an example of how this work in the Hash output plugin:

class Scrubyt::Output::Hash < Scrubyt::Output::Plugin

  @subscribers = {}

  on_initialize :setup_results

  on_save_result :store_hash



  def setup_results(args = {})

    @results = []

  end



  def results

    @results

  end



  def store_hash(name, passed_results)

    @results << passed_results

  end

end

Here we setup a @results instance oninitialize, and then onsaveresults simply pushes the passedresults into @results. Confused yet? Hopefully the code is clear enough to make sense.

All that happens then is that back in your extractor definition the call to @extractor.results is passed through to the first output plugin it can find.

Naming Conventions and Namespacing

The only additional requirement for a plugin to work in scRUBYt! is that it is correctly named and namespaced. As you may have noticed the ones I've provided are called Scrubyt::Output::Hash and Scrubyt::Output::XmlFile, that means they can be targeted using :output => :hash and :output => :xmlfile respectively. If you wanted to call your output GlennsBadExample it would be namespaced as Scrubyt::Output::GlennsBadExample and you'd then just need to require the appropriate file and use :output => :glennsbad_example

Oh the possibilities! So what's next?

We're only just starting to see the possibilities that this will offer our extractors. It opens up the possibility of pushing results not only to a different format, but possibly a completely different service. It's now trivial to create an output format that streams results directly into backgroundRB, a nanite worker, or a web service for further processing and data warehousing. By the time you read this, you'll also be able to pass in an array of outputs like :output => [:hash, :xml_file] and have both plugins generate the appropriate format(s). For the scraper I'm currently working on where I have two different companies wanting the same data, this could be just the ticket for interfacing directly to their API as I scrape.

I'd love to hear what ideas people might have for this, or how you think it could be improved. We're really hopeful that this is the kind of thing that makes developing and extending scRUBYt! really easy for those with more complicated needs.

Hi, I'm Glenn! 👋 I've spent most of my career working with or at startups. I'm currently the Director of Product @ Ockam where I'm helping developers build applications and systems that are secure-by-design. It's time we started securely connecting apps, not networks.

Previously I led the Terraform product team @ HashiCorp, where we launched Terraform Cloud and set the stage for a successful IPO. Prior to that I was part of the Startup Team @ AWS, and earlier still an early employee @ Heroku. I've also invested in a couple of dozen early stage startups.