Filtering the Twitter Streaming API
Jun 30, 2010
A couple of months ago I gave a brief introduction on how I've been parsing the Twitter Streaming API in my ruby applications. Part of the new featureset of the streaming API is the ability to filter the stream to only include tweets that include your preferred users, a set of keywords, from within a specific geographic location, or a combination of the above. There are limits on how many filter rules you can provide, and they vary depending on your access level so check the documentation for more details
Extending the Twitter
class created in the last post should be fairly straight forward. First there is a new URL:
url = URI.parse("http://#{username}:#{password}@stream.twitter.com/1/statuses/filter.json")
And we have to POST
to this address instead of requesting with GET
:
Yajl::HttpStream.post(url, :symbolize_keys => true)
We also need to pass through the predicates we want to filter on. I've opted for building up a list of the settings, only including them if it's been supplied. I've also given the options different names to the Twitter specified ones to try and prevent me confusing them in the future (does follow mean "follow this user" or "follow these keywords"? Avoid the confusion by calling them users and keywords instead):
params = []
params << "follow=#{[*filters[:users]].join(",")}" if filters[:users]
params << "track=#{[*filters[:keywords]].join(",")}" if filters[:keywords]
params << "locations=#{[*filters[:locations]].join(",")}" if filters[:locations]
You'll notice above that I'm actually splatting the value of each setting into a Hash
and then calling join
on that. The reason is so I can pass through just a single value (:users => 12
) or a list of values (:users => [12,13]
) and they'll both work the same way.
All that is left is to wrap it all up in a new method, and add it to our class:
require 'uri'
require 'yajl/http_stream'
class Twitter
MAX_ALLOWED_ERRORS = 1200
def self.filter_stream(username, password, filters = {}, &block)
url = URI.parse("http://#{username}:#{password}@stream.twitter.com/1/statuses/filter.json")
params = []
params << "follow=#{[*filters[:users]].join(",")}" if filters[:users]
params << "track=#{[*filters[:keywords]].join(",")}" if filters[:keywords]
params << "locations=#{[*filters[:locations]].join(",")}" if filters[:locations]
consecutive_errors = 0
while consecutive_errors < max_allowed_errors do
begin
Yajl::HttpStream.post(url, params.join("&"), :symbolize_keys => true) do |status|
consecutive_errors = 0
yield(status)
end
rescue Yajl::HttpStream::InvalidContentType
consecutive_errors += 1
end
sleep(0.25*consecutive_errors)
end
end
end
Taking it further
As previously mentioned, I'll extend on this series to cover:
Supply additional options to the stream to filter it down to just the tweets you are interested in
Provision
Amazon EC2
instances to help you deal with processing loadGet
Chef
involved to handle the provision and setup of yourEC2
instances automaticallyUse RabbitMQ to dispatch work to multiple servers
Load balance your
MongoDB
instances acrossEC2
Previously I led the Terraform product team @ HashiCorp, where we launched Terraform Cloud and set the stage for a successful IPO. Prior to that I was part of the Startup Team @ AWS, and earlier still an early employee @ Heroku. I've also invested in a couple of dozen early stage startups.