Getting started with MongoDB
Jul 03, 2010
MongoDB is a document database, which if you've not come across before is best visualised as a way of persisting a complex Hash object. It varies from a key-value store (like memcache and redis) as the "documents" you store aren't retrieved by their key alone, but can be extracted by querying the values. The objects can be of any structure as you don't define a schema upfront, and can be nested to an infinite depth.
The beauty in a design like this is that you can have a flexible way of storing whatever you want, with all dependent objects nested within each other so you don't need to join across multiple tables to get the information you need. In situations where the most often use-case requires more than one query or a table join to retrieve everything you want (a blog post along with a series of comments is a good example) this can be a real performance win. You'd actually store the comments within the blog post itself rather than in a separate table or collection.
Where I've found it useful, is in storing the data from the Twitter Streaming API. I've had grand plans to build an app on top of this data and API, but life continues to get in the way and in the interim period Twitter keeps adding new features and richer data. Thanks to MongoDB I don't need to update my code to stay in step with their change, I just dump the data straight into the DB and it will automatically include the new fields that have been included. Even better, is that I can write queries against this new data immediately and don't have to worry about migrations.
Installing MongoDB
Installing MongoDB is easy. On OSX I just used Homebrew:
brew install mongodb
On other platforms there are binaries available to download.
By default mongod
wants to store data in the /data/db
directory so you'll need to create that first:
mkdir -p /data/db
Once that's done you can start the service by just running mongod
. Open another terminal session and run mongo
to connect to your local server. You'll be taken to a Mongo shell session where you can start issuing Javascript commands to talk to your database. So start by doing the following:
db.mydatabase.save({ name: "Steve" })
The command above with save the document {name: "Steve"}
into the database called mydatabase
. But that doesn't exist, does it? Well, it does now. If you try and issue a command to a database that doesn't exist, Mongo will create the database for you and then run the command against it. If you take a look in /data/db
now you should see a couple of files, one of them is probably in the range of 64MB to 2GB. Hang about, what?! 2GB to store that one small document?
Mongo will pre-allocate disk for storing these objects, that means the next record you insert wont increase the size of these files. It will continue appending documents into these pre-allocated files until they are full, at which point it will pre-allocate another file of the same size and repeat the process.
Storing the Twitter Stream in MongoDB
As I said, I've been using it as a flexible store for persisting the Twitter stream. I've created a Tweet
model to store each tweet that looks like the following:
require 'mongo'
require 'twitter-text'
class Tweet
def self.create!(tweets)
collection.insert(tweets)
end
private
def self.establish_connection
Mongo::Connection.new.db("twitter")
end
def self.db
@db ||= establish_connection
end
def self.collection
@collection ||= db.collection("tweets")
end
end
I'm just using the native ruby driver and not using one of the wrapper libraries, and to be honest I have no plans to as the interface is so simple I don't really see the point. You just need to open a connection to a database (in this code above it's called twitter
) and then identify the collection this model is writing to (the equivalent of a table for those transitioning from SQL, above it's called tweets
). Remember that if the database and collection doesn't exist, Mongo will just make it for us. Using the ruby code we've been building on from previous posts, we can take the data we've been receiving and store it with the following:
Twitter.stream("mytwittername", "secret") do |status|
Tweet.create!(status)
end
Simple! Now you're storing the tweets as quickly as they are arriving. That in itself isn't very interesting though, so lets see how we'd go about retrieving some of our saved data.
Querying in MongoDB
To do the equivalent of a select * from where ...
in Mongo you need to pass a JSON object to the find
command. Something like the following:
collection.find(:user => { :screen_name => "glenngillen" })
What might not be entirely obvious here is that the Twitter data comes back in a format like the following:
{ :user => { :screen_name => "glenngillen",
:profile_image_url => "http://rubypond.com/image.png",
:followers_count => 1000000 },
:text => "This is the text from my tweet"
}
And you can see from the query above, I'm able to query based on value nested down within the :user
key. Much like SQL, you can write queries to return data that is in
a given list of values, is greater than or less than a certain value, you can even match based on a regular expression. For more examples of how to query the data, head over to the MongoDB query documentation
Taking it further
In coming posts I'll expand on the previous examples, show you how to easily:
Supply additional options to the stream to filter it down to just the tweets you are interested in
Provision
Amazon EC2
instances to help you deal with processing loadGet
Chef
involved to handle the provision and setup of yourEC2
instances automaticallyUse RabbitMQ to dispatch work to multiple servers
Load balance your
MongoDB
instances acrossEC2
Previously I led the Terraform product team @ HashiCorp, where we launched Terraform Cloud and set the stage for a successful IPO. Prior to that I was part of the Startup Team @ AWS, and earlier still an early employee @ Heroku. I've also invested in a couple of dozen early stage startups.