Getting started with MongoDB

Jul 03, 2010

MongoDB is a document database, which if you've not come across before is best visualised as a way of persisting a complex Hash object. It varies from a key-value store (like memcache and redis) as the "documents" you store aren't retrieved by their key alone, but can be extracted by querying the values. The objects can be of any structure as you don't define a schema upfront, and can be nested to an infinite depth.

The beauty in a design like this is that you can have a flexible way of storing whatever you want, with all dependent objects nested within each other so you don't need to join across multiple tables to get the information you need. In situations where the most often use-case requires more than one query or a table join to retrieve everything you want (a blog post along with a series of comments is a good example) this can be a real performance win. You'd actually store the comments within the blog post itself rather than in a separate table or collection.

Where I've found it useful, is in storing the data from the Twitter Streaming API. I've had grand plans to build an app on top of this data and API, but life continues to get in the way and in the interim period Twitter keeps adding new features and richer data. Thanks to MongoDB I don't need to update my code to stay in step with their change, I just dump the data straight into the DB and it will automatically include the new fields that have been included. Even better, is that I can write queries against this new data immediately and don't have to worry about migrations.

Installing MongoDB

Installing MongoDB is easy. On OSX I just used Homebrew:

brew install mongodb

On other platforms there are binaries available to download.

By default mongod wants to store data in the /data/db directory so you'll need to create that first:

mkdir -p /data/db

Once that's done you can start the service by just running mongod. Open another terminal session and run mongo to connect to your local server. You'll be taken to a Mongo shell session where you can start issuing Javascript commands to talk to your database. So start by doing the following:

db.mydatabase.save({ name: "Steve" })

The command above with save the document {name: "Steve"} into the database called mydatabase. But that doesn't exist, does it? Well, it does now. If you try and issue a command to a database that doesn't exist, Mongo will create the database for you and then run the command against it. If you take a look in /data/db now you should see a couple of files, one of them is probably in the range of 64MB to 2GB. Hang about, what?! 2GB to store that one small document?

Mongo will pre-allocate disk for storing these objects, that means the next record you insert wont increase the size of these files. It will continue appending documents into these pre-allocated files until they are full, at which point it will pre-allocate another file of the same size and repeat the process.

Storing the Twitter Stream in MongoDB

As I said, I've been using it as a flexible store for persisting the Twitter stream. I've created a Tweet model to store each tweet that looks like the following:

require 'mongo'

require 'twitter-text'



class Tweet  

  def self.create!(tweets)

    collection.insert(tweets)

  end



  private

    def self.establish_connection

      Mongo::Connection.new.db("twitter")

    end



    def self.db

      @db ||= establish_connection

    end



    def self.collection

      @collection ||= db.collection("tweets")

    end

end

I'm just using the native ruby driver and not using one of the wrapper libraries, and to be honest I have no plans to as the interface is so simple I don't really see the point. You just need to open a connection to a database (in this code above it's called twitter) and then identify the collection this model is writing to (the equivalent of a table for those transitioning from SQL, above it's called tweets). Remember that if the database and collection doesn't exist, Mongo will just make it for us. Using the ruby code we've been building on from previous posts, we can take the data we've been receiving and store it with the following:

Twitter.stream("mytwittername", "secret") do |status|

  Tweet.create!(status)

end

Simple! Now you're storing the tweets as quickly as they are arriving. That in itself isn't very interesting though, so lets see how we'd go about retrieving some of our saved data.

Querying in MongoDB

To do the equivalent of a select * from where ... in Mongo you need to pass a JSON object to the find command. Something like the following:

collection.find(:user => { :screen_name => "glenngillen" })

What might not be entirely obvious here is that the Twitter data comes back in a format like the following:

{ :user => { :screen_name => "glenngillen", 

             :profile_image_url => "http://rubypond.com/image.png",

             :followers_count => 1000000 },

  :text => "This is the text from my tweet"

}

And you can see from the query above, I'm able to query based on value nested down within the :user key. Much like SQL, you can write queries to return data that is in a given list of values, is greater than or less than a certain value, you can even match based on a regular expression. For more examples of how to query the data, head over to the MongoDB query documentation

Taking it further

In coming posts I'll expand on the previous examples, show you how to easily:

Supply additional options to the stream to filter it down to just the tweets you are interested in
Setup MongoDB to store the data you need
Provision Amazon EC2 instances to help you deal with processing load
Get Chef involved to handle the provision and setup of your EC2 instances automatically
Use RabbitMQ to dispatch work to multiple servers
Load balance your MongoDB instances across EC2

Hi, I'm Glenn! 👋 I've spent most of my career working with or at startups. I'm currently the Director of Product @ Ockam where I'm helping developers build applications and systems that are secure-by-design. It's time we started securely connecting apps, not networks.

Previously I led the Terraform product team @ HashiCorp, where we launched Terraform Cloud and set the stage for a successful IPO. Prior to that I was part of the Startup Team @ AWS, and earlier still an early employee @ Heroku. I've also invested in a couple of dozen early stage startups.