Touched by god (process monitoring)

Dec 28, 2007

The tag ling from the god website simply states 'Like monit, only awesome', and having played with it for a couple of days over the break now I have to agree. Monit was very handy at the time, but I found myself growing increasingly frustrated with it when things wouldn't restart properly (stuck sockets with backgroundrb being an example), and it's lack of logging only compounded the situation.

And then I found god.

All blasphemy aside, this tool really is awesome. The scripts are in ruby so you can be as creative as you want, it has in built support to add in your own behaviour hooks (like a clean_socket action on trying to restart backgroundrb) and best of all it provides incredibly in-depth logging, request-able on a per process basis.

Installing god

It's a gem, so installation couldn't be easier:

$ sudo gem install god

Configuring god

Next step was to setup a config file. I called mine /etc/god.conf and set it up to dynamically include any thing in /etc/god.d/ like so:

God.load "/etc/god.d/*.god" 

For another client who is running a server with multiple applications, we set it up so that the god config could actually be deployed with the rails applications, so the /etc/god.conf file looked like:

God.load "/var/www/apps/*/current/config/*.god" 

But now onto the nitty gritty of what is actually in those application specific configs. Here's an example from this site:

app_root = "/var/www/apps/rubypond/current" 
%w{10000 10001}.each do |port|
  God.watch do |w|
    w.name = "rubypond-mongrel-#{port}" 
    w.group ="rubypond" 
    w.uid = "mongrel" 
    w.interval = 30.seconds # default
    w.start = "mongrel_rails start -c #{app_root} -p #{port} \
      -P #{app_root}/log/mongrel.#{port}.pid  -d -e production" 
    w.stop = "mongrel_rails stop -P #{app_root}/log/mongrel.#{port}.pid" 
    w.restart = "mongrel_rails restart -P #{app_root}/log/mongrel.#{port}.pid" 
    w.start_grace = 10.seconds
    w.restart_grace = 10.seconds
    w.pid_file = File.join(app_root, "log/mongrel.#{port}.pid")
    w.behavior(:clean_pid_file)
    w.start_if do |start|
      start.condition(:process_running) do |c|
        c.interval = 5.seconds
        c.running = false
      end
    end

    w.restart_if do |restart|
      restart.condition(:memory_usage) do |c|
        c.above = 100.megabytes
        c.times = [3, 5] # 3 out of 5 intervals
      end
      restart.condition(:cpu_usage) do |c|
        c.above = 50.percent
        c.times = 5
      end
    end

    # lifecycle
    w.lifecycle do |on|
      on.condition(:flapping) do |c|
        c.to_state = [:start, :restart]
        c.times = 5
        c.within = 5.minute
        c.transition = :unmonitored
        c.retry_in = 10.minutes
        c.retry_times = 5
        c.retry_within = 2.hours
      end
    end
  end
end

Hopefully the intention of the config is relatively clear. For the given ports (10000 and 10001) create a new god watch that has these properties and these SLAs. The thing that may need clarifying is what exactly is flapping? Well it's the state god goes into when it tries to start your service, and it instantly terminates for whatever reason. In the example, if these occurs 5 times within 5 minutes then the process will be unmonitored for a further 10 minutes, before trying again. This will go on for 2 hours before finally giving up. Another cool thing is that you don't always need to explicitly provide a restart command, god is intelligent enough to assume that if one doesn't exist it should use start and then stop instead.

Monitoring MySQL with god

But hey, why constrain it just to rails apps? Here's the config to get MySQL into the Garden of Eden:

God.watch do |w|
  w.name = 'mysql'
  w.interval = 30.seconds # default
  w.start = "cd /etc/init.d && ./mysqld start" 
  w.stop = "cd /etc/init.d && ./mysqld start" 
  w.restart = "cd /etc/init.d && ./mysqld restart" 
  w.start_grace = 10.seconds
  w.restart_grace = 10.seconds
  w.pid_file = '/var/run/mysqld/mysqld.pid'
  w.behavior(:clean_pid_file)

  w.start_if do |start|
    start.condition(:process_running) do |c|
      c.interval = 5.seconds
      c.running = false
    end
  end

  # lifecycle
  w.lifecycle do |on|
    on.condition(:flapping) do |c|
      c.to_state = [:start, :restart]
      c.times = 5
      c.within = 5.minute
      c.transition = :unmonitored
      c.retry_in = 10.minutes
      c.retry_times = 5
      c.retry_within = 2.hours
    end
  end
end

Email and notifications

It's all well and good having god keep the system up and accessibly, but it's still nice to know when things have gone wrong so you can try and prevent it from occurring again. You'll need to add some email settings to the top of /etc/god.conf, be sure to change them to something appropriate for your server:

God::Contacts::Email.message_settings = {
  :from => 'god@rubypond-example.com'
}

God::Contacts::Email.server_settings = {
  :address => "localhost",
  :port => 25,
  :domain => "rubypond.com",
  :authentication => :plain,
  :user_name => "glenn",
  :password => "password" 
}

God.contact(:email) do |c|
  c.name = 'glenn'
  c.email = 'glenn@example.com'
end

Then you need to add a notification to the transition state you want to be alerted of:

w.transition(:up, :start) do |on|
  on.condition(:process_exits) do |c|
    c.notify = 'tom'
  end
end

Tada! And there you have it, a whole bunch of god-like goodness. I'll follow up with another article with some more app specific configs, and how to extend with your own behaviours.

Hi, I'm Glenn! 👋 I've spent most of my career working with or at startups. I'm currently the Director of Product @ Ockam where I'm helping developers build applications and systems that are secure-by-design. It's time we started securely connecting apps, not networks.

Previously I led the Terraform product team @ HashiCorp, where we launched Terraform Cloud and set the stage for a successful IPO. Prior to that I was part of the Startup Team @ AWS, and earlier still an early employee @ Heroku. I've also invested in a couple of dozen early stage startups.