Touched by god (process monitoring)
Dec 28, 2007
The tag ling from the god website simply states 'Like monit, only awesome', and having played with it for a couple of days over the break now I have to agree. Monit was very handy at the time, but I found myself growing increasingly frustrated with it when things wouldn't restart properly (stuck sockets with backgroundrb being an example), and it's lack of logging only compounded the situation.
And then I found god.
All blasphemy aside, this tool really is awesome. The scripts are in ruby so you can be as creative as you want, it has in built support to add in your own behaviour hooks (like a clean_socket action on trying to restart backgroundrb) and best of all it provides incredibly in-depth logging, request-able on a per process basis.
Installing god
It's a gem, so installation couldn't be easier:
$ sudo gem install god
Configuring god
Next step was to setup a config file. I called mine /etc/god.conf and set it up to dynamically include any thing in /etc/god.d/ like so:
God.load "/etc/god.d/*.god"
For another client who is running a server with multiple applications, we set it up so that the god config could actually be deployed with the rails applications, so the /etc/god.conf file looked like:
God.load "/var/www/apps/*/current/config/*.god"
But now onto the nitty gritty of what is actually in those application specific configs. Here's an example from this site:
app_root = "/var/www/apps/rubypond/current"
%w{10000 10001}.each do |port|
God.watch do |w|
w.name = "rubypond-mongrel-#{port}"
w.group ="rubypond"
w.uid = "mongrel"
w.interval = 30.seconds # default
w.start = "mongrel_rails start -c #{app_root} -p #{port} \
-P #{app_root}/log/mongrel.#{port}.pid -d -e production"
w.stop = "mongrel_rails stop -P #{app_root}/log/mongrel.#{port}.pid"
w.restart = "mongrel_rails restart -P #{app_root}/log/mongrel.#{port}.pid"
w.start_grace = 10.seconds
w.restart_grace = 10.seconds
w.pid_file = File.join(app_root, "log/mongrel.#{port}.pid")
w.behavior(:clean_pid_file)
w.start_if do |start|
start.condition(:process_running) do |c|
c.interval = 5.seconds
c.running = false
end
end
w.restart_if do |restart|
restart.condition(:memory_usage) do |c|
c.above = 100.megabytes
c.times = [3, 5] # 3 out of 5 intervals
end
restart.condition(:cpu_usage) do |c|
c.above = 50.percent
c.times = 5
end
end
# lifecycle
w.lifecycle do |on|
on.condition(:flapping) do |c|
c.to_state = [:start, :restart]
c.times = 5
c.within = 5.minute
c.transition = :unmonitored
c.retry_in = 10.minutes
c.retry_times = 5
c.retry_within = 2.hours
end
end
end
end
Hopefully the intention of the config is relatively clear. For the given ports (10000 and 10001) create a new god watch that has these properties and these SLAs. The thing that may need clarifying is what exactly is flapping? Well it's the state god goes into when it tries to start your service, and it instantly terminates for whatever reason. In the example, if these occurs 5 times within 5 minutes then the process will be unmonitored for a further 10 minutes, before trying again. This will go on for 2 hours before finally giving up. Another cool thing is that you don't always need to explicitly provide a restart command, god is intelligent enough to assume that if one doesn't exist it should use start and then stop instead.
Monitoring MySQL with god
But hey, why constrain it just to rails apps? Here's the config to get MySQL into the Garden of Eden:
God.watch do |w|
w.name = 'mysql'
w.interval = 30.seconds # default
w.start = "cd /etc/init.d && ./mysqld start"
w.stop = "cd /etc/init.d && ./mysqld start"
w.restart = "cd /etc/init.d && ./mysqld restart"
w.start_grace = 10.seconds
w.restart_grace = 10.seconds
w.pid_file = '/var/run/mysqld/mysqld.pid'
w.behavior(:clean_pid_file)
w.start_if do |start|
start.condition(:process_running) do |c|
c.interval = 5.seconds
c.running = false
end
end
# lifecycle
w.lifecycle do |on|
on.condition(:flapping) do |c|
c.to_state = [:start, :restart]
c.times = 5
c.within = 5.minute
c.transition = :unmonitored
c.retry_in = 10.minutes
c.retry_times = 5
c.retry_within = 2.hours
end
end
end
Email and notifications
It's all well and good having god keep the system up and accessibly, but it's still nice to know when things have gone wrong so you can try and prevent it from occurring again. You'll need to add some email settings to the top of /etc/god.conf, be sure to change them to something appropriate for your server:
God::Contacts::Email.message_settings = {
:from => 'god@rubypond-example.com'
}
God::Contacts::Email.server_settings = {
:address => "localhost",
:port => 25,
:domain => "rubypond.com",
:authentication => :plain,
:user_name => "glenn",
:password => "password"
}
God.contact(:email) do |c|
c.name = 'glenn'
c.email = 'glenn@example.com'
end
Then you need to add a notification to the transition state you want to be alerted of:
w.transition(:up, :start) do |on|
on.condition(:process_exits) do |c|
c.notify = 'tom'
end
end
Tada! And there you have it, a whole bunch of god-like goodness. I'll follow up with another article with some more app specific configs, and how to extend with your own behaviours.
Previously I led the Terraform product team @ HashiCorp, where we launched Terraform Cloud and set the stage for a successful IPO. Prior to that I was part of the Startup Team @ AWS, and earlier still an early employee @ Heroku. I've also invested in a couple of dozen early stage startups.