For one of my current clients we've been making some changes to a script that runs nightly via cron. Things like moving over to using bundler, not running it via script/runner to avoid loading the full Rails stack, and a couple of other things that have introduced the risk that a dependency may not load in production or be available in the context/user in which the script is run. Having been caught out once, it's time to put some monitoring on the sucker to make sure it doesn't happen again.

Redirecting script output to a logfile

For most cron jobs we pipe the output to a log file somewhere to help keep out inboxes sane. It also means we can then monitor the log for events, and fire off an appropriate message (IM/jabber, email, sms) depending on the severity. So we redirect STDERR to STDOUT, and then append that to a file:

/path/to/myscript >> /var/log/myscript 2>&1

In addition to that though, we want to know the exit code for the last time it ran, so I'll append that to the end of the command

 /path/to/myscript >> /var/log/myscript 2>&1; echo $? > /var/log/myscript-error-code

Returning the exit code

We're using Zabbix to monitor services, so what I need is a means of raising an error if the exit code from the last run was not zero. That could fairly trivially be done by simply doing cat /var/log/myscript-error-code but I know were going to want to extend that just a little, so I'm going to put in in a shell script:

#!/bin/bash



LOG_DIR=/var/log

ERROR_CODE_FILE=myscript-error-code



cat $LOG_DIR/$ERROR_CODE_FILE

Checking that the script has been run in the past 24 hours

There are a number of servers this codebase is deployed to, but we only want it to run on one of them nightly. That also introduces the risk that the servers wont know which one should run the script, and as a result it doesn't run at all. To mitigate against that, we want to make sure the file has exit code is no more than 24 hours old:

MINS_SINCE_UPDATE=-1500

if [[ -z $(find $LOG_DIR -cmin $MINS_SINCE_UPDATE -name $ERROR_CODE_FILE) ]];

then

  echo "Error"

fi

I've used the cmin option because using atime is problematic with monitoring things that should occur in less than a day, and I want to be confident in the difference between it running 23 hours ago, and 25 hours ago. The -z option is testing if the following command returns an empty response.

Calculating the difference between now and the file timestamp

So I can immediately tell whether the problem lies with cron/scheduling or the script I'm going to make the monitoring send a more meaningful error when it is a scheduling error:

LAST_MODIFIED=`stat -c %y $LOG_DIR/$ERROR_CODE_FILE | cut -f1 -d '.'`

HOUR_SINCE=$(((`date -d "$LAST_MODIFIED" +%s` - `date +%s`)/3600))

echo "Import last run $LAST_MODIFIED ($HOUR_SINCE hours ago)"

Here I'm using stat to return the date the file was last updated. I pass that into date to convert it to seconds since epoch, and then subtract the current seconds since epoch. Divide those seconds by 3600 and you get how long in hours since that file was updated.

Putting it all together

So the final script looks something a little like this:

#!/bin/bash



LOG_DIR=/var/log

ERROR_CODE_FILE=myscript-error-code

MINS_SINCE_UPDATE=-1500



if [[ -z $(find $LOG_DIR -cmin $MINS_SINCE_UPDATE -name $ERROR_CODE_FILE) ]];

then

  echo "Error"

  LAST_MODIFIED=`stat -c %y $LOG_DIR/$ERROR_CODE_FILE | cut -f1 -d '.'`

  HOUR_SINCE=$(((`date -d "$LAST_MODIFIED" +%s` - `date +%s`)/3600))

  echo "Import last run $LAST_MODIFIED ($HOUR_SINCE hours ago)"

else

  cat $LOG_DIR/$ERROR_CODE_FILE

fi

All that is left is to plug it in to Zabbix, and hope that it never fails again.