For one of my current clients we've been making some changes to a script that runs nightly via
cron. Things like moving over to using
bundler, not running it via
script/runner to avoid loading the full Rails stack, and a couple of other things that have introduced the risk that a dependency may not load in production or be available in the context/user in which the script is run. Having been caught out once, it's time to put some monitoring on the sucker to make sure it doesn't happen again.
cron jobs we pipe the output to a log file somewhere to help keep out inboxes sane. It also means we can then monitor the log for events, and fire off an appropriate message (IM/jabber, email, sms) depending on the severity. So we redirect STDERR to STDOUT, and then append that to a file:
/path/to/myscript >> /var/log/myscript 2>&1
In addition to that though, we want to know the exit code for the last time it ran, so I'll append that to the end of the command
/path/to/myscript >> /var/log/myscript 2>&1; echo $? > /var/log/myscript-error-code
We're using Zabbix to monitor services, so what I need is a means of raising an error if the exit code from the last run was not zero. That could fairly trivially be done by simply doing
cat /var/log/myscript-error-code but I know were going to want to extend that just a little, so I'm going to put in in a shell script:
#!/bin/bash LOG_DIR=/var/log ERROR_CODE_FILE=myscript-error-code cat $LOG_DIR/$ERROR_CODE_FILE
There are a number of servers this codebase is deployed to, but we only want it to run on one of them nightly. That also introduces the risk that the servers wont know which one should run the script, and as a result it doesn't run at all. To mitigate against that, we want to make sure the file has exit code is no more than 24 hours old:
MINS_SINCE_UPDATE=-1500 if [[ -z $(find $LOG_DIR -cmin $MINS_SINCE_UPDATE -name $ERROR_CODE_FILE) ]]; then echo "Error" fi
I've used the
cmin option because using
atime is problematic with monitoring things that should occur in less than a day, and I want to be confident in the difference between it running 23 hours ago, and 25 hours ago. The
-z option is testing if the following command returns an empty response.
So I can immediately tell whether the problem lies with cron/scheduling or the script I'm going to make the monitoring send a more meaningful error when it is a scheduling error:
LAST_MODIFIED=`stat -c %y $LOG_DIR/$ERROR_CODE_FILE | cut -f1 -d '.'` HOUR_SINCE=$(((`date -d "$LAST_MODIFIED" +%s` - `date +%s`)/3600)) echo "Import last run $LAST_MODIFIED ($HOUR_SINCE hours ago)"
Here I'm using
stat to return the date the file was last updated. I pass that into
date to convert it to seconds since epoch, and then subtract the current seconds since epoch. Divide those seconds by 3600 and you get how long in hours since that file was updated.
So the final script looks something a little like this:
#!/bin/bash LOG_DIR=/var/log ERROR_CODE_FILE=myscript-error-code MINS_SINCE_UPDATE=-1500 if [[ -z $(find $LOG_DIR -cmin $MINS_SINCE_UPDATE -name $ERROR_CODE_FILE) ]]; then echo "Error" LAST_MODIFIED=`stat -c %y $LOG_DIR/$ERROR_CODE_FILE | cut -f1 -d '.'` HOUR_SINCE=$(((`date -d "$LAST_MODIFIED" +%s` - `date +%s`)/3600)) echo "Import last run $LAST_MODIFIED ($HOUR_SINCE hours ago)" else cat $LOG_DIR/$ERROR_CODE_FILE fi
All that is left is to plug it in to Zabbix, and hope that it never fails again.