At $workplace we have, like many other similar operations, a “farm” of MX (inbound) and MTA (outbound and intra-system) servers. By an accident of history we’ve been running Exim as our MTA of choice for a very long time – indeed, by another accident of history I’m one of Exim’s maintainers – so we’re very familiar with the logs.
Just like everyone else, our MX/MTA farm generates an enormous amount of log data. Over the years we’ve monitored the state and performance of our ever-growing platform using a variety of home-brewed shell or Perl scripts, SNMP, MRTG, RRDtool, Munin, Cacti, OpenNMS and probably several others that have been forgotten. Some of these have produced tables, some produced pretty graphs, others still produced reports which fed into our (now defunct) annual institutional “let’s bash the IT guys with some stats to show how poor their systems are” meetings.
In recent years the volume of email and the volume of logging has grown at the exact same time as the number of staff responsible for the farm has shrunk, so we’ve found a dependency increasing on individuals at the same time as their availability is reducing. As a result, I’ve been trying to put some things together to follow a “shift left” approach – that is, creating systems that less specialised staff can look at to see what’s going on.
It took ages to get one that worked – several of the above-mentioned pieces of software were (and still are) great, but just didn’t have the flexibility I needed and would have taken a lot of wrangling to make them show the data I needed to get out. Conveniently, I recently attended the OpenNMS Users’ Conference Europe – OUCE2014 – and mentioned to one of the presenters my specific problem, wrestling with millions of lines of log data per day.
The presenter (thanks Ken!) and several other attendees chimed, in unison, one word…