Here’s the science bit #3 – grok & friends

(It would be a good idea right now if you went and had a read of the excellent “Getting Started With Logstash” tutorial. Go past the basics, read the more advanced bits underneath).

So, we’ve got data coming in to logstash.

We’ve defined some patterns.

Now we need to start analysing it, and this is where logstash’s filters come in. Every log line contains a set of fields – not conveniently comma delimited, nor space delimited, but clearly defined and with a discrete meaning. Without putting a boatload of Exim log data in this post, there are effectively four:

  • receiving a message
  • delivering a message
  • completing a message
  • rejecting a message

There are actually potentially hundreds more, as we log a lot of surrounding information such as reputational data, but (for now) those are the ones we’re interested in.

There are lots that we’re not interested in, which are stored in the patterns referred to in the previous post (see the Github link):

# Strip out lines we don't need using the regexp in the patterns file
if [message] =~ /%{EXIM_EXCLUDE_TERMS}/ {
  drop { }

That’s pretty well self explanatory – match one of the terms we’ve defined as exclusions, this message (log line) gets dropped (although see the comments on this post about this section, because it’s non-functional).

We then move on to do some pattern matching using the “grok” filter.

We then wanted to do multiple matches on each message, but at the time of writing there’s a bug in grok which prevents this, so we’ve done a bit of tinkering using the “mutate” filter – we have five grok lines, so we clone the incoming message into 5 self-contained fields and work on those:

  # Really, really dirty hack to workaround bug in grok code
  # which won't handle multiple matches on the same field
  mutate {
    add_field => {
      "message_1" => "%{message}"
      "message_2" => "%{message}"
      "message_3" => "%{message}"
      "message_4" => "%{message}"
      "message_5" => "%{message}"

Don’t worry – we remove these at the end 🙂

Now we start to get a bit expression-heavy, so if that’s not your bag, skip to the next post…

If it is, we now pass the message through the grok filter and start to process it. This then extracts lots of useful stuff and populates fields for later use:

  grok {
    patterns_dir => "/etc/logstash/patterns/"
    break_on_match => false
    keep_empty_captures => true
    match => [
      "message_1", "(%{SYSLOGBASE} )(%{EXIM_DATE:exim_date} )(%{EXIM_PID:exim_pid} )(%{EXIM_MSGID:exim_msg_id} )(%{EXIM_FLAGS:exim_flags} )(%{GREEDYDATA})"
    match => [
      "message_2", "(%{EXIM_MSGID} )(<= )(%{NOTSPACE:env_sender} )(%{EXIM_REMOTE_HOST} )?(%{EXIM_INTERFACE} )?(%{EXIM_PROTOCOL} )?(X=%{NOTSPACE:tls_info} )?(%{EXIM_MSG_SIZE} )?(%{EXIM_HEADER_ID} )?(%{EXIM_SUBJECT})"
    match => [
      "message_3", "(%{EXIM_MSGID} )([=-]> )(%{NOTSPACE:env_rcpt} )(<%{NOTSPACE:env_rcpt_outer}> )?(R=%{NOTSPACE:exim_router} )(T=%{NOTSPACE:exim_transport} )(%{EXIM_REMOTE_HOST} )(X=%{NOTSPACE:tls_info} )?(QT=%{EXIM_QT:exim_qt})"
    match => [
      "message_4", "(%{SYSLOGBASE} )(%{EXIM_DATE:exim_date} )(%{EXIM_PID:exim_pid} )(%{EXIM_MSGID:exim_msg_id} )(Completed )(QT=%{EXIM_QT:exim_qt})"
    match => [
      "message_5", "(%{SYSLOGBASE} )(%{EXIM_DATE:exim_date} )(%{EXIM_PID:exim_pid} )(%{EXIM_MSGID:exim_msg_id} )?(%{EXIM_REMOTE_HOST} )?(%EXIM_INTERFACE} )?(F=<%{NOTSPACE:env_sender}> )?(.+(rejected after DATA|rejected \(but fed to sa-learn\)|rejected [A-Z]+ (or [A-Z]+ %{NOTSPACE}?|<%{NOTSPACE:env_rcpt}>)?): (?<exim_rej_reason>.+))"

Remember the patterns file? All of the %{EXIM_ bits there are predefined in that file. Everywhere there’s a :exim_ something, the expression match is passed into a field of that name. This gets us the good bits – addresses, hostnames, status and so on, all of which are useful later on.

After that, we start looking at the fields we’ve pulled data into and doing some logic on them:

  if [exim_flags] == "<=" {
    mutate {
      update => [ "exim_msg_state", "received" ] 
  } else if [exim_flags] == "=>" {
    mutate {
       update => [ "exim_msg_state", "delivered" ] 
  } else if [exim_flags] == "->" {
    mutate {
       update => [ "exim_msg_state", "delivered" ]

(snipped for brevity)

That little set defines a field called exim_msg_state, which is *really* useful as it tells us in which direction the message is going, what’s happening to it and so on – metrics which can be graphed. Dashboard detail, in fact, the like of which we’ve never really had in this way.

So that’s the grok bit explained, and a little bit of the subsequent logic…

But! (I hear you cry) How do we glue all this stuff together when Exim’s logs aren’t all neat and tidy and place the entire lifetime of an email message into a single line?

Well, logstash has the multiline filter, but that doesn’t apply here because of the way Exim processes things. An email message doesn’t simply arrive and get delivered, it can get queued too – meaning that the arrival line (marked ” <= ” in the log) and subsequent deliveries and completion (marked ” => “, ” -> ” and “Completed” respectively) can be literally tens or hundreds (or even millions) of lines apart, separated by some considerable time.

As well as pushing data into elasticsearch, logstash has an elasticsearch filter. This can query an ES backend with any query, but we use it to add metadata to the “delivered” messages:

  # Look back through ES table for the incoming message and
  # extract the envelope sender. Permits later stats creation
  # listing deliveries per env_sender
  if [exim_msg_state] == "delivered" {
    elasticsearch {
      query => 'exim_msg_id:"%{exim_msg_id}" AND exim_msg_state:"received"'
      fields => [ "env_sender", "env_sender", "remote_host", "remote_host", "remote_hostname", "remote_hostname" ]
      sort => "ignore_unmapped:true"
      fail_on_error => "false"
    mutate {
      remove_field => [ "query_failed" ]

NOTE HOWEVER: I have patched the elasticsearch filter so it doesn’t log an exception for a query which fails (details in my Github tree here). Without this, your logstash logs and possibly ES logs will be very, very, very, very noisy.

NOTE ALSO: The subsequent flush settings in logstash’s elasticsearch output can seriously affect the reliability of the query, of which more in a bit.

At this point, the filter is almost complete. Apart from a couple of other bits of logic, and this bit:

  # Remove the really, really dirty hack to workaround bug in grok code
  # which won't handle multiple matches on the same field
  mutate {
    remove_field => [ "message_1","message_2","message_3","message_4","message_5" ]

That removes those extra fields that we copied in earlier, thus making the message going into ES much smaller (and carrying far less extraneous data).

So that’s the filter section complete. Next, we push the data somewhere useful.


  1. Hi, in your complete config on git you do “if [exim_msg_state] == “completed” and [host_type] == “MX” {” where is host_type coming from?


    • Hi Paul

      Good catch! I’d somehow missed the following extra bit from the config:

      if [logsource] =~ /mx-/ {
      mutate {
      add_field => { “host_type” => “MX” }
      } else if [logsource] =~ /mta-/ {
      mutate {
      add_field => { “host_type” => “MTA” }

      I’ve chucked it in now, so you should be able to make this work. It’s basically setting the host_type field based on our internal naming convention for the VMs involved – if it starts “mx-” it’s an MX, and similarly for MTA. For clarity – this is simply to separate the flows traversing the inbound (MX) hosts, and the outbound/intra-site hosts (MTA).


  2. Hi Greame,

    I found that the EXIM_EXLUDE_FILTER is being ignored in the if [message] =~ /%{EXIM_EXCLUDE_FILTER}/ {
    The pattern file is not loaded before the grok stuff. Are you doing something different with this file?


    • reply on my own comment 🙂 anyway, one way of fixing this is

      Above your original grok add the following:

      23 grok {
      24 patterns_dir => “/home/pjanzen/logStashStuff/logstash-1.4.2/patterns/” #this should be your own pattern dir of course.
      25 match => [
      26 “message”, “(%{EXIM_EXCLUDE_TERMS})”
      27 ]
      28 add_tag => [“exclude_message”]
      29 }
      31 if “exclude_message” in [tags] {
      32 drop { }
      33 }

      Then you can remove the if underneath:
      filter {

      Hope this helps.

  3. Hi Paul

    Sorry it’s taken me a while to reply!

    Yes, that’s a hangover actually, from a previous configuration – it’s irrelevant and has been removed from my working config. You’re quite correct that it should in fact be inside a grok{} (or similar) block.

    I’ll leave it in the above post, though, because it serves to ensure that the reader is understanding things (as you demonstrate!)


  4. Thanks.

    1) Can you please post the full configuration file (input, filter & output) file ?
    2) Does these filter settings needs to be defined in both Shipper as well as in Shipper ?

    Thanks for your time.

    • Hi

      The complete configuration – logstash, elasticsearch mappings, dashboards – can be found on my Github pages, which are linked to in several parts of this blog.

      I’m not sure what you mean by your second question.


    • No. I’m shipping logs using rsyslogd (which is a whole different subject), then running Logstash+Elasticsearch+Kibana on the machine which receives them.

      So we have:

      MX/MTA farm -> rsyslog -> *network* -> rsyslog receiver -> logfile

      Logstash then listens to that logfile, and that’s where the filtering is applied before inserting the data into the local ES node.

      Hope that helps!

    • We don’t do ‘instant’ delivery with Exim.

      We have a listener that runs in “queue_only” mode using the global setting “queue_only = yes”; then we have queue runners spawn every 30 seconds. This allows us to control our processing load on the MX/MTA farm in as close to predictable a manner as possible – we also have a bunch of other settings in our Exim configuration to smooth out bursty loads.

      The net effect as the ELK end is that we always have *at least* one second delay between receipt, processing and delivery – it never happens in the same second, so we grab all the events properly – I guess because I have control of all the “moving parts”, I can make trade-offs in terms of delivery performance and monitoring so we get exactly what we want.

      I can’t think of another way to achieve what you’re after.

  5. thanks for clarifying
    I am experimenting now with ES input for logstash but with no luck so far
    Patching ES filter with big sleep delay causes logstash to stuck (i have no idea and knowledge why?)
    If u got another suggestions please PM me.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s