Here’s the science bit #4 – pumping the data into elasticsearch

If you’ve read this far through the background, the Eureka moment, the basics, the patterns and the stuff about grok, well done.

If you haven’t, now might be a good idea to do just that. Honestly. If you don’t, this post will make even less sense!

Anyway: now we’ve got our datastream from our consolidated Exim main logfile, we’ve processed it, added lots of fields, and we’re about to store it in an Elasticsearch instance. In this specific setup, the ELK stack (Elasticsearch, Logstash and Kibana – if you’ve forgotten – all live on the same server, but it’s quite plausible that for a monstrous datastream you could have this in a clustered environment. For ease of description, I’m not going there – so we assume it’s all in the same place.

The final block in the logstash configuration is ‘output’:

  output {
    elasticsearch {
      host => localhost
      index_type => "%{[exim_msg_state]}"
      index => "exim-%{+YYYY.MM.dd}"
      flush_size => 100
    }
  }

Nothing really fancy there. We’re talking to localhost, we’re creating indexes of the form exim-2014.05.09 (daily indexes), and we’re creating separate index types dependent on the state of the message in the Exim logs (see the stuff about grok post). They all tie together, where relevant, via the exim_msg_id field which is Exim’s internal identifier for any given message.

Observant types will have noticed the “flush_size” setting there, and will think “wow, that’s really low”. It is indeed, because of the need to search the index in the logstash filter. If two entries are too close together then we run the risk of losing metadata in the copy operation – at the moment we haven’t got a performance problem but we may have to tune that up (or down!) in future to make sure we hit the sweet spot in terms of performance and consistency. It’s early days for us yet.

All fairly simple so far – but made slightly more complex by Elasticsearch’s type mappings. By default, Logstash will create an index with dynamic type mappings based on the data that’s arriving – dates will get mapped as dates, some other stuff as strings, some as integers and so on.

Unfortunately, the dynamic mapping (and you need to read the docs to understand this) will tokenise the data that’s being shovelled into elasticsearch by breaking it on spaces, periods, dashes and the like. This is one of the extremely powerful features of the underpinning Lucene search system; but for us it can break things and break them badly. Hostnames, for example, get broken up into their dotted or dashed parts; searching later for aggregate terms (such as ‘top 5 sending hosts’) can return some very odd results!

But there’s an answer to this – without going into detail, you need to define a type mapping which tells ES to not analyze the elements we need to keep together. Again, without going into too much detail (because the file is quite long) we push a static mapping into Elasticsearch using the RESTful interface:

curl -XPUT 'http://localhost:9200/_template/exim' -d '{
 "order" : 0,
 "template" : "exim*",
 "settings" : {
   "index.refresh_interval" : "5s"
 },
 "mappings" : {
   "delivered" : {
     "properties" : {
       "logsource" : {
         "type" : "string"
       },
       "env_sender" : {
         "type" : "string",
         "index" : "not_analyzed"
       },
       "env_rcpt" : {
         "type" : "string",
         "index" : "not_analyzed"
       },
       "env_rcpt_outer" : {
         "type" : "string",
         "index" : "not_analyzed"
       },
       "@version" : {
         "type" : "string"
       },
       "host_type" : {
         "type" : "string",
         "index" : "not_analyzed"
       },
       "remote_host" : {
         "type" : "string"
       },
       "timestamp" : {
         "type" : "string"
       },
       "exim_pid" : {
         "type" : "string"
       },
       "remote_hostname" : {
         "type" : "string",
         "index" : "not_analyzed"
       },
...

We created that by taking the dynamic mapping (downloaded using “curl -XGET …”) and editing it, then pushing that back using “curl -XPUT …” as demonstrated above.

Then, when starting up logstash, it pushes data into elasticsearch and doesn’t break it apart, leaving it “as is”. I’m sure there are downsides to this but I haven’t found any yet.

The full mapping can be found here (Github).

Advertisements

4 comments

  1. Thanks

    For a custom application log file,
    (1) how do we determine the various mappings and their properties from a log file ?
    (2) How does the property name mapped to the real entry in the log file ?

    Thanks again

    • Hi!

      I think I know what you’re asking… I think. You mean you have a logfile of the form:

      2014-12-29 11:11:11 SOMEFIELD SOMEOTHERFIELD “SOMETHING ELSE” 9999 999999 99.99

      If you know what the structure is, and you know what the fields mean (i.e. can assign them meaningful names) then you can work up a set of patterns in Logstash and map the patterns onto appropriate fields. Then, once you know what fields you have, you can create your own mapping (like I did) and import it into Elasticsearch using the PUT method.

      I suggest you go back to the beginning of this set of posts and read through sequentially, because what you’re asking is almost exactly what I went through.

      Hope that helps!

      Graeme

  2. Hello @greem!

    I would like to thank you for such a tutorial. I am pushing logs from remote machings using logstash-forwarder.
    I already have one output filter called elasticsearch.

    How would I tell those grok rules to push to, for example, elasticexicm otuput ?

    I think I managed to lang exim + logstash using all the modifications you have mentioned in earlier posts.
    Just can’t get those log to appear in kinana.

    Thank you very much!

  3. Hello, and thanks for this tutorial. I’m currently experiencing the very field mapping explosion you’ve avoided here. Question: given the time since you’ve written this, has this continued to work well for you over time, and is there a method you’d now prefer over this?

    I’m getting ready to implement this and I appreciate the insights you’ve provided here.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s