Google Drive – data sync with gsync

In my last post (some time ago now) I wrote about using the lovely Exiftool to import photos, image files & videos into a date-structured store from digital cameras.

Over the years with a digital camera I’ve never got anywhere a professional standard with proper workflow tools which allow me to sift out the good from the bad (and dispose of the latter). This has left me with approaching 500GB of image data from all manner of places & events: weddings, parties, trips abroad, my family growing up, aircraft, racing cars and in more recent times a growing set of attempts to be arty or “wildlifey”.

That’s a lot of data. And it’s growing – every photo I take ends up on there. Last time I took my camera out (to an air show) I came back with almost 3000 shots!

It’s all currently sitting on a venerable (old!) D-Link DNS323 two-drive NAS, which is now running the rather splendid Alt-F Alternative Firmware instead of the vendor’s firmware.

A few weeks ago, despite taking regular backups to an external disk, my wife and I realised that we’d really quite like to be able to back this up off-site. When I checked the SMART power-on time data of the drives it was somewhere in the region of 54000 hours each. Six and a bit years. Ouch! Can anyone smell an MTBF period approaching?

So… I could have taken my NAS to work where we have $SILLY_GIGABITS_PER_SEC internet connectivity, but I’m loathe to power it down and move it due to the age and runtime of the drives. Thankfully my ISP (hi Virgin Media, I’m looking at you here) has some fairly preposterous packages available with reasonably decent upload speed, subject to usage caps – which meant despite the prospect of it taking days, it’s possible to do an upload to some cloud storage provider from home.

Google’s storage pricing policy changed recently (although everyone else is playing race to the bottom with them) to provide 1TB of storage for a relatively reasonable USD9.99 per month. That brought a large amount of storage down into “accessible” land, so I decided (being Google Apps for Domains user for years, since it was free) to shell out for that after doing some testing.

But… How on earth to sync data from a NAS without having an intermediary client to sync it?

My tool of choice at work for that would be rsync – I’ve shifted giant quantities of data from place to place with that but it has no native support for cloud “filesystems” (which actually aren’t filesystems, they’re almost all massive object stores with a filesystem abstraction at the client end). A bit of searching around kept pointing me back to one place, a tool called…

gsync

(there are two, actually. This one is written in Python and seems reasonably well maintained)

On a CentOS box I ran

[graeme@centos ~]$ sudo yum -y install python-pip

…and then, because it’s not just available on github but also in the Python Package Index

[graeme@centos ~]$ sudo yum -y python-pip install gsync

Voila! A working version – or so I thought. I did a quick test, which failed, and discovered I might need a few patches to make it all work properly. Thankfully, github being what it is, they were mainly in the issues list for the project so I could pick them out and make the app work as I expected. I’ll put some more details in here if anyone asks about them.

Having done some tests with small datasets limited to one, tens or hundreds of files, it transpired that one thing I was missing – badly – was “magic” to do with MIME types (and yes, it is called “magic” – look in /usr/share/magic or similar on a Linux system and see this Wikipedia entry). Files kept repeatedly being written up to the Google Drive, so I had to use the –debug switch to find out why and largely found lines telling me that the local MIME type – often “application/octet-stream” – didn’t match the type that Google had provided at the far end – say “image/x-panasonic-rw2”. If they didn’t match, the whole file got copied over again.

Digital cameras produce a plethora of different media types – JPEG, a bazillion different variations of RAW, and even more variations of video files. MPEG4, AVI, Quicktime MOV, MPEG2TS, AVCHD… loads of them, some of which are actually containers with streams of other formats inside. What a pain.

All things being equal, one of the things you get with a commercial OS such as Windows or MacOS X is the magic data already included and regularly updated, both by the OS vendor and application software vendors. The Linux world, of course, is a little different in that most distributions come with a file – /etc/magic – to which you can add your own definitions. There are many out there, which a judicious bit of searching will turn up for you. I found a few collections for digital media formats so added them to my /etc/magic, and the problem was solved.

Then it was time for a first sync.

It took over a week!

But it worked, with a few wrinkles:

  • it doesn’t catch 500 Internal Server Errors from Google very gracefully – rather than retrying, it quits
  • predictably it can consume a lot of RAM
  • I had to stop and add a few more MIME types

So the next question was: would it run on my DNS323?

And the answer…

YES!

A huge, great, big fat hairy yes.

I had to do some more tweaking with the Alt-F Package list, installing Python and using Python to install pip so I could get all the dependencies installed (like the third-party magic.py), adding stuff to /etc/magic and making a small final tweak to the underlying gsync code to look at that file (because it wasn’t).

It’s now running – albeit at about half the speed of the 1.6GHz CentOS box it was running on, but given that the device has only 64MB RAM and a nominal 500MHz Marvell Ferocean ARM CPU (reporting 332 BogoMIPS at boot) I’m very pleasantly surprised that it even starts up. I’ve had to wrap it in a script which basically says “foreach item in a list of directories in a predictable directory structure; do gsync blah…; done” rather than “gsync $massive_heap_o’data” to keep the memory usage down (and logging easier to digest), but running it is.

It’ll be interesting to see how long it takes before it finishes. Or crashes. Either way I’ll come back and update this post in a day or two.

Advertisements

Automating image imports with exiftool

For years I’ve been using Shotwell as my image manager of choice, but in the last 18 months or so it’s become increasingly difficult (nee impossible) to compile on non-bleeding edge stable Linux releases such as CentOS 6 (which happens to be my choice of desktop).

I tried, and failed, to get along with Digikam, gThumb, Darktable and a host of others – many non-maintained. They were almost all single user, except web-based DAM solutions, which were massively overweight for my use case. For completeness, the primary use case was “allow me and my wife to dump photos from SD cards/phones/tablets etc to somewhere safe, indexed by date, and backed up”.

I’d already created a share on my home NAS into which we could both dump images, but a side-effect of long term use of Shotwell was that the images were copied over to a YYYY/mm/dd directory structure and with 500GB+ I did not want to have to reindex that if I could help it.

Google didn’t seem to want to help me, until I started searching for terms like “import photos based on EXIF data”, and then…

I found a nugget of beautiful information: ExifTool (written in Perl) could do the hard bit, importing, for me!

So to save you the trouble, here’s the code:

#!/bin/sh

/usr/bin/exiftool -r -P -v0 -if '$filename !~ /(^\.)/' \
   "-Directory<FileModifyDate" \
   "-Directory<DateTimeOriginal" \
   -o no-date-available/ \
   -d "/mnt/imagedir/%Y/%m/%d" \
   "/mnt/photodrop"

Essentially, it boils down to:

  • ignore files beginning with a period
  • select target directory based on EXIF date, or date modified depending which is available
  • put them in a YYYY/mm/dd directory structure

And that’s it. Perfect. Dead simple, agnostic of editor, properly multi-user and indexed by date.

And finally… visualisation

Or, of course, visualization if you’re from that side of the pond 🙂

We’ve now got millions of lines of data pumping from a farm of servers across our network into a central logging server; we’re reading that data with logstash’s “file” input, we’re grokking it, applying logic, filtering, pulling data back out of Elasticsearch (if it exists) and finally dropping our grokked/filtered/analysed datastream into Elasticsearch.

Great.

Or… well, now we need the final piece of the puzzle. We could use any number of analytics tools, but those lovely Elasticsearch people have provided one already: Kibana.

After using the “default” dashboard, it was a reasonably simple step to producing a custom one. Essentially the steps are:

  • define some searches
  • use those searches in the available “panels”
  • draw pretty graphs and tables
  • profit! (Well, not strictly true, this is all in house)

We built up a simple dashboard, then added elements to it, then exported the dashboard to a local file which meant we could have a look inside it. As you may be aware, an email message can go to one or more people, so there are two distinct ways to view the data; to avoid visual overload, we decided to build two separate primary dashboards:

  • Dashboard aggregating messages
  • Dashboard aggregating recipients

We took the exported version, edited it, and saved it to two separate files in the Kibana dashboards directory (so we’re not relying on Elasticsearch to store them for us – pragmatism begins with backups).

Here’s the finished product – I’m writing this on a Saturday morning so things are pretty quiet, but it gives an idea of what’s being collected. Also worth mentioning that I’ve blurred the addresses here, because, well, you know. Privacy. You don’t need to know them!

Image

And there we are!

[The above dashboard is available here, on Github]

Data visualisation, summarised, from the generation of the data to the web dashboards. I hope that helps someone else in the same state!

Here’s the science bit #4 – pumping the data into elasticsearch

If you’ve read this far through the background, the Eureka moment, the basics, the patterns and the stuff about grok, well done.

If you haven’t, now might be a good idea to do just that. Honestly. If you don’t, this post will make even less sense!

Anyway: now we’ve got our datastream from our consolidated Exim main logfile, we’ve processed it, added lots of fields, and we’re about to store it in an Elasticsearch instance. In this specific setup, the ELK stack (Elasticsearch, Logstash and Kibana – if you’ve forgotten – all live on the same server, but it’s quite plausible that for a monstrous datastream you could have this in a clustered environment. For ease of description, I’m not going there – so we assume it’s all in the same place.

The final block in the logstash configuration is ‘output’:

  output {
    elasticsearch {
      host => localhost
      index_type => "%{[exim_msg_state]}"
      index => "exim-%{+YYYY.MM.dd}"
      flush_size => 100
    }
  }

Nothing really fancy there. We’re talking to localhost, we’re creating indexes of the form exim-2014.05.09 (daily indexes), and we’re creating separate index types dependent on the state of the message in the Exim logs (see the stuff about grok post). They all tie together, where relevant, via the exim_msg_id field which is Exim’s internal identifier for any given message.

Observant types will have noticed the “flush_size” setting there, and will think “wow, that’s really low”. It is indeed, because of the need to search the index in the logstash filter. If two entries are too close together then we run the risk of losing metadata in the copy operation – at the moment we haven’t got a performance problem but we may have to tune that up (or down!) in future to make sure we hit the sweet spot in terms of performance and consistency. It’s early days for us yet.

All fairly simple so far – but made slightly more complex by Elasticsearch’s type mappings. By default, Logstash will create an index with dynamic type mappings based on the data that’s arriving – dates will get mapped as dates, some other stuff as strings, some as integers and so on.

Unfortunately, the dynamic mapping (and you need to read the docs to understand this) will tokenise the data that’s being shovelled into elasticsearch by breaking it on spaces, periods, dashes and the like. This is one of the extremely powerful features of the underpinning Lucene search system; but for us it can break things and break them badly. Hostnames, for example, get broken up into their dotted or dashed parts; searching later for aggregate terms (such as ‘top 5 sending hosts’) can return some very odd results!

But there’s an answer to this – without going into detail, you need to define a type mapping which tells ES to not analyze the elements we need to keep together. Again, without going into too much detail (because the file is quite long) we push a static mapping into Elasticsearch using the RESTful interface:

curl -XPUT 'http://localhost:9200/_template/exim' -d '{
 "order" : 0,
 "template" : "exim*",
 "settings" : {
   "index.refresh_interval" : "5s"
 },
 "mappings" : {
   "delivered" : {
     "properties" : {
       "logsource" : {
         "type" : "string"
       },
       "env_sender" : {
         "type" : "string",
         "index" : "not_analyzed"
       },
       "env_rcpt" : {
         "type" : "string",
         "index" : "not_analyzed"
       },
       "env_rcpt_outer" : {
         "type" : "string",
         "index" : "not_analyzed"
       },
       "@version" : {
         "type" : "string"
       },
       "host_type" : {
         "type" : "string",
         "index" : "not_analyzed"
       },
       "remote_host" : {
         "type" : "string"
       },
       "timestamp" : {
         "type" : "string"
       },
       "exim_pid" : {
         "type" : "string"
       },
       "remote_hostname" : {
         "type" : "string",
         "index" : "not_analyzed"
       },
...

We created that by taking the dynamic mapping (downloaded using “curl -XGET …”) and editing it, then pushing that back using “curl -XPUT …” as demonstrated above.

Then, when starting up logstash, it pushes data into elasticsearch and doesn’t break it apart, leaving it “as is”. I’m sure there are downsides to this but I haven’t found any yet.

The full mapping can be found here (Github).

Here’s the science bit #3 – grok & friends

(It would be a good idea right now if you went and had a read of the excellent “Getting Started With Logstash” tutorial. Go past the basics, read the more advanced bits underneath).

So, we’ve got data coming in to logstash.

We’ve defined some patterns.

Now we need to start analysing it, and this is where logstash’s filters come in. Every log line contains a set of fields – not conveniently comma delimited, nor space delimited, but clearly defined and with a discrete meaning. Without putting a boatload of Exim log data in this post, there are effectively four:

  • receiving a message
  • delivering a message
  • completing a message
  • rejecting a message

There are actually potentially hundreds more, as we log a lot of surrounding information such as reputational data, but (for now) those are the ones we’re interested in.

There are lots that we’re not interested in, which are stored in the patterns referred to in the previous post (see the Github link):

# Strip out lines we don't need using the regexp in the patterns file
if [message] =~ /%{EXIM_EXCLUDE_TERMS}/ {
  drop { }
}

That’s pretty well self explanatory – match one of the terms we’ve defined as exclusions, this message (log line) gets dropped (although see the comments on this post about this section, because it’s non-functional).

We then move on to do some pattern matching using the “grok” filter.

We then wanted to do multiple matches on each message, but at the time of writing there’s a bug in grok which prevents this, so we’ve done a bit of tinkering using the “mutate” filter – we have five grok lines, so we clone the incoming message into 5 self-contained fields and work on those:

  # Really, really dirty hack to workaround bug in grok code
  # which won't handle multiple matches on the same field
  mutate {
    add_field => {
      "message_1" => "%{message}"
      "message_2" => "%{message}"
      "message_3" => "%{message}"
      "message_4" => "%{message}"
      "message_5" => "%{message}"
    }
  }

Don’t worry – we remove these at the end 🙂

Now we start to get a bit expression-heavy, so if that’s not your bag, skip to the next post…

If it is, we now pass the message through the grok filter and start to process it. This then extracts lots of useful stuff and populates fields for later use:

  grok {
    patterns_dir => "/etc/logstash/patterns/"
    break_on_match => false
    keep_empty_captures => true
    match => [
      "message_1", "(%{SYSLOGBASE} )(%{EXIM_DATE:exim_date} )(%{EXIM_PID:exim_pid} )(%{EXIM_MSGID:exim_msg_id} )(%{EXIM_FLAGS:exim_flags} )(%{GREEDYDATA})"
    ]
    match => [
      "message_2", "(%{EXIM_MSGID} )(<= )(%{NOTSPACE:env_sender} )(%{EXIM_REMOTE_HOST} )?(%{EXIM_INTERFACE} )?(%{EXIM_PROTOCOL} )?(X=%{NOTSPACE:tls_info} )?(%{EXIM_MSG_SIZE} )?(%{EXIM_HEADER_ID} )?(%{EXIM_SUBJECT})"
    ]
    match => [
      "message_3", "(%{EXIM_MSGID} )([=-]> )(%{NOTSPACE:env_rcpt} )(<%{NOTSPACE:env_rcpt_outer}> )?(R=%{NOTSPACE:exim_router} )(T=%{NOTSPACE:exim_transport} )(%{EXIM_REMOTE_HOST} )(X=%{NOTSPACE:tls_info} )?(QT=%{EXIM_QT:exim_qt})"
    ]
    match => [
      "message_4", "(%{SYSLOGBASE} )(%{EXIM_DATE:exim_date} )(%{EXIM_PID:exim_pid} )(%{EXIM_MSGID:exim_msg_id} )(Completed )(QT=%{EXIM_QT:exim_qt})"
    ]
    match => [
      "message_5", "(%{SYSLOGBASE} )(%{EXIM_DATE:exim_date} )(%{EXIM_PID:exim_pid} )(%{EXIM_MSGID:exim_msg_id} )?(%{EXIM_REMOTE_HOST} )?(%EXIM_INTERFACE} )?(F=<%{NOTSPACE:env_sender}> )?(.+(rejected after DATA|rejected \(but fed to sa-learn\)|rejected [A-Z]+ (or [A-Z]+ %{NOTSPACE}?|<%{NOTSPACE:env_rcpt}>)?): (?<exim_rej_reason>.+))"
    ]
  }

Remember the patterns file? All of the %{EXIM_ bits there are predefined in that file. Everywhere there’s a :exim_ something, the expression match is passed into a field of that name. This gets us the good bits – addresses, hostnames, status and so on, all of which are useful later on.

After that, we start looking at the fields we’ve pulled data into and doing some logic on them:

  if [exim_flags] == "<=" {
    mutate {
      update => [ "exim_msg_state", "received" ] 
    }
  } else if [exim_flags] == "=>" {
    mutate {
       update => [ "exim_msg_state", "delivered" ] 
    }
  } else if [exim_flags] == "->" {
    mutate {
       update => [ "exim_msg_state", "delivered" ]
    }
  ...

(snipped for brevity)

That little set defines a field called exim_msg_state, which is *really* useful as it tells us in which direction the message is going, what’s happening to it and so on – metrics which can be graphed. Dashboard detail, in fact, the like of which we’ve never really had in this way.

So that’s the grok bit explained, and a little bit of the subsequent logic…

But! (I hear you cry) How do we glue all this stuff together when Exim’s logs aren’t all neat and tidy and place the entire lifetime of an email message into a single line?

Well, logstash has the multiline filter, but that doesn’t apply here because of the way Exim processes things. An email message doesn’t simply arrive and get delivered, it can get queued too – meaning that the arrival line (marked ” <= ” in the log) and subsequent deliveries and completion (marked ” => “, ” -> ” and “Completed” respectively) can be literally tens or hundreds (or even millions) of lines apart, separated by some considerable time.

As well as pushing data into elasticsearch, logstash has an elasticsearch filter. This can query an ES backend with any query, but we use it to add metadata to the “delivered” messages:

  # Look back through ES table for the incoming message and
  # extract the envelope sender. Permits later stats creation
  # listing deliveries per env_sender
  if [exim_msg_state] == "delivered" {
    elasticsearch {
      query => 'exim_msg_id:"%{exim_msg_id}" AND exim_msg_state:"received"'
      fields => [ "env_sender", "env_sender", "remote_host", "remote_host", "remote_hostname", "remote_hostname" ]
      sort => "ignore_unmapped:true"
      fail_on_error => "false"
    }
    mutate {
      remove_field => [ "query_failed" ]
    }
  }

NOTE HOWEVER: I have patched the elasticsearch filter so it doesn’t log an exception for a query which fails (details in my Github tree here). Without this, your logstash logs and possibly ES logs will be very, very, very, very noisy.

NOTE ALSO: The subsequent flush settings in logstash’s elasticsearch output can seriously affect the reliability of the query, of which more in a bit.

At this point, the filter is almost complete. Apart from a couple of other bits of logic, and this bit:

  # Remove the really, really dirty hack to workaround bug in grok code
  # which won't handle multiple matches on the same field
  mutate {
    remove_field => [ "message_1","message_2","message_3","message_4","message_5" ]
  }

That removes those extra fields that we copied in earlier, thus making the message going into ES much smaller (and carrying far less extraneous data).

So that’s the filter section complete. Next, we push the data somewhere useful.

Here’s the science bit #2 – consistency & patterns

Essentially, you can think of logstash as a pipeline. A message arrives (input), is processed (filter), and is output to somewhere else (output).

Within the filter part of the pipeline we can match data using expressions, fiddle with it, do logic on it, replace bits of it if necessary, ignore it if it’s not what we need and much more besides.

What we had to do was:

  • take each from the file – every line is a discrete “message” (input)
  • make sure we’re interested in them (filter)
  • match the meanings (filter)
  • extract meaningful data from them (filter)
  • do some clever looking back in the existing data (filter)
  • pump them into the elasticsearch backend (output)

Firstly, it’s absolutely essential to ensure your data is predictable and conforms with a known format. Luckily Exim’s log format is very well defined, and we made sure all of our servers in the farm were using the same format to avoid any difficulty. For us this means using the following additional configuration in our exim configuration:

## Logging
 syslog_facility = local5
 log_file_path = :syslog
 log_selector = +incoming_interface +subject +pid +queue_time +queue_time_overall

For those not familar with Exim, that says “log to normal files but also to syslog with facility local5, and add these extra bits of metadata to the log lines”. The local rsyslog server is configured to simply push local5.* to a central logging server which I’ve already mentioned, where the rsyslog receiver puts the data into the relevant farm-wide files. See the previous posts for why it doesn’t simply get stuffed into logstash directly.

Now, how do we do all that stuff in the logstash filter section to make this mass of data usable?

Patterns.

logstash uses a mixed regular expression engine which allows the definition and subsequent use of “fields” inside regex patterns. After a bit of work, we conjured up a set of predefined patterns specific to Exim which also make use of logstash’s predefined fields. As an example:

EXIM_MSGID [0-9A-Za-z]{6}-[0-9A-Za-z]{6}-[0-9A-Za-z]{2}
EXIM_FLAGS (<=|[-=>*]>|[*]{2}|==)
EXIM_REMOTE_HOST (H=(%{NOTSPACE:remote_hostname} )?(\(%{NOTSPACE:remote_heloname}\) )?\[%{IP:remote_host}\])
EXIM_SUBJECT (T=%{QS:exim_subject})

 

The full list of these can be found here (Github). It may grow.

The next question: how do these get applied?

Here’s the science bit, #1 – first steps

Now things get a little more complicated.

We have a farm of machines, split across two different roles – primarily MX (inbound email from the outside world), and MTA (passing email back to the outside world, but also between various systems within the organisation).

We’ve been using the venerable syslog protocol for years to aggregate Exim‘s logs from across the farm onto a central syslog platform. This is a stable system, and I didn’t want to change the configuration of rsyslog on the collector to add in logstash.

Thankfully, logstash has a number of input modules so I could simply hook it into the existing files by defining an input as follows:

input {
  file {
    path => "/path/to/exim/mainlog"
    start_position => 'end'
    sincedb_path => "/dev/null"
  }
}

Pretty simple.

At first I ran it with a simple “take everything and pretty print it to stdout” output block:

output {
  stdout {
    codec => rubydebug
  }
}

Also pretty simple. Data started flying past – bear in mind we’re talking between 5 and 10 million log lines a day on average here, so parking this on the raw input data returned a stream of beautifully formatted JSON objects that were far, far too fast to read. But that wasn’t the point – I had it reading data.

Now all I needed to do was pre-analyse the data to generate relevant fields/tags for Elasticsearch/Kibana to work with. Well, I say “all”…