Google Drive – data sync with gsync

In my last post (some time ago now) I wrote about using the lovely Exiftool to import photos, image files & videos into a date-structured store from digital cameras.

Over the years with a digital camera I’ve never got anywhere a professional standard with proper workflow tools which allow me to sift out the good from the bad (and dispose of the latter). This has left me with approaching 500GB of image data from all manner of places & events: weddings, parties, trips abroad, my family growing up, aircraft, racing cars and in more recent times a growing set of attempts to be arty or “wildlifey”.

That’s a lot of data. And it’s growing – every photo I take ends up on there. Last time I took my camera out (to an air show) I came back with almost 3000 shots!

It’s all currently sitting on a venerable (old!) D-Link DNS323 two-drive NAS, which is now running the rather splendid Alt-F Alternative Firmware instead of the vendor’s firmware.

A few weeks ago, despite taking regular backups to an external disk, my wife and I realised that we’d really quite like to be able to back this up off-site. When I checked the SMART power-on time data of the drives it was somewhere in the region of 54000 hours each. Six and a bit years. Ouch! Can anyone smell an MTBF period approaching?

So… I could have taken my NAS to work where we have $SILLY_GIGABITS_PER_SEC internet connectivity, but I’m loathe to power it down and move it due to the age and runtime of the drives. Thankfully my ISP (hi Virgin Media, I’m looking at you here) has some fairly preposterous packages available with reasonably decent upload speed, subject to usage caps – which meant despite the prospect of it taking days, it’s possible to do an upload to some cloud storage provider from home.

Google’s storage pricing policy changed recently (although everyone else is playing race to the bottom with them) to provide 1TB of storage for a relatively reasonable USD9.99 per month. That brought a large amount of storage down into “accessible” land, so I decided (being Google Apps for Domains user for years, since it was free) to shell out for that after doing some testing.

But… How on earth to sync data from a NAS without having an intermediary client to sync it?

My tool of choice at work for that would be rsync – I’ve shifted giant quantities of data from place to place with that but it has no native support for cloud “filesystems” (which actually aren’t filesystems, they’re almost all massive object stores with a filesystem abstraction at the client end). A bit of searching around kept pointing me back to one place, a tool called…

gsync

(there are two, actually. This one is written in Python and seems reasonably well maintained)

On a CentOS box I ran

[graeme@centos ~]$ sudo yum -y install python-pip

…and then, because it’s not just available on github but also in the Python Package Index

[graeme@centos ~]$ sudo yum -y python-pip install gsync

Voila! A working version – or so I thought. I did a quick test, which failed, and discovered I might need a few patches to make it all work properly. Thankfully, github being what it is, they were mainly in the issues list for the project so I could pick them out and make the app work as I expected. I’ll put some more details in here if anyone asks about them.

Having done some tests with small datasets limited to one, tens or hundreds of files, it transpired that one thing I was missing – badly – was “magic” to do with MIME types (and yes, it is called “magic” – look in /usr/share/magic or similar on a Linux system and see this Wikipedia entry). Files kept repeatedly being written up to the Google Drive, so I had to use the –debug switch to find out why and largely found lines telling me that the local MIME type – often “application/octet-stream” – didn’t match the type that Google had provided at the far end – say “image/x-panasonic-rw2”. If they didn’t match, the whole file got copied over again.

Digital cameras produce a plethora of different media types – JPEG, a bazillion different variations of RAW, and even more variations of video files. MPEG4, AVI, Quicktime MOV, MPEG2TS, AVCHD… loads of them, some of which are actually containers with streams of other formats inside. What a pain.

All things being equal, one of the things you get with a commercial OS such as Windows or MacOS X is the magic data already included and regularly updated, both by the OS vendor and application software vendors. The Linux world, of course, is a little different in that most distributions come with a file – /etc/magic – to which you can add your own definitions. There are many out there, which a judicious bit of searching will turn up for you. I found a few collections for digital media formats so added them to my /etc/magic, and the problem was solved.

Then it was time for a first sync.

It took over a week!

But it worked, with a few wrinkles:

  • it doesn’t catch 500 Internal Server Errors from Google very gracefully – rather than retrying, it quits
  • predictably it can consume a lot of RAM
  • I had to stop and add a few more MIME types

So the next question was: would it run on my DNS323?

And the answer…

YES!

A huge, great, big fat hairy yes.

I had to do some more tweaking with the Alt-F Package list, installing Python and using Python to install pip so I could get all the dependencies installed (like the third-party magic.py), adding stuff to /etc/magic and making a small final tweak to the underlying gsync code to look at that file (because it wasn’t).

It’s now running – albeit at about half the speed of the 1.6GHz CentOS box it was running on, but given that the device has only 64MB RAM and a nominal 500MHz Marvell Ferocean ARM CPU (reporting 332 BogoMIPS at boot) I’m very pleasantly surprised that it even starts up. I’ve had to wrap it in a script which basically says “foreach item in a list of directories in a predictable directory structure; do gsync blah…; done” rather than “gsync $massive_heap_o’data” to keep the memory usage down (and logging easier to digest), but running it is.

It’ll be interesting to see how long it takes before it finishes. Or crashes. Either way I’ll come back and update this post in a day or two.

Advertisements

9 comments

  1. OK, first problem: Python’s magic module isn’t, because it’s missing the binary blob (magic.so) to tie it to libmagic. And compiling it with the right toolchain on an x64 box is painful, to say the least. End result is a number of files have copied over with invalid/incorrect MIME types and don’t subsequently display correctly in Drive, but that’s almost nitpicking – because they are up there, but (essentially) tagged incorrectly. Small problem, really.

    • As long as you can get Python working (which is trivial, and likely to be installed anyway) you should be able to download Gsync and get on with it.

  2. Yay! I have just found this and glad that someone has posted about using GSync in Centos rather than Ubuntu or Debian or some other Debian based flavour.

    Thank you for sharing this.

    Anyway, I notice you said you had some issues with your first sync and you would post up what they were and your workarounds if anyone asked… Here’s me asking! I have Centos 6.6 (x64 of course).

    Many thanks,
    Simon

    • Ah, yes… I did say I’d come back, didn’t I?

      Short story is that although it worked on the DNS323 I subsequently retired that box and replaced it with an HP Microserver instead. So back to CentOS 6 – on which it’s working exactly as I need it to.

      Happy to help you with any specific problems you might have – fire away 🙂

  3. Sorry, I didn’t see you had replied.

    I am now running it OK and now in a few shell scripts so I can write the entries to time-based logs so I can see what has happened overnight.

    I can run the scripts from the bash prompt with no issue, however, when I run them via cron, it doesn’t work as I think it’s trying to use Python 2.6 instead of 2.7.something, which it uses when

    Any ideas?

    My log file output is as follows:

    02042015 18:30:01 : Starting work
    DEBUG: ENoTTY(): File “/usr/bin/gsync”, line 55, in
    crawler = Crawler(p, dest)
    File “/usr/lib/python2.6/site-packages/libgsync/crawler.py”, line 57, in __init__
    info = self._drive.stat(self._dst)
    File “/usr/lib/python2.6/site-packages/libgsync/drive/__init__.py”, line 507, in stat
    ents = self._query(parentId = parentId)
    File “/usr/lib/python2.6/site-packages/libgsync/drive/__init__.py”, line 766, in _query
    service = self.service()
    File “/usr/lib/python2.6/site-packages/libgsync/drive/__init__.py”, line 231, in service
    credentials = self._obtainCredentials()
    File “/usr/lib/python2.6/site-packages/libgsync/drive/__init__.py”, line 328, in _obtainCredentials
    raise ENoTTY

    Error:
    02042015 18:30:01 : FINISHED

    I would appreciate your advice.

    • It may surprise you to find that when you run a script from cron it doesn’t inherit your full environment unless you tell it to.
      In your case, you probably need to manipulate $PATH – either in the script, or in the crontab, and make it explicit.

      In a login shell, do

      which python

      That should tell you what you want your path to include.

  4. thanks for posting:
    no command:
    sudo yum -y python-pip install gsync
    should be:
    sudo pip install gsync or
    perhaps as a one liner
    sudo yum -y install python-pip; pip install gsync
    Cheers

  5. Pingback: Ooh. A Google Drive CLI tool that works properly! « Random Ramblings


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s