Smarchive – The Smart Archive

Short and sweet: GitHub, Build, Distribution

Oh, the suspense!

I recently faced an issue where I had to read an archive stream but I had no prior knowledge about the method of compression *insert-dramatic-music*.
I mean, the stream I was given could have been read from a zip, or from a tar, or from a ar, or from anything; I just don’t know which.

Go with the flow

I figured that the most elegant (though not most simple) way to solve this is to try and auto-detect the type of archive before I handle its content.
So now I’m facing 2 main issues:

  1. How do I detect the type of archive?
  2. How do treat the stream as a generic archive without specifically referencing its type?

The first issue can be solved because each archive file contains magic header bytes that give a hint in regard to its type; we can read the first bytes of the stream and that will give as a hint about the archive implementation that we should use.

The second issue can be solved by using the great Apache Commons-Compress; The library not only provides us with implementations for all popular archive types, it also gives us an abstract interface over all types so that we can reference an archive without caring for the implementation.

Enter the Smarchive!

The Smarchive input stream has one method – realize:

import org._10ne.smarchive.SmarchiveInputStream

...

public void readArchiveEntries(InputStream inputStream) {
    ArchiveInputStream archiveStream = SmarchiveInputStream.realize(inputStream);
    ...
}

realize reads the magic header of the provided input stream, determines the correct implementation and returns a generic commons-compress ArchiveInputStream that’s ready to use for reading entries.

Bits and bytes

It’s important to note that at the current implementation I took the easy path and the smarts come at the price of resources.

I’m able to read the magic bytes at the beginning of the archive and keep the stream intact for extraction because I use a BufferedInputStream:

  1. Wrap original stream with a BufferedInputStream.
  2. Read magic bytes and reset buffer.
  3. Wrap the buffered stream with a specific archive implementation.

In a worse scenario – the original archive is GZziped, which means that I now need another buffer level. One for the GZip filter and one for the archive:

  1. Wrap original stream with a GZIPInputStream.
  2. Read magic bytes and reset buffer.
  3. Wrap gzip stream with a BufferedInputStream.
  4. Read magic bytes and reset buffer.
  5. Wrap the buffered stream with a specific archive implementation.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s