Q Ethan McCallum bio photo

Q Ethan McCallum

Twitter

RSS feed

free paper
cover: Business Models for the Data Ecomony

upcoming book
cover: Making Analytics Work

quick start

forqlift helps you manage Hadoop SequenceFiles. It is released under the Apache Software License v2.0.

For more details, read on.

introduction to forqlift

If you use Hadoop to process binary data, chances are you store that data in SequenceFile archives.

SequenceFiles are nice, but they can be unwieldy at times. I wrote forqlift to make it easier to manage SequenceFiles.

forqlift is a command-line tool that lets you:

  • create SequenceFiles from files on your local filesystem (just like creating an archive with tar or zip). Ship binary and whole-file data to a Hadoop cluster for processing.
  • set compression (none, bzip2, gzip) and value types (text or binary). Compress to save bandwidth and storage space. This is especially useful if you're shipping data back and forth to Amazon's Elastic MapReduce.
  • extract the contents of a SequenceFile back to the filesystem. Now that Hadoop has processed your data, extract the archive and see the results.
  • convert popular archive formats -- tar (including tar.bz2 and tar.gz) and zip -- to and from SequenceFile format. Already have lots of data in these more traditional archive formats, but want to work with Hadoop? No problem!

Feel free to review the forqlift FAQ or check out some examples. Better yet, you can grab a copy and try it out for yourself!

forqlift's mission statement

forqlift’s primary goals are as follows:

  • create a SequenceFile from files on your local disk
  • extract data from a SequenceFile back to local disk
  • list the contents of SequenceFiles

additionally, forqlift supports the following:

  • translation: convert a standard tar or zip archive into a SequenceFile, or vice-versa
  • merge: combine several SequenceFile archives into one

I try to keep forqlift as simple as possible, and only as complex as necessary. That stands for the code as well as the feature list.

licensing

forqlift is released under the popular Apache License, version 2.0.

credit where it's due

forqlift is built on quality, open-source components, including Hadoop and several Apache libraries. Also, the forqlift and forqlift.bat scripts were heavily lifted from Apache Maven’s launch scripts.