forqlift helps you manage Hadoop SequenceFiles. It is released under the Apache Software License v2.0.
For more details, read on.
introduction to forqlift
If you use Hadoop to process binary data, chances are you store that data in SequenceFile archives.
SequenceFiles are nice, but they can be unwieldy at times. I wrote forqlift to make it easier to manage SequenceFiles.
forqlift is a command-line tool that lets you:
- create SequenceFiles from files on your local filesystem (just like creating an archive with tar or zip). Ship binary and whole-file data to a Hadoop cluster for processing.
- set compression (none, bzip2, gzip) and value types (text or binary). Compress to save bandwidth and storage space. This is especially useful if you’re shipping data back and forth to Amazon’s Elastic MapReduce.
- extract the contents of a SequenceFile back to the filesystem. Now that Hadoop has processed your data, extract the archive and see the results.
- convert popular archive formats — tar (including tar.bz2 and tar.gz) and zip — to and from SequenceFile format. Already have lots of data in these more traditional archive formats, but want to work with Hadoop? No problem!
forqlift’s mission statement
forqlift’s primary goals are as follows:
- create a SequenceFile from files on your local disk
- extract data from a SequenceFile back to local disk
- list the contents of SequenceFiles
additionally, forqlift supports the following:
- translation: convert a standard tar or zip archive into a SequenceFile, or vice-versa
- merge: combine several SequenceFile archives into one
I try to keep forqlift as simple as possible, and only as complex as necessary. That stands for the code as well as the feature list.
forqlift is released under the popular Apache License, version 2.0.
credit where it’s due
forqlift is built on quality, open-source components, including Hadoop and several Apache libraries. Also, the
forqlift.bat scripts were heavily lifted from Apache Maven’s launch scripts.