Q Ethan McCallum bio photo

Q Ethan McCallum

Twitter

RSS feed

free paper
cover: Business Models for the Data Ecomony

upcoming book
cover: Making Analytics Work

what is forqlift?

forqlift is a tool for managing SequenceFiles.

If you work with Hadoop (or products built on Hadoop, such as Mahout), there’s a chance you work with SequenceFiles.

In turn, if you work with SequenceFiles a lot and want an interface similar to the familiar tar or zip commands, forqlift is for you.

what’s a SequenceFile, and why would I use one?

A SequenceFile is a Hadoop-specific archive format, similar to tar or zip.

Hadoop typically works on text data, treating each line as a record (a key/value pair). If you’re processing text line-by-line, this is pretty natural: split the line on some delimiter (say, a tab or a comma) to define “key” and “value” and you Hadoop away.

What if you want to use Hadoop’s parallel-processing muscle on binary data (images, videos, sound files, or anything else you can see as a series of bytes)? or what if you need to treat an entire text file (such as an XML document) as a single record?

That’s when you’d use a SequenceFile. Hadoop treats the contents of a SequenceFile as records of key/value pairs. (For example: if you’re feeding Hadoop a series of images, the “key” could be the filename and the “value” could be the image’s raw bytes.)

As an additional benefit, SequenceFiles are very efficient for Hadoop because you can ship lots of small files to and from the cluster as a single package.

where do I get forqlift?

forqlift’s homepage at http://www.qethanm.cc/go/forqlift has links for downloading.

what do I need to run forqlift?

forqlift requires Java runtime (JRE) 1.6.

If you’re running under Windows, forqlift also requires a cygwin install that includes the chmod command. (This is a requirement of the underlying Hadoop code, not of forqlift itself.)

Please note that forqlift does not require a Hadoop install, nor a Hadoop cluster! forqlift uses the Hadoop libraries on the backend, but those are included in the product.

what is forqlift’s license?

forqlift is released under the popular Apache License, Version 2.0.

You can see the text of this license in the LICENSE.txt file included in the distribution, or online at one of the Apache Foundation or the Open Source Initiative.

why did you write forqlift?

In a nutshell: it all started when I was churning XML data through Mahout (Apache’s library of machine-learning algorithms). Mahout uses Hadoop on the backend, and therefore it likes to read data from SequenceFiles. Mahout’s basic tools for loading data into SequenceFiles were fine, but I wanted something with a bit more heft. So I wrote forqlift.

what’s with the name, “forqlift?”

It’s a play on the term “forklift.” Forklifts, like SequenceFiles, can be used to move things in bulk.

I tossed in the “q” because that letter appears in the term “SequenceFile.”

do you have some examples on how to use forqlift?

You bet I do! Check out the examples page.

can I suggest a feature for forqlift?

Please do! forqlift is still in its early stages. (I plan to tackle some cleanup before I add new features, but feel free to share your thoughts and use cases. That will help set future direction.)

Just shoot a message to: forqlift-questions at exmachinatech.net

can I ask you a question about forqlift?

Go right ahead: forqlift-questions at exmachinatech.net