Q Ethan McCallum bio photo

Q Ethan McCallum

Twitter

RSS feed

free paper
cover: Business Models for the Data Ecomony

upcoming book
cover: Making Analytics Work

NOTE: The examples here reflect the most recent release of forqlift. If the syntax doesn’t work, please confirm you’re using the latest version (or, refer to the EXAMPLES.txt in your distribution).

(all examples assume the forqlift command is in your path)

forqlift’s syntax is similar to that of svn and other tools: you specify some action, followed by that action’s flags. For example:

forqlift create [... options for "create" ...]

get help / see options

To see help for all actions, run:

forqlift --help

To see help for a specific action:

forqlift [action] --help

(e.g., forqlift create --help)

create a SequenceFile

Inside the SequenceFile, each record will use the filename for the key and the file’s contents (asa Hadoop BytesWritable type) for the value.

forqlift create --file=/some/file.seq file1 file2 file3 /path/to/file4

create a SequenceFile, text data

This time, the value will be a Hadoop Text type, which means your Mapper and Reducer code can just fetch the contents as a big String. (If the value were still BytesWritable, you would have to first convert the raw bytes to text.)

forqlift create --file=/some/file.seq --data-type=text file1.txt file2.xml file3.txt /path/to/file4.xml

create a SequenceFile, compressed text data

Text tends to compress well. This can lead to big savings on bandwidth and storage, both of which are especially important if you’re on a slow line and/or you use cloud services, such as Amazon’s S3 or Elastic MapReduce.

This time, the value will be a Hadoop Text type, which means your Mapper and Reducer code can just fetch the contents as a big String. (If the value were still BytesWritable, you would have to first convert the raw bytes to text.)

forqlift create --file=/some/file.seq --data-type=text --compress=bzip2 /path/to/*.xml /another/path/*.txt

inspect a SequenceFile

Fetch the number and type of records:

forqlift inspect /some/file.seq

list the contents of a SequenceFile

forqlift list /some/file.seq

clean up filenames on extract

When extracting a SequenceFile’s contents, forqlift uses a record’s key as the file name. If you’ve created the SequenceFile using forqlift, this is not a problem.

If, on the other hand, you generated the SequenceFile elsewhere – say, your Hadoop job uses URLs for a record key, and web page content as the record value – then using forqlift to extract that content may leave you with filenames full of special characters: ampersands (&), slashes (/), and more.

Pass the –munge-names switch to the extract operation, and forqlift will replace unfriendly characters with an underscore (_).

forqlift extract --munge-names --file=/some/file.seq

extract the contents of a SequenceFile

Extract to current directory: forqlift extract --file=/some/file.seq

Extract to another directory (paths will be created as needed): forqlift extract --file=/some/file.seq --dir=/another/directory

convert a zip or tar(.bz2, .gz) file to a SequenceFile

(NOTE: This is an experimental feature!)

Note that you can also use the --data-type and --compress options, if need be.

forqlift fromarchive --file=/some/file/seq somefile.tar

You can also squeeze multiple zip or tar files into a single SequenceFile:

forqlift fromarchive --file=/some/file/seq file1.zip file2.tar.bz2 file3.tar.bz file4.tar

convert a SequenceFile into zip or tar format

(NOTE: This is an experimental feature!)

forqlift toarchive --file=/some/file.tar.bz2 file1.seq

or, create one file from several SequenceFiles:

forqlift toarchive --file=/some/file.tar.bz2 file1.seq file2.seq file3.seq

merge SequenceFiles

(NOTE: This is an experimental feature!)

Sometimes, you want to combine several SequenceFiles into one. (Hadoop may generate several SequenceFiles for a job – one for each reducer – and you may want to squeeze all of those into a single, logical unit.)

forqlift seq2seq --file new_combined_file.seq original_file1.seq original_file2.seq original_file3.seq ...

work with other data types

By default, forqlift only works with builtin Hadoop Writable data types: Text and BytesWritable. It’s entirely possible to create a SequenceFile using other Writable types, such as those from Hive, or even your own custom classes.

To help forqlift understand these data types, drop the necessary JAR files into:

{forqlift install}/lib.ext/

forqlift will add these JARs to its runtime classpath, so it will be able to identify the other Writable types.

Please note that this will only impact listing, inspecting, or extracting a SequenceFile. forqlift will still create new SequenceFiles using plain old Text or BytesWritable.

pass flags to forqlift's jvm (set memory, etc)

Use the FORQLIFT_OPTS environment variable, the value of which gets passed to the JVM:

For example, to set forqlift’s JVM memory (heap size) to 512MB:

export FORQLIFT_OPTS="-Xmx512m"
forqlift .....

get information about forqlift's version

pass the --version flag to forqlift to see the project version, and also the version of Hadoop used to build forqlift.

forqlift --version