NOTE: The examples here reflect the most recent release of forqlift. If the syntax doesn’t work, please confirm you’re using the latest version (or, refer to the EXAMPLES.txt in your distribution).
(all examples assume the forqlift command is in your path)
forqlift’s syntax is similar to that of svn and other tools: you specify some action, followed by that action’s flags. For example:
forqlift create [... options for "create" ...]
get help / see options
To see help for all actions, run:
forqlift --help
To see help for a specific action:
forqlift [action] --help
(e.g., forqlift create --help)
create a SequenceFile
Inside the SequenceFile, each record will use the filename for the key and the file’s contents (asa Hadoop BytesWritable type) for the value.
forqlift create --file=/some/file.seq file1 file2 file3 /path/to/file4
create a SequenceFile, text data
This time, the value will be a Hadoop Text type, which means your Mapper and Reducer code can just fetch the contents as a big String. (If the value were still BytesWritable, you would have to first convert the raw bytes to text.)
forqlift create --file=/some/file.seq --data-type=text file1.txt file2.xml file3.txt /path/to/file4.xml
create a SequenceFile, compressed text data
Text tends to compress well. This can lead to big savings on bandwidth and storage, both of which are especially important if you’re on a slow line and/or you use cloud services, such as Amazon’s S3 or Elastic MapReduce.
This time, the value will be a Hadoop Text type, which means your Mapper and Reducer code can just fetch the contents as a big String. (If the value were still BytesWritable, you would have to first convert the raw bytes to text.)
forqlift create --file=/some/file.seq --data-type=text --compress=bzip2 /path/to/*.xml /another/path/*.txt
inspect a SequenceFile
Fetch the number and type of records:
forqlift inspect /some/file.seq
list the contents of a SequenceFile
forqlift list /some/file.seq
clean up filenames on extract
When extracting a SequenceFile’s contents, forqlift uses a
record’s key as the file name. If you’ve created the
SequenceFile using forqlift, this is not a problem.
If, on the other hand, you generated the SequenceFile elsewhere
– say, your Hadoop job uses URLs for a record key, and web page
content as the record value — then using forqlift to extract
that content may leave you with filenames full of special
characters: ampersands (&), slashes (/), and more.
Pass the –munge-names switch to the extract operation, and
forqlift will replace unfriendly characters with an underscore
(_).
forqlift extract --munge-names --file=/some/file.seq
extract the contents of a SequenceFile
Extract to current directory:
forqlift extract --file=/some/file.seq
Extract to another directory (paths will be created as needed):
forqlift extract --file=/some/file.seq --dir=/another/directory
convert a zip or tar(.bz2, .gz) file to a SequenceFile
(NOTE: This is an experimental feature!)
Note that you can also use the --data-type and --compress options, if need be.
forqlift fromarchive --file=/some/file/seq somefile.tar
You can also squeeze multiple zip or tar files into a single SequenceFile:
forqlift fromarchive --file=/some/file/seq file1.zip file2.tar.bz2 file3.tar.bz file4.tar
convert a SequenceFile into zip or tar format
(NOTE: This is an experimental feature!)
forqlift toarchive --file=/some/file.tar.bz2 file1.seq
or, create one file from several SequenceFiles:
forqlift toarchive --file=/some/file.tar.bz2 file1.seq file2.seq file3.seq
merge SequenceFiles
(NOTE: This is an experimental feature!)
Sometimes, you want to combine several SequenceFiles into one.
(Hadoop may generate several SequenceFiles for a job — one for
each reducer — and you may want to squeeze all of those into a
single, logical unit.)
forqlift seq2seq --file new_combined_file.seq original_file1.seq original_file2.seq original_file3.seq ...
work with other data types
By default, forqlift only works with builtin Hadoop Writable data
types: Text and BytesWritable. It’s entirely possible to create
a SequenceFile using other Writable types, such as those from
Hive, or even your own custom classes.
To help forqlift understand these data types, drop the necessary
JAR files into:
{forqlift install}/lib.ext/
forqlift will add these JARs to its runtime classpath, so it will
be able to identify the other Writable types.
Please note that this will only impact listing, inspecting, or
extracting a SequenceFile. forqlift will still create new
SequenceFiles using plain old Text or BytesWritable.
pass flags to forqlift’s jvm (set memory, etc)
Use the FORQLIFT_OPTS environment variable, the value of which gets passed to the JVM:
For example, to set forqlift’s JVM memory (heap size) to 512MB:
export FORQLIFT_OPTS="-Xmx512m" forqlift .....
get information about forqlift’s version
pass the --version flag to forqlift to see the project version, and also the version of Hadoop used to build forqlift.
forqlift --version