charcuterie examples


“net.exmachinatech.charcuterie.pig.eval.XMLExtractorMultipleValue”

extract multiple fields from an XML document, using XPath expressions

define XMLExtractorMultipleValue net.exmachinatech.charcuterie.pig.eval.XMLExtractorMultipleValue() ;

-- the UDF accepts an arbitrary number of XPath expressions.
-- Here, we extract just three fields.

abstracts = FOREACH raw_data GENERATE
   file_name , 
   XMLExtractorMultipleValue(
      file_contents ,
      '/article/meta/author/name/text()'
      '/article/meta/abstract/text()'
      '/article/content'
   ) AS author , abstract , content
;

returns: tuple, with one member for each provided XPath expression


“net.exmachinatech.charcuterie.pig.eval.SentenceCounter”

count the number of sentences in a document

define SentenceCounter net.exmachinatech.charcuterie.pig.eval.SentenceCounter() ;

FOREACH document_set GENERATE
  file_name ,
  SentenceCounter( file_contents )
;

returns: integer, number of sentences in the specified document.


“net.exmachinatech.charcuterie.pig.eval.SentenceTokenizer”

tokenize a document into sentences

define SentenceCounter net.exmachinatech.charcuterie.pig.eval.SentenceTokenizer() ;

FOREACH document_set GENERATE
  file_name ,
  SentenceTokenizer( file_contents )
;

returns: bag of tuples, with one sentence per tuple


“net.exmachinatech.charcuterie.pig.eval.NGramTermTokenizer”

tokenize a document and extract ngrams

-- the default N is 2 (so, the UDF returns bigrams).  The constructor
-- also accepts an arbitrary value of N (say, to retrieve trigrams or
-- even 5-grams).
define NGramTermTokenizer net.exmachinatech.charcuterie.pig.eval.NGramTermTokenizer() ;

ngrams = FOREACH document_set GENERATE
  file_name , 
  NGramTermTokenizer( file_contents )
;

returns: bag of (term,count) tuples


“net.exmachinatech.charcuterie.pig.eval.TotalTermCount”

tokenize a document, and return the total number of terms therein

define TotalTermCount net.exmachinatech.charcuterie.pig.eval.TotalTermCount() ;

FOREACH document_set GENERATE
  file_name ,
  TotalTermCount( file_contents )
;

returns: integer, total number of (stemmed) terms in the specified document.


“net.exmachinatech.charcuterie.pig.eval.XMLExtractorSingleValue”

extract a single field from an XML document, using an XPath expression

define XMLExtractorSingleValue net.exmachinatech.charcuterie.pig.eval.XMLExtractorSingleValue() ;

-- we specify the desired XML element using an XPath expression
abstracts = FOREACH raw_data GENERATE
   file_name , 
   XMLExtractorSingleValue(
      file_contents ,
      '/article/meta/abstract/text()'
   ) AS abstract
;

returns: string, the result of the supplied XPath expression


“net.exmachinatech.charcuterie.pig.eval.UniqueTermCount”

tokenize a document, and return the number of unique terms therein

define UniqueTermCount net.exmachinatech.charcuterie.pig.eval.UniqueTermCount() ;

FOREACH document_set GENERATE
  file_name ,
  UniqueTermCount( file_contents )
;

returns: integer, total number of unique (stemmed) terms in the specified document.


“net.exmachinatech.charcuterie.pig.eval.TermFrequency”

tokenize a document, then list term frequencies

define TermFrequency net.exmachinatech.charcuterie.pig.eval.TermFrequency() ;

FOREACH document_set GENERATE
  file_name ,
  TermFrequency( file_contents )
;

returns: bag, of (term,frequency) tuples


“net.exmachinatech.charcuterie.pig.eval.EmailExtractor” (EXPERIMENTAL)

extract certain fields (to, from, CC, subject, body) from an e-mail
message. Please note that this UDF has not been tested on a wide variety
of e-mail formats. Be further warned that the UDF employs a rather naive
method in handling multipart messages.

-- can also pass a date pattern, suitable for SimpleDateFormat, and
-- the UDF will format the date accordingly.
define EmailExtractor net.exmachinatech.charcuterie.pig.eval.EmailExtractor() ;

-- assuming field "email_contents" contains the raw e-mail ...

emails = FOREACH raw_data GENERATE
  file_name , 
  FLATTEN( EmailExtractor( email_contents ) )
;

returns: tuple of chararray: ( from, to, cc, date, subject, body )


“net.exmachinatech.charcuterie.pig.eval.StripHTML”

strip HTML formatting out of the supplied text

define StripHTML net.exmachinatech.charcuterie.pig.eval.StripHTML() ;

FOREACH raw_html GENERATE StripHTML( raw_html ) ;

returns: string, of plain text


“net.exmachinatech.charcuterie.pig.eval.TermTokenizer”

stem and tokenize a document

define TermTokenizer net.exmachinatech.charcuterie.pig.eval.TermTokenizer() ;

FOREACH document_set GENERATE
  file_name ,
  TermTokenizer( file_contents )
;

returns: bag, of (term) tuples


“net.exmachinatech.charcuterie.pig.eval.HTMLExtractor”

extract multiple fields from an HTML document, using Jsoup Selector
expressions (http://jsoup.org/apidocs/org/jsoup/select/Selector.html)

-- can also pass "true" to constructor, to strip HTML tags from the results.
define HTMLExtractor net.exmachinatech.charcuterie.pig.eval.HTMLExtractor() ;

-- The UDF uses Jsoup Selector expressions.  For details, see:
-- http://jsoup.org/apidocs/org/jsoup/select/Selector.html
--
-- Fetch the title and the content of the element with ID "#interesting" ...
-- (The raw HTML is in the "raw_data" field "html_content")
--
-- The UDF accepts an arbitrary number of Selector expressions.
-- If an expression matches multiple HTML elements, the UDF will concatenate
-- them into a single string value.

extracted = FOREACH raw_data GENERATE
file_name , 
FLATTEN( HTMLExtractor( html_content , "title" , "#interesting" ) )
  AS ( title , interesting )
;

returns: tuple of chararray, one member for each Selector expression.