“net.exmachinatech.charcuterie.pig.eval.XMLExtractorMultipleValue”
extract multiple fields from an XML document, using XPath expressions
define XMLExtractorMultipleValue net.exmachinatech.charcuterie.pig.eval.XMLExtractorMultipleValue() ;
-- the UDF accepts an arbitrary number of XPath expressions.
-- Here, we extract just three fields.
abstracts = FOREACH raw_data GENERATE
file_name ,
XMLExtractorMultipleValue(
file_contents ,
'/article/meta/author/name/text()'
'/article/meta/abstract/text()'
'/article/content'
) AS author , abstract , content
;
returns: tuple, with one member for each provided XPath expression
“net.exmachinatech.charcuterie.pig.eval.SentenceCounter”
count the number of sentences in a document
define SentenceCounter net.exmachinatech.charcuterie.pig.eval.SentenceCounter() ;
FOREACH document_set GENERATE
file_name ,
SentenceCounter( file_contents )
;
returns: integer, number of sentences in the specified document.
“net.exmachinatech.charcuterie.pig.eval.SentenceTokenizer”
tokenize a document into sentences
define SentenceCounter net.exmachinatech.charcuterie.pig.eval.SentenceTokenizer() ;
FOREACH document_set GENERATE
file_name ,
SentenceTokenizer( file_contents )
;
returns: bag of tuples, with one sentence per tuple
“net.exmachinatech.charcuterie.pig.eval.NGramTermTokenizer”
tokenize a document and extract ngrams
-- the default N is 2 (so, the UDF returns bigrams). The constructor
-- also accepts an arbitrary value of N (say, to retrieve trigrams or
-- even 5-grams).
define NGramTermTokenizer net.exmachinatech.charcuterie.pig.eval.NGramTermTokenizer() ;
ngrams = FOREACH document_set GENERATE
file_name ,
NGramTermTokenizer( file_contents )
;
returns: bag of (term,count) tuples
“net.exmachinatech.charcuterie.pig.eval.TotalTermCount”
tokenize a document, and return the total number of terms therein
define TotalTermCount net.exmachinatech.charcuterie.pig.eval.TotalTermCount() ;
FOREACH document_set GENERATE
file_name ,
TotalTermCount( file_contents )
;
returns: integer, total number of (stemmed) terms in the specified document.
“net.exmachinatech.charcuterie.pig.eval.XMLExtractorSingleValue”
extract a single field from an XML document, using an XPath expression
define XMLExtractorSingleValue net.exmachinatech.charcuterie.pig.eval.XMLExtractorSingleValue() ;
-- we specify the desired XML element using an XPath expression
abstracts = FOREACH raw_data GENERATE
file_name ,
XMLExtractorSingleValue(
file_contents ,
'/article/meta/abstract/text()'
) AS abstract
;
returns: string, the result of the supplied XPath expression
“net.exmachinatech.charcuterie.pig.eval.UniqueTermCount”
tokenize a document, and return the number of unique terms therein
define UniqueTermCount net.exmachinatech.charcuterie.pig.eval.UniqueTermCount() ;
FOREACH document_set GENERATE
file_name ,
UniqueTermCount( file_contents )
;
returns: integer, total number of unique (stemmed) terms in the specified document.
“net.exmachinatech.charcuterie.pig.eval.TermFrequency”
tokenize a document, then list term frequencies
define TermFrequency net.exmachinatech.charcuterie.pig.eval.TermFrequency() ;
FOREACH document_set GENERATE
file_name ,
TermFrequency( file_contents )
;
returns: bag, of (term,frequency) tuples
“net.exmachinatech.charcuterie.pig.eval.EmailExtractor” (EXPERIMENTAL)
extract certain fields (to, from, CC, subject, body) from an e-mail
message. Please note that this UDF has not been tested on a wide variety
of e-mail formats. Be further warned that the UDF employs a rather naive
method in handling multipart messages.
-- can also pass a date pattern, suitable for SimpleDateFormat, and
-- the UDF will format the date accordingly.
define EmailExtractor net.exmachinatech.charcuterie.pig.eval.EmailExtractor() ;
-- assuming field "email_contents" contains the raw e-mail ...
emails = FOREACH raw_data GENERATE
file_name ,
FLATTEN( EmailExtractor( email_contents ) )
;
returns: tuple of chararray: ( from, to, cc, date, subject, body )
“net.exmachinatech.charcuterie.pig.eval.StripHTML”
strip HTML formatting out of the supplied text
define StripHTML net.exmachinatech.charcuterie.pig.eval.StripHTML() ;
FOREACH raw_html GENERATE StripHTML( raw_html ) ;
returns: string, of plain text
“net.exmachinatech.charcuterie.pig.eval.TermTokenizer”
stem and tokenize a document
define TermTokenizer net.exmachinatech.charcuterie.pig.eval.TermTokenizer() ;
FOREACH document_set GENERATE
file_name ,
TermTokenizer( file_contents )
;
returns: bag, of (term) tuples
“net.exmachinatech.charcuterie.pig.eval.HTMLExtractor”
extract multiple fields from an HTML document, using Jsoup Selector
expressions (http://jsoup.org/apidocs/org/jsoup/select/Selector.html)
-- can also pass "true" to constructor, to strip HTML tags from the results.
define HTMLExtractor net.exmachinatech.charcuterie.pig.eval.HTMLExtractor() ;
-- The UDF uses Jsoup Selector expressions. For details, see:
-- http://jsoup.org/apidocs/org/jsoup/select/Selector.html
--
-- Fetch the title and the content of the element with ID "#interesting" ...
-- (The raw HTML is in the "raw_data" field "html_content")
--
-- The UDF accepts an arbitrary number of Selector expressions.
-- If an expression matches multiple HTML elements, the UDF will concatenate
-- them into a single string value.
extracted = FOREACH raw_data GENERATE
file_name ,
FLATTEN( HTMLExtractor( html_content , "title" , "#interesting" ) )
AS ( title , interesting )
;
returns: tuple of chararray, one member for each Selector expression.