Posts

Showing posts from February, 2012

Set operations in Apache Pig

Simple set operation examples

While writing Apache Pig scripts, I realized that in many cases the result I was after was attainable through a series of set operations performed on various relations. It’s not very clear from the documentation how to perform these operations. I googled a bit and found this PDF from Tufts University on ‘Advanced Pig’. In a nutshell, COGROUP is your friend. Here are some simple examples that show how you can perform set operations in Pig using COGROUP.

Let’s assume we have 2 relations TEST1 and TEST2. We load TEST1 from test1.txt containing:

1,aaaa
2,bbbb
3,cccc
4,dddd
5,eeee
6,ffff
7,gggg
8,hhhh
9,iiii

TEST1 = LOAD 's3://mybucket/test1.txt' USING PigStorage(',') as (
id: chararray,
value: chararray);
We load TEST2 from test2.txt containing:

7,ggggggg
8,hhhhhhh
9,iiiiiii
10,jjjjjjj
11,kkkkkkk

TEST2 = LOAD 's3://mybucket/test2.txt' USING PigStorage(',') as (
id: chararray,
value: chararray);

We use COGROUP to generate a new relation. COGROUP is sim…

Handling date/time in Apache Pig

A common usage scenario for Apache Pig is to analyze log files. Most log files contain a timestamp of some sort -- hence the need to handle dates and times in your Pig scripts. I'll present here a few techniques you can use.

Mail server logs

The first example I have is a Pig script which analyzes the time it takes for a mail server to send a message. The script is available here as a gist.

We start by registering the piggybank jar and defining the functions we'll need. I ran this using Elastic MapReduce, and all these functions are available in the piggybank that ships with EMR.

REGISTER file:/home/hadoop/lib/pig/piggybank.jar; DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();             DEFINE CustomFormatToISO org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO(); DEFINE ISOToUnix org.apache.pig.piggybank.evaluation.datetime.convert.ISOToUnix(); DEFINE DATE_TIME org.apache.pig.piggybank.evaluation.datetime.DATE_TIME(); DEFINE FORMAT_DT o…