The following are the prerequisites for setting up Pig and running Pig scripts.
- You should have the latest stable build of Hadoop up and running, to install hadoop, please check my previous blog article on Hadoop Setup.
Setting up Pig
- Download a stable version of Pig file from apache download mirrors, For this tutorial we are using pig-0.11.1,this release works with Hadoop 0.20.X, 1.X, 0.23.X and 2.X
2. Copy the pig binaries into the /usr/local/pig directory.
cp -r pig-0.11.1.tar.gz /usr/local/pig
3. Change the directory to /usr/local/pig by using this command
4. Unpack the compressed pig, in the directory /usr/local/pig
sudo tar xvzf pig-0.11.1.tar.gz
5. set PIG_HOME in $HOME/.bashrc so it will be set every time you login. Add the following line to it.
export PIG_HOME=<path_to_pig_home_directory> e.g. export PIG_HOME='/usr/local/pig/pig-0.11.1' export PATH=$HADOOP_HOME/bin:$PIG_HOME/bin:$JAVA_HOME/bin:$PATH
6. Set the environment variable JAVA_HOME to point to the Java installation directory, which Pig uses internally.
Pig has two modes of execution – local mode and MapReduce mode.
Local mode is usually used to verify and debug Pig queries and/or scripts on smaller datasets which a single machine could handle. It runs on a single JVM and access the local filesystem.
To run in local mode, please pass the following command:
$ pig -x local grunt>
This is the default mode Pig translates the queries into MapReduce jobs, which requires access to a Hadoop cluster.
2013-10-28 11:39:44,767 [main] INFO org.apache.pig.Main – Apache Pig version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53
2013-10-28 11:39:44,767 [main] INFO org.apache.pig.Main – Logging error messages to: /home/hduser/pig_1382985584762.log
2013-10-28 11:39:44,797 [main] INFO org.apache.pig.impl.util.Utils – Default bootup file /home/hduser/.pigbootup not found
2013-10-28 11:39:45,094 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to hadoop file system at: hdfs://Hadoopmaster:54310
2013-10-28 11:39:45,592 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to map-reduce job tracker at: Hadoopmaster:54311
You can see the log reports from Pig stating the filesystem and jobtracker it connected to. Grunt is an interactive shell for your Pig queries.