Installing single node Hadoop 2.4.1 on Ubuntu

In this tutorial  you will know step by step process for setting up a Hadoop Single Node cluster.

Prerequisites:

  1. Installing Java.
  2. Adding dedicated Hadoop system user.
  3. Configuring SSH access.
  4. Disabling IPv6.
  5. Hadoop Installation.

1. Installing Java :

For running Hadoop it requires Java v1. 7+

a. Download Latest oracle Java Linux version from the oracle website.

b. Unpack the compressed Java binaries, using the command:

    sudo tar xvzf jdk-7u67-linux-x64.tar.g

c. Create a Java directory using mkdir under /user/local/ and change the directory to /usr/local/Java by using this command.

    mkdir -R /usr/local/Java
    cd /usr/local/Java

d. Copy the Oracle Java binaries into the /usr/local/Java directory.

    sudo cp -r jdk-1.7.0_67/usr/local/java

e. Edit the system PATH file /etc/profile and add the following system variables to your system path.

    sudo nano /etc/profil

f. Scroll down to the end of the file using your arrow keys and add the following lines below to the end of your /etc/profile file:

    JAVA_HOME=/usr/local/Java/jdk1.7.0_67
    PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
    export JAVA_HOME
    export PATH

g. Inform your Ubuntu Linux system where your Oracle Java JDK/JRE is located. This will tell the system that the new Oracle Java version is available for use.

   sudo update-alternatives –install “/usr/bin/javac” “javac” “/usr/local/java/jdk1.7.0_67/bin/javac” 1
   sudo update-alternatvie –set javac /usr/local/Java/jdk1.7.0_67/bin/javac

h. Reload your system wide PATH /etc/profile by typing the following command:

    . /etc/profile

Test to see if Oracle Java was installed correctly on your system.

    Java -version

2. Adding dedicated Hadoop system user.

We will use a dedicated Hadoop user account for running Hadoop. While that’s not required  but it is recommended, because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine.

a. Adding group:

    sudo addgroup hadoop

b. Creating a user and adding the user to a group:

    sudo adduser –ingroup hadoop hduser

3. Configuring SSH access:

The need for SSH Key based authentication is required so that the master node can then login to slave nodes (and the secondary node) to start/stop them and also local machine if you want to use Hadoop with it. For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section.

    sudo apt-get install openssh-server

Before this step you have to make sure that SSH is up and running on your machine and configured it to allow SSH public key authentication.

Generating an SSH key for the hduser user.
a. Login as hduser with sudo
b. Run this Key generation command:

    ssh-keyegen -t rsa -P “”

c. It will ask to provide the file name in which to save the key, just press has entered so that it will generate the key at ‘/home/hduser/ .ssh’

d. Enable SSH access to your local machine with this newly created key.

    cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

e. The final step is to test the SSH setup by connecting to your local machine with the hduser user.

    ssh hduser@localhost

4. Disabling IPv6.

We need to disable IPv6 because Ubuntu is using 0.0.0.0 IP for different Hadoop configurations. You will need to run the following commands using a root account:

    sudo gedit /etc/sysctl.conf

Add the following lines to the end of the file and reboot the machine, to update the configurations correctly.

    #disable ipv6
    net.ipv6.conf.all.disable_ipv6 = 1
    net.ipv6.conf.default.disable_ipv6 = 1
    net.ipv6.conf.lo.disable_ipv6 = 1

And then reboot your computer.

5. Hadoop Installation.

a. Go to Apache Downloads and download Hadoop’s latest version (prefer to download any stable versions).

b. Unpack the compressed hadoop file by using this command:

    tar –xvzf hadoop-2.4.1.tar.gz

c. Move hadoop-2.4.1 to hadoop directory (in /usr/local) by using give command:

    mv hadoop-2.4.1 /usr/local

d. Make sure to change the owner of all the files to the hduser user and hadoop group by using this command:

    sudo chown -R hduser:hadoop hadoop

Configuring Hadoop:

The following are the required files we will use for the perfect configuration of the single node Hadoop cluster.

a. yarn-site.xml:
b. core-site.xml
c. mapred-site.xml
d. hdfs-site.xml
e. Update $HOME/.bashrc

We can find the list of files in Hadoop directory which is located in

    cd /usr/local/hadoop/etc/hadoop

a.yarn-site.xml:

<configuration>
<!– Site specific YARN configuration properties –>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

b. core-site.xml:

    vi core-site.xml

Add the following entry to the file and save and quit the file:

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

c. mapred-site.xml:

    vi mapred-site.xml

Add the following entry to the file and save and quit the file.

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

d. hdfs-site.xml:

    vi hdfs-site.xml

Create two directories to be used by namenode and datanode.

    mkdir -p $HADOOP_HOME/yarn_data/hdfs/namenode
    sudo mkdir -p $HADOOP_HOME/yarn_data/hdfs/namenode
    mkdir -p $HADOOP_HOME/yarn_data/hdfs/datanode

Add the following entry to the file and save and quit the file:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/yarn_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/yarn_data/hdfs/datanode</value>
</property>
</configuration>

e. Update $HOME/.bashrc

Go back to the root and edit the .bashrc file.

    vi .bashrc

Add the following lines to the end of the file.

# Set Hadoop-related environment variables
export HADOOP_PREFIX=/usr/local/hadoop
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
# Native Path
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
export HADOOP_OPTS=”-Djava.library.path=$HADOOP_PREFIX/lib”
#Java path
export JAVA_HOME=’/usr/locla/Java/jdk1.7.0_67′
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_PATH/bin:$HADOOP_HOME/sbin

Formatting and Starting/Stopping the HDFS filesystem via the NameNode:

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your cluster. You need to do this the first time you set up a Hadoop cluster. Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS). To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the

    hadoop namenode -format

Start Hadoop by running the following command:

    start-dfs.sh

Stop Hadoop by running the following command:

    stop-dfs.sh

Hadoop Web Interfaces:

    HDFS Namenode and check health using http://localhost:50070

    HDFS Secondary Namenode status using http://localhost:50090

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s