Setting up Hadoop -MapReduce, HDFS and YARN. Standalone and pseudo-distributed mode.

Nidhin Mahesh
7 min readJul 7, 2017

In this story, I will introduce you to

  1. configuring Hadoop on your Linux system:

2. Prerequisites

2.1. in Standalone mode.

2.2. Pseudo-Distributed mode.

1. configuring Hadoop on your Linux system:

The best way to learn is by actually implementing stuff on your own. Hadoop can be installed in 3 different modes: Standalone mode, Pseudo-Distributed mode and Fully-Distributed mode. Standalone mode is the default mode of operation of Hadoop and it runs on a single node ( a node is your machine). HDFS and YARN doesn't run on standalone mode. Pseudo-Distributed mode stands between the standalone mode and fully distributed mode on a production level cluster. It is used to simulate the actual cluster. It simulated 2 node — a master and a slave by running JVM process. it gives you a fully-fledged test environment. HDFS is used for storage using some portion of your disk space and YARN needs to run to manage resources on this Hadoop installation. Full Distributed runs on cluster of machines. Lots of configuration parameter had to be setup for production system.

2. Prerequisites — Java installed on your system

Hadoop is built using Java and Java is required to run MapReduce code. You can check whether you have Java installed on your machine.The Java version should be higher than Java 7 i.e Java 1.7. for the latest hadoop installs.

Open terminal and fire:

$ java -version

If you have already installed Java, move to 2.1. If not, follow the steps:

$ sudo apt-get update
$ sudo apt-get install default-jdk

Now check the version once again

$ java -version

2.1. in Standalone mode.

Download the latest version of Hadoop here. In this tutorial I use Hadoop 2.7.3

Go to the directory you have downloaded the compressed Hadoop file and unzip using terminal

$ tar -xzvf hadoop-2.7.3.tar.gz

Change the version number if needed to match the Hadoop version you have downloaded. Now we are moving this extracted files to /usr/local, suitable for local installs.

$ sudo mv hadoop-2.7.3 /usr/local/hadoop

Now go to the Hadoop distribution directory using terminal

$ cd /usr/local/hadoop

Lets see whats inside the Hadoop folder

etc — has the configuration files for Hadoop environment.

bin — include various commands useful like Hadoop cmdlet.

share — has the jars that is required when you write MapReduce job. It has Hadoop libraries

Hadoop command in the bin folder is used to run jobs in Hadoop.

$ bin/hadoop 

jar command is used to run the MapReduce jobs on Hadoop cluster

$ bin/hadoop jar

Now we will run an example MapReduce to ensure that our standalone install works

create a input directory to place the input files and we run MapReduce command on it. These are the configuration and command files along with hadoop, we will use those as text file input for our MapReduce.

$ mkdir input
$ cp etc/hadoop/* input/

This is run using the bin/hadoop command. It is used to run MapReduce. Jar indicates that the MapReduce operation is specified in a Java archive. Here we will use the Hadoop-MapReduce-examples.jar file which come along with installation. Jar name differ based on the version you are installing. Now move on to your Hadoop install directory and type:

$ Hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-  2.7.3.jar

If you are not in the correct directory, you will get an error saying “Not a valid JAR” as shown below. If this issue persist, check if the location of Jar file is correct for your system.

Running example to check working of standalone mode.

This is an MapReduce ran successfully on standalone setup.

2.2. Pseudo-Distributed mode.

prerequisite

Now we will setup Hadoop in Pseudo-Distributed setup. This is same as running Hadoop on a real cluster. At least 20% of free memory in hard disk must be free to run Hadoop properly. Also, SSH (Secure Shell) must be installed. We can check it by using

$ ssh localhost

Master node communicate with slave node very freequently over SSH protocol. in Pseudo-Distributed mode, only one node exist (your machine) and master slave interaction is simulated by JVM. Since communication is very frequent, ssh should be password less. Authentication needs to be done using Public key. Above command may not work in your machine if ssh is not installed on your machine, use this command to install ssh

$ sudo apt-get install openssh-server

to disable password authentication

$ nano /etc/ssh/sshd_config

edit the following line to “no” as shown

PasswordAuthentication no

restart ssh to apply the settings

$ /etc/init.d/sshd restart

Configuration:

note that most of the files are in etc/hadoop.

1. hadoop-env.sh

Hadoop needs to know where Java is installed in your machine before it can function. To find the java path on your machine:

$ readlink -f /usr/bin/java | sed "s:bin/java::"

We will set the JAVA_HOME to point to our Java install directory and other environment variable for Hadoop. go to the directory /usr/local/hadoop/etc/hadoop

open hadoop-env.sh using nano editor in terminal

$ nano hadoop-env.sh

Add this environment variable to the file

export JAVA_HOME=/<the path you found from above command>/

Also add Hadoop prefix after Java home. This variable is required by Hadoop to startup and run in this mode.

export HADOOP_PREFIX=/usr/local/hadoop/

now save and exit nano using Ctrl+X and then enter Y to save the buffer with existing name.

2. core-site.xml

Now we will edit core-site.xml in the same directory. In the terminal

nano core-site.xml

add the following inside the <configuration> </configuration> tag

<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>

3. hdfs-site.xml

Now we will cofigure HDFS using the file hdfs-site.xml. Paste the following inside the <configuration> </configuration> tag

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

any data you store on HDFS is replicated to 1 another node as a backup

4. mapred-site.xml

we create a new file in the same directory using the command

nano mapred-site.xml

paste the following lines to mapred-site.xml

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

in this property, we configured the resource negotiator as YARN.

5. yarn-site.xml

At last, we configure the resource negotiator YARN.

nano yarn-site.xml

Paste the following inside the <configuration> </configuration> tag

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

Now we are ready to get started.

Format NameNode

Move to bin directory and format the existing namenode using the command. NameNode is the master node in any cluster and keeps track of all the other nodes in the cluster where the processes run. We format this to have a fresh start. A NameNode to a cluster is like table of content to a book!

$ hdfs namenode -format

If there are any error in message while executing the command, read through it fix and run the command again.

Start master and slave node

Once the above step is successful, move to usr/local/hadoop/sbin directory and fire following commands

$ ./start-dfs.sh

you can check if the NameNode is running on your browser under

localhost:50070

jps command gives the list of running process in java. This is used to check if HDFS is running.

$ jps
NameNode, SecondaryNameNode and Datanode are the processes associated with HDFS

Start YARN

YARN is our resource negotiator here. We can start it by running the command on the same sbin directory

$ start-yarn.sh

you can recheck if YARN is running properly by running jps command again

$ jps
NodeManager and ResourceManager are two new process associated with managing node

Congrats! You successfully setup Hadoop is Pseudo-Distributed mode.

3. create a simple Java program to run on Hadoop and MapReduce.

to be continued…

--

--

Nidhin Mahesh

I learn by practice and I remember by writing it down. Just another curious human being.