Setting up Hadoop -MapReduce, HDFS and YARN. Standalone and pseudo-distributed mode.
In this story, I will introduce you to
- configuring Hadoop on your Linux system:
2. Prerequisites
2.1. in Standalone mode.
2.2. Pseudo-Distributed mode.
1. configuring Hadoop on your Linux system:
The best way to learn is by actually implementing stuff on your own. Hadoop can be installed in 3 different modes: Standalone mode, Pseudo-Distributed mode and Fully-Distributed mode. Standalone mode is the default mode of operation of Hadoop and it runs on a single node ( a node is your machine). HDFS and YARN doesn't run on standalone mode. Pseudo-Distributed mode stands between the standalone mode and fully distributed mode on a production level cluster. It is used to simulate the actual cluster. It simulated 2 node — a master and a slave by running JVM process. it gives you a fully-fledged test environment. HDFS is used for storage using some portion of your disk space and YARN needs to run to manage resources on this Hadoop installation. Full Distributed runs on cluster of machines. Lots of configuration parameter had to be setup for production system.
2. Prerequisites — Java installed on your system
Hadoop is built using Java and Java is required to run MapReduce code. You can check whether you have Java installed on your machine.The Java version should be higher than Java 7 i.e Java 1.7. for the latest hadoop installs.
Open terminal and fire:
$ java -version
If you have already installed Java, move to 2.1. If not, follow the steps:
$ sudo apt-get update
$ sudo apt-get install default-jdk
Now check the version once again
$ java -version
2.1. in Standalone mode.
Download the latest version of Hadoop here. In this tutorial I use Hadoop 2.7.3
Go to the directory you have downloaded the compressed Hadoop file and unzip using terminal
$ tar -xzvf hadoop-2.7.3.tar.gz
Change the version number if needed to match the Hadoop version you have downloaded. Now we are moving this extracted files to /usr/local, suitable for local installs.
$ sudo mv hadoop-2.7.3 /usr/local/hadoop
Now go to the Hadoop distribution directory using terminal
$ cd /usr/local/hadoop
Lets see whats inside the Hadoop folder
etc — has the configuration files for Hadoop environment.
bin — include various commands useful like Hadoop cmdlet.
share — has the jars that is required when you write MapReduce job. It has Hadoop libraries
Hadoop command in the bin folder is used to run jobs in Hadoop.
$ bin/hadoop
jar command is used to run the MapReduce jobs on Hadoop cluster
$ bin/hadoop jar
Now we will run an example MapReduce to ensure that our standalone install works
create a input directory to place the input files and we run MapReduce command on it. These are the configuration and command files along with hadoop, we will use those as text file input for our MapReduce.
$ mkdir input
$ cp etc/hadoop/* input/
This is run using the bin/hadoop command. It is used to run MapReduce. Jar indicates that the MapReduce operation is specified in a Java archive. Here we will use the Hadoop-MapReduce-examples.jar file which come along with installation. Jar name differ based on the version you are installing. Now move on to your Hadoop install directory and type:
$ Hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples- 2.7.3.jar
If you are not in the correct directory, you will get an error saying “Not a valid JAR” as shown below. If this issue persist, check if the location of Jar file is correct for your system.
This is an MapReduce ran successfully on standalone setup.
2.2. Pseudo-Distributed mode.
prerequisite
Now we will setup Hadoop in Pseudo-Distributed setup. This is same as running Hadoop on a real cluster. At least 20% of free memory in hard disk must be free to run Hadoop properly. Also, SSH (Secure Shell) must be installed. We can check it by using
$ ssh localhost
Master node communicate with slave node very freequently over SSH protocol. in Pseudo-Distributed mode, only one node exist (your machine) and master slave interaction is simulated by JVM. Since communication is very frequent, ssh should be password less. Authentication needs to be done using Public key. Above command may not work in your machine if ssh is not installed on your machine, use this command to install ssh
$ sudo apt-get install openssh-server
to disable password authentication
$ nano /etc/ssh/sshd_config
edit the following line to “no” as shown
PasswordAuthentication no
restart ssh to apply the settings
$ /etc/init.d/sshd restart
Configuration:
note that most of the files are in etc/hadoop.
1. hadoop-env.sh
Hadoop needs to know where Java is installed in your machine before it can function. To find the java path on your machine:
$ readlink -f /usr/bin/java | sed "s:bin/java::"
We will set the JAVA_HOME to point to our Java install directory and other environment variable for Hadoop. go to the directory /usr/local/hadoop/etc/hadoop
open hadoop-env.sh using nano editor in terminal
$ nano hadoop-env.sh
Add this environment variable to the file
export JAVA_HOME=/<the path you found from above command>/
Also add Hadoop prefix after Java home. This variable is required by Hadoop to startup and run in this mode.
export HADOOP_PREFIX=/usr/local/hadoop/
now save and exit nano using Ctrl+X and then enter Y to save the buffer with existing name.
2. core-site.xml
Now we will edit core-site.xml in the same directory. In the terminal
nano core-site.xml
add the following inside the <configuration> </configuration> tag
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
3. hdfs-site.xml
Now we will cofigure HDFS using the file hdfs-site.xml. Paste the following inside the <configuration> </configuration> tag
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
any data you store on HDFS is replicated to 1 another node as a backup
4. mapred-site.xml
we create a new file in the same directory using the command
nano mapred-site.xml
paste the following lines to mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
in this property, we configured the resource negotiator as YARN.
5. yarn-site.xml
At last, we configure the resource negotiator YARN.
nano yarn-site.xml
Paste the following inside the <configuration> </configuration> tag
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
Now we are ready to get started.
Format NameNode
Move to bin directory and format the existing namenode using the command. NameNode is the master node in any cluster and keeps track of all the other nodes in the cluster where the processes run. We format this to have a fresh start. A NameNode to a cluster is like table of content to a book!
$ hdfs namenode -format
If there are any error in message while executing the command, read through it fix and run the command again.
Start master and slave node
Once the above step is successful, move to usr/local/hadoop/sbin directory and fire following commands
$ ./start-dfs.sh
you can check if the NameNode is running on your browser under
localhost:50070
jps command gives the list of running process in java. This is used to check if HDFS is running.
$ jps
Start YARN
YARN is our resource negotiator here. We can start it by running the command on the same sbin directory
$ start-yarn.sh
you can recheck if YARN is running properly by running jps command again
$ jps
Congrats! You successfully setup Hadoop is Pseudo-Distributed mode.
3. create a simple Java program to run on Hadoop and MapReduce.
to be continued…