How to Install and Configure Apache Hadoop on Debian 11

Apache Hadoop is an open-source, Java-based software platform that manages data processing and storage for big data applications. Hadoop works by distributing large data sets and analytics jobs across nodes in a computing cluster, breaking them down into smaller workloads that can be run in parallel. Hadoop can process structured and unstructured data and scale up reliably froma single server to thousands of machines.

Update the system

Update the system packages with the latest version with the following command and reboot the system once updated.

apt-get update -y

Installing Java

Apache Hadoop is an application based on JAVA programming, so we need to install JAVA with the following command.

apt-get install default-jdk default-jre -y

Verify the JAVA version once the installation is done.

java -version

Output:

root@server:~# java --version
openjdk 11.0.12 2021-07-20
OpenJDK Runtime Environment (build 11.0.12+7-post-Debian-2)
OpenJDK 64-Bit Server VM (build 11.0.12+7-post-Debian-2, mixed mode, sharing)

Creating hadoop user

Create Hadoop User and Setup Passwordless SSH for Hadoop user run the follwing command to create Hadoop user.

adduser hadoop

Output:

root@server:~# adduser hadoop
Adding user `hadoop' ...
Adding new group `hadoop' (1001) ...
Adding new user `hadoop' (1001) with group `hadoop' ...
Creating home directory `/home/hadoop' ...
Copying files from `/etc/skel' ...

Switch to Hadoop user once the user has been created.

su - hadoop

Run the following command to generate the SSH key.

ssh-keygen -t rsa

Output:

hadoop@server:~$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): /home/hadoop/.ssh/id_rsa
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/hadoop/.ssh/id_rsa
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:2/sLwfVmXrgJmGHiijhZnu6mR6p3P3lLSGgq9RtksU8 hadoop@server
The key's randomart image is:
+---[RSA 3072]----+
|                 |
|                 |
|    .   . o .    |
|     + . + = . . |
|  . B E S = . = .|
| . @.* o o . = + |
|. =o* +.o o   +  |
| .oo+oo..  o     |
|.o.*+..o....o.   |
+----[SHA256]-----+

You have to add the public key of your computer to the authorized_keys file of the computer also give permission to the authorized_keys file.

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Verify the passwordless SSH connection with the following command.

ssh Server's_IP_Address

Install Hadoop

Switch Hadoop user and download the latest version of Hadoop using the following "wget" command.

su - hadoop

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz 

Extract the downloaded "tar" file with the following command.

tar -xvzf hadoop-3.3.1.tar.gz 

Next, switch back to root user for the below commands. We will move the extracted files to a specific directory.

su root
cd /home/hadoop
mv hadoop-3.3.0 /usr/local/hadoop

The /home/hadoop path will differ in case you have a different username.

Create the log directory to store the "Apache Hadoop" logs.

mkdir /usr/local/hadoop/logs

Change the ownership of /usr/local/hadoop directory to hadoop and switch back to hadoop user.

chown -R hadoop:hadoop /usr/local/hadoop
su hadoop

Enter the edit mode to ".bashrc" and define the Hadoop environment variables by adding the following content to the end of the file.

nano ~/.bashrc

And add the follwing configuration to the end of the file.

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Run the following command to activate the added environment variables.

source ~/.bashrc

Configure Hadoop

If you are new to Hadoop and want to explore basic commands or test applications, you can configure Hadoop on a single node. Configure Java Environment Variables.

Next, you will need to define Java environment variables in hadoop-env.sh to configure YARN, HDFS, MapReduce, and Hadoop-related project settings.

To locate the correct path of Java by using the following command.

which javac

Output:

hadoop@server:~$ which javac
/bin/javac

Next, find the OpenJDK directory with the following command.

readlink -f /usr/bin/javac

Output:

hadoop@server:~$ readlink -f /usr/bin/javac
/usr/lib/jvm/java-11-openjdk-amd64/bin/javac

Next, edit the hadoop-env.sh file and define the Java path.

nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

And add the following configuration to the end of the file.

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 
export HADOOP_CLASSPATH+=" $HADOOP_HOME/lib/*.jar"

Need to download the Javax activation file by running the following command.

cd /usr/local/hadoop/lib
sudo wget https://jcenter.bintray.com/javax/activation/javax.activation-api/1.2.0/javax.activation-api-1.2.0.jar

Output:

hadoop@server:/usr/local/hadoop/lib$ wget https://jcenter.bintray.com/javax/activation/javax.activation-api/1.2.0/javax.activation-api-1.2.0.jar
--2021-08-09 15:45:15--  https://jcenter.bintray.com/javax/activation/javax.activation-    api/1.2.0/javax.activation-api-1.2.0.jar
Resolving jcenter.bintray.com (jcenter.bintray.com)... 34.95.74.180
Connecting to jcenter.bintray.com (jcenter.bintray.com)|34.95.74.180|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 56674 (55K) [application/java-archive]
Saving to: ‘javax.activation-api-1.2.0.jar’

javax.activati 100%  55.35K  --.-KB/s    in 0.003s        

2021-08-09 15:45:15 (19.1 MB/s) - ‘javax.activation-api-1.2.0.jar’ saved [56674/56674]

Next, Verify the hadoop version.

hadoop version

Output:

hadoop@server:~$ hadoop version
Hadoop 3.3.1
Source code repository https://github.com/apache/hadoop.git -r a3b9c37a397ad4188041dd80621bdeefc46885f2
Compiled by ubuntu on 2021-06-15T05:13Z
Compiled with protoc 3.7.1
From source with checksum 88a4ddb2299aca054416d6b7f81ca55
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.3.1.jar

Configure core-site.xml File

To set up Hadoop you need to specify the URL for your NameNode as following.

vi $HADOOP_HOME/etc/hadoop/core-site.xml

And add the follwing configuration to the end of the file.

<configuration>
 <property>
            <name>fs.default.name</name>
        <value>hdfs://0.0.0.0:9000</value>
        <description>The default file system URI</description>
 </property>
</configuration>

Configure hdfs-site.xml File

Need to define the location for storing node metadata, fsimage file, and edit log file. Configure the file by defining the NameNode and DataNode storage directories.

Before configure create a directory for storing node metadata.

mkdir -p /home/hadoop/hdfs/{namenode,datanode}
chown -R hadoop:hadoop /home/hadoop/hdfs

Edit the hdfs-site.xml file and define the location of the directory as follows.

vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml

And add the follwing configuration to the end of the file.

<configuration>
     <property>
            <name>dfs.replication</name>
        <value>1</value>
     </property>

     <property>
        <name>dfs.name.dir</name>
        <value>file:///home/hadoop/hdfs/namenode</value>
     </property>

     <property>
        <name>dfs.data.dir</name>
        <value>file:///home/hadoop/hdfs/datanode</value>
     </property>
</configuration>

Configure mapred-site.xml File

Use the following command to access the mapred-site.xml file and define MapReduce values.

vi $HADOOP_HOME/etc/hadoop/mapred-site.xml

And add the following configuration to the end of the file.

<configuration>
 <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
 </property>
</configuration>

Configure yarn-site.xml File

You would need to edit the yarn-site.xml file and define YARN-related settings.

vi $HADOOP_HOME/etc/hadoop/yarn-site.xml

And add the following configuration to the end of the file.

<configuration>
 <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
 </property>
</configuration>

Format HDFS NameNode

It is important to format the NameNode before starting Hadoop services for the first time.

hdfs namenode -format

Start the Hadoop Cluster

First, start the NameNode and DataNode with the following command.

start-dfs.sh

Output:

hadoop@server:~$ start-dfs.sh
Starting namenodes on [server]
server: ssh: Could not resolve hostname server: Name or service not known
Starting datanodes
localhost: Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
2021-08-09 15:58:34,014 ERROR conf.Configuration: error parsing conf core-site.xml

Next, start the YARN resource and nodemanagers by typing.

start-yarn.sh

Output:

hadoop@server:~$ start-yarn.sh
Starting resourcemanagers on []
Starting nodemanagers

Verify if all the daemons are active and running as Java processes.

jps

Output:

hadoop@server:~$ jps
58000 NameNode
54897 DataNode
55265 ResourceManager
55043 SecondaryNameNode
58506 Jps
55355 NodeManager

Access Hadoop Web Interface

Navigate your localhost URL or IP to access Hadoop NameNode: http://your-server-ip:9870

images

Navigate your localhost URL or IP to access individual DataNodes : http://your-server-ip:9864

images

Navigate your localhost URL or IP to access the YARN Resource Manager: http://your-server-ip:8088

images

Done!