How to Install and Configure Apache Hadoop on Ubuntu 20.04

Apache Hadoop is an open source, Java-based software platform that manages data processing and storage for big data applications. Hadoop works by distributing large data sets and analytics jobs across nodes in a computing cluster, breaking them down into smaller workloads that can be run in parallel. Hadoop can process structured and unstructured data and scale up reliably from a single server to thousands of machines.

Update the system

Update the system packages with latest version with follwing command and reboot the system once updated.

apt-get update -y

Installing Java

Apache Hadoop ia the application based on JAVA programming, need to install JAVA with following command.

apt-get install default-jdk default-jre -y


root@vps:~# apt-get install default-jdk default-jre -y
Reading package lists... Done
Building dependency tree... 50%
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
    at-spi2-core ca-certificates-java default-jdk-headless 

Verify the JAVA version once installation is done.

java -version


root@vps:~# java -version
openjdk version "11.0.8" 2020-07-14
OpenJDK Runtime Environment (build 11.0.8+10-post-Ubuntu-0ubuntu120.04)
OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Ubuntu-0ubuntu120.04, mixed mode, sharing)

Creating hadoop user

Create Hadoop User and Setup Passwordless SSH for Hadoop user run the follwing command to create Hadoop user.

adduser hadoop


root@vps:~# adduser hadoop
Adding user `hadoop' ...
Adding new group `hadoop' (1000) ...
Adding new user `hadoop' (1000) with group `hadoop' ...
Creating home directory `/home/hadoop' ...
Copying files from `/etc/skel' ...
New password:

Swith to Hadoop user once the user has been created.

su - hadoop


root@vps:~# su - hadoop
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

Run the following command to generate SSH key.

ssh-keygen -t rsa


hadoop@vps:~$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa
Your public key has been saved in /home/hadoop/.ssh/
The key fingerprint is:
The key's randomart image is:
+---[RSA 3072]----+
|     .OB*o.      |
|    ...+o* .     |
|   . o+o+ = .    |
| ..o.Eo+ @ .     |
|o.+ o . S O o    |
|.o . . + @ =     |
|.       = +      |
|         .       |
|                 |

You have to add the public key of your computer to the authorized_keys file of the computer also give the permission to the authorized_keys file.

cat ~/.ssh/ >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Verify the passwordless SSH connection with following command.

ssh Server's_IP_Address

Install Hadoop

Switch hadoop user and download the latest version of Hadoop using follwing "wget" command.

su - hadoop



hadoop@vps:~$ wget
--2020-10-05 14:40:38--
Resolving (, 2a01:4f8:10a:201a::2
Connecting to (||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 359196911 (343M) [application/x-gzip]
Saving to: ‘hadoop-3.2.1.tar.gz’

hadoop-3.2.1.tar.gz                       100%[=====================================================================================>] 342.56M  18.1MB/s    in 20s

2020-10-05 14:40:59 (17.5 MB/s) - ‘hadoop-3.2.1.tar.gz’ saved [359196911/359196911]

Extract the downloaded "tar" file with following command.

tar -xvzf hadoop-3.2.1.tar.gz

Next, switch back to root user for the below commands. We will move the extracted files to a specific directory.

su root
cd /home/hadoop
mv hadoop-3.3.0 /usr/local/hadoop

/home/hadoop path will differ in case you have a different username.

Create the log directory to stotre the "Apache Hadoop" logs.

mkdir /usr/local/hadoop/logs

Change the ownership of /usr/local/hadoop directory to hadoop and switch back to hadoop user.

chown -R hadoop:hadoop /usr/local/hadoop
su hadoop

Enter the edit mode to ".bashrc" and define the Hadoop environment variables by adding the following content to the end of the file.

vi ~/.bashrc

And add the follwing configuration to the end of the file.

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Run the follwing command to activate the added environment variables.

source ~/.bashrc

Configure Hadoop

Next, switch back to Hadoop user. If you are new to Hadoop and want to explore basic commands or test applications, you can configure Hadoop on a single node. Configure Java Environment Variables.

Next, you will need to define Java environment variables in to configure YARN, HDFS, MapReduce, and Hadoop-related project settings.

To locate the correct path of Java by using the following command.

which javac


hadoop@vps:~$ which javac

Next, find the OpenJDK directory with the following command.

readlink -f /usr/bin/javac


hadoop@vps:~$ readlink -f /usr/bin/javac

Next, edit the file and define the Java path.

vi $HADOOP_HOME/etc/hadoop/

And add the follwing configuration to the end of the file.

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 
export HADOOP_CLASSPATH+=" $HADOOP_HOME/lib/*.jar"

Need to download the Javax activation file by running the following command.

cd /usr/local/hadoop/lib
sudo wget


hadoop@vps:/usr/local/hadoop/lib$ sudo wget
--2020-10-05 14:55:51--
Resolving (,,, ...
Connecting to (||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 56674 (55K) [application/content-stream]
Saving to: ‘javax.activation-api-1.2.0.jar’

javax.activation-api-1.2.0.jar            100%[=====================================================================================>]  55.35K  --.-KB/s    in 0.05s

2020-10-05 14:55:51 (1021 KB/s) - ‘javax.activation-api-1.2.0.jar’ saved [56674/56674]

Next, Verify the hadoop version.

hadoop version


hadoop@vps:/usr/local/hadoop/lib$ hadoop version
Hadoop 3.2.1
Source code repository -r b3cbbb467e22ea829b3808f4b7b01d07e0bf3842
Compiled by rohithsharmaks on 2019-09-10T15:56Z
Compiled with protoc 2.5.0
From source with checksum 776eaf9eee9c0ffc370bcbc1888737
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.2.1.jar

Configure core-site.xml File

To set up Hadoop you need to specify the URL for your NameNode as following.

vi $HADOOP_HOME/etc/hadoop/core-site.xml

And add the follwing configuration to the end of the file.

            <description>The default file system URI</description>

Configure hdfs-site.xml File

Need to define location for storing node metadata, fsimage file, and edit log file. Configure the file by defining the NameNode and DataNode storage directories.

Before configure create a directory for storing node metadata.

mkdir -p /home/hadoop/hdfs/{namenode,datanode}
chown -R hadoop:hadoop /home/hadoop/hdfs

Edit the hdfs-site.xml file and define the location of the directory as follows.

vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml

And add the follwing configuration to the end of the file.




Configure mapred-site.xml File

Use the following command to access the mapred-site.xml file and define MapReduce values.

vi $HADOOP_HOME/etc/hadoop/mapred-site.xml

And add the follwing configuration to the end of the file.


Configure yarn-site.xml File

You would need to edit the yarn-site.xml file and define YARN related settings.

vi $HADOOP_HOME/etc/hadoop/yarn-site.xml

And add the follwing configuration to the end of the file.


Format HDFS NameNode

It is important to format the NameNode before starting Hadoop services for the first time.

hdfs namenode -format


2020-10-05 15:12:51,795 INFO snapshot.SnapshotManager: Loaded config captureOpenFiles: false, skipCaptureAccessTimeOnlyChange: false, snapshotDiffAllowSnapRootDescendant: true, maxSnapshotLimit: 65536
2020-10-05 15:12:51,803 INFO snapshot.SnapshotManager: SkipList is disabled
2020-10-05 15:12:51,814 INFO util.GSet: Computing capacity for map cachedBlocks
2020-10-05 15:12:51,815 INFO util.GSet: VM type       = 64-bit
2020-10-05 15:12:51,815 INFO util.GSet: 0.25% max memory 498 MB = 1.2 MB
2020-10-05 15:12:51,815 INFO util.GSet: capacity      = 2^17 = 131072 entries
2020-10-05 15:12:51,848 INFO metrics.TopMetrics: NNTop conf: = 10
2020-10-05 15:12:51,852 INFO metrics.TopMetrics: NNTop conf: = 10
2020-10-05 15:12:51,852 INFO metrics.TopMetrics: NNTop conf: = 1,5,25
2020-10-05 15:12:51,863 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
2020-10-05 15:12:51,863 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
2020-10-05 15:12:51,871 INFO util.GSet: Computing capacity for map NameNodeRetryCache
2020-10-05 15:12:51,872 INFO util.GSet: VM type       = 64-bit
2020-10-05 15:12:51,873 INFO util.GSet: 0.029999999329447746% max memory 498 MB = 153.0 KB
2020-10-05 15:12:51,874 INFO util.GSet: capacity      = 2^14 = 16384 entries
2020-10-05 15:12:51,937 INFO namenode.FSImage: Allocated new BlockPoolId: BP-809721395-
2020-10-05 15:12:52,059 INFO common.Storage: Storage directory /home/hadoop/hdfs/namenode has been successfully formatted.
2020-10-05 15:12:52,161 INFO namenode.FSImageFormatProtobuf: Saving image file /home/hadoop/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 using no compression
2020-10-05 15:12:52,447 INFO namenode.FSImageFormatProtobuf: Image file /home/hadoop/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 of size 401 bytes saved in 0 seconds .
2020-10-05 15:12:52,486 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2020-10-05 15:12:52,515 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid=0 when meet shutdown.
2020-10-05 15:12:52,516 INFO namenode.NameNode: SHUTDOWN_MSG:
SHUTDOWN_MSG: Shutting down NameNode at

Start the Hadoop Cluster

First, start the NameNode and DataNode with the following command.


Starting namenodes on [] Warning: Permanently added '' (ECDSA) to the list of known hosts.
Starting datanodes
Starting secondary namenodes [] Warning: Permanently added '' (ECDSA) to the list of known hosts.

Next, start the YARN resource and nodemanagers by typing.


Starting resourcemanager
Starting nodemanagers

Verify if all the daemons are active and running as Java processes.



hadoop@vps:~/hdfs$ jps
63187 DataNode
64227 Jps
63699 ResourceManager
63868 NodeManager
63437 SecondaryNameNode
63038 NameNode

Access Hadoop Web Interface

Navigate your localhost URL or IP to access Hadoop NameNode : http://your-server-ip:9870


Navigate your localhost URL or IP to access individual DataNodes : http://your-server-ip:9864


Navigate your localhost URL or IP to access the YARN Resource Manager: http://your-server-ip:8088