Install Hadoop on Windows — Step by step guide

A guide to setting up APACHE HADOOP on Windows 10 and running hdfs clusters as well as yarn service.
Installing hadoop on windows 10

Apache is an open-source distributed filesystem setup that is used by companies in setting up Data Lakes. There are multiple resources to help you setup Hadoop on Linux but very few offer help in setting it up on Windows. In this tutorial I will explain how to configure Hadoop to run on Windows in a pseudo-distributed mode (Single node).

It is recommended to first understand big data before moving on to Hadoop setup. You can read about it below.

Download zip file

Download the .tar file from Hadoop’s website.

At time of writing of this article, Hadoop 3.3.1 is the most stable version so this article will follow its setup.

Create a destination folder

You need a separate directory to extract the Hadoop files. In my case, I will unpack all my files in the following directory.

D:/Data_Lake/

Unzip

Place the .tar file in your destination directory and run the following command in your command prompt.

tar -xvzf hadoop-3.3.1.tar.gz
Code language: CSS (css)

Hadoop has ALOT of set up files so the unzipping could take a lot of time.

While extracting, the following error may be shown up. 

tar: Error exit delayed from previous errors.
Code language: JavaScript (javascript)

Ignore it.

JDK Installation

A huge portion of the Hadoop backend is written in JAVA so in order to run Hadoop you need to install JAVA Development Kit (JDK). It can be downloaded from the following link.

MAKE SURE YOU DOWNLOAD JDK VERSION 8. Later versions of Java are not supported with HADOOP and will throw exceptions when you try to run HADOOP.

I have JDK installed in the following folder “D:/java/”. After installation you need to set this path in the environment variables. 

Environment variables are set to let the command prompt know which file to run when you type its specific keyword in the terminal window.

To edit add an environment variable, type “Environment Variable” in your windows search bar and the relevant setting should pop up.

Set environment variables for hadoop set up
Step 1: Editing Environment variable

Click on the ‘Environment variables…” button (Highlighted in above picture).

From click on “add” under user variables.

Set environment variables for hadoop set up on windows
Step 2 : Add variable path

Set the variable Name as: JAVA_HOME

For path, you need to browse to directory where you have installed JDK. In my case it is: “D:/java/

Your path will be different to mine.

Now under System Variables, click on PATH and click on edit. Add a new value to this PATH variable as:

<strong>%JAVA_HOME\bin</strong>
Code language: HTML, XML (xml)

Native IO files and Hadoop compilation

For working on Windows, Hadoop requires some Native IO files. These native libraries contain implementations of certain components for enhanced performance and non-availability of Java implementations. 

Originally you would’ve had to compile and build Native IO files yourself but some good folks on the internet have uploaded pre-built binaries on their GitHub repo. For our particular version of Hadoop, go to the following link to get the binaries.

Look for Hadoop version 3.3.1 for our current installation.

Download the bin file and copy its content to the following path.

<you Hadoop directory>/bin” (For me it is: “D:/Data_lake/hadoop/bin“)

Our Native files are all set

Configure Environment Variables for Hadoop

Similar steps will be taken as we did for Java path.

Go to Environment variables and under user variables, create a new variable with the name ‘HADOOP_HOME’ and set the path to your Hadoop home directory.

We add this environment variable the same we did for Java. In my case the variable looks like,

HADOOP_HOME Variable
Hadoop_home variable set

Name: HADOOP_HOME

Value: D:\Data_lake\hadoop-3.3.1

User Path variable for Hadoop

Now under the user variables, find the PATH variable and click on edit.

Click ‘New’ to add a new path and add the following path

%HADOOP_HOME%\bin

Your final path will look like the one below

PATH Variable for Hadoop
PATH Variables added

Click ‘OK’ on all windows for the system variables.

Verify Setup

Now in command prompt, navigate to the parent folder of your Hadoop installation and run the following command

winutils.exe
Code language: CSS (css)

You should see an output like the following, confirming that all the above steps were carried out correctly.

Run winutils.exe to check hadoop installation
winutils.exe

Configuring Hadoop files

Hadoop has some configuration files which you need to edit manually in order to get Hadoop running.

  1. core-site.xml

The first file to edit is the core-site.xml. 

This file is located at the following path:

%HADOOP_HOME%\etc\hadoop\core-site.xml
Code language: CSS (css)

This file contains information about the NameNode. It tells the Hadoop daemon where in the cluster the NameNode runs. It also binds the NameNode to a particular IP and Port.

At the bottom of the XML you will find a <configuration> tag.

Change the contents of this tag with the following text.

<configuration> <property> <name>fs.default.name</name> <value>hdfs://0.0.0.0:10000</value> </property> </configuration>
Code language: HTML, XML (xml)

2. hdfs-site.xml

This file is present in the same folder as the previous file. However, before editing this file you need to understand two core concepts of Hadoops distributed file system (HDFS). These are NameNode and DataNode. The ‘NameNode’ contains all the metadata of all the tables imported into Hadoop whereas the ‘DataNode’ contains the actual data.

The hdfs-site.xml file contains information (file paths) for these two nodes.

Before adding the two paths you actually need to create the two folders. In my case i cerated them at the folloeign apth.

D:/data-lake/data/dfs/namenode
Code language: JavaScript (javascript)
D:/data-lake/data/dfs/data
Code language: JavaScript (javascript)

Now navigate back to the directory ‘%HADOOP_HOME%\etc\hadoop\’ and find the hdfs-site.xml file. Similar to previous file, you will find a <configuration> tag at the bottom of the file. Replace it will the following text.

<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///D:/data-lake/data/dfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///D:/data-lake/data/dfs/data</value> </property> </configuration>
Code language: HTML, XML (xml)
  • *BE CAREFUL TO REPLACE THE PATHS IN THE ABOVE TEXT WITH YOUR OWN*

The first property ‘dfs.replication’ sets the default value of the replication factor of the data node. This can also be set at the time of creation of a file.

3. mapred-site.xml

The file configures the MapReduce algorithm which Hadoop uses. The steps remain the same, replace the <configuration> tag with the following text.

<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.application.classpath</name> <value>%HADOOP_HOME%/share/hadoop/mapreduce/*,%HADOOP_HOME%/share/hadoop/mapreduce/lib/*,%HADOOP_HOME%/share/hadoop/common/*,%HADOOP_HOME%/share/hadoop/common/lib/*,%HADOOP_HOME%/share/hadoop/yarn/*,%HADOOP_HOME%/share/hadoop/yarn/lib/*,%HADOOP_HOME%/share/hadoop/hdfs/*,%HADOOP_HOME%/share/hadoop/hdfs/lib/*</value> </property> </configuration>
Code language: HTML, XML (xml)

4. yarn-site.xml

Finally we edit the ‘yarn-site.xml’ file present, again, in the same folder.

Replace the <configuration> tag with the following text.

<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property> </configuration>
Code language: HTML, XML (xml)

Starting HDFS

We’re all done with the configurations and now ready to start the file system server.

Run the following command in the command prompt.

hdfs namenode -format

This will format any data present in the NameNode (None for now). If this command is executed without any errors then it means our hdfs set up is good to go.

Then start the NameNode and DataNode with the following command.

%HADOOP_HOME%\sbin\start-dfs.cmd
Code language: CSS (css)

Two new terminals should pop up and the Hadoop Node service should be up and running.

Hadoop hdfs nodes running on windows terminal
The Hadoop Nodes running in the terminal

Starting YARN

Type the following into the command prompt.

%HADOOP_HOME%\sbin\start-yarn.cmd
Code language: CSS (css)

Possible Errors:

YARN might not start because of 2 reasons.

  1. No Administrative privileges.

Solution: Run CMD as administrator.

2. JAVA Error

Solution: Check what version of JAVA you have, Hadoop is not compatible with all latest versions, you might have to roll back to an old one. (It is important for this tutorial that you have JDK 8 installed)

If all is settled then you will see the following prompts pop up

Hadoop YARN running on windows terminal
YARN running

To verify it is running, go to 

localhost:8088/cluster 

in your browser.

You see the Hadoop homepage with information about all the nodes and processes going on.

That’s all.

Total
0
Shares
Comments 3
Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Prev
Why is big data so important?
Understanding big data and the 4 Vs of big data

Why is big data so important?

Learn what big data is, how is it defined using the 5 Vs of big data and how it

Next
Install SQOOP on windows — step by step guide (with bug fixes)

Install SQOOP on windows — step by step guide (with bug fixes)

A guide to installing Apache SQOOP on a windows device

You May Also Like