Apache is an open-source distributed filesystem setup that is used by companies in setting up Data Lakes. There are multiple resources to help you setup Hadoop on Linux but very few offer help in setting it up on Windows. In this tutorial I will explain how to configure Hadoop to run on Windows in a pseudo-distributed mode (Single node).
It is recommended to first understand big data before moving on to Hadoop setup. You can read about it below.

Download zip file
Download the .tar file from Hadoop’s website.
At time of writing of this article, Hadoop 3.3.1 is the most stable version so this article will follow its setup.
Create a destination folder
You need a separate directory to extract the Hadoop files. In my case, I will unpack all my files in the following directory.
D:/Data_Lake/
Unzip
Place the .tar file in your destination directory and run the following command in your command prompt.
tar -xvzf hadoop-3.3.1.tar.gz
Code language: CSS (css)
Hadoop has ALOT of set up files so the unzipping could take a lot of time.
While extracting, the following error may be shown up.
tar: Error exit delayed from previous errors.
Code language: JavaScript (javascript)
Ignore it.
JDK Installation
A huge portion of the Hadoop backend is written in JAVA so in order to run Hadoop you need to install JAVA Development Kit (JDK). It can be downloaded from the following link.

MAKE SURE YOU DOWNLOAD JDK VERSION 8. Later versions of Java are not supported with HADOOP and will throw exceptions when you try to run HADOOP.
I have JDK installed in the following folder “D:/java/”. After installation you need to set this path in the environment variables.
Environment variables are set to let the command prompt know which file to run when you type its specific keyword in the terminal window.
To edit add an environment variable, type “Environment Variable” in your windows search bar and the relevant setting should pop up.

Click on the ‘Environment variables…” button (Highlighted in above picture).
From click on “add” under user variables.

Set the variable Name as: JAVA_HOME
For path, you need to browse to directory where you have installed JDK. In my case it is: “D:/java/“
Your path will be different to mine.
Now under System Variables, click on PATH and click on edit. Add a new value to this PATH variable as:
<strong>%JAVA_HOME\bin</strong>
Code language: HTML, XML (xml)
Native IO files and Hadoop compilation
For working on Windows, Hadoop requires some Native IO files. These native libraries contain implementations of certain components for enhanced performance and non-availability of Java implementations.
Originally you would’ve had to compile and build Native IO files yourself but some good folks on the internet have uploaded pre-built binaries on their GitHub repo. For our particular version of Hadoop, go to the following link to get the binaries.
Look for Hadoop version 3.3.1 for our current installation.
Download the bin file and copy its content to the following path.
“<you Hadoop directory>/bin” (For me it is: “D:/Data_lake/hadoop/bin“)
Our Native files are all set
Configure Environment Variables for Hadoop
Similar steps will be taken as we did for Java path.
Go to Environment variables and under user variables, create a new variable with the name ‘HADOOP_HOME’ and set the path to your Hadoop home directory.
We add this environment variable the same we did for Java. In my case the variable looks like,

Name: HADOOP_HOME
Value: D:\Data_lake\hadoop-3.3.1
User Path variable for Hadoop
Now under the user variables, find the PATH variable and click on edit.
Click ‘New’ to add a new path and add the following path
%HADOOP_HOME%\bin
Your final path will look like the one below

Click ‘OK’ on all windows for the system variables.
Verify Setup
Now in command prompt, navigate to the parent folder of your Hadoop installation and run the following command
winutils.exe
Code language: CSS (css)
You should see an output like the following, confirming that all the above steps were carried out correctly.


Configuring Hadoop files
Hadoop has some configuration files which you need to edit manually in order to get Hadoop running.
- core-site.xml
The first file to edit is the core-site.xml.
This file is located at the following path:
%HADOOP_HOME%\etc\hadoop\core-site.xml
Code language: CSS (css)
This file contains information about the NameNode. It tells the Hadoop daemon where in the cluster the NameNode runs. It also binds the NameNode to a particular IP and Port.
At the bottom of the XML you will find a <configuration> tag.
Change the contents of this tag with the following text.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:10000</value>
</property>
</configuration>
Code language: HTML, XML (xml)
2. hdfs-site.xml
This file is present in the same folder as the previous file. However, before editing this file you need to understand two core concepts of Hadoops distributed file system (HDFS). These are NameNode and DataNode. The ‘NameNode’ contains all the metadata of all the tables imported into Hadoop whereas the ‘DataNode’ contains the actual data.
The hdfs-site.xml file contains information (file paths) for these two nodes.
Before adding the two paths you actually need to create the two folders. In my case i cerated them at the folloeign apth.
D:/data-lake/data/dfs/namenode
Code language: JavaScript (javascript)
D:/data-lake/data/dfs/data
Code language: JavaScript (javascript)
Now navigate back to the directory ‘%HADOOP_HOME%\etc\hadoop\’ and find the hdfs-site.xml file. Similar to previous file, you will find a <configuration> tag at the bottom of the file. Replace it will the following text.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///D:/data-lake/data/dfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///D:/data-lake/data/dfs/data</value>
</property>
</configuration>
Code language: HTML, XML (xml)
- *BE CAREFUL TO REPLACE THE PATHS IN THE ABOVE TEXT WITH YOUR OWN*
The first property ‘dfs.replication’ sets the default value of the replication factor of the data node. This can also be set at the time of creation of a file.
3. mapred-site.xml
The file configures the MapReduce algorithm which Hadoop uses. The steps remain the same, replace the <configuration> tag with the following text.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>%HADOOP_HOME%/share/hadoop/mapreduce/*,%HADOOP_HOME%/share/hadoop/mapreduce/lib/*,%HADOOP_HOME%/share/hadoop/common/*,%HADOOP_HOME%/share/hadoop/common/lib/*,%HADOOP_HOME%/share/hadoop/yarn/*,%HADOOP_HOME%/share/hadoop/yarn/lib/*,%HADOOP_HOME%/share/hadoop/hdfs/*,%HADOOP_HOME%/share/hadoop/hdfs/lib/*</value>
</property>
</configuration>
Code language: HTML, XML (xml)
4. yarn-site.xml
Finally we edit the ‘yarn-site.xml’ file present, again, in the same folder.
Replace the <configuration> tag with the following text.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
Code language: HTML, XML (xml)
Starting HDFS
We’re all done with the configurations and now ready to start the file system server.
Run the following command in the command prompt.
hdfs namenode -format
This will format any data present in the NameNode (None for now). If this command is executed without any errors then it means our hdfs set up is good to go.
Then start the NameNode and DataNode with the following command.
%HADOOP_HOME%\sbin\start-dfs.cmd
Code language: CSS (css)
Two new terminals should pop up and the Hadoop Node service should be up and running.

Starting YARN
Type the following into the command prompt.
%HADOOP_HOME%\sbin\start-yarn.cmd
Code language: CSS (css)
Possible Errors:
YARN might not start because of 2 reasons.
- No Administrative privileges.
Solution: Run CMD as administrator.
2. JAVA Error
Solution: Check what version of JAVA you have, Hadoop is not compatible with all latest versions, you might have to roll back to an old one. (It is important for this tutorial that you have JDK 8 installed)
If all is settled then you will see the following prompts pop up

To verify it is running, go to
localhost:8088/cluster
in your browser.
You see the Hadoop homepage with information about all the nodes and processes going on.
That’s all.
This is awesome thank you!
You’re welcome