Hadoop Installation Steps
Hadoop Installation Steps
And then choose one of the mirror link. The page lists the mirrors closest to you
based on your location. For me, I am choosing the following mirror link:
http://apache.mirror.digitalpacific.com.au/hadoop/common/hadoop-3.2.1/hadoop-
3.2.1.tar.gz
$dest_dir="F:\big-data"
$url = "http://apache.mirror.digitalpacific.com.au/hadoop/common/hadoop-3.2.1/hadoop-
3.2.1.tar.gz"
$client = new-object System.Net.WebClient
$client.DownloadFile($url,$dest_dir+"\hadoop-3.2.1.tar.gz")
It may take a few minutes to download.
Once the download completes, you can verify it:
PS F:\big-data> cd $dest_dir
PS F:\big-data> ls
Directory: F:\big-data
PS F:\big-data>
You can also directly download the package through your web browser and save it to
the destination directory.
Now we need to unpack the downloaded package using GUI tool (like 7 Zip) or
command line. For me, I will use git bash to unpack it.
Open git bash and change the directory to the destination folder:
cd F:/big-data
And then run the following command to unzip:
I also published another article with very detailed steps about how to compile and
build native Hadoop on Windows: Compile and Build Hadoop 3.2.1 on Windows 10
Guide.
The build may take about one hourand to save our time, we can just download the
binary package from github.
https://github.com/cdarlint/winutils/tree/master/hadoop-3.2.1/bin
Alternatively, you can run the following commands in the previous PowerShell
window to download:
$client.DownloadFile("https://github.com/cdarlint/winutils/raw/master/hadoop-3.2.1/bin/
hadoop.dll",$dest_dir+"\hadoop-3.2.1\bin\"+"hadoop.dll")
$client.DownloadFile("https://github.com/cdarlint/winutils/raw/master/hadoop-3.2.1/bin/
hadoop.exp",$dest_dir+"\hadoop-3.2.1\bin\"+"hadoop.exp")
$client.DownloadFile("https://github.com/cdarlint/winutils/raw/master/hadoop-3.2.1/bin/
hadoop.lib",$dest_dir+"\hadoop-3.2.1\bin\"+"hadoop.lib")
$client.DownloadFile("https://github.com/cdarlint/winutils/raw/master/hadoop-3.2.1/bin/
hadoop.pdb",$dest_dir+"\hadoop-3.2.1\bin\"+"hadoop.pdb")
$client.DownloadFile("https://github.com/cdarlint/winutils/raw/master/hadoop-3.2.1/bin/
libwinutils.lib",$dest_dir+"\hadoop-3.2.1\bin\"+"libwinutils.lib")
$client.DownloadFile("https://github.com/cdarlint/winutils/raw/master/hadoop-3.2.1/bin/
winutils.exe",$dest_dir+"\hadoop-3.2.1\bin\"+"winutils.exe")
$client.DownloadFile("https://github.com/cdarlint/winutils/raw/master/hadoop-3.2.1/bin/
winutils.pdb",$dest_dir+"\hadoop-3.2.1\bin\"+"winutils.pdb")
After this, the bin folder looks like the following:
Java JDK is required to run Hadoop. If you have not installed Java JDK please install
it.
Once you complete the installation, please run the following command in PowerShell
or Git Bash to verify:
$ java -version
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
If you got error about 'cannot find java command or executable'. Don't worry we will
resolve this in the following step.
Now we've downloaded and unpacked all the artefacts we need to configure two
important environment variables.
First, we need to find out the location of Java SDK. In my system, the path is: D:\
Java\jdk1.8.0_161.
Your location can be different depends on where you install your JDK.
And then run the following command in the previous PowerShell window:
If you used PowerShell to download and if the window is still open, you can simply
run the following command:
Once we finish setting up the above two environment variables, we need to add
the bin folders to the PATH environment variable.
If PATH environment exists in your system, you can also manually add the following
two paths to it:
%JAVA_HOME%/bin
%HADOOP_HOME%/bin
Close PowerShell window and open a new one and type winutils.exe directly to verify that
our above steps are completed successfully:
You should also be able to run the following command:
hadoop -version
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
Step 6 - Configure Hadoop
Now we are ready to configure the most important part - Hadoop configurations
which involves Core, YARN, MapReduce, HDFS configurations.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:19000</value>
</property>
</configuration>
Configure HDFS
Edit file hdfs-site.xml in %HADOOP_HOME%\etc\hadoop folder.
Before editing, please correct two folders in your system: one for namenode
directory and another for data directory. For my system, I created the following two
sub folders:
F:\big-data\data\dfs\namespace_logs
F:\big-data\data\dfs\data
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///F:/big-data/data/dfs/namespace_logs</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///F:/big-data/data/dfs/data</value>
</property>
</configuration>
In Hadoop 3, the property names are slightly different from previous version. Refer to
the following official documentation to learn more about the configuration properties:
For DFS replication we configure it as one as we are configuring just one single
node. By default the value is 3.
The directory configuration are not mandatory and by default it will use Hadoop
temporary folder. For our tutorial purpose, I would recommend customise the
values.
Configure MapReduce and YARN site
Edit file mapred-site.xml in %HADOOP_HOME%\etc\hadoop folder.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>%HADOOP_HOME%/share/hadoop/mapreduce/*,%HADOOP_HOME%/share/
hadoop/mapreduce/lib/*,%HADOOP_HOME%/share/hadoop/common/*,%HADOOP_HOME%/
share/hadoop/common/lib/*,%HADOOP_HOME%/share/hadoop/yarn/*,%HADOOP_HOME%/
share/hadoop/yarn/lib/*,%HADOOP_HOME%/share/hadoop/hdfs/*,%HADOOP_HOME%/
share/hadoop/hdfs/lib/*</value>
</property>
</configuration>
Edit file yarn-site.xml in %HADOOP_HOME%\etc\hadoop folder.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF
_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_H
OME</value>
</property>
</configuration>
Step 7 - Initialise HDFS & bug fix
Refer to the following sub section (About 3.2.1 HDFS bug on Windows) about
the details of fixing this problem.
Once this is fixed, the format command (hdfs namenode -format) will show
something like the following:
About 3.2.1 HDFS bug on Windows
https://issues.apache.org/jira/browse/HDFS-14890
I've done the following to get this temporarily fixed before 3.2.2/3.3.0 is released:
if (permission != null) {
try {
Set<PosixFilePermission> permissions =
PosixFilePermissions.fromString(permission.toString());
Files.setPosixFilePermissions(curDir.toPath(), permissions);
} catch (UnsupportedOperationException uoe) {
// Default to FileUtil for non posix file systems
FileUtil.setPermission(curDir, permission);
}
}
Fix bug HDFS-14890
I've uploaded the JAR file into the following location. Please download it from the
following link:
https://github.com/FahaoTang/big-data/blob/master/hadoop-hdfs-3.2.1.jar
This is just a temporary fix before the official improvement is published. I publish it
purely for us to complete the whole installation process and there is no guarantee
this temporary fix won't cause any new issue.
Refer to this article for more details about how to build a native Windows
Hadoop: Compile and Build Hadoop 3.2.1 on Windows 10 Guide.
%HADOOP_HOME%\sbin\start-dfs.cmd
Two Command Prompt windows will open: one for datanode and another for
namenode as the following screenshot shows:
You may encounter permission issues if you start YARN daemons using normal
user. To ensure you don't encounter any issues. Please open a Command Prompt
window using Run as administrator.
Alternatively, you can follow this comment on this page which doesn't require
Administrator permission using a local Windows account:
https://kontext.tech/column/hadoop/377/latest-hadoop-321-installation-on-windows-
10-step-by-step-guide#comment314
Run the following command in an elevated Command Prompt window (Run as
administrator) to start YARN daemons:
%HADOOP_HOME%\sbin\start-yarn.cmd
Similarly two Command Prompt windows will open: one for resource manager and
another for node manager as the following screenshot shows:
Step 10 - Useful Web portals exploration
The daemons also host websites that provide useful information about the cluster.
http://localhost:9870/dfshealth.html#tab-overview
The website looks like the following screenshot:
HDFS Datanode information UI
http://localhost:9864/datanode.html
The website looks like the following screenshot:
YARN resource manager UI
http://localhost:8088
The website looks like the following screenshot:
You don't need to keep the services running all the time. You can stop them by
running the following commands one by one:
%HADOOP_HOME%\sbin\stop-yarn.cmd
%HADOOP_HOME%\sbin\stop-dfs.cmd
Congratulations! You've successfully completed the installation of Hadoop 3.2.1 on
Windows 10.
Let me know if you encounter any issues. Enjoy with your latest Hadoop on Windows
10.