Introduction To HDFS PDF
Introduction To HDFS PDF
Contents
Lab 1Exploring Hadoop Distributed File System...................................................................................................4
1.1Getting Started....................................................................................................................5
1.2Exploring Hadoop Distributed File System (Terminal).........................................................9
1.2.1Using the command line Interface.......................................................................9
1.3Exploring Hadoop Distributed File System (Web Console)................................................15
1.3.1Using the Web Console.....................................................................................15
1.3.2Working with the Welcome page.......................................................................16
1.3.3Administering BigInsights...................................................................................18
1.3.4Inspecting the status of your cluster..................................................................18
1.3.5Starting and stopping a component...................................................................19
1.3.6Working with Files..............................................................................................20
1.4Summary...........................................................................................................................24
Contents Page 3
IBM Software
Technically, Hadoop consists of two key services: data storage using the Hadoop Distributed File System
(HDFS) and large-scale parallel data processing using a technique called MapReduce.
This version of the lab was designed using the InfoSphere BigInsights 2.1 Quick Start Edition.
Throughout this lab you will be using the following account login information:
Username Password
__1. Start the VMware image by clicking the Play virtual machine button in the VMware Player if it is
not already on.
__2. Log in to the VMware virtual machine using the following credentials.
User: biadmin
Password: biadmin
Hands-on-Lab Page 5
IBM Software
__3. After you log in, your screen should look similar to the one below.
Before we can start working with Hadoop Distributed File system, we must first start all the Biginsights
components. There are two ways of doing this, through terminal and through simply double-clicking an
icon. Both of these methods will be shown in the following steps.
__4. Now open the terminal by double clicking the BigInsights Shell icon.
__6. Once the terminal has been opened change to the $BIGINSIGHTS_HOME/bin directory (which
by default is /opt/ibm/biginsights)
cd $BIGINSIGHTS_HOME/bin
or
cd /opt/ibm/biginsights/bin
__7. Start the Hadoop components (daemons) on the BigInsights server. You can practice starting all
components with these commands. Please note that they will take a few minutes to run.
./start-all.sh
__8. Sometimes certain hadoop components may fail to start. You can start and stop the failed
components one at a time by using start.sh and stop.sh respectively. For example to start and
stop Hive use:
./start.sh hive
./stop.sh hive
Notice that since Hive did not initially fail, the terminal is telling us that Hive is already running.
Hands-on-Lab Page 7
IBM Software
__9. Once all components have started successfully you may move on.
__10. If you would like to stop all components execute the command below. However, for this lab
please leave all components started.
./stop-all.sh
Next, let us look at how you would start all the components by double-clicking an icon.
__11. Double-clicking on the Start BigInsights icon would execute a script that does the above
mentioned steps. Once all components are started the terminal exits and you are set. Simple.
__12. We can stop the components in a similar manner, by double-clicking on the Stop Biginsights
icon. (To the right of Start BigInsights icon)
Now that are components are started you may move on to the next section.
1. You can use the command-line approach and invoke the FileSystem (fs) shell using the format:
hadoop fs <args>
2. You can also manipulate HDFS using the BigInsights Web Console.
We will start with the hadoop fs -ls command, which returns the list of files and directories with
permission information.
Ensure the Hadoop components are all started, and from the same terminal window as before (and
logged on as biadmin), follow these instructions
hadoop fs -ls /
hadoop fs -ls
or
Hands-on-Lab Page 9
IBM Software
Note that in the first command there was no director referenced, but it is equivalent to the second
command where /user/biadmin is explicitly specified. Each user will get its own home directory under
/user. For example, in the case of user biadmin, the home directory is /user/biadmin. Any command
where there is no explicit directory specified will be relative to the user’s home directory. User space in
the native file system (Linux) is generally found under /home/biadmin or /usr/biadmin, but in HDFS user
space is /user/biadmin (spelled as “user” rather than “usr”).
__3. To create the directory test you can issue the following command:
The result of ls here is similar to that found with Linux, except for the second column (in this case
either “1” or “-“). The “1” indicates the replication factor (generally “1” for pseudo-distributed
clusters and “3” for distributed clusters); directory information is kept in the namenode and thus
not subject to replication (hence “-“).
To use HDFS commands recursively generally you add an “r” to the HDFS command.
__5. For example, to do a recursive listing we'll use the -lsr command rather than just -ls, like the
example below.
__6. You can pipe (using the | character) any HDFS command to be used with the Linux shell. For
example, you can easily use grep with HDFS by doing the following.
As you can see the grep command only returned the lines which had “test” in them (thus removing the
“Found x items” line and the other directories from the listing.
Hands-on-Lab Page 11
IBM Software
__7. To move files between your regular Linux file system and HDFS you can use the put and get
commands. For example, move the text file README to the hadoop file system:
You should now see a new file called /user/biadmin/README listed as shown above.
__8. In order to view the contents of this file use the -cat command as follows:
You should see the output of the README file (that is stored in HDFS). We can also use the
linux diff command to see if the file we put in HDFS is actually the same as the original on the
local filesystem.
cd /home/biadmin/
Since the diff command produces no output we know that the files are the same (the diff command prints
all the lines in the files that differ).
To find the size of files you need to use the -du or -dus commands. Keep in mind that these
commands return the file size in bytes.
__10. To find the size of the README file use the following command.
__11. To find the size of all files individually in the /user/biadmin directory use the following command:
__12. To find the size of all files in total of the /user/biadmin directory use the following command.
__13. If you would like to get more information about hadoop fs commands, invoke -help as follows.
hadoop fs -help
Hands-on-Lab Page 13
IBM Software
__14. For specific help on a command, add the command name after help. For example, to get help on
the dus command you'd do the following.
We are now done with the terminal section, you may close the terminal.
__1. Start the Web Console by double-clicking on the BigInsights WebConsole icon.
__2. Verify that your Web console appears similar to this, and note each section:
Tasks: quick access to popular BigInsights tasks,
Quick Links: Links to internal and external quick links and downloads to enhance your
environment, and
Learn More: Online resources available to learn more about BigInsights
Hands-on-Lab Page 15
IBM Software
This section introduces you to the Web console's main page displayed through the Welcome tab. The
Welcome page features links to common tasks, many of which can also be launched from other areas of
the console. In addition, the Welcome page includes links to popular external resources, such as the
BigInsights InfoCenter (product documentation) and community forum. You'll explore several aspects of
this page.
__3. In the Welcome Tab, the Tasks pane allows you to quickly access common tasks. Select the
View, start or stop a service task. If necessary scroll down.
__4. This takes you to the Cluster Status tab. Here, you can stop and start Hadoop services, as well
as gain additional information as shown in the next section
__5. Click on the Welcome tab to return back to the main page.
__6. Inspect the Quick Links pane at top right and use its vertical scroll bar (if necessary) to become
familiar with the various resources accessible through this pane. The first several links simply
activate different tabs in the Web console, while subsequent links enable you to perform set-up
functions, such as adding BigInsights plug-ins to your Eclipse development environment.
__7. Inspect the Learn More pane at lower right. Links in this area access external Web resources
that you may find useful, such as the Accelerator demos and documentation, BigInsights
InfoCenter, a public discussion forum, IBM support, and IBM's BigInsights product site. If
desired, click on one or more of these links to see what's available to you
Hands-on-Lab Page 17
IBM Software
The Web console allows administrators to inspect the overall health of the system as well as perform
basic functions, such as starting and stopping specific servers (or components), adding nodes to the
cluster, and so on. You’ll explore a subset of these capabilities here.
__8. Click on the Cluster Status tab at the top of the page to return to the Cluster Status window
__9. Inspect the overall status of your cluster. The figure below was taken on a single-node cluster
that had several services running. One service – Monitoring -- was unavailable. (If you installed
and started all BigInsights services on your cluster, your display will show all services to be
running)
__10. Click on the Hive service and note the detailed information provided for this service in the pane
at right. From here, you can start or stop the hive service (or any service you select) depending
on your needs. For example, you can see the URL for Hive's Web interface and its process ID.
__11. Optionally, cut-and-paste the URL for Hive’s Web interface into a new tab of your browser.
You'll see an open source tool provided with Hive for administration purposes, as shown below.
__12. Close this tab and return to the Cluster Status section of the BigInsights Web console
__14. In the pane to the right (which displays the Hive status), click the red Stop button to stop the
service
__15. When prompted to confirm that you want to stop the Hive service, click OK and wait for the
operation to complete. The right pane should appear similar to the following image
__16. Restart the Hive service by clicking on the green arrow just beneath the Hive Status heading.
(See the previous figure.) When the operation completes, the Web console will indicate that
Hive is running again, likely under a process ID that differs from the earlier Hive process ID
shown at the beginning of this lab module. (You may need to use the Refresh button of your
Web browser to reload information displayed in the left pane.)
Hands-on-Lab Page 19
IBM Software
The Files tab of the console enables you to explore the contents of your file system, create new
subdirectories, upload small files for test purposes, and perform other file-related functions. In this
module, you’ll learn how to perform such tasks against the Hadoop Distributed File System (HDFS) of
BigInsights.
__17. Click on the Files tab of the console to begin exploring your distributed file system.
__18. Expand the directory tree shown in the pane at left (/user/biadmin). If you already uploaded
files to HDFS, you’ll be able to navigate through the directory to locate them.
__19. Become familiar with the functions provided through the icons at the top of this pane, as we'll
refer to some of these in subsequent sections of this module. Simply point your cursor at the
icon to learn its function. From left to right, the icons enable you to Copy a file or directory, move
a file, create a directory, rename, upload a file to HDFS, download a file from HDFS to your local
file system, delete a file from HDFS, set permissions, open a command window to launch HDFS
shell commands, and refresh the Web console page
__20. Position your cursor on the user/biadmin directory and click the Create Directory icon to create
a subdirectory for test purposes
__21. When a pop-up window appears prompting you for a directory name, enter ConsoleLab and
click OK
__22. Expand the directory hierarchy to verify that your new subdirectory was created.
Hands-on-Lab Page 21
IBM Software
__25. Click the Move icon, when the pop up Move screen appears select the ConsoleLab directory
and click OK.
__26. Using the set permission icon, you can change the permission settings for your directory. When
finished click OK.
__27. While highlighting the ConsoleLabTest2 folder, select the Remove icon and delete the directory.
__28. Remain in the ConsoleLab directory, and click the Upload icon to upload a small sample file for
test purposes.
__29. When the pop-up window appears, click the Browse button to browse your local file system for a
sample file.
__30. Navigate through your local file system to the directory where BigInsights was installed. For the
IBM-provided VMWare image, BigInsights is installed in file system: /opt/ibm/biginsights.
Locate the …/IHC subdirectory and select the CHANGES.txt file. Click Open.
__31. Verify that the window displays the name of this file. Note that you can continue to Browse for
additional files to upload and that you can delete files as upload targets from the displayed list.
However, for this exercise, simply click OK
Hands-on-Lab Page 23
IBM Software
__32. When the upload completes, verify that the CHANGES.txt file appears in the directory tree at left,
If it is not immediately visible click the refresh button. On the right, you should see a subset of
the file’s contents displayed in text format
__33. Highlight the CHANGES.txt file in your ConsoleLab directory and click the Download button.
__34. When prompted, click the Save File button. Then select OK.
__35. If Firefox is set as default browser, the file will be saved to your user Downloads directory. For
this exercise, the default directory location is fine
1.4 Summary
Congratulations! You're now familiar with the Hadoop Distributed File System. You know now how to
manipulate files within by using the terminal and the BigInsights Web Console. You may move on to the
next Unit.