Labs Hadoop1
Labs Hadoop1
Labs, Lecture 1
1
General Notes on Homework Labs
Homework for this course is completed using the course Virtual Machine (VM)
which runs the CentOS 6.3 Linux distribution. This VM has CDH (Cloudera’s
Distribution, including Apache Hadoop) installed in pseudo- distributed mode.
Pseudo-distributed mode is a method of running Hadoop whereby all Hadoop
daemons run on the same machine. It is, essentially, a cluster consisting of a single
machine. It works just like a larger Hadoop cluster, the key difference (apart from
speed, of course!) is that the block replication factor is set to one, since there is only
a single DataNode available.
Getting Started
1. The VM is set to automatically log in as the user training. Should you log out
at any time, you can log back in as the user training with the password
training.
2. In some command- line steps in the labs, you will see lines like this:
The dollar sign ($) at the beginning of each line indicates the Linux shell prompt.
The actual prompt will include additional information (e.g.,
[training@localhost workspace]$ ) but this is omitted from these
instructions for brevity.
The backslash (\) at the end of the first line signifies that the command is not
completed, and continues on the next line. You can enter the code exactly as
2
shown (on two lines), or you can enter it on a single line. If you do the latter, you
should not type in the backslash.
3. Although many students are comfortable using UNIX text editors like vi or
emacs, some might prefer a graphical text editor. To invoke the graphical editor
from the command line, type gedit followed by the path of the file you wish to
edit. Appending & to the command allows you to type additional commands
while the editor is still open. Here is an example of how to edit a file named
myfile.txt:
3
Lab: Using HDFS
Files Used in This Exercise:
In this lab you will begin to get acquainted with the Hadoop tools. You will
manipulate files in HDFS, the Hadoop Distributed File System.
$ ~/scripts/developer/training_setup_dev.sh
Hadoop
Hadoop is already installed, configured, and running on your virtual machine.
Most of your interaction with the system will be through a command- line wrapper
called hadoop. If you run this program with no arguments, it prints a help message.
To try this, run the following command in a terminal window:
$ hadoop
The hadoop command is subdivided into several subsystems. For example, there is
a subsystem for working with files in HDFS and another for launching and managing
MapReduce processing jobs.
4
Step 1: Exploring HDFS
The subsystem associated with HDFS in the Hadoop wrapper program is called
FsShell. This subsystem can be invoked with the command hadoop fs.
1. Open a terminal window (if one is not already open) by double- clicking the
Terminal icon on the desktop.
$ hadoop fs
You see a help message describing all the commands associated with the
FsShell subsystem.
3. Enter:
$ hadoop fs -ls /
This shows you the contents of the root directory in HDFS. There will be
multiple entries, one of which is /user. Individual users have a “home”
directory under this directory, named after their username; your username in
this course is training, therefore your home directory is /user/training.
There are no files yet, so the command silently exits. This is different from
running hadoop fs -ls /foo, which refers to a directory that doesn’t exist.
In this case, an error message would be displayed.
5
Note that the directory structure in HDFS has nothing to do with the directory
structure of the local filesystem; they are completely separate namespaces.
1. Change directories to the local filesystem directory containing the sample data
we will be using in the homework labs.
$ cd ~/training_materials/developer/data
If you perform a regular Linux ls command in this directory, you will see a few
files, including two named shakespeare.tar.gz and
shakespeare-stream.tar.gz. Both of these contain the complete works of
Shakespeare in text format, but with different formats and organizations. For
now we will work with shakespeare.tar.gz.
This copies the local shakespeare directory and its contents into a remote,
HDFS directory named /user/training/shakespeare.
6
5. Now try the same fs -ls command but without a path argument:
$ hadoop fs -ls
You should see the same results. If you don’t pass a directory name to the -ls
command, it assumes you mean your home directory, i.e. /user/training.
Relative paths
If you pass any relative (non-absolute) paths to FsShell commands (or use
relative paths in MapReduce programs), they are considered relative to your
home directory.
6. We will also need a sample web server log file, which we will put into HDFS for
use in future labs. This file is currently compressed using GZip. Rather than
extract the file to the local disk and then upload it, we will extract and upload in
one step. First, create a directory in HDFS in which to store it:
7. Now, extract and upload the file in one step. The -c option to gunzip
uncompresses to standard output, and the dash (-) in the hadoop fs -put
command takes whatever is being sent to its standard input and places that data
in HDFS.
$ gunzip -c access_log.gz \
| hadoop fs -put - weblog/access_log
8. Run the hadoop fs -ls command to verify that the log file is in your HDFS
home directory.
9. The access log file is quite large – around 500 MB. Create a smaller version of
this file, consisting only of its first 5000 lines, and store the smaller version in
HDFS. You can use the smaller version for testing in subsequent labs.
7
$ hadoop fs -mkdir testlog
$ gunzip -c access_log.gz | head -n 5000 \
| hadoop fs -put - testlog/test_access_log
1. Enter:
2. The glossary file included in the compressed file you began with is not
strictly a work of Shakespeare, so let’s remove it:
Note that you could leave this file in place if you so wished. If you did, then it
would be included in subsequent computations across the works of
Shakespeare, and would skew your results slightly. As with many real- world big
data problems, you make trade- offs between the labor to purify your input data
and the precision of your results.
3. Enter:
This prints the last 50 lines of Henry IV, Part 1 to your terminal. This command
is handy for viewing the output of MapReduce programs. Very often, an
individual output file of a MapReduce program is very large, making it
inconvenient to view the entire file in the terminal. For this reason, it’s often a
8
good idea to pipe the output of the fs -cat command into head, tail, more,
or less.
4. To download a file to work with on the local filesystem use the fs -get
command. This command takes two arguments: an HDFS path and a local path.
It copies the HDFS contents into the local filesystem:
Other Commands
There are several other operations available with the hadoop fs command to
perform most common filesystem manipulations: mv, cp, mkdir, etc.
1. Enter:
$ hadoop fs
This displays a brief usage report of the commands available within FsShell.
Try playing around with a few of these commands if you like.