0% found this document useful (0 votes)
3 views

Module 2 Cont.

The document provides an overview of HDFS user commands and the MapReduce framework in Hadoop, detailing how to interact with HDFS using the 'hdfs' command for file management tasks such as listing, copying, and deleting files and directories. It explains the MapReduce model, which consists of mapping and reducing stages, and describes how data is processed in parallel across a distributed system. Additionally, it introduces Apache Pig as a high-level language for simplifying MapReduce transformations and outlines its usage modes.

Uploaded by

manusoma8080
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Module 2 Cont.

The document provides an overview of HDFS user commands and the MapReduce framework in Hadoop, detailing how to interact with HDFS using the 'hdfs' command for file management tasks such as listing, copying, and deleting files and directories. It explains the MapReduce model, which consists of mapping and reducing stages, and describes how data is processed in parallel across a distributed system. Additionally, it introduces Apache Pig as a high-level language for simplifying MapReduce transformations and outlines its usage modes.

Uploaded by

manusoma8080
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

HDFS User Commands

The preferred way to interact with HDFS in Hadoop version 2 is through the hdfs command. Previously, in
version 1 and subsequently in many Hadoop examples, the hadoop dfs command was used to manage files
The following listing presents the full range of options that are available for
in HDFS.
the hdfs command.

General HDFS Commands


The version of HDFS can be found from the version option.
$ hdfs version
Hadoop 2.6.0.2.2.4.2-2

A list of those commands can be obtained by issuing the following command.


Several of these commands will be highlighted here under the user account hdfs.

 List Files in HDFS


To list the files in the root HDFS directory, enter the following command:
$ hdfs dfs -ls /

O/p :Found 2 items


drwxrwxrwx - yarn hadoop 0 2015-04-29 16:52 /app-logs
drwxr-xr-x - hdfs hdfs 0 2015-04-21 14:28 /apps
To list files in your home directory, enter the following command:
$ hdfs dfs -ls Found

3 items
drwx------ - hdfs hdfs 0 2015-05-27 20:00 .Trash
drwx------ - hdfs hdfs 0 2015-05-26 15:43 .staging
drwxr-xr-x - hdfs hdfs 0 2015-05-28 13:03 DistributedShell

The same result can be obtained by issuing the following command:


$ hdfs dfs -ls /user/hdfs

 Make a Directory in HDFS


To make a directory in HDFS, use the following command. As with the -ls
command, when no path is supplied, the user’s home directory is used (e.g.,
/users/hdfs).
$ hdfs dfs -mkdir stuff

 Copy Files to HDFS


To copy a file from your current local directory into HDFS, use the following
command. If a full path is not supplied, your home directory is assumed. In this case,
the file test is placed in the directory stuff that was created previously.
$ hdfs dfs -put test stuff

The file transfer can be confirmed by using the -ls command:


$ hdfs dfs -ls stuff

Found 1 items
-rw-r--r-- 2 hdfs hdfs 12857 2015-05-29 13:12 stuff/test

 Copy Files from HDFS


Files can be copied back to your local file system using the following command. In
this case, the file we copied into HDFS, test, will be copied back to the current
local directory with the name test-local.

$ hdfs dfs -get stuff/test test-local

 Copy Files within HDFS


The following command will copy a file in HDFS:
$ hdfs dfs -cp stuff/test test.hdfs
 Delete a File within HDFS
The following command will delete the HDFS file test.dhfs that was created
previously:
$ hdfs dfs -rm test.hdfs

Moved: 'hdfs://limulus:8020/user/hdfs/stuff/test' to trash at:


hdfs://
limulus:8020/user/hdfs/.Trash/Current

Note that when the fs.trash.interval option is set to a non-zero value in


core-site.xml, all deleted files are moved to the user’s .Trash directory.
This can be avoided by including the -skipTrash option.
$ hdfs dfs -rm -skipTrash stuff/test Deleted

stuff/test

 Delete a Directory in HDFS


The following command will delete the HDFS directory stuff and all its contents:
$ hdfs dfs -rm -r -skipTrash stuff Deleted
stuff

 Get an HDFS Status Report


Regular users can get an abbreviated HDFS status report using the following
command. Those with HDFS administrator privileges will get a full (and potentially
long) report. Also, this command uses dfsadmin instead of dfs to invoke
administrative commands.
$ hdfs dfsadmin -report

Configured Capacity: 1503409881088 (1.37 TB)


Present Capacity: 1407945981952 (1.28 TB)
DFS Remaining: 1255510564864 (1.14 TB)
DFS Used: 152435417088 (141.97 GB)
DFS Used%: 10.83%
Under replicated blocks: 54 Blocks
with corrupt replicas: 0
Missingblocks:0

Hadoop MapReduce Framework


 The MapReduce programming model is conceptually simple.
 Based on two simple steps—applying a mapping process and then reducing
(condensing/collecting) the results—it can be applied to many real-world
problems.
The MapReduce Model
 There are two stages: a mapping stage and a reducing stage.
 In the mapping stage, a mapping procedure is applied to input data.
 The map is usually some kind of filter or sorting process.
 For instance, assume you need to count how many times the name ―Kutuzov‖
appears in the novel War and Peace. One solution is to gather 20 friends and
give them each a section of the book to search. This step is the map stage.
 The reduce phase happens when everyone is done counting and you sum the total
as your friends tell you their counts.
 Now consider how this same process could be accomplished using simple
*nix command-line tools.

 The following grep command applies a specific map to a text file:


$ grep " Kutuzov " war-and-peace.txt

 This command searches for the word Kutuzov (with leading and trailing
spaces) in a text file called war-and-peace.txt.
 Each match is reported as a single line of text that contains the search term.
 The search term, Kutuzov, is a character in the book.
 Though not strictly a MapReduce process, this idea is quite similar to and
much faster than the manual process of counting the instances of Kutuzov in the
printed book.
 The analogy can be taken a bit further by using the two simple (and naive)
shell scripts
 We can perform the same operation (much more slowly) and tokenize both the
Kutuzov and Petersburg strings in the text:

$ cat war-and-peace.txt |./mapper.sh |./reducer.sh Kutuzov,315


Petersburg,128

Listing 5.1 Simple Mapper Script

#!/bin/bash
while read line ; do for
token in $line; do
if [ "$token" = "Kutuzov" ] ; then echo
"Kutuzov,1"
elif [ "$token" = "Petersburg" ] ; then echo
"Petersburg,1"
fi done
done
Listing 5.2 Simple Reducer Script

#!/bin/bash kcount=0
pcount=0
while read line ; do
if [ "$line" = "Kutuzov,1" ] ; then let
kcount=kcount+1
elif [ "$line" = "Petersburg,1" ] ; then let
pcount=pcount+1
fi done
echo "Kutuzov,$kcount" echo
"Petersburg,$pcount"

 Formally, the MapReduce process can be described as follows.


 The mapper and reducer functions are both defined with respect to data structured
in (key, value) pairs.
 The mapper takes one pair of data with a type in one data domain, and returns a
list of pairs in a different domain:
Map(key1,value1) → list(key2,value2)

 The reducer function is then applied to each key–value pair, which in turn
produces a collection of values in the same domain:
Reduce(key2, list (value2)) → list(value3)

 Each reducer call typically produces either one value (value3) or an empty
response.
 Thus, the MapReduce framework transforms a list of (key, value) pairs into a list
of values.
The functional nature of MapReduce has some important properties:
Data flow is in one direction (map to reduce). It is possible to use the output of
a reduce step as the input to another MapReduce process.
As with functional programing, the input data are not changed. By applying the
mapping and reduction functions to the input data, new data are produced.
Because there is no dependency on how the mapping and reducing functions are
applied to the data, the mapper and reducer data flow can be implemented in any
number of ways to provide better performance.
 Distributed (parallel) implementations of MapReduce enable large amounts of data
to be analyzed quickly.
 Hadoop accomplishes parallelism by using a distributed file system (HDFS) to
slice and spread data over multiple servers.
 Apache Hadoop MapReduce will try to move the mapping tasks to the server
that contains the data slice.
 Results from each data slice are then combined in the reducer step.
MapReduce Parallel Data Flow
Parallel execution of MapReduce requires other steps in addition to the mapper and
reducer processes.
The basic steps are as follows:
1. Input Splits. As mentioned, HDFS distributes and replicates data over multiple
servers. The default data chunk or block size is 64MB. Thus, a 500MB file would
be broken into 8 blocks and written to different machines in the cluster. The data
are also replicated on multiple machines (typically three machines). These data
slices are physical boundaries determined by HDFS. The input splits used by
MapReduce are logical boundaries based on the input data. For example, the split
size can be based on the number of records in a file or an actual size in bytes.
Splits are almost always smaller than the HDFS block size. The number of splits
corresponds to the number of mapping processes used in the map stage.
2. Map Step. The mapping process is where the parallel nature of Hadoop comes into
play. For large amounts of data, many mappers can be operating at the same
time. The user provides the specific mapping process. MapReduce will try to
execute the mapper on the machines where the block resides. Because the file is
replicated in HDFS, the least busy node with the data will be chosen. If all
nodes holding the data are too busy, MapReduce will try to pick a node that is
closest to the node that hosts the data block. The last choice is any node in the
cluster that has access to HDFS.
3. Combiner Step. It is possible to provide an optimization or pre-reduction as
part of the map stage where key–value pairs are combined prior to the next stage.
The combiner stage is optional.
4. Shuffle Step. Before the parallel reduction stage can complete, all similar keys
must be combined and counted by the same reducer process. Therefore, results
of the map stage must be collected by key–value pairs and shuffled to the same
reducer process. If only a single reducer process is used, the shuffle stage is not
needed.
5. Reduce Step. The final step is the actual reduction. In this stage, the data reduction
is performed as per the programmer’s design. The reduce step is also optional. The
results are written to HDFS. Each reducer will write an output file.
Figure 5.1 is an example of a simple Hadoop MapReduce data flow for a word
count program.
 The map process counts the words in the split, and the reduce process calculates
the total for each word.
 The MapReduce data flow shown in Figure 5.1 is the same regardless of the
specific map and reduce tasks.
Figure 5.1 Apache Hadoop parallel MapReduce data flow
 The input to the MapReduce application is the following file in HDFS with three
lines of text.
 The goal is to count the number of times each word is used.
see spot run run
spot run see the
cat

 The first thing MapReduce will do is create the data splits. For simplicity, each
line will be one split.
 Since each split will require a map task, there are three mapper processes that
count the number of words in the split.
 On a cluster, the results of each map task are written to local disk and not to HDFS.
 Next, similar keys need to be collected and sent to a reducer process.
 The shuffle step requires data movement and can be expensive in terms of
processing time. Depending on the nature of the application, the amount of data
that must be shuffled throughout the cluster can vary from small to large.
 Once the data have been collected and sorted by key, the reduction step can begin
(even if only partial results are available).
 It is not necessary—and not normally recommended—to have a reducer for each
key–value pair as shown in Figure 5.1. In some cases, a single reducer will
provide adequate performance; in other cases, multiple reducers may be required
to speed up the reduce phase. The number of reducers is a tunable option for
many applications.
 The final step is to write the output to HDFS.
 As mentioned, a combiner step enables some pre-reduction of the map
output data. For instance, in the previous example, one map produced the
following counts:
(run,1)
(spot,1)
(run,1)

As shown in Figure 5.2, the count for run can be combined into
(run,2) before the shuffle. This optimization can help minimize the
amount of data transfer needed for the shuffle phase.

Figure 5.2 Adding a combiner process to the map step in MapReduce


 The Hadoop YARN resource manager and the MapReduce framework
determine the actual placement of mappers and reducers.
 As mentioned earlier, the MapReduce framework will try to place the
map task as close to the data as possible.
 It will request the placement from the YARN scheduler but may not get
the best placement due to the load on the cluster.
 In general, nodes can run both mapper and reducer tasks.
 Figure 5.3 shows a simple three-node MapReduce process.
 Once the mapping is complete, the same nodes begin the reduce process.
 The shuffle stage makes sure the necessary data are sent to each mapper.
 Also note that there is no requirement that all the mappers complete at the
same time or that the mapper on a specific node be complete before a
reducer is started. Reducers can be set to start shuffling based on a
threshold of percentage of mappers that have finished.

Figure 5.3 Process placement during MapReduce (Adapted from Yahoo


Hadoop Documentation)

USING APACHE PIG


Apache Pig is a high-level language that enables programmers to write complex
MapReduce transformations using a simple scripting language. Pig Latin (the actual
language) defines a set of transformations on a data set such as aggregate, join, and
sort.
Apache Pig has several usage modes.
• The first is a local mode in which all processing is done on the local machine.
• The non-local (cluster) modes are MapReduce and Tez. These modes execute
the job on the cluster using either the MapReduce engine or the optimized Tez
engine.
There are also interactive and batch modes available; they enable Pig applications to be
developed locally in interactive modes, using small amounts of data, and then run at
scale on the cluster in a production mode. The modes are summarized in Table 7.1.

Table 7.1 Apache Pig Usage Modes

Pig Example Walk-Through


In this simple example, Pig is used The following example assumes the user is hdfs, but
any valid user with access to HDFS can run the example.
 To begin the example, copy the passwd file (It’s a text file in Linux that stores user account
information) to a working directory for local Pig operation:
$ cp /etc/passwd .

 Next, copy the data file into HDFS for Hadoop MapReduce operation:

$ hdfs dfs -put passwd passwd

 You can confirm the file is in HDFS by entering the following command:
hdfs dfs -ls passwd
-rw-r--r-- 2 hdfs hdfs 2526 2015-03-17 11:08 passwd
 In the following example of local Pig operation, all processing is done on the
local machine (Hadoop is not used). First, the interactive command line is started:
$ pig -x local

 If Pig starts correctly, you will see a grunt> prompt. Next, enter the following
commands to load the passwd file and then grab the user name and dump it to the
terminal. Note that Pig commands must end with a semicolon (;).
grunt> A = load 'passwd' using PigStorage(':');
grunt> B = foreach A generate $0 as id;
grunt> dump B;

[Explanation : 1. grunt> A = load 'passwd' using pigStorage(':');


 Loads the passwd file (like /etc/passwd) into a Pig relation (like a table).

 Uses colon (:) as the delimiter (since /etc/passwd fields are colon-separated).

 Now A holds all the lines from the file, split into fields.

2.grunt> B = foreach A generate $0 as id;


 For each line (record) in A, this keeps only the first field ($0) and renames it
as id.
 In /etc/passwd, the first field is the username.

3. grunt> dump B;
Prints the contents of B (just the usernames) to the screen.]

 The processing will start and a list of user names will be printed to the screen. To exit
the interactive session, enter the command quit.

$ grunt> quit

 To use Hadoop MapReduce, start Pig as follows (or just enter pig):

$ pig -x mapreduce

If you are using the Hortonworks HDP distribution with tez installed, the tez engine can
be used as follows:
$ pig -x tez

Pig can also be run from a script. This script, which is repeated here, is designed to do
the same things as the interactive version:

/* id.pig */
A = load 'passwd' using PigStorage(':'); -- load the passwd file into A
B = foreach A generate $0 as id; -- extract
the user IDs dump B;
store B into 'id.out'; -- write the results to a directory name id.out

Comments are delineated by /* */ and -- at the end of a line. First, ensure that the id.out
directory is not in your local directory, and then start Pig with the script on the
command line:

$ /bin/rm -r id.out/
$ pig -x local id.pig

If the script worked correctly, you should see at least one data file with the results and a
zero- length file with the name _SUCCESS. To run the MapReduce version, use the
same procedure; the only difference is that now all reading and writing takes place in
HDFS.

$ hdfs dfs -rm -r id.out


$ pig id.pig
USING APACHE HIVE

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, ad hoc queries, and the analysis of large data sets using a SQL-like
language called HiveQL.
Hive offers the following features:
 Tools to enable easy data extraction, transformation, and loading (ETL)
 A mechanism to impose structure on a variety of data formats
 Access to files stored either directly in HDFS or in other data storage systems
such as HBase
 Query execution via MapReduce and Tez (optimized MapReduce)
Hive Example Walk-Through

To start Hive, simply enter the hive command. If Hive starts correctly, you should get a hive>
prompt.

$ hive
(some messages may show up here)
hive>

As a simple test, create and drop a table. Note that Hive commands must end with a
semicolon (;).

hive> CREATE TABLE pokes (foo INT, bar STRING);


OK
Time taken: 1.705 seconds

hive> SHOW TABLES;


OK
pokes
Time taken: 0.174 seconds, Fetched: 1 row(s)

hive> DROP TABLE pokes;


OK
Time taken: 4.038 seconds

A more detailed example can be developed using a web server log file to summarize message
types. First, create a table using the following command:

hive> CREATE TABLE logs(t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ' ';
OK
Time taken: 0.129 seconds
[ CREATE TABLE logs(...)
➜ Creates a new table in Hive called logs.

 t1, t2, ..., t7 string


➜ The table has 7 columns (t1 to t7), and all of them will store text (string) data.

 ROW FORMAT DELIMITED


➜ This tells Hive the table data is plain text, not in a special format like JSON or ORC.

 FIELDS TERMINATED BY ' '


➜ Columns are separated by spaces in the data file.

Next, load the data—in this case, from the sample.log file. Note that the file is found in the
local directory and not in HDFS.

hive> LOAD DATA LOCAL INPATH 'sample.log' OVERWRITE INTO TABLE logs;

Output:

Loading data to table default.logs


Table default.logs stats: [numFiles=1, numRows=0, totalSize=99271, rawDataSize=0]
OK
Time taken: 0.953 seconds

[explanation:

 LOAD DATA ➜ Tells Hive to load a data file into a table.

 LOCAL INPATH 'sample.log' ➜ The file sample.log is on your local machine, not in HDFS.
 OVERWRITE INTO TABLE logs ➜ Replaces (overwrites) any existing data in the logs table with the
contents of sample.log.

Finally, apply the select step to the file. Note that this invokes a Hadoop MapReduce
operation. The results appear at the end of the output (e.g., totals for the message
types DEBUG, ERROR, and so on).

hive> SELECT t4 AS sev, COUNT(*) AS cnt FROM logs WHERE t4 LIKE '[%' GROUP BY t4;

output:

Query ID = hdfs_20150327130000_d1e1a265-a5d7-4ed8-b785-2c6569791368
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1427397392757_0001, Tracking URL = http://norbert:8088/proxy/
application_1427397392757_0001/
Kill Command = /opt/hadoop-2.6.0/bin/hadoop job -kill job_1427397392757_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-03-27 13:00:17,399 Stage-1 map = 0%, reduce = 0%
2015-03-27 13:00:26,100 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.14 sec
2015-03-27 13:00:34,979 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.07 sec
MapReduce Total cumulative CPU time: 4 seconds 70 msec
Ended Job = job_1427397392757_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.07 sec HDFS Read: 106384
HDFS Write: 63 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 70 msec
OK
[DEBUG] 434
[ERROR] 3
[FATAL] 1
[INFO] 96
[TRACE] 816
[WARN] 4
Time taken: 32.624 seconds, Fetched: 6 row(s)

[explanation:

 SELECT t4 AS sev
➜ You are selecting column t4 and calling it sev (probably stands for "severity").

 COUNT(*) AS cnt
➜ You're counting how many times each value of t4 appears. The result is labeled as cnt (count).

 FROM logs
➜ This data is coming from your logs table.

 WHERE t4 LIKE '[%'


➜ Filters rows where t4 starts with a [ character (e.g., [INFO], [ERROR], etc.).

 GROUP BY t4
➜ Groups all rows that have the same t4 value, so you can count how many times each group appears.

]
To exit Hive, simply type exit;

hive> exit;

You might also like