Module 2 Cont.
Module 2 Cont.
The preferred way to interact with HDFS in Hadoop version 2 is through the hdfs command. Previously, in
version 1 and subsequently in many Hadoop examples, the hadoop dfs command was used to manage files
The following listing presents the full range of options that are available for
in HDFS.
the hdfs command.
3 items
drwx------ - hdfs hdfs 0 2015-05-27 20:00 .Trash
drwx------ - hdfs hdfs 0 2015-05-26 15:43 .staging
drwxr-xr-x - hdfs hdfs 0 2015-05-28 13:03 DistributedShell
Found 1 items
-rw-r--r-- 2 hdfs hdfs 12857 2015-05-29 13:12 stuff/test
stuff/test
This command searches for the word Kutuzov (with leading and trailing
spaces) in a text file called war-and-peace.txt.
Each match is reported as a single line of text that contains the search term.
The search term, Kutuzov, is a character in the book.
Though not strictly a MapReduce process, this idea is quite similar to and
much faster than the manual process of counting the instances of Kutuzov in the
printed book.
The analogy can be taken a bit further by using the two simple (and naive)
shell scripts
We can perform the same operation (much more slowly) and tokenize both the
Kutuzov and Petersburg strings in the text:
#!/bin/bash
while read line ; do for
token in $line; do
if [ "$token" = "Kutuzov" ] ; then echo
"Kutuzov,1"
elif [ "$token" = "Petersburg" ] ; then echo
"Petersburg,1"
fi done
done
Listing 5.2 Simple Reducer Script
#!/bin/bash kcount=0
pcount=0
while read line ; do
if [ "$line" = "Kutuzov,1" ] ; then let
kcount=kcount+1
elif [ "$line" = "Petersburg,1" ] ; then let
pcount=pcount+1
fi done
echo "Kutuzov,$kcount" echo
"Petersburg,$pcount"
The reducer function is then applied to each key–value pair, which in turn
produces a collection of values in the same domain:
Reduce(key2, list (value2)) → list(value3)
Each reducer call typically produces either one value (value3) or an empty
response.
Thus, the MapReduce framework transforms a list of (key, value) pairs into a list
of values.
The functional nature of MapReduce has some important properties:
Data flow is in one direction (map to reduce). It is possible to use the output of
a reduce step as the input to another MapReduce process.
As with functional programing, the input data are not changed. By applying the
mapping and reduction functions to the input data, new data are produced.
Because there is no dependency on how the mapping and reducing functions are
applied to the data, the mapper and reducer data flow can be implemented in any
number of ways to provide better performance.
Distributed (parallel) implementations of MapReduce enable large amounts of data
to be analyzed quickly.
Hadoop accomplishes parallelism by using a distributed file system (HDFS) to
slice and spread data over multiple servers.
Apache Hadoop MapReduce will try to move the mapping tasks to the server
that contains the data slice.
Results from each data slice are then combined in the reducer step.
MapReduce Parallel Data Flow
Parallel execution of MapReduce requires other steps in addition to the mapper and
reducer processes.
The basic steps are as follows:
1. Input Splits. As mentioned, HDFS distributes and replicates data over multiple
servers. The default data chunk or block size is 64MB. Thus, a 500MB file would
be broken into 8 blocks and written to different machines in the cluster. The data
are also replicated on multiple machines (typically three machines). These data
slices are physical boundaries determined by HDFS. The input splits used by
MapReduce are logical boundaries based on the input data. For example, the split
size can be based on the number of records in a file or an actual size in bytes.
Splits are almost always smaller than the HDFS block size. The number of splits
corresponds to the number of mapping processes used in the map stage.
2. Map Step. The mapping process is where the parallel nature of Hadoop comes into
play. For large amounts of data, many mappers can be operating at the same
time. The user provides the specific mapping process. MapReduce will try to
execute the mapper on the machines where the block resides. Because the file is
replicated in HDFS, the least busy node with the data will be chosen. If all
nodes holding the data are too busy, MapReduce will try to pick a node that is
closest to the node that hosts the data block. The last choice is any node in the
cluster that has access to HDFS.
3. Combiner Step. It is possible to provide an optimization or pre-reduction as
part of the map stage where key–value pairs are combined prior to the next stage.
The combiner stage is optional.
4. Shuffle Step. Before the parallel reduction stage can complete, all similar keys
must be combined and counted by the same reducer process. Therefore, results
of the map stage must be collected by key–value pairs and shuffled to the same
reducer process. If only a single reducer process is used, the shuffle stage is not
needed.
5. Reduce Step. The final step is the actual reduction. In this stage, the data reduction
is performed as per the programmer’s design. The reduce step is also optional. The
results are written to HDFS. Each reducer will write an output file.
Figure 5.1 is an example of a simple Hadoop MapReduce data flow for a word
count program.
The map process counts the words in the split, and the reduce process calculates
the total for each word.
The MapReduce data flow shown in Figure 5.1 is the same regardless of the
specific map and reduce tasks.
Figure 5.1 Apache Hadoop parallel MapReduce data flow
The input to the MapReduce application is the following file in HDFS with three
lines of text.
The goal is to count the number of times each word is used.
see spot run run
spot run see the
cat
The first thing MapReduce will do is create the data splits. For simplicity, each
line will be one split.
Since each split will require a map task, there are three mapper processes that
count the number of words in the split.
On a cluster, the results of each map task are written to local disk and not to HDFS.
Next, similar keys need to be collected and sent to a reducer process.
The shuffle step requires data movement and can be expensive in terms of
processing time. Depending on the nature of the application, the amount of data
that must be shuffled throughout the cluster can vary from small to large.
Once the data have been collected and sorted by key, the reduction step can begin
(even if only partial results are available).
It is not necessary—and not normally recommended—to have a reducer for each
key–value pair as shown in Figure 5.1. In some cases, a single reducer will
provide adequate performance; in other cases, multiple reducers may be required
to speed up the reduce phase. The number of reducers is a tunable option for
many applications.
The final step is to write the output to HDFS.
As mentioned, a combiner step enables some pre-reduction of the map
output data. For instance, in the previous example, one map produced the
following counts:
(run,1)
(spot,1)
(run,1)
As shown in Figure 5.2, the count for run can be combined into
(run,2) before the shuffle. This optimization can help minimize the
amount of data transfer needed for the shuffle phase.
Next, copy the data file into HDFS for Hadoop MapReduce operation:
You can confirm the file is in HDFS by entering the following command:
hdfs dfs -ls passwd
-rw-r--r-- 2 hdfs hdfs 2526 2015-03-17 11:08 passwd
In the following example of local Pig operation, all processing is done on the
local machine (Hadoop is not used). First, the interactive command line is started:
$ pig -x local
If Pig starts correctly, you will see a grunt> prompt. Next, enter the following
commands to load the passwd file and then grab the user name and dump it to the
terminal. Note that Pig commands must end with a semicolon (;).
grunt> A = load 'passwd' using PigStorage(':');
grunt> B = foreach A generate $0 as id;
grunt> dump B;
Uses colon (:) as the delimiter (since /etc/passwd fields are colon-separated).
Now A holds all the lines from the file, split into fields.
3. grunt> dump B;
Prints the contents of B (just the usernames) to the screen.]
The processing will start and a list of user names will be printed to the screen. To exit
the interactive session, enter the command quit.
$ grunt> quit
To use Hadoop MapReduce, start Pig as follows (or just enter pig):
$ pig -x mapreduce
If you are using the Hortonworks HDP distribution with tez installed, the tez engine can
be used as follows:
$ pig -x tez
Pig can also be run from a script. This script, which is repeated here, is designed to do
the same things as the interactive version:
/* id.pig */
A = load 'passwd' using PigStorage(':'); -- load the passwd file into A
B = foreach A generate $0 as id; -- extract
the user IDs dump B;
store B into 'id.out'; -- write the results to a directory name id.out
Comments are delineated by /* */ and -- at the end of a line. First, ensure that the id.out
directory is not in your local directory, and then start Pig with the script on the
command line:
$ /bin/rm -r id.out/
$ pig -x local id.pig
If the script worked correctly, you should see at least one data file with the results and a
zero- length file with the name _SUCCESS. To run the MapReduce version, use the
same procedure; the only difference is that now all reading and writing takes place in
HDFS.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, ad hoc queries, and the analysis of large data sets using a SQL-like
language called HiveQL.
Hive offers the following features:
Tools to enable easy data extraction, transformation, and loading (ETL)
A mechanism to impose structure on a variety of data formats
Access to files stored either directly in HDFS or in other data storage systems
such as HBase
Query execution via MapReduce and Tez (optimized MapReduce)
Hive Example Walk-Through
To start Hive, simply enter the hive command. If Hive starts correctly, you should get a hive>
prompt.
$ hive
(some messages may show up here)
hive>
As a simple test, create and drop a table. Note that Hive commands must end with a
semicolon (;).
A more detailed example can be developed using a web server log file to summarize message
types. First, create a table using the following command:
hive> CREATE TABLE logs(t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ' ';
OK
Time taken: 0.129 seconds
[ CREATE TABLE logs(...)
➜ Creates a new table in Hive called logs.
Next, load the data—in this case, from the sample.log file. Note that the file is found in the
local directory and not in HDFS.
hive> LOAD DATA LOCAL INPATH 'sample.log' OVERWRITE INTO TABLE logs;
Output:
[explanation:
LOCAL INPATH 'sample.log' ➜ The file sample.log is on your local machine, not in HDFS.
OVERWRITE INTO TABLE logs ➜ Replaces (overwrites) any existing data in the logs table with the
contents of sample.log.
Finally, apply the select step to the file. Note that this invokes a Hadoop MapReduce
operation. The results appear at the end of the output (e.g., totals for the message
types DEBUG, ERROR, and so on).
hive> SELECT t4 AS sev, COUNT(*) AS cnt FROM logs WHERE t4 LIKE '[%' GROUP BY t4;
output:
Query ID = hdfs_20150327130000_d1e1a265-a5d7-4ed8-b785-2c6569791368
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1427397392757_0001, Tracking URL = http://norbert:8088/proxy/
application_1427397392757_0001/
Kill Command = /opt/hadoop-2.6.0/bin/hadoop job -kill job_1427397392757_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-03-27 13:00:17,399 Stage-1 map = 0%, reduce = 0%
2015-03-27 13:00:26,100 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.14 sec
2015-03-27 13:00:34,979 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.07 sec
MapReduce Total cumulative CPU time: 4 seconds 70 msec
Ended Job = job_1427397392757_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.07 sec HDFS Read: 106384
HDFS Write: 63 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 70 msec
OK
[DEBUG] 434
[ERROR] 3
[FATAL] 1
[INFO] 96
[TRACE] 816
[WARN] 4
Time taken: 32.624 seconds, Fetched: 6 row(s)
[explanation:
SELECT t4 AS sev
➜ You are selecting column t4 and calling it sev (probably stands for "severity").
COUNT(*) AS cnt
➜ You're counting how many times each value of t4 appears. The result is labeled as cnt (count).
FROM logs
➜ This data is coming from your logs table.
GROUP BY t4
➜ Groups all rows that have the same t4 value, so you can count how many times each group appears.
]
To exit Hive, simply type exit;
hive> exit;