0% found this document useful (0 votes)

3 views

Module 2 Cont.

The document provides an overview of HDFS user commands and the MapReduce framework in Hadoop, detailing how to interact with HDFS using the 'hdfs' command for file management tasks such as listing, copying, and deleting files and directories. It explains the MapReduce model, which consists of mapping and reducing stages, and describes how data is processed in parallel across a distributed system. Additionally, it introduces Apache Pig as a high-level language for simplifying MapReduce transformations and outlines its usage modes.

Uploaded by

manusoma8080

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Module 2 Cont.

Uploaded by

manusoma8080

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

HDFS User Commands

The preferred way to interact with HDFS in Hadoop version 2 is through the hdfs command. Previously, in
version 1 and subsequently in many Hadoop examples, the hadoop dfs command was used to manage files
The following listing presents the full range of options that are available for
in HDFS.
the hdfs command.

General HDFS Commands

The version of HDFS can be found from the version option.
$ hdfs version
Hadoop 2.6.0.2.2.4.2-2

A list of those commands can be obtained by issuing the following command.

Several of these commands will be highlighted here under the user account hdfs.

 List Files in HDFS

To list the files in the root HDFS directory, enter the following command:
$ hdfs dfs -ls /

O/p :Found 2 items

drwxrwxrwx - yarn hadoop 0 2015-04-29 16:52 /app-logs
drwxr-xr-x - hdfs hdfs 0 2015-04-21 14:28 /apps
To list files in your home directory, enter the following command:
$ hdfs dfs -ls Found

3 items
drwx------ - hdfs hdfs 0 2015-05-27 20:00 .Trash
drwx------ - hdfs hdfs 0 2015-05-26 15:43 .staging
drwxr-xr-x - hdfs hdfs 0 2015-05-28 13:03 DistributedShell

The same result can be obtained by issuing the following command:

$ hdfs dfs -ls /user/hdfs

 Make a Directory in HDFS

To make a directory in HDFS, use the following command. As with the -ls
command, when no path is supplied, the user’s home directory is used (e.g.,
/users/hdfs).
$ hdfs dfs -mkdir stuff

 Copy Files to HDFS

To copy a file from your current local directory into HDFS, use the following
command. If a full path is not supplied, your home directory is assumed. In this case,
the file test is placed in the directory stuff that was created previously.
$ hdfs dfs -put test stuff

The file transfer can be confirmed by using the -ls command:

$ hdfs dfs -ls stuff

Found 1 items
-rw-r--r-- 2 hdfs hdfs 12857 2015-05-29 13:12 stuff/test

 Copy Files from HDFS

Files can be copied back to your local file system using the following command. In
this case, the file we copied into HDFS, test, will be copied back to the current
local directory with the name test-local.

$ hdfs dfs -get stuff/test test-local

 Copy Files within HDFS

The following command will copy a file in HDFS:
$ hdfs dfs -cp stuff/test test.hdfs
 Delete a File within HDFS
The following command will delete the HDFS file test.dhfs that was created
previously:
$ hdfs dfs -rm test.hdfs

Moved: 'hdfs://limulus:8020/user/hdfs/stuff/test' to trash at:

hdfs://
limulus:8020/user/hdfs/.Trash/Current

Note that when the fs.trash.interval option is set to a non-zero value in

core-site.xml, all deleted files are moved to the user’s .Trash directory.
This can be avoided by including the -skipTrash option.
$ hdfs dfs -rm -skipTrash stuff/test Deleted

stuff/test

 Delete a Directory in HDFS

The following command will delete the HDFS directory stuff and all its contents:
$ hdfs dfs -rm -r -skipTrash stuff Deleted
stuff

 Get an HDFS Status Report

Regular users can get an abbreviated HDFS status report using the following
command. Those with HDFS administrator privileges will get a full (and potentially
long) report. Also, this command uses dfsadmin instead of dfs to invoke
administrative commands.
$ hdfs dfsadmin -report

Configured Capacity: 1503409881088 (1.37 TB)

Present Capacity: 1407945981952 (1.28 TB)
DFS Remaining: 1255510564864 (1.14 TB)
DFS Used: 152435417088 (141.97 GB)
DFS Used%: 10.83%
Under replicated blocks: 54 Blocks
with corrupt replicas: 0
Missingblocks:0

Hadoop MapReduce Framework

 The MapReduce programming model is conceptually simple.
 Based on two simple steps—applying a mapping process and then reducing
(condensing/collecting) the results—it can be applied to many real-world
problems.
The MapReduce Model
 There are two stages: a mapping stage and a reducing stage.
 In the mapping stage, a mapping procedure is applied to input data.
 The map is usually some kind of filter or sorting process.
 For instance, assume you need to count how many times the name ―Kutuzov‖
appears in the novel War and Peace. One solution is to gather 20 friends and
give them each a section of the book to search. This step is the map stage.
 The reduce phase happens when everyone is done counting and you sum the total
as your friends tell you their counts.
 Now consider how this same process could be accomplished using simple
*nix command-line tools.

 The following grep command applies a specific map to a text file:

$ grep " Kutuzov " war-and-peace.txt

 This command searches for the word Kutuzov (with leading and trailing
spaces) in a text file called war-and-peace.txt.
 Each match is reported as a single line of text that contains the search term.
 The search term, Kutuzov, is a character in the book.
 Though not strictly a MapReduce process, this idea is quite similar to and
much faster than the manual process of counting the instances of Kutuzov in the
printed book.
 The analogy can be taken a bit further by using the two simple (and naive)
shell scripts
 We can perform the same operation (much more slowly) and tokenize both the
Kutuzov and Petersburg strings in the text:

$ cat war-and-peace.txt |./mapper.sh |./reducer.sh Kutuzov,315

Petersburg,128

Listing 5.1 Simple Mapper Script

#!/bin/bash
while read line ; do for
token in $line; do
if [ "$token" = "Kutuzov" ] ; then echo
"Kutuzov,1"
elif [ "$token" = "Petersburg" ] ; then echo
"Petersburg,1"
fi done
done
Listing 5.2 Simple Reducer Script

#!/bin/bash kcount=0
pcount=0
while read line ; do
if [ "$line" = "Kutuzov,1" ] ; then let
kcount=kcount+1
elif [ "$line" = "Petersburg,1" ] ; then let
pcount=pcount+1
fi done
echo "Kutuzov,$kcount" echo
"Petersburg,$pcount"

 Formally, the MapReduce process can be described as follows.

 The mapper and reducer functions are both defined with respect to data structured
in (key, value) pairs.
 The mapper takes one pair of data with a type in one data domain, and returns a
list of pairs in a different domain:
Map(key1,value1) → list(key2,value2)

 The reducer function is then applied to each key–value pair, which in turn
produces a collection of values in the same domain:
Reduce(key2, list (value2)) → list(value3)

 Each reducer call typically produces either one value (value3) or an empty
response.
 Thus, the MapReduce framework transforms a list of (key, value) pairs into a list
of values.
The functional nature of MapReduce has some important properties:
Data flow is in one direction (map to reduce). It is possible to use the output of
a reduce step as the input to another MapReduce process.
As with functional programing, the input data are not changed. By applying the
mapping and reduction functions to the input data, new data are produced.
Because there is no dependency on how the mapping and reducing functions are
applied to the data, the mapper and reducer data flow can be implemented in any
number of ways to provide better performance.
 Distributed (parallel) implementations of MapReduce enable large amounts of data
to be analyzed quickly.
 Hadoop accomplishes parallelism by using a distributed file system (HDFS) to
slice and spread data over multiple servers.
 Apache Hadoop MapReduce will try to move the mapping tasks to the server
that contains the data slice.
 Results from each data slice are then combined in the reducer step.
MapReduce Parallel Data Flow
Parallel execution of MapReduce requires other steps in addition to the mapper and
reducer processes.
The basic steps are as follows:
1. Input Splits. As mentioned, HDFS distributes and replicates data over multiple
servers. The default data chunk or block size is 64MB. Thus, a 500MB file would
be broken into 8 blocks and written to different machines in the cluster. The data
are also replicated on multiple machines (typically three machines). These data
slices are physical boundaries determined by HDFS. The input splits used by
MapReduce are logical boundaries based on the input data. For example, the split
size can be based on the number of records in a file or an actual size in bytes.
Splits are almost always smaller than the HDFS block size. The number of splits
corresponds to the number of mapping processes used in the map stage.
2. Map Step. The mapping process is where the parallel nature of Hadoop comes into
play. For large amounts of data, many mappers can be operating at the same
time. The user provides the specific mapping process. MapReduce will try to
execute the mapper on the machines where the block resides. Because the file is
replicated in HDFS, the least busy node with the data will be chosen. If all
nodes holding the data are too busy, MapReduce will try to pick a node that is
closest to the node that hosts the data block. The last choice is any node in the
cluster that has access to HDFS.
3. Combiner Step. It is possible to provide an optimization or pre-reduction as
part of the map stage where key–value pairs are combined prior to the next stage.
The combiner stage is optional.
4. Shuffle Step. Before the parallel reduction stage can complete, all similar keys
must be combined and counted by the same reducer process. Therefore, results
of the map stage must be collected by key–value pairs and shuffled to the same
reducer process. If only a single reducer process is used, the shuffle stage is not
needed.
5. Reduce Step. The final step is the actual reduction. In this stage, the data reduction
is performed as per the programmer’s design. The reduce step is also optional. The
results are written to HDFS. Each reducer will write an output file.
Figure 5.1 is an example of a simple Hadoop MapReduce data flow for a word
count program.
 The map process counts the words in the split, and the reduce process calculates
the total for each word.
 The MapReduce data flow shown in Figure 5.1 is the same regardless of the
specific map and reduce tasks.
Figure 5.1 Apache Hadoop parallel MapReduce data flow
 The input to the MapReduce application is the following file in HDFS with three
lines of text.
 The goal is to count the number of times each word is used.
see spot run run
spot run see the
cat

 The first thing MapReduce will do is create the data splits. For simplicity, each
line will be one split.
 Since each split will require a map task, there are three mapper processes that
count the number of words in the split.
 On a cluster, the results of each map task are written to local disk and not to HDFS.
 Next, similar keys need to be collected and sent to a reducer process.
 The shuffle step requires data movement and can be expensive in terms of
processing time. Depending on the nature of the application, the amount of data
that must be shuffled throughout the cluster can vary from small to large.
 Once the data have been collected and sorted by key, the reduction step can begin
(even if only partial results are available).
 It is not necessary—and not normally recommended—to have a reducer for each
key–value pair as shown in Figure 5.1. In some cases, a single reducer will
provide adequate performance; in other cases, multiple reducers may be required
to speed up the reduce phase. The number of reducers is a tunable option for
many applications.
 The final step is to write the output to HDFS.
 As mentioned, a combiner step enables some pre-reduction of the map
output data. For instance, in the previous example, one map produced the
following counts:
(run,1)
(spot,1)
(run,1)

As shown in Figure 5.2, the count for run can be combined into
(run,2) before the shuffle. This optimization can help minimize the
amount of data transfer needed for the shuffle phase.

Figure 5.2 Adding a combiner process to the map step in MapReduce

 The Hadoop YARN resource manager and the MapReduce framework
determine the actual placement of mappers and reducers.
 As mentioned earlier, the MapReduce framework will try to place the
map task as close to the data as possible.
 It will request the placement from the YARN scheduler but may not get
the best placement due to the load on the cluster.
 In general, nodes can run both mapper and reducer tasks.
 Figure 5.3 shows a simple three-node MapReduce process.
 Once the mapping is complete, the same nodes begin the reduce process.
 The shuffle stage makes sure the necessary data are sent to each mapper.
 Also note that there is no requirement that all the mappers complete at the
same time or that the mapper on a specific node be complete before a
reducer is started. Reducers can be set to start shuffling based on a
threshold of percentage of mappers that have finished.

Figure 5.3 Process placement during MapReduce (Adapted from Yahoo

Hadoop Documentation)

USING APACHE PIG

Apache Pig is a high-level language that enables programmers to write complex
MapReduce transformations using a simple scripting language. Pig Latin (the actual
language) defines a set of transformations on a data set such as aggregate, join, and
sort.
Apache Pig has several usage modes.
• The first is a local mode in which all processing is done on the local machine.
• The non-local (cluster) modes are MapReduce and Tez. These modes execute
the job on the cluster using either the MapReduce engine or the optimized Tez
engine.
There are also interactive and batch modes available; they enable Pig applications to be
developed locally in interactive modes, using small amounts of data, and then run at
scale on the cluster in a production mode. The modes are summarized in Table 7.1.

Table 7.1 Apache Pig Usage Modes

Pig Example Walk-Through

In this simple example, Pig is used The following example assumes the user is hdfs, but
any valid user with access to HDFS can run the example.
 To begin the example, copy the passwd file (It’s a text file in Linux that stores user account
information) to a working directory for local Pig operation:
$ cp /etc/passwd .

 Next, copy the data file into HDFS for Hadoop MapReduce operation:

$ hdfs dfs -put passwd passwd

 You can confirm the file is in HDFS by entering the following command:
hdfs dfs -ls passwd
-rw-r--r-- 2 hdfs hdfs 2526 2015-03-17 11:08 passwd
 In the following example of local Pig operation, all processing is done on the
local machine (Hadoop is not used). First, the interactive command line is started:
$ pig -x local

 If Pig starts correctly, you will see a grunt> prompt. Next, enter the following
commands to load the passwd file and then grab the user name and dump it to the
terminal. Note that Pig commands must end with a semicolon (;).
grunt> A = load 'passwd' using PigStorage(':');
grunt> B = foreach A generate $0 as id;
grunt> dump B;

[Explanation : 1. grunt> A = load 'passwd' using pigStorage(':');

 Loads the passwd file (like /etc/passwd) into a Pig relation (like a table).

 Uses colon (:) as the delimiter (since /etc/passwd fields are colon-separated).

 Now A holds all the lines from the file, split into fields.

2.grunt> B = foreach A generate $0 as id;

 For each line (record) in A, this keeps only the first field ($0) and renames it
as id.
 In /etc/passwd, the first field is the username.

3. grunt> dump B;
Prints the contents of B (just the usernames) to the screen.]

 The processing will start and a list of user names will be printed to the screen. To exit
the interactive session, enter the command quit.

$ grunt> quit

 To use Hadoop MapReduce, start Pig as follows (or just enter pig):

$ pig -x mapreduce

If you are using the Hortonworks HDP distribution with tez installed, the tez engine can
be used as follows:
$ pig -x tez

Pig can also be run from a script. This script, which is repeated here, is designed to do
the same things as the interactive version:

/* id.pig */
A = load 'passwd' using PigStorage(':'); -- load the passwd file into A
B = foreach A generate $0 as id; -- extract
the user IDs dump B;
store B into 'id.out'; -- write the results to a directory name id.out

Comments are delineated by /* */ and -- at the end of a line. First, ensure that the id.out
directory is not in your local directory, and then start Pig with the script on the
command line:

$ /bin/rm -r id.out/
$ pig -x local id.pig

If the script worked correctly, you should see at least one data file with the results and a
zero- length file with the name _SUCCESS. To run the MapReduce version, use the
same procedure; the only difference is that now all reading and writing takes place in
HDFS.

$ hdfs dfs -rm -r id.out

$ pig id.pig
USING APACHE HIVE

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, ad hoc queries, and the analysis of large data sets using a SQL-like
language called HiveQL.
Hive offers the following features:
 Tools to enable easy data extraction, transformation, and loading (ETL)
 A mechanism to impose structure on a variety of data formats
 Access to files stored either directly in HDFS or in other data storage systems
such as HBase
 Query execution via MapReduce and Tez (optimized MapReduce)
Hive Example Walk-Through

To start Hive, simply enter the hive command. If Hive starts correctly, you should get a hive>
prompt.

$ hive
(some messages may show up here)
hive>

As a simple test, create and drop a table. Note that Hive commands must end with a
semicolon (;).

hive> CREATE TABLE pokes (foo INT, bar STRING);

OK
Time taken: 1.705 seconds

hive> SHOW TABLES;

OK
pokes
Time taken: 0.174 seconds, Fetched: 1 row(s)

hive> DROP TABLE pokes;

OK
Time taken: 4.038 seconds

A more detailed example can be developed using a web server log file to summarize message
types. First, create a table using the following command:

hive> CREATE TABLE logs(t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ' ';
OK
Time taken: 0.129 seconds
[ CREATE TABLE logs(...)
➜ Creates a new table in Hive called logs.

 t1, t2, ..., t7 string

➜ The table has 7 columns (t1 to t7), and all of them will store text (string) data.

 ROW FORMAT DELIMITED

➜ This tells Hive the table data is plain text, not in a special format like JSON or ORC.

 FIELDS TERMINATED BY ' '

➜ Columns are separated by spaces in the data file.

Next, load the data—in this case, from the sample.log file. Note that the file is found in the
local directory and not in HDFS.

hive> LOAD DATA LOCAL INPATH 'sample.log' OVERWRITE INTO TABLE logs;

Output:

Loading data to table default.logs

Table default.logs stats: [numFiles=1, numRows=0, totalSize=99271, rawDataSize=0]
OK
Time taken: 0.953 seconds

[explanation:

 LOAD DATA ➜ Tells Hive to load a data file into a table.

 LOCAL INPATH 'sample.log' ➜ The file sample.log is on your local machine, not in HDFS.
 OVERWRITE INTO TABLE logs ➜ Replaces (overwrites) any existing data in the logs table with the
contents of sample.log.

Finally, apply the select step to the file. Note that this invokes a Hadoop MapReduce
operation. The results appear at the end of the output (e.g., totals for the message
types DEBUG, ERROR, and so on).

hive> SELECT t4 AS sev, COUNT(*) AS cnt FROM logs WHERE t4 LIKE '[%' GROUP BY t4;

output:

Query ID = hdfs_20150327130000_d1e1a265-a5d7-4ed8-b785-2c6569791368
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1427397392757_0001, Tracking URL = http://norbert:8088/proxy/
application_1427397392757_0001/
Kill Command = /opt/hadoop-2.6.0/bin/hadoop job -kill job_1427397392757_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-03-27 13:00:17,399 Stage-1 map = 0%, reduce = 0%
2015-03-27 13:00:26,100 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.14 sec
2015-03-27 13:00:34,979 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.07 sec
MapReduce Total cumulative CPU time: 4 seconds 70 msec
Ended Job = job_1427397392757_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.07 sec HDFS Read: 106384
HDFS Write: 63 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 70 msec
OK
[DEBUG] 434
[ERROR] 3
[FATAL] 1
[INFO] 96
[TRACE] 816
[WARN] 4
Time taken: 32.624 seconds, Fetched: 6 row(s)

[explanation:

 SELECT t4 AS sev
➜ You are selecting column t4 and calling it sev (probably stands for "severity").

 COUNT(*) AS cnt
➜ You're counting how many times each value of t4 appears. The result is labeled as cnt (count).

 FROM logs
➜ This data is coming from your logs table.

 WHERE t4 LIKE '[%'

➜ Filters rows where t4 starts with a [ character (e.g., [INFO], [ERROR], etc.).

 GROUP BY t4
➜ Groups all rows that have the same t4 value, so you can count how many times each group appears.

]
To exit Hive, simply type exit;

hive> exit;

Az 400
No ratings yet
Az 400
43 pages
Big Data & Analytics Lab Manual
No ratings yet
Big Data & Analytics Lab Manual
51 pages
Reference Letter:: Ryerson University, Canada
No ratings yet
Reference Letter:: Ryerson University, Canada
2 pages
Final Project - Ecommerce ENEB
100% (1)
Final Project - Ecommerce ENEB
10 pages
Hadoop module1
No ratings yet
Hadoop module1
37 pages
Big Data Ia Answers
No ratings yet
Big Data Ia Answers
14 pages
Bda Lab Manual 2024
No ratings yet
Bda Lab Manual 2024
45 pages
BDA UNIT -3 Updated (1).docx
No ratings yet
BDA UNIT -3 Updated (1).docx
25 pages
PDC All Labs
100% (1)
PDC All Labs
129 pages
Unit v Programming Model
No ratings yet
Unit v Programming Model
53 pages
BDA
No ratings yet
BDA
88 pages
Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming
No ratings yet
Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming
33 pages
Extracting Real Value From Your Data With Apache Hadoop: Sarah Sproehnle
No ratings yet
Extracting Real Value From Your Data With Apache Hadoop: Sarah Sproehnle
51 pages
big datalab
No ratings yet
big datalab
4 pages
C21053 Jay Vijay Karwatkar-Big Data Analytics & Visualization
No ratings yet
C21053 Jay Vijay Karwatkar-Big Data Analytics & Visualization
210 pages
BDA
No ratings yet
BDA
30 pages
Lez.d-01-Hadoop (A) Intro
No ratings yet
Lez.d-01-Hadoop (A) Intro
58 pages
02-Hadoop
No ratings yet
02-Hadoop
117 pages
BIG DATA UNIT -2
No ratings yet
BIG DATA UNIT -2
18 pages
Cloud PDF
No ratings yet
Cloud PDF
47 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Extreme Computing Lab Exercises Session One: 1 Getting Started
No ratings yet
Extreme Computing Lab Exercises Session One: 1 Getting Started
6 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
@bigdatalabfile 09
No ratings yet
@bigdatalabfile 09
35 pages
Hadoop1
No ratings yet
Hadoop1
15 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Hadoop and Mapreduce Cheat Sheet
No ratings yet
Hadoop and Mapreduce Cheat Sheet
1 page
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
27 pages
Big Data Mapreduce and Streaming
No ratings yet
Big Data Mapreduce and Streaming
10 pages
HDFS Commands Updated
No ratings yet
HDFS Commands Updated
87 pages
Big Data Analytics Lab
No ratings yet
Big Data Analytics Lab
18 pages
Hadoop File System: CSC 369 Distributed Computing Alexander Dekhtyar
No ratings yet
Hadoop File System: CSC 369 Distributed Computing Alexander Dekhtyar
5 pages
BDA-Lec5
No ratings yet
BDA-Lec5
40 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
20 pages
Data Science
No ratings yet
Data Science
82 pages
BDH Record - Merged
No ratings yet
BDH Record - Merged
47 pages
1.4 Map Reduce
No ratings yet
1.4 Map Reduce
30 pages
05 Movies Data Analysis Using Mapreduce
No ratings yet
05 Movies Data Analysis Using Mapreduce
20 pages
Hadoop Commands Only
No ratings yet
Hadoop Commands Only
19 pages
BDA LAB RECORD
No ratings yet
BDA LAB RECORD
32 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Week 14
No ratings yet
Week 14
33 pages
hadoop
No ratings yet
hadoop
6 pages
Exp-2 Hadoop Commands
No ratings yet
Exp-2 Hadoop Commands
6 pages
Big Data
No ratings yet
Big Data
43 pages
HADOOP One Day Crash Course
No ratings yet
HADOOP One Day Crash Course
19 pages
Unit1 Remainingtopics 6feb
No ratings yet
Unit1 Remainingtopics 6feb
13 pages
CSE545 Sp23 (3) Hadoop MapReduce 2-13
No ratings yet
CSE545 Sp23 (3) Hadoop MapReduce 2-13
96 pages
Ccs334 Bda Lab Ex
No ratings yet
Ccs334 Bda Lab Ex
45 pages
bda-manual
No ratings yet
bda-manual
33 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
05_MapReduce in Hadoop - An Introduction
No ratings yet
05_MapReduce in Hadoop - An Introduction
31 pages
Big data analytics lab-JD
No ratings yet
Big data analytics lab-JD
49 pages
BDA Lab Manual_organized (2) (1) - Copy
No ratings yet
BDA Lab Manual_organized (2) (1) - Copy
69 pages
Kcs 061 PPT Unit 2
No ratings yet
Kcs 061 PPT Unit 2
56 pages
BDA-Lab Record
No ratings yet
BDA-Lab Record
43 pages
084 Liza Bda File
No ratings yet
084 Liza Bda File
23 pages
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet
Windows Command Prompt
From Everand
Windows Command Prompt
Murat Yildirimoglu
No ratings yet
004 - Chimdesa - Gedefa - CORBA Architecture and Primary Elements at Du
No ratings yet
004 - Chimdesa - Gedefa - CORBA Architecture and Primary Elements at Du
17 pages
Introduction-to-PCI-DSS-Awareness-Training
No ratings yet
Introduction-to-PCI-DSS-Awareness-Training
10 pages
SEC Spring2022 Lecture#1
No ratings yet
SEC Spring2022 Lecture#1
47 pages
Data Overview - Senior ERP Analyst - User Support Public Sector Records Management (PSRM)
No ratings yet
Data Overview - Senior ERP Analyst - User Support Public Sector Records Management (PSRM)
2 pages
Uninstaller Input
No ratings yet
Uninstaller Input
4 pages
DS Dev501
No ratings yet
DS Dev501
2 pages
ManagedServer 2
No ratings yet
ManagedServer 2
37 pages
Copy On Write Based File Systems Performance Analysis and Implementation
No ratings yet
Copy On Write Based File Systems Performance Analysis and Implementation
94 pages
CCURE 9000 v29 and iSTAR Hardening Guide v1
100% (1)
CCURE 9000 v29 and iSTAR Hardening Guide v1
27 pages
Nareshkumari Choudhary: Career Objective
No ratings yet
Nareshkumari Choudhary: Career Objective
2 pages
Lesson 1-Basics of Internet and The World Wide Web
No ratings yet
Lesson 1-Basics of Internet and The World Wide Web
11 pages
Data Warehousing Slides
No ratings yet
Data Warehousing Slides
76 pages
Parapriaydpayer Authentication Datasheet
0% (1)
Parapriaydpayer Authentication Datasheet
2 pages
Application Life Management Pre Final Exam Riri
No ratings yet
Application Life Management Pre Final Exam Riri
27 pages
Simple Online Shopping System
No ratings yet
Simple Online Shopping System
7 pages
Operating System
No ratings yet
Operating System
10 pages
Oracle SHUTDOWN: Introduction To The Oracle Statement
No ratings yet
Oracle SHUTDOWN: Introduction To The Oracle Statement
3 pages
btech-oe-6-sem-basics-of-data-base-management-system-koe-067-2023
No ratings yet
btech-oe-6-sem-basics-of-data-base-management-system-koe-067-2023
2 pages
Adobe Photoshop 7.0 Serial
No ratings yet
Adobe Photoshop 7.0 Serial
48 pages
Week 01
No ratings yet
Week 01
79 pages
Nosql Cassandra Database: What Is Apache Cassandra?
No ratings yet
Nosql Cassandra Database: What Is Apache Cassandra?
4 pages
CASE STUDY QUESTIONS 4 Facebook Privacy
No ratings yet
CASE STUDY QUESTIONS 4 Facebook Privacy
3 pages
Ospf 3
No ratings yet
Ospf 3
11 pages
History: ISACA Is An International Professional Association Focused On IT Governance. On Its IRS Filings, It Is
No ratings yet
History: ISACA Is An International Professional Association Focused On IT Governance. On Its IRS Filings, It Is
2 pages
Suraj Chahande: Python Developer
No ratings yet
Suraj Chahande: Python Developer
2 pages
10 Marketing Strategies To Attract and Retain Customers
No ratings yet
10 Marketing Strategies To Attract and Retain Customers
4 pages
BCA 205 B Fundamental of DBMS
No ratings yet
BCA 205 B Fundamental of DBMS
4 pages

Module 2 Cont.

Uploaded by

Module 2 Cont.

Uploaded by

HDFS User Commands

General HDFS Commands

A list of those commands can be obtained by issuing the following command.

 List Files in HDFS

O/p :Found 2 items

The same result can be obtained by issuing the following command:

 Make a Directory in HDFS

 Copy Files to HDFS

The file transfer can be confirmed by using the -ls command:

 Copy Files from HDFS

$ hdfs dfs -get stuff/test test-local

 Copy Files within HDFS

Moved: 'hdfs://limulus:8020/user/hdfs/stuff/test' to trash at:

Note that when the fs.trash.interval option is set to a non-zero value in

 Delete a Directory in HDFS

 Get an HDFS Status Report

Configured Capacity: 1503409881088 (1.37 TB)

Hadoop MapReduce Framework

 The following grep command applies a specific map to a text file:

$ cat war-and-peace.txt |./mapper.sh |./reducer.sh Kutuzov,315

Listing 5.1 Simple Mapper Script

 Formally, the MapReduce process can be described as follows.

Figure 5.2 Adding a combiner process to the map step in MapReduce

Figure 5.3 Process placement during MapReduce (Adapted from Yahoo

USING APACHE PIG

Table 7.1 Apache Pig Usage Modes

Pig Example Walk-Through

$ hdfs dfs -put passwd passwd

[Explanation : 1. grunt> A = load 'passwd' using pigStorage(':');

2.grunt> B = foreach A generate $0 as id;

$ hdfs dfs -rm -r id.out

hive> CREATE TABLE pokes (foo INT, bar STRING);

hive> SHOW TABLES;

hive> DROP TABLE pokes;

 t1, t2, ..., t7 string

 ROW FORMAT DELIMITED

 FIELDS TERMINATED BY ' '

Loading data to table default.logs

 LOAD DATA ➜ Tells Hive to load a data file into a table.

 WHERE t4 LIKE '[%'

You might also like