0% found this document useful (0 votes)

45 views33 pages

Big Data Cloudera TP

Uploaded by

inesabdelali11

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views33 pages

Big Data Cloudera TP

Uploaded by

inesabdelali11

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Big Data Analytics

© 1
Hadoop Distribution
• Hadoop distribution: Cloudera QuickStart
• Platform: Virtual Box
• System Requirements
– 64-bit host OS and a virtualization that support
64-bit guest OS
– RAM for VM: 4 GB
– HDD: 20 GB

© 2
Installing Cloudera QuickStart
• Download size: ~5.5 GB
• Download links
– https://www.virtualbox.org/wiki/Downloads
Select package corresponding to your host system

– https://downloads.cloudera.com/demo_vm/virtu
albox/cloudera-quickstart-vm-5.13.0-0-
virtualbox.zip

© 4
Installing Cloudera QuickStart
• Install VirtualBox
• Unzip Cloudera VM
• Start VirtualBox
• Import Appliance (Virtual Machine)
• Launch Cloudera VM

Select Bidirectional
to share clipboard

8GB of RAM is
recommended

At least 2 CPUs is
recommended

Login: cloudera Password: cloudera

Make sure that your BIOS allows virtualization

• VM freezes when starting:

It does not freeze, just wait until it finishes
loading

• Type in the following command

hadoop jar /usr/lib/hadoop-
mapreduce/hadoop-mapreduce-examples.jar

• It should list available commands

© 15
Word Count
• Now let’s try
hadoop jar /usr/lib/hadoop-
mapreduce/hadoop-mapreduce-
examples.jar wordcount
• Result
Usage: wordcount <in> [<in>...] <out>
[cloudera@quickstart ~]$
• This is word-counting example
• Let’s count some words
© 16
Word Files
• The Complete Works of William Shakespeare
https://ocw.mit.edu/ans7870/6/6.006/s08/lec
turenotes/files/t8.shakespeare.txt

• The Project Gutenberg EBook of The

Adventures of Sherlock Holmes
http://norvig.com/big.txt

• Type in or paste the URL

© 19
Let’s count the words
• Open terminal and type
hadoop jar /usr/lib/hadoop-
mapreduce/hadoop-mapreduce-
examples.jar wordcount big.txt out

• It will fail
InvalidInputException: Input path
does not exist:

• This is because the file is not yet in HDFS!

© 20
Local File System and HDFS
• Hadoop does not store everything in HDFS
• Map results are normally stored in nodes’ local
file systems
– Map results are intermediate results which will be
sent to reduce task later
– They do not need redundancy provided by HDFS
– If a map node fails, Hadoop task manager simply
resend the task to another node
• Hadoop HDFS stores
– Input data: We must put our data into HDFS first
– Reduce output data: Result of the entire process

• List the files with ls or ls –al

• You should see your downloaded files

[cloudera@quickstart Downloads]$ ls
big.txt t8.shakespeare.txt

Command: Command Option:

File system Copy file from local FS to HDFS
commands

• Check whether the file is copied correctly

hadoop fs –ls

• Now, let’s try to copy big.txt to HDFS again

• Copy files within HDFS

hadoop fs -cp big.txt big2.txt

• Copy files back to local file system

hadoop fs -copyToLocal big2.txt

• Remove files in HDFS

hadoop fs -rm big2.txt

• Show all command options

hadoop fs
© 24
Let’s count the words (again)
• Open terminal and type
hadoop jar /usr/lib/hadoop-
mapreduce/hadoop-mapreduce-
examples.jar wordcount big.txt out
• This time it should run
• While it is running, Hadoop will show progress
including completed map and reduce tasks

• You can list the contents inside the directory with:

hadoop fs –ls out

• Then copy the result file back with

hadoop fs –copyToLocal out/part-r-
00000

• Now see the contents of the result:

more part-r-00000
© 26
What have we done so far?
• We copied files to and from HDFS
• We have run some HDFS file commands
• We have executed MapReduce program
– The data to be operated is on HDFS
– But the program is on the local file system
– WordCount is written in Java but it can be any
language

© 27
Prepare Compiling Environment
• Most of environment parameters are already set in
Cloudera QuickStart, to check type:
printenv

• The following environment should be there:

JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
PATH=/usr/java/jdk1.7.0_67-cloudera/bin

• What we have to do is to set is

export
HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

© 28
Compiling Word Count
• To compile:
hadoop com.sun.tools.javac.Main
WordCount.java
• The result will be multiple class files

• We have to pack them into one JAR file

jar cf wc.jar WordCount*.class
• Result will be a JAR file: wc.jar

© 29
Running Word Count
• Counting word in the big.txt file
hadoop jar wc.jar WordCount big.txt
out2
• You should have the same result as previous
example
• The result is stored in out2 directory
• Let’s copy to local file system
hadoop fs -copyToLocal out2

© 30
Hadoop Jobs
• Hadoop MapReduce process is categorized as
a job
• A job consists of tasks
– Map tasks
– Reduce tasks
– Tasks are scheduled by YARN
– If a task fails, it will be automatically re-scheduled
in another node

© 31
Input Splits
• MapReduce separates entire data into smaller
chunks or splits and feed into map tasks (and
later to reduce tasks)
• Splits allow the tasks to be distributed among
nodes
• Best size of each splits is the size of a HDFS block
– Too small, too much scheduling overhead
– Too large, one split is separated into many nodes
• Hadoop tries to assign map task to the node
where the data already resides
– locality optimization
© 32
Distributed and Combining Tasks
• A job is split into tasks and tasks are distributed to map
nodes
– Tasks are processed in parallel
• When map tasks are done, the results will be sent to
reducer(s)
– There can be more than one reducers
– Could also be zero reducer if the tasks are simple and can
be done as map tasks
• If there are more than one reducers, the map tasks
must partition the outputs
– Partition (divide) the outputs into different keys
– Send different keys to different reducers

Hadoop 2 Quick Start Guide PDF
100% (1)
Hadoop 2 Quick Start Guide PDF
736 pages
Exp1 Hirday Merged
No ratings yet
Exp1 Hirday Merged
102 pages
Big Data Analytics - Lecture 6
No ratings yet
Big Data Analytics - Lecture 6
33 pages
Cloud PDF
No ratings yet
Cloud PDF
47 pages
TPhadoop
No ratings yet
TPhadoop
27 pages
PDC All Labs
100% (1)
PDC All Labs
129 pages
Activity 2
No ratings yet
Activity 2
31 pages
Final Bda Lab Manual
No ratings yet
Final Bda Lab Manual
56 pages
Hadoop Mapreduce V2 Cookbook 2Nd Edition Explore The Hadoop Mapreduce V2 Ecosystem To Gain Insights From Very Large Datasets Thilina Gunarathne
No ratings yet
Hadoop Mapreduce V2 Cookbook 2Nd Edition Explore The Hadoop Mapreduce V2 Ecosystem To Gain Insights From Very Large Datasets Thilina Gunarathne
51 pages
SharePoint 2019
67% (3)
SharePoint 2019
31 pages
Big Data & Analytics Lab Manual
No ratings yet
Big Data & Analytics Lab Manual
51 pages
BDA Lab Manual - Organized
No ratings yet
BDA Lab Manual - Organized
69 pages
Guia Contador de Palabras Cloudera
No ratings yet
Guia Contador de Palabras Cloudera
23 pages
Lsde Workshop wk9
No ratings yet
Lsde Workshop wk9
31 pages
Word Count Using MapReduce On Hadoop
No ratings yet
Word Count Using MapReduce On Hadoop
14 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
42 pages
Big Data Analytics and Visualization Lab
No ratings yet
Big Data Analytics and Visualization Lab
193 pages
Developing A MapReduce Application
No ratings yet
Developing A MapReduce Application
30 pages
Hacking and Phreaking Com
100% (1)
Hacking and Phreaking Com
44 pages
Big Data - Tomas Iglesias IV
No ratings yet
Big Data - Tomas Iglesias IV
37 pages
12 13 14 Map Reduce
No ratings yet
12 13 14 Map Reduce
57 pages
wk8 Final
No ratings yet
wk8 Final
39 pages
Bda Manual
No ratings yet
Bda Manual
80 pages
Lab11 B
No ratings yet
Lab11 B
9 pages
Big Data Lab Manual Printout
No ratings yet
Big Data Lab Manual Printout
51 pages
Extracting Real Value From Your Data With Apache Hadoop: Sarah Sproehnle
No ratings yet
Extracting Real Value From Your Data With Apache Hadoop: Sarah Sproehnle
51 pages
Hadoop Training #4: Programming With Hadoop
100% (2)
Hadoop Training #4: Programming With Hadoop
46 pages
BDA Lab
No ratings yet
BDA Lab
13 pages
Labs Hadoop1
No ratings yet
Labs Hadoop1
9 pages
Bda Lab S
No ratings yet
Bda Lab S
92 pages
Homework Labs Lecture2
No ratings yet
Homework Labs Lecture2
6 pages
Part 03 Intro To Hadoop
No ratings yet
Part 03 Intro To Hadoop
22 pages
I Supplier
100% (1)
I Supplier
16 pages
03 - Run The WordCount Program Instructions
No ratings yet
03 - Run The WordCount Program Instructions
4 pages
Big Data File
No ratings yet
Big Data File
16 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
26 pages
TP3 - Hadoop Python - Wordcount
No ratings yet
TP3 - Hadoop Python - Wordcount
6 pages
Labs Lecture2
No ratings yet
Labs Lecture2
6 pages
Data Storage Data Processing: Hadoop Distributed File System (HDFS) Mapreduce
No ratings yet
Data Storage Data Processing: Hadoop Distributed File System (HDFS) Mapreduce
35 pages
Run The WordCount Program Instructions
No ratings yet
Run The WordCount Program Instructions
3 pages
Basic HDFS Commands
No ratings yet
Basic HDFS Commands
7 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Setup Hadoop Gettingstart
No ratings yet
Setup Hadoop Gettingstart
4 pages
Big Datalab
No ratings yet
Big Datalab
4 pages
BDA Manual
No ratings yet
BDA Manual
41 pages
@bigdatalabfile 09
No ratings yet
@bigdatalabfile 09
35 pages
Hadoop Practical Commands & Mapreduce Lab Mannula With Java and Python
No ratings yet
Hadoop Practical Commands & Mapreduce Lab Mannula With Java and Python
2 pages
CCS334-BDA LAB MANUAL Final
No ratings yet
CCS334-BDA LAB MANUAL Final
46 pages
Examsgrade API 580 Exam Questions Answers
No ratings yet
Examsgrade API 580 Exam Questions Answers
7 pages
Lab Assignment 1: Mapreduce / Hadoop: Notes
No ratings yet
Lab Assignment 1: Mapreduce / Hadoop: Notes
2 pages
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
No ratings yet
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
11 pages
Course: Big Data Analytics Lab Scheme: 2017
No ratings yet
Course: Big Data Analytics Lab Scheme: 2017
25 pages
Electric Vehicles and Power Electronics PDF
0% (1)
Electric Vehicles and Power Electronics PDF
41 pages
Go To Cloudera Quickstart VM To Download A Pre-Setup CDH Virtual Machine
No ratings yet
Go To Cloudera Quickstart VM To Download A Pre-Setup CDH Virtual Machine
20 pages
Extreme Computing Lab Exercises Session One: 1 Getting Started
No ratings yet
Extreme Computing Lab Exercises Session One: 1 Getting Started
6 pages
Homework Labs Lecture01
No ratings yet
Homework Labs Lecture01
9 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
SOP Design Dept
No ratings yet
SOP Design Dept
1 page
Intellipaat Hands On Exercises PDF
No ratings yet
Intellipaat Hands On Exercises PDF
49 pages
Developing A Simple Map-Reduce Program For Hadoop: Big Data Course CS6350 Professor: Dr. Latifur Khan
No ratings yet
Developing A Simple Map-Reduce Program For Hadoop: Big Data Course CS6350 Professor: Dr. Latifur Khan
22 pages
Presentation On Input and Output Devices-2
No ratings yet
Presentation On Input and Output Devices-2
19 pages
Hands On
No ratings yet
Hands On
26 pages
How To Set Up A Hadoop Cluster in Docker
No ratings yet
How To Set Up A Hadoop Cluster in Docker
13 pages
Evolution of Windows Operating System
No ratings yet
Evolution of Windows Operating System
16 pages
Electronic Bank Statement in Sap Fi: 06/11/2015 Venkat
No ratings yet
Electronic Bank Statement in Sap Fi: 06/11/2015 Venkat
13 pages
Lab 1 - Hadoop HDFS and MapReduce
No ratings yet
Lab 1 - Hadoop HDFS and MapReduce
4 pages
Software Crisis
No ratings yet
Software Crisis
4 pages
Course Outline - C.2 IS602 - Spreadsheet Modeling For T&O Managers (Students Copy)
No ratings yet
Course Outline - C.2 IS602 - Spreadsheet Modeling For T&O Managers (Students Copy)
8 pages
Before You Begin: DSL-2640T
No ratings yet
Before You Begin: DSL-2640T
96 pages
Rem Con 32
No ratings yet
Rem Con 32
61 pages
RAN Sharing - New Paradigm For LTE
100% (1)
RAN Sharing - New Paradigm For LTE
10 pages
HSF301 Chapter 01
No ratings yet
HSF301 Chapter 01
34 pages
F6
No ratings yet
F6
2 pages
Fuel Cell Lab Report
No ratings yet
Fuel Cell Lab Report
6 pages
4104.3 Extended Reality - 2022 Exam Paper
100% (1)
4104.3 Extended Reality - 2022 Exam Paper
2 pages
Cost Effective Questions
No ratings yet
Cost Effective Questions
5 pages
Ch01 Definition and Nature of AIS
No ratings yet
Ch01 Definition and Nature of AIS
8 pages
Sharp Gf777
No ratings yet
Sharp Gf777
6 pages
Vectrex
No ratings yet
Vectrex
54 pages
TCP Ports Used by VeriCentre 3.0 SP1
No ratings yet
TCP Ports Used by VeriCentre 3.0 SP1
3 pages
BURNER BREAKER Guide
No ratings yet
BURNER BREAKER Guide
3 pages
2º Eso Ficha 7
No ratings yet
2º Eso Ficha 7
4 pages
Rank College-Branch Combinations For Engineering KCET 2010
No ratings yet
Rank College-Branch Combinations For Engineering KCET 2010
43 pages
Unit Four
No ratings yet
Unit Four
8 pages
61-20 Further Declaration by Kevin Gammill MSFT
No ratings yet
61-20 Further Declaration by Kevin Gammill MSFT
5 pages
Siru GHE. Ionel IMAA English Project
No ratings yet
Siru GHE. Ionel IMAA English Project
2 pages
114-115 DIY AC-DC Signal Mixer July 15
No ratings yet
114-115 DIY AC-DC Signal Mixer July 15
2 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Evaluation of Some Cloud Based Virtual Private Server (VPS) Providers
From Everand
Evaluation of Some Cloud Based Virtual Private Server (VPS) Providers
Dr. Hidaia Mahmood Alassouli
No ratings yet
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet

Big Data Cloudera TP

Uploaded by

Big Data Cloudera TP

Uploaded by

Big Data Analytics

Login: cloudera Password: cloudera

Make sure that your BIOS allows virtualization

• VM freezes when starting:

• Type in the following command

• It should list available commands

• The Project Gutenberg EBook of The

• Type in or paste the URL

• This is because the file is not yet in HDFS!

• List the files with ls or ls –al

• You should see your downloaded files

Command: Command Option:

• Check whether the file is copied correctly

• Now, let’s try to copy big.txt to HDFS again

• Copy files within HDFS

• Copy files back to local file system

• Remove files in HDFS

• Show all command options

• You can list the contents inside the directory with:

• Then copy the result file back with

• Now see the contents of the result:

• The following environment should be there:

• What we have to do is to set is

• We have to pack them into one JAR file

You might also like