Big Data Cloudera TP
Big Data Cloudera TP
© 1
Hadoop Distribution
• Hadoop distribution: Cloudera QuickStart
• Platform: Virtual Box
• System Requirements
– 64-bit host OS and a virtualization that support
64-bit guest OS
– RAM for VM: 4 GB
– HDD: 20 GB
© 2
Installing Cloudera QuickStart
• Download size: ~5.5 GB
• Download links
– https://www.virtualbox.org/wiki/Downloads
Select package corresponding to your host system
– https://downloads.cloudera.com/demo_vm/virtu
albox/cloudera-quickstart-vm-5.13.0-0-
virtualbox.zip
© 3
VirtualBox Download
© 4
Installing Cloudera QuickStart
• Install VirtualBox
• Unzip Cloudera VM
• Start VirtualBox
• Import Appliance (Virtual Machine)
• Launch Cloudera VM
© 5
Start Virtual Box
© 6
Import Appliance
© 7
Setting Up the VM
Select Bidirectional
to share clipboard
© 8
Setting Up the VM
8GB of RAM is
recommended
© 9
Setting Up the VM
At least 2 CPUs is
recommended
© 10
Launch Cloudera VM
© 11
Launch Cloudera VM
© 13
Troubleshooting
• The VM does not start:
AMD-V is disabled in the BIOS (or by
the host OS) (VERR_SVM_DISABLED).
© 14
Let’s check if we can run Hadoop
• Open terminal
© 17
Download and Save
• Open web browser
© 18
Download and Save
• After the page is loaded,
save the file
• Default destination is
~/Download
© 19
Let’s count the words
• Open terminal and type
hadoop jar /usr/lib/hadoop-
mapreduce/hadoop-mapreduce-
examples.jar wordcount big.txt out
• It will fail
InvalidInputException: Input path
does not exist:
© 21
Copy the data into HDFS
• Open the terminal and go to Downloads
directory
cd Downloads/
© 22
Copy the data into HDFS
• Copy the file from local file system to HDFS
hadoop fs -copyFromLocal big.txt
© 23
Other HDFS Command Options
• List the files in current directory
hadoop fs –ls
© 25
Copy the result to local FS
• The output is stored in directory out in HDFS
© 27
Prepare Compiling Environment
• Most of environment parameters are already set in
Cloudera QuickStart, to check type:
printenv
© 28
Compiling Word Count
• To compile:
hadoop com.sun.tools.javac.Main
WordCount.java
• The result will be multiple class files
© 29
Running Word Count
• Counting word in the big.txt file
hadoop jar wc.jar WordCount big.txt
out2
• You should have the same result as previous
example
• The result is stored in out2 directory
• Let’s copy to local file system
hadoop fs -copyToLocal out2
© 30
Hadoop Jobs
• Hadoop MapReduce process is categorized as
a job
• A job consists of tasks
– Map tasks
– Reduce tasks
– Tasks are scheduled by YARN
– If a task fails, it will be automatically re-scheduled
in another node
© 31
Input Splits
• MapReduce separates entire data into smaller
chunks or splits and feed into map tasks (and
later to reduce tasks)
• Splits allow the tasks to be distributed among
nodes
• Best size of each splits is the size of a HDFS block
– Too small, too much scheduling overhead
– Too large, one split is separated into many nodes
• Hadoop tries to assign map task to the node
where the data already resides
– locality optimization
© 32
Distributed and Combining Tasks
• A job is split into tasks and tasks are distributed to map
nodes
– Tasks are processed in parallel
• When map tasks are done, the results will be sent to
reducer(s)
– There can be more than one reducers
– Could also be zero reducer if the tasks are simple and can
be done as map tasks
• If there are more than one reducers, the map tasks
must partition the outputs
– Partition (divide) the outputs into different keys
– Send different keys to different reducers
© 33