0% found this document useful (0 votes)

102 views54 pages

DSCI 5350 - Lecture 2 PDF

This document provides an introduction to Hadoop, including: 1) It describes the key components of Hadoop architecture including the master node (NameNode), worker nodes (DataNodes), and daemons. 2) It explains that Hadoop is a distributed file system that stores large amounts of data across commodity hardware and provides fault tolerance through data replication. 3) It discusses how Hadoop distributions from companies like Cloudera provide enterprise-ready versions of Hadoop with additional support and integration.

Uploaded by

Praz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

102 views54 pages

DSCI 5350 - Lecture 2 PDF

Uploaded by

Praz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

DSCI 5350 – Big Data Analytics

Lecture 2 - Introduction to Hadoop

Kashif Saeed
1
Lecture Outline

• Big Data Deployment Scenarios

• History: Distributed Systems
• What is Hadoop
• Hadoop Architecture

2
Why Organizations need Big Data?

• Handling the Volume & Variety of data

IOT, streaming data, weblogs, sensor/machine data
Unstructured, PDF, voice, text, email, social media
• Cheap Storage
Cost/TB in Hadoop is significantly less than traditional
databases
• Need for advanced analytics
• More data-driven business decisions are made
• More data is needed to run sophisticated and accurate models

Big Data != Hadoop

Big Data is a concept; Hadoop is one implementation of that concept
3
Big Data Deployment Scenarios

• On-Premises Scenarios
Hadoop with MapReduce as a processing engine –
already a legacy
Hadoop with Spark as processing engine
Spark with NoSQL (or other data stores)
• Cloud Scenarios
Persistent Cluster on Cloud
Non-persistent Cluster on Cloud
Managed cluster on Cloud

4
Big Data Cloud Deployment - concepts

Non-Persistent Cluster
Pay as you go
No investment in hardware, setup, maintenance
Setup, Process data, and terminate the cluster
Entire process from setup, processing, and termination
can be automated
Since you terminate the cluster, you should not store the
outcome on the cluster itself – AWS uses S3 for storage,
GCP uses Cloud Storage for storage

5
What will be cover in this class

• We will use Cloudera VM to learn and practice about

HDFS, Hadoop ecosystem tools (Hive, Sqoop), and
Spark
The learning can be used for on-premises or cloud
deployments
Spark coding and practice
• Spark standalone deployment overview (can be used
for practicing Spark)
• Overview of AWS Cloud
• Overview of GCP (with hands-on)
• GCP Big Data and ML (with hands-on)
6
The Origin of Hadoop

7
Distributed Systems

• A Distributed system is a cluster of multiple machines

working together at the software level for storing and
processing data
• Multiple Machines = More processing power
• The software enables recovery from failure and
exceptions, handles storage, and the read/write
to/from storage
• Distributed systems work with central storage
All machines read and write to the same storage in
distributed systems

8
Challenges with Distributed Systems

• Programming complexity
• The software needs to be sophisticated enough to handle all
scenarios that can go wrong (exception and error handling)
• Centralized Data Storage
• With hundreds of computers reading and writing to the
storage, the read/write speed of the storage becomes the
bottleneck
• Data Processing & Storage at different layers
• With Data being centralized and processing being localized,
you have to bring the data to the processing engine by
copying the data
• Replicating/Moving data to processing engine is not possible
when dealing with petabytes or Exabyte's of data
9
Figure 1: Distributed System
10
Google’s Solution for Distributed Systems

• GFS - Google File System modified the Distributed

Systems to address the storage bottleneck by
decentralizing the data storage
• Distributed storage – Data processed where it is stored
• GFS can store large volumes of data because data is not
replicated/moved
• Distributed Processing of Data - MapReduce
• Hadoop is based on the solution used by Google in the
1990’s and published in 2003-2004
• Doug Cutting was working on an open source project
called Nutch and used Google’s publications to solve
the problem, which eventually became Hadoop
11
What is Hadoop

12
What is Hadoop

• Hadoop is an Infrastructure Software for processing, storing,

and analyzing large volumes of data.
The software part handles distribution of data, handling failures etc., while
the hardware part handles the storage of data and processing power
• Hadoop is Distributed – Several machines in Hadoop cluster
• Hadoop is Scalable – You can add more machines to the
cluster which proportionally adds capacity
• Hadoop is Fault-tolerant – It can recover from hardware
failures
- Master re-assigns work
- Data is replicated by default on 3 machines
- Nodes which recover rejoin the cluster automatically
• Hadoop is Open Source
- Overseen by Apache
- Close to 100 Committers from companies like Cloudera, Hortonworks,
Facebook, LinkedIn, Yahoo, Google contributing to the Ecosystem 13
Differences between RDBMS & Hadoop

• RDBMS is good for point queries or updates, where the dataset

has been indexed to deliver low-latency retrieval
• Hadoop suits applications where the data is written once and
read many times, whereas a relational database is good for
datasets that are continually updated

Traditional RDBMS Hadoop

Data size Gigabytes/Terabytes Petabytes+

Access Interactive and batch Batch

Updates Read and write many Write once, read

times many times
Structure Static schema Dynamic schema
Scaling Vertical Horizontal
14
Hadoop Distributions

• Hadoop is open-source and can be downloaded from

Apache’s website
• What is a Distribution?
A distribution is an enterprise-ready version of an open-source
software
It is tested thoroughly, supported, and integrates well with
other software
Most companies prefer using distributions over open-source
software
• Cloudera, Hortonworks, and MapR are the most widely
used enterprise-ready Hadoop Distributions
We will use Cloudera’s distribution for some parts of this class
Cloudera and Hortonworks merged in 2018 – see link below

https://www.cloudera.com/about/news-and-blogs/press-releases/2018-10-03-cloudera-and-hortonworks-
announce-merger-to-create-worlds-leading-next-generation-data-platform-and-deliver-industrys-first-
enterprise-data-cloud.html 15
Hadoop Architecture

16
Hadoop Cluster Terminologies

• Cluster - a group of computers working together

• Node - an individual computer in the cluster
There are two kind of nodes; Master node (Name Node)
and Worker nodes (Data Node)
• Master node – a node that manages the distribution
of work across worker nodes and the
distribution/retrieval of data to/from worker nodes
• Daemon - a program running on a node
• Commodity hardware – a term used for affordable
hardware which generally does not have the bells and
whistles like RAID or hot swappable CPUs
• Applies mainly to Data nodes; Name nodes should be highly
reliable
17
Master Node (Name Node)

• Master node – manages the work

• Coordinates the work and data storage on the cluster
• Daemons on Master nodes manage the entire cluster
• A failed daemon on Master node may result in the entire
cluster to be unavailable
• Master nodes are configured for high availability in an
active-passive mode
• Master nodes use Carrier-class hardware
• Dual power supplies and dual Ethernet cards
• Hard drives use RAID (redundant array of inexpensive disks)
to protect from data loss
• Reasonable RAM and CPU’s required
• 64 GB for 20 nodes or less
• 96 GB for up to 300 nodes 18
Worker Nodes (Data Nodes)

• Worker nodes – perform the work

• Can be scaled horizontally
• Daemons on worker nodes handle the data processing on that node
• A failed worker node does not bring the cluster down because of data replication
and high availability
• Worker node hardware
• Midrange CPUs are okay
• More RAM the better
• Memory intensive processing framework and tools are being used e.g. Spark and Impala
• HDFS caching can take advantage of extra RAM
• 512 GB RAM/node or better is not uncommon
• More disk is needed
• By default the HDFS data is replicated 3 times
• 20-30% of cluster capacity is needed for temporary raw storage
• 4 x your data storage need is a good number for estimation
• A good practical maximum is 36 TB per worker node (12 x 3TB drives)
• 7200 RPM SATA/SATA II drives are fine; no need to buy 15000 RPM drives
19
NameNode High Availability

• In Hadoop versions prior to 2.0, the NameNode was a single point of

failure
• Hadoop v. 2.0 introduced 2 additional configurations: High Availability
(HA) and Federated Name Nodes (FNN).
• HA allowed using two NNs in an active-passive relationship. The two NNs
relied on a shared information space using either shared storage (NFS) or
the Quorum Journal Manager.
• FNN introduced the concept of a namespace composed of a NN and
dependent data nodes; for example, NN1 managed DN1–3, NN2 managed
DN4–6, and NN3 managed DN7–9. HA and FNN are not incompatible,
thus it is possible to have, for example, three active NNs and three
standby NNs managing three namespaces.
• Version 3 of Hadoop will support more than one standby NNs in an HA
configuration.

20
21
Hadoop Core Components

• HDFS for Storage

• Processing Framework
• MapReduce
• Spark
• Impala (on Cloudera distributions only)
• Resource Manager
• YARN

22
23
Hadoop Distributed File System (HDFS)

• HDFS (storage layer for Hadoop) - a

filesystem written in Java that sits on top
of the native filesystem
Directory structure in HDFS has nothing to do
with the directory structure on local
filesystem
• It is a filesystem that can store any type
of data including text files, PDFs, images,
video, etc.
• Files in Hadoop system are Immutable
New data can be appended at the end of the
file, but existing data cannot be modified

24
Image Source: Cloudera Academic Alliance
Hadoop Distributed File System (HDFS)

• Works better with very large files as compared to

many small files
• Less files means less time to Seek data location
• Seeks are expensive operations and only needed when you
need to analyze a subset of data
• Large files minimize seeks
• Best suited for sequential data access and streaming rather
than random data access because sequential data access
requires less Seeks
• Files in HDFS can be hundred of MB or even GB in size
• If Importing data into Hadoop, it is better to combine the
data and ingest one large file as opposed to ingesting 50
smaller files
25
HDFS - Blocks

• Uses Blocks to store a file or part of a file

• Default size is 64MB, Recommended size is 128MB
• Compare it to 4KB in UNIX file system
• One HDFS Block comprises of many OS Blocks
• Blocks only use the needed space to store a file
E.g. a 280 MB file uses 2 x 128 MB blocks and 24 MB from the 3rd block as
opposed to consuming the whole 3rd block
• Blocks are replicated to multiple nodes, which allows
for a fault tolerant system if a node fails
• Replication factor is configurable to determine how many
copies of a file are kept
• Default is 3 copies of each block on different nodes

26
27
Image Source: Cloudera Academic Alliance
Processing Framework: MapReduce

• MapReduce was the original and only processing

framework for Hadoop initially
• Still available as a processing engine, though already
considered legacy by most clients
• MapReduce is:
• Platform and Language independent
• Record-oriented data processing (key value pair)
• Performs task distribution across multiple nodes
• Typically written in Java

28
Processing Framework: Others

• Multiple frameworks may exist on a Hadoop cluster

MapReduce
Spark
Impala
• Each framework is designed to consume all of the cluster
resources
• Different frameworks do not have knowledge of other active
frameworks in a cluster
• Different frameworks are competing for resources
• YARN (Yet Another Resource Negotiator) was developed to
manage resources for different frameworks
Allocates resources to different frameworks based on system settings or
demand
29
30
Image Source: Cloudera Academic Alliance
31
Image Source: Cloudera Academic Alliance
Getting Data In and Out of
Hadoop

32
Storing and Retrieving Files

33
Image Source: Cloudera Academic Alliance
34
Image Source: Cloudera Academic Alliance
35
Image Source: Cloudera Academic Alliance
The data itself is NEVER
retrieved via the Name Node
36
Image Source: Cloudera Academic Alliance
Accessing Hadoop System

• From Command Line

• hdfs dfs
• Using Spark
• Accessed via URL
• Using Ecosystem Projects
• Flume - Data from websites, system logs
• Sqoop – For transfer between HDFS and RDBMS
• Hue – Web interface for analyzing data
• Using BI tools

37
Image Source: Cloudera Academic Alliance
The Hadoop Ecosystem
and additional tools integrated in CDH

38
39
Data Storage: Apache HBase

• Apache HBase – Hadoop NoSQL Database

NoSQL Distributed database built on HDFS
Designed for applications requiring fast,
random access to large volumes of data
Scales to support very large amounts of data
and high throughput
Can handle a table with millions of columns
and billions of rows
Modeled after Google’s ‘BigTable’

40
Apache Sqoop

• High speed import to HDFS from RDBMS

(and vice versa)
• Supports many data databases like
Netezza, Mongo, MySQL, Teradata, Oracle
• Covered later in this course

41
Streaming Systems: Flume, Kafka, Flink,
Others
• Apache Flume
• Distributed service for ingesting streaming data
• Ideally suited for event data from multiple
systems – For example, log files
• Covered later in this course
• Kafka
• A high throughput, scalable messaging system
• Distributed, reliable publish-subscribe system
• Integrates with Flume and Spark Streaming
• Use Cases: Large scale messaging, log
aggregation, customer activity tracking etc.

42
Image Source: Cloudera Academic Alliance
Data Processing: Apache Spark

• Spark is Large scale Data Processing engine

• Has the ability to load dataset into memory of data nodes
• Spark provides a high level programming API which allows
programmers to focus on logic as apposed to plumbing
• Spark programs work with Cluster Resource Management
framework/tools
• Runs on the cluster with data stored in HDFS
• Supports wide range of workloads
• Machine Learning
• Business Intelligence
• Streaming
• Batch Processing
• Will be covered in detail later in the course

43
Data Processing: MapReduce

• MapReduce is the original Processing

framework in Hadoop
• Java based
• MapReduce was the core processing
engine before Spark was introduced
• Many existing tools are built using
MapReduce, however it is not used in the
industry much anymore

44
Data Processing: Apache Pig

• Apache Pig is a scripting language built on

Hadoop for high-level data processing
An alternative to writing MapReduce code
Especially good for joining and transforming data
• Pig interpreter runs on the client machine
Turns Pig Latin scripts into MapReduce or Spark
jobs

45
Image Source: Cloudera Academic Alliance
High Performance SQL: Cloudera Impala

• Impala is high performance SQL engine

Developed by Cloudera
Open source and released under Apache
Inspired by Google’s Dremel project
Used for Interactive analysis
• Impala supports SQL (Impala SQL)
Stores data in HDFS as database tables
Hive on Spark is available in its early stages

46
SQL on MapReduce: Apache Hive

• Hive is an abstraction layer on top of

Hadoop
Uses SQL-like language called HiveQL
Similar to Impala SQL
Useful for data processing and ETL
• Hive executes queries using MapReduce
Hive on Spark is available as well (Power Hive)

47
UI Interface: Hue

• Hue = Hadoop User Interface

• Provides a Web Front-end to Hadoop
• Upload and browse data
• Query tables in Impala and Hive
• Run Spark and Pig jobs
• Search
• Created by Cloudera; now 100% open source
project released under Apache license

48
Workflow Management: Apache Oozie

• Oozie is the Workflow engine

• Defines dependencies between jobs
• ETL like interface for Workflow management
• Oozie server submits the jobs to the cluster in
correct sequence

49
50
51
Apache Incubator Projects

• Apache Incubator contains all Apache projects under

development
• It contains Hadoop and non-Hadoop projects
• Graduates from the Incubator are added to the
corresponding stack
• http://incubator.apache.org/projects/#current

52
Cloudera Labs

• The class will use a VM provided by Cloudera

• The class will use labs provided by Cloudera Academic
alliance
• All labs refer to a fictitious company – Loudacre
Mobile
• All files needed for the labs are available on the VM

53
Cloudera Labs - continued

• Your Virtual Machine

• Login as user training (password training)
• Pre-installed and configured with CDH and Spark

• Training Material:
• ~/training_materials/dev1 folder on the VM

• Course Data:
• ~/training_materials/data

Digital Forensica Cyber Crime
No ratings yet
Digital Forensica Cyber Crime
392 pages
Understanding Hadoop Ecosystem
No ratings yet
Understanding Hadoop Ecosystem
38 pages
Vxworks: Command-Line Tools User'S Guide
No ratings yet
Vxworks: Command-Line Tools User'S Guide
94 pages
1 - HADOOP Crash Course
No ratings yet
1 - HADOOP Crash Course
52 pages
Bsd1313 Chapter 4
No ratings yet
Bsd1313 Chapter 4
129 pages
Networking Essentials 2.0 Module10
100% (1)
Networking Essentials 2.0 Module10
37 pages
Unit 2
No ratings yet
Unit 2
73 pages
E4160 - Microprocessor & Microcontroller System
100% (4)
E4160 - Microprocessor & Microcontroller System
48 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
CSE 20CS33P W13 S1 Sy
No ratings yet
CSE 20CS33P W13 S1 Sy
6 pages
Unit 5
No ratings yet
Unit 5
32 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
Seminar Topics
No ratings yet
Seminar Topics
21 pages
STM 32 F 423 CH
No ratings yet
STM 32 F 423 CH
209 pages
Unit-5 - Hadoop
No ratings yet
Unit-5 - Hadoop
29 pages
Unit - 3
No ratings yet
Unit - 3
34 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Buspassms Synopsis
100% (1)
Buspassms Synopsis
14 pages
Unit 2 Part A
No ratings yet
Unit 2 Part A
34 pages
Module 1 - Introduction To Big Data
100% (1)
Module 1 - Introduction To Big Data
40 pages
Tetralogy of Fallot (TOF) : Dr. Sayeedur Rahman Khan Rumi MD Final Part Student Nhfh&Ri
No ratings yet
Tetralogy of Fallot (TOF) : Dr. Sayeedur Rahman Khan Rumi MD Final Part Student Nhfh&Ri
49 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
Hadoop ISE 2
No ratings yet
Hadoop ISE 2
25 pages
Data Analyst
No ratings yet
Data Analyst
9 pages
Remote Control Ing System SRS
No ratings yet
Remote Control Ing System SRS
9 pages
Lenovo ThinkSystem ST550
No ratings yet
Lenovo ThinkSystem ST550
88 pages
DSCI 5350 - Lecture 5 PDF
No ratings yet
DSCI 5350 - Lecture 5 PDF
64 pages
Hadoop Introduction
No ratings yet
Hadoop Introduction
29 pages
Final PPT Crime Reporting
No ratings yet
Final PPT Crime Reporting
23 pages
I Am Preparing For A Big Data Analytics University...
No ratings yet
I Am Preparing For A Big Data Analytics University...
15 pages
Unit I
No ratings yet
Unit I
38 pages
HWUnifiedBasicPanelsitIT It IT
No ratings yet
HWUnifiedBasicPanelsitIT It IT
148 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
BCSL 022 1
No ratings yet
BCSL 022 1
6 pages
Thin Client Buyers Guide
No ratings yet
Thin Client Buyers Guide
24 pages
03 Activity 3 - ARG
No ratings yet
03 Activity 3 - ARG
4 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
ETTERCAP - The Easy Tutorial - ARP Poisoning
No ratings yet
ETTERCAP - The Easy Tutorial - ARP Poisoning
5 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Introduction To
No ratings yet
Introduction To
7 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
DSCI 5350 - Lecture 3 PDF
No ratings yet
DSCI 5350 - Lecture 3 PDF
39 pages
Part2 HDFS
No ratings yet
Part2 HDFS
33 pages
DSCI 5350 - Lecture 4 PDF
No ratings yet
DSCI 5350 - Lecture 4 PDF
33 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
VxRail - Sas3ircu Runtime Error Caused Disks Shown As - Lost - in VxRail Manager Physical View On Quanta Appliance - Dell Singapore
No ratings yet
VxRail - Sas3ircu Runtime Error Caused Disks Shown As - Lost - in VxRail Manager Physical View On Quanta Appliance - Dell Singapore
3 pages
Oracle 10g Database Administrator: Implementation and Administration
No ratings yet
Oracle 10g Database Administrator: Implementation and Administration
35 pages
1 - Big Data and Hadoop Framework
No ratings yet
1 - Big Data and Hadoop Framework
40 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Slides PDF
No ratings yet
Slides PDF
30 pages
HP All-In-One 1.4.3 & ProLiant Storage Server
No ratings yet
HP All-In-One 1.4.3 & ProLiant Storage Server
18 pages
Chapter 3 Testbench, Dataflow and Behavioral Verilog
No ratings yet
Chapter 3 Testbench, Dataflow and Behavioral Verilog
78 pages
Demystifying The Big Data Ecosystem... - Param Natarajan
100% (1)
Demystifying The Big Data Ecosystem... - Param Natarajan
8 pages
Chapter 1 Getting Started With PC 370
No ratings yet
Chapter 1 Getting Started With PC 370
20 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
User Manual 2 3152935
No ratings yet
User Manual 2 3152935
93 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Maximum Partition Size Using FAT16 File System
No ratings yet
Maximum Partition Size Using FAT16 File System
9 pages
Tournament Regulations Xxi International Open Sant Marti 2019
No ratings yet
Tournament Regulations Xxi International Open Sant Marti 2019
2 pages
Operating System 1 Com 311 Cte 223 ND 11 Ce HND 1 CS
No ratings yet
Operating System 1 Com 311 Cte 223 ND 11 Ce HND 1 CS
1 page
HADOOP
No ratings yet
HADOOP
18 pages
MATLAB 2009b Installation Network Version - Windows
No ratings yet
MATLAB 2009b Installation Network Version - Windows
10 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
Big Data
No ratings yet
Big Data
67 pages
HADOOP
No ratings yet
HADOOP
10 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Apache Hadoop
No ratings yet
Apache Hadoop
11 pages
Classic Games
No ratings yet
Classic Games
2 pages
Apache Hadoop: Getting Started With
No ratings yet
Apache Hadoop: Getting Started With
7 pages
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
No ratings yet
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
34 pages
Predictive Power Consumption Model For Compute Intensive Applications in Clustered ARM A53 Embedded Systems
No ratings yet
Predictive Power Consumption Model For Compute Intensive Applications in Clustered ARM A53 Embedded Systems
4 pages
Hadoop & HDFS Final
No ratings yet
Hadoop & HDFS Final
31 pages
CourseNotes - Linux CentOS 7 Shells and Processes
No ratings yet
CourseNotes - Linux CentOS 7 Shells and Processes
13 pages
Open Gapps Log
No ratings yet
Open Gapps Log
2 pages
Rules Engine Configuration Guide
No ratings yet
Rules Engine Configuration Guide
8 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
B Asing Adinda Syariefa
No ratings yet
B Asing Adinda Syariefa
2 pages

DSCI 5350 - Lecture 2 PDF

Uploaded by

DSCI 5350 - Lecture 2 PDF

Uploaded by

DSCI 5350 – Big Data Analytics

Lecture 2 - Introduction to Hadoop

• Big Data Deployment Scenarios

• Handling the Volume & Variety of data

Big Data != Hadoop

• We will use Cloudera VM to learn and practice about

• A Distributed system is a cluster of multiple machines

• GFS - Google File System modified the Distributed

• Hadoop is an Infrastructure Software for processing, storing,

• RDBMS is good for point queries or updates, where the dataset

Traditional RDBMS Hadoop

Data size Gigabytes/Terabytes Petabytes+

Updates Read and write many Write once, read

• Hadoop is open-source and can be downloaded from

• Cluster - a group of computers working together

• Master node – manages the work

• Worker nodes – perform the work

• In Hadoop versions prior to 2.0, the NameNode was a single point of

• HDFS for Storage

• HDFS (storage layer for Hadoop) - a

• Works better with very large files as compared to

• Uses Blocks to store a file or part of a file

• MapReduce was the original and only processing

• Multiple frameworks may exist on a Hadoop cluster

• From Command Line

• Apache HBase – Hadoop NoSQL Database

• High speed import to HDFS from RDBMS

• Spark is Large scale Data Processing engine

• MapReduce is the original Processing

• Apache Pig is a scripting language built on

• Impala is high performance SQL engine

• Hive is an abstraction layer on top of

• Hue = Hadoop User Interface

• Oozie is the Workflow engine

• Apache Incubator contains all Apache projects under

• The class will use a VM provided by Cloudera

• Your Virtual Machine

You might also like