0% found this document useful (0 votes)

19 views41 pages

02 HadoopIntroEcosystem

Uploaded by

balakrishna bobbili

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views41 pages

02 HadoopIntroEcosystem

Uploaded by

balakrishna bobbili

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Introduc)on

to Hadoop and the

Hadoop Ecosystem
Chapter 2

201509
Course Chapters

1 Introduc)on Course Introduc)on

2 Introduc,on to Hadoop and the Hadoop Ecosystem
Introduc,on to Hadoop
3 Hadoop Architecture and HDFS
4 Impor)ng Rela)onal Data with Apache Sqoop
5 Introduc)on to Impala and Hive
Impor)ng and Modeling Structured
6 Modeling and Managing Data with Impala and Hive
Data
7 Data Formats
8 Data File Par))oning
9 Capturing Data with Apache Flume Inges)ng Streaming Data

10 Spark Basics
11 Working with RDDs in Spark
12 Aggrega)ng Data with Pair RDDs
13 Wri)ng and Deploying Spark Applica)ons Distributed Data Processing with
14 Parallel Processing in Spark Spark
15 Spark RDD Persistence
16 Common PaEerns in Spark Data Processing
17 Spark SQL and DataFrames

18 Conclusion Course Conclusion

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐2
Introduc)on to Hadoop and the Hadoop Ecosystem

In this chapter you will learn

§ What Hadoop is and how it addresses big data challenges
§ The guiding principles behind Hadoop
§ The major components of the Hadoop Ecosystem
§ The tools you will be using in the Homework Labs

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐3
Chapter Topics

Introduc,on to Hadoop and the

Introduc,on to Hadoop
Hadoop Ecosystem

§ Problems with Tradi,onal Large-‐scale Systems

§ Hadoop!
§ Data Storage and Ingest
§ Data Processing
§ Data Analysis and Explora)on
§ Other Ecosystem Tools
§ Introduc)on to Homework Labs
§ Conclusion

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐4
Tradi)onal Large-‐Scale Computa)on

§ Tradi,onally, computa,on has been

processor-‐bound
– Rela)vely small amounts of data
– Lots of complex processing

§ The early solu,on: bigger computers

– Faster processor, more memory
– But even this couldn’t keep up

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐5
Distributed Systems

§ The beJer solu,on: more computers

– Distributed systems – use mul)ple machines
for a single job

“In pioneer days they used oxen for heavy

pulling, and when one ox couldn’t budge a log,
we didn’t try to grow a larger ox. We shouldn’t
be trying for bigger computers, but for more
systems of computers.”
– Grace Hopper

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera.
Database Hadoop Cluster 2-‐6
Challenges with Distributed Systems

§ Challenges with distributed systems

– Programming complexity
– Keeping data and processes in sync
– Finite bandwidth
– Par)al failures

§ The solu,on?
– Hadoop!

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐7
Chapter Topics

Introduc,on to Hadoop and the

Introduc,on to Hadoop
Hadoop Ecosystem

§ Problems with Tradi)onal Large-‐scale Systems

§ Hadoop!
§ Data Storage and Ingest
§ Data Processing
§ Data Analysis and Explora)on
§ Other Ecosystem Tools
§ Introduc)on to Homework Labs
§ Conclusion

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐8
What is Apache Hadoop?

§ Scalable and economical data storage, processing and analysis

– Distributed and fault-‐tolerant
– Harnesses the power of industry standard hardware
§ Heavily inspired by technical documents published by Google

Batch Search Analy)c Machine Stream Other

Processing Engine SQL Learning Processing Applica)ons

Workload Management

Data Storage

Filesystem Online NoSQL

Data Integra)on

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐9
Common Hadoop Use Cases

§ Extract/Transform/Load (ETL) § Collabora,ve ﬁltering

§ Text mining § Predic,on models
§ Index building § Sen,ment analysis
§ Graph crea,on and analysis § Risk assessment
§ PaJern recogni,on

§ What do these workloads have in common? Nature of the data…
– Volume
– Velocity
– Variety

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐10
Distributed Systems: The Data BoEleneck (1)

§ Tradi,onally, data is stored in a central loca,on

§ Data is copied to processors at run,me
§ Fine for limited amounts of data

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐11
Distributed Systems: The Data BoEleneck (2)

§ Modern systems have much more data

– terabytes+ a day
– petabytes+ total
§ We need a new approach…

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐12
Big Data Processing with Hadoop

§ Hadoop introduced a radical new approach:

– Bring the program to the data rather than the data to the program
§ Based on two key concepts
– Distribute data when the data is stored
A Hadoop Cluster
– Run computa)on where the data resides

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐13
Core Hadoop

Processing
A Hadoop Cluster
• Spark
• MapReduce

Resource Management Storage

• YARN • HDFS

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐14
Big Data Processing

1. Ingest 2. Process 3. Analyze 4. Access

Data Analysis
Data Sources Data Storage Data Processing and Explora)on
Hadoop Spark Impala Search
Distributed
File System
(HDFS) Hadoop
MapReduce

Hive
HBase
Pig

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐15
Chapter Topics

Introduc,on to Hadoop and the

Introduc,on to Hadoop
Hadoop Ecosystem

§ Problems with Tradi)onal Large-‐scale Systems

§ Hadoop!
§ Data Storage and Ingest
§ Data Processing
§ Data Analysis and Explora)on
§ Other Ecosystem Tools
§ Introduc)on to Homework Labs
§ Conclusion

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐16
Data Ingest and Storage

§ Hadoop typically ingests data from many

sources and in many formats
– Tradi)onal data management systems, e.g.
1. Ingest
databases
– Logs and other machine generated data
(event data) Data Sources Data Storage
– Imported ﬁles Hadoop
Distributed
File System
(HDFS)

HBase

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐17
Data Storage

§ Hadoop Distributed File System (HDFS)

– HDFS is the storage layer for Hadoop
– Provides inexpensive reliable storage for massive
amounts of data on industry-‐standard hardware
– Data is distributed when stored
– Covered later in this course
§ Apache HBase: The Hadoop Database
HDFS
– A NoSQL distributed database built on HDFS
– Scales to support very large amounts of data
and high throughput
– A table can have thousands of columns
– Covered in depth in Cloudera Training for Apache HBase

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐18
Data Ingest Tools (1)

§ HDFS
– Direct ﬁle transfer
§ Apache Sqoop
– High speed import to HDFS from Rela)onship
Database (and vice versa)
– Supports many data storage systems
– e.g. Netezza, Mongo, MySQL, Teradata, Oracle HDFS
– Covered later in this course

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐19
Data Ingest Tools (2)

§ Apache Flume
– Distributed service for inges)ng streaming data
– Ideally suited for event data from mul)ple systems
– For example, log ﬁles
– Covered later in this course

§ Kaca
HDFS
– A high throughput, scalable messaging system
– Distributed, reliable publish-‐subscribe system
– Integrates with Flume and Spark Streaming

Apache
Kana

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐20
Chapter Topics

Introduc,on to Hadoop and the

Introduc,on to Hadoop
Hadoop Ecosystem

§ Problems with Tradi)onal Large-‐scale Systems

§ Hadoop!
§ Data Storage and Ingest
§ Data Processing
§ Data Analysis and Explora)on
§ Other Ecosystem Tools
§ Introduc)on to Homework Labs
§ Conclusion

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐21
Apache Spark: An Engine For Large-‐scale Data Processing

§ Spark is large-‐scale data processing engine

– General purpose
– Runs on Hadoop clusters and data in HDFS
§ Supports a wide range of workloads
– Machine learning
– Business intelligence
– Streaming
– Batch Processing
§ This course uses Spark for data processing

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐22
Hadoop MapReduce: The Original Hadoop Processing Engine

§ Hadoop MapReduce is the original Hadoop

framework
– Primarily Java based
§ Based on the MapReduce programming model
§ The core Hadoop processing engine before Spark was introduced
§ S,ll the dominant technology
– But losing ground to Spark fast
§ Many exis,ng tools are s,ll built using MapReduce code
§ Has extensive and mature fault tolerance built into the framework

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐23
Apache Pig: Scrip)ng for MapReduce

§ Apache Pig builds on Hadoop to oﬀer high-‐level data processing
– This is an alterna)ve to wri)ng low-‐level MapReduce code
– Pig is especially good at joining and transforming data
§ The Pig interpreter runs on the client machine
– Turns Pig La)n scripts into MapReduce or Spark jobs
– Submits those jobs to a Hadoop cluster
– Covered in Cloudera Data Analyst Training

people = LOAD '/user/training/customers' AS (cust_id, name);
orders = LOAD '/user/training/orders' AS (ord_id, cust_id, cost);
groups = GROUP orders BY cust_id;
totals = FOREACH groups GENERATE group, SUM(orders.cost) AS t;
result = JOIN totals BY group, people BY cust_id;
DUMP result;

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐24
Chapter Topics

Introduc,on to Hadoop and the

Introduc,on to Hadoop
Hadoop Ecosystem

§ Problems with Tradi)onal Large-‐scale Systems

§ Hadoop!
§ Data Storage and Ingest
§ Data Processing
§ Data Analysis and Explora,on
§ Other Ecosystem Tools
§ Introduc)on to Homework Labs
§ Conclusion

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 2-‐25
Cloudera Impala: High Performance SQL

§ Impala is a high-‐performance SQL engine

– Runs on Hadoop clusters
– Data stored in HDFS ﬁles
– Inspired by Google’s Dremel project
– Very low latency – measured in milliseconds
– Ideal for interac)ve analysis
§ Impala supports a dialect of SQL (Impala SQL)
– Data in HDFS modeled as database tables
§ Impala was developed by Cloudera
– 100% open source, released under the Apache soqware
license
§ Impala is used for data analysis in this course

§ Hive is an abstrac,on layer on top of Hadoop

– Hive uses a SQL-‐like language called HiveQL
– Similar to Impala SQL
– Useful for data processing and ETL
– Impala is preferred for ad hoc analy)cs
§ Hive executes queries using MapReduce
– Hive on Spark is available for early adopters; not yet recommended for
produc)on
§ Hive can op,onally be used for data analysis in this course

§ Interac,ve full-‐text search for data in a Hadoop cluster

§ Allows non-‐technical users to access your data
– Nearly everyone can use a search engine
§ Cloudera Search enhances Apache Solr
– Integrates Solr with HDFS, MapReduce, HBase,
and Flume
– Supports ﬁle formats widely used with Hadoop
– Dynamic Web-‐based dashboard interface with Hue
– Apache Sentry based security
§ Cloudera Search is 100% open source

Introduc,on to Hadoop and the

Introduc,on to Hadoop
Hadoop Ecosystem

§ Problems with Tradi)onal Large-‐scale Systems

§ Hadoop!
§ Data Storage and Ingest
§ Data Processing
§ Data Analysis and Explora)on
§ Other Ecosystem Tools
§ Introduc)on to Homework Labs
§ Conclusion

§ Hue = Hadoop User Experience

§ Hue provides a Web front-‐end to a Hadoop
– Upload and browse data
– Query tables in Impala and Hive
– Run Spark and Pig jobs and workﬂows
– Search
– And much more
§ Makes Hadoop easier to use
§ Hue is 100% open-‐source
§ Created by Cloudera
– Open source, released under Apache license
§ Hue is used throughout this course

§ Oozie
– Workﬂow engine for Hadoop jobs
– Deﬁnes dependencies between jobs
§ The Oozie server submits the jobs to the server in the correct sequence

§ Sentry provides ﬁne-‐grained access control

(authoriza,on) to various Hadoop ecosystem
components
– Impala
– Hive
– Cloudera Search
– HDFS
§ In conjunc,on with Kerberos authen,ca,on, Sentry
authoriza,on provides a complete cluster security
solu,on
§ Created by Cloudera
– Now an open-‐source Apache project

Introduc,on to Hadoop and the

Introduc,on to Hadoop
Hadoop Ecosystem

§ Problems with Tradi)onal Large-‐scale Systems

§ Hadoop!
§ Data Storage and Ingest
§ Data Processing
§ Data Analysis and Explora)on
§ Other Ecosystem Tools
§ Introduc,on to the Homework Labs
§ Conclusion

§ The best way to learn is to do!

§ Most topics in this course have a corresponding lab for prac,cing the skills
you have learned in lecture

§ The Homework Labs are based on a hypothe,cal scenario

– However, the concepts apply to nearly any organiza)on
§ Loudacre Mobile is a (ﬁc,onal) fast-‐growing wireless carrier
– Provides mobile service to customers throughout western USA

L udacre mobile
o

§ Loudacre needs to migrate their exis,ng infrastructure to Hadoop

– The size and velocity and their data has exceeded their ability to
processing and analyze their data
§ Loudacre data sources
– MySQL database – customer account data (name, address, phone
numbers, devices)
– Apache web server logs from Customer Service site
– HTML ﬁles – Knowledge base ar)cles
– XML ﬁles – Device ac)va)on records
– Real-‐)me device status logs
– Base sta)ons – cell tower loca)ons

§ Instruc,ons are in the Homework Labs

§ Start with
– General Notes
– Sevng Up
– Run setup script for the course

§ Your virtual machine

– Log in as user training (password training)
– Pre-‐installed and conﬁgured with
– Spark and CDH (Cloudera’s Distribu)on, including Apache Hadoop)
– Various tools including Firefox, gedit, Emacs, Eclipse, and Maven
§ Training materials: ~/training_materials/dev1 folder on the VM
– exercises – one folder per homework
– scripts – course setup scripts
§ Course data: ~/training_materials/data

Introduc,on to Hadoop and the

Introduc,on to Hadoop
Hadoop Ecosystem

§ Problems with Tradi)onal Large-‐scale Systems

§ Hadoop!
§ Data Storage and Ingest
§ Data Processing
§ Data Analysis and Explora)on
§ Other Ecosystem Tools
§ Introduc)on to Homework Labs
§ Conclusion

§ Hadoop is a framework for distributed storage and processing

§ Core Hadoop includes HDFS for storage and YARN for cluster resource
management
§ The Hadoop ecosystem includes many components for
– Inges)ng data (Flume, Sqoop, Kana)
– Storing data (HDFS, HBase)
– Processing data (Spark, Hadoop MapReduce, Pig)
– Modeling data as tables for SQL access (Impala, Hive)
– Exploring data (Hue, Search)
– Protec)ng Data (Sentry)
§ This course introduces most of the key Hadoop infrastructure
§ Homework Labs let you prac,ce and reﬁne your Hadoop skills!

The following oﬀer more informa,on on topics discussed in this chapter
§ Hadoop: The Deﬁni0ve Guide (published by O’Reilly)
– http://tiny.cloudera.com/hadooptdg
§ Cloudera Essen0als for Apache Hadoop – free online training
– http://tiny.cloudera.com/esscourse

(Yale Language Series) Eleanor Harz Jorden, Mari Noda-Japanese - The Spoken Language, Part 1-Yale University Press (1987)
86% (7)
(Yale Language Series) Eleanor Harz Jorden, Mari Noda-Japanese - The Spoken Language, Part 1-Yale University Press (1987)
357 pages
Why Do We Baptize Infants - (Basics of The Faith) (Basics of - Bryan Chapell - Basics of The Reformed Faith, 1st Ed, Phillipsburg, N - J, - Oxford - 9781596380585 - Anna's Archive
No ratings yet
Why Do We Baptize Infants - (Basics of The Faith) (Basics of - Bryan Chapell - Basics of The Reformed Faith, 1st Ed, Phillipsburg, N - J, - Oxford - 9781596380585 - Anna's Archive
36 pages
Hadoop Ecosystem and Their Components
No ratings yet
Hadoop Ecosystem and Their Components
19 pages
Cloudera Administrator Training
100% (6)
Cloudera Administrator Training
373 pages
Afshin Molavi Persian Pilgrimages - Journeys Across Iran
100% (2)
Afshin Molavi Persian Pilgrimages - Journeys Across Iran
344 pages
BDA Experiment1
No ratings yet
BDA Experiment1
8 pages
Cloudera Developer Training Slides
No ratings yet
Cloudera Developer Training Slides
784 pages
Jason Weston Reasoning Alignment Berkeley Talk
No ratings yet
Jason Weston Reasoning Alignment Berkeley Talk
106 pages
Mastering HTML A Beginners Guide (Sufyan Bin Uzayr) (Z-Library)
No ratings yet
Mastering HTML A Beginners Guide (Sufyan Bin Uzayr) (Z-Library)
341 pages
Cloudera Apache Hadoop 101
100% (1)
Cloudera Apache Hadoop 101
51 pages
David Ellen - The Scientific Examination of Documents - Methods and Techniques - Methods and Techniques-Taylor & Francis (2014)
No ratings yet
David Ellen - The Scientific Examination of Documents - Methods and Techniques - Methods and Techniques-Taylor & Francis (2014)
189 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
55 pages
Cloudera Developer Training
100% (1)
Cloudera Developer Training
483 pages
Intro To ANSYS Ncode DL 14 5 L14 Standalone DesignLife
No ratings yet
Intro To ANSYS Ncode DL 14 5 L14 Standalone DesignLife
21 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Cloudera Hadoop Introduction PDF
100% (1)
Cloudera Hadoop Introduction PDF
50 pages
Untitled
No ratings yet
Untitled
727 pages
Wittgenstein Limitation of Language
No ratings yet
Wittgenstein Limitation of Language
6 pages
Cloudera Developer Training Slides
No ratings yet
Cloudera Developer Training Slides
729 pages
Big Data Introduction PDF
No ratings yet
Big Data Introduction PDF
180 pages
Cloudera Administrator Training Slides PDF
No ratings yet
Cloudera Administrator Training Slides PDF
601 pages
An Introduction To Hadoop Presentation PDF
100% (1)
An Introduction To Hadoop Presentation PDF
91 pages
Programming Tools
No ratings yet
Programming Tools
2 pages
School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024
No ratings yet
School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024
260 pages
Deadlock and Starvation
No ratings yet
Deadlock and Starvation
5 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
Cloud PDF
No ratings yet
Cloud PDF
138 pages
BIGDATA
No ratings yet
BIGDATA
180 pages
Cloudera Tutorial
100% (1)
Cloudera Tutorial
36 pages
Introduction To The Hadoop Ecosystem
No ratings yet
Introduction To The Hadoop Ecosystem
77 pages
CLASS PROGRAM HUMSSgas11
No ratings yet
CLASS PROGRAM HUMSSgas11
3 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
Understanding Hadoop Ecosystem
No ratings yet
Understanding Hadoop Ecosystem
38 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
Bda 2
No ratings yet
Bda 2
25 pages
Hadoop Illuminated PDF
No ratings yet
Hadoop Illuminated PDF
74 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
Day 2 S1 Intro - To - Hadoop - Ashok
No ratings yet
Day 2 S1 Intro - To - Hadoop - Ashok
27 pages
Week 4 - Hadoop Ecosystem
No ratings yet
Week 4 - Hadoop Ecosystem
109 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Lez.d-01-Hadoop (C)
No ratings yet
Lez.d-01-Hadoop (C)
29 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
Cloud Era Csu La 11122012
No ratings yet
Cloud Era Csu La 11122012
50 pages
Simla Deputation PPT Edexcel
No ratings yet
Simla Deputation PPT Edexcel
8 pages
Hi! No Doubt You Know Me. Yes, Yes I Am William: Shakespeare!
No ratings yet
Hi! No Doubt You Know Me. Yes, Yes I Am William: Shakespeare!
16 pages
Slides PDF
No ratings yet
Slides PDF
30 pages
COA Mod 3
No ratings yet
COA Mod 3
30 pages
DBS Finalized Scripts
No ratings yet
DBS Finalized Scripts
7 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Cloud Computing Lab Manual
No ratings yet
Cloud Computing Lab Manual
12 pages
2nd Unit Bda
No ratings yet
2nd Unit Bda
30 pages
Data Analyst
No ratings yet
Data Analyst
9 pages
Top 50 SQL Server Interview Question
No ratings yet
Top 50 SQL Server Interview Question
15 pages
Unit Iii
No ratings yet
Unit Iii
9 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Concept Class 5 - New Pattern Based Questions Lyst7477
No ratings yet
Concept Class 5 - New Pattern Based Questions Lyst7477
14 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
Lesson 1 - Introduction To Big Data and Hadoop
No ratings yet
Lesson 1 - Introduction To Big Data and Hadoop
46 pages
Make Your Calling and Election Sure As A Workman Approved by God
No ratings yet
Make Your Calling and Election Sure As A Workman Approved by God
4 pages
الخامس الابتدائي عام بنات - We can Mc Graw Hill الابتدائية منتظم
No ratings yet
الخامس الابتدائي عام بنات - We can Mc Graw Hill الابتدائية منتظم
26 pages
DBMS Module-II
No ratings yet
DBMS Module-II
33 pages
Another Intro To Hadoop
No ratings yet
Another Intro To Hadoop
23 pages
Bda Lab 1
No ratings yet
Bda Lab 1
9 pages
Ringkasan Materi Optimasi Tugas Mata Kul
No ratings yet
Ringkasan Materi Optimasi Tugas Mata Kul
15 pages
SubjectwiseCutOffs-1
No ratings yet
SubjectwiseCutOffs-1
2 pages
The Monkey Paw
No ratings yet
The Monkey Paw
5 pages
Big Data
No ratings yet
Big Data
3 pages
English Lesson Family
No ratings yet
English Lesson Family
3 pages
Note Sap 2091232
No ratings yet
Note Sap 2091232
2 pages
Spagobi Server Configure v3
No ratings yet
Spagobi Server Configure v3
11 pages
Quadratics Expression
No ratings yet
Quadratics Expression
6 pages
Hadoop and Big Data
No ratings yet
Hadoop and Big Data
41 pages
Noun Clause
No ratings yet
Noun Clause
28 pages
Cloudera Developer Training For Spark and Hadoop
No ratings yet
Cloudera Developer Training For Spark and Hadoop
4 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
Hadoop (Big Data) : Skills Gained
No ratings yet
Hadoop (Big Data) : Skills Gained
8 pages
Bondage Breaker Review
No ratings yet
Bondage Breaker Review
6 pages
Hadoop Course Content
No ratings yet
Hadoop Course Content
3 pages
Cloudera Lab Preparation
No ratings yet
Cloudera Lab Preparation
3 pages
Efficient Data Processing with Apache Pig: Definitive Reference for Developers and Engineers
From Everand
Efficient Data Processing with Apache Pig: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet

02 HadoopIntroEcosystem

Uploaded by

02 HadoopIntroEcosystem

Uploaded by

Introduc)on

to Hadoop and the

1 Introduc)on Course Introduc)on

18 Conclusion Course Conclusion

In this chapter you will learn

Introduc,on to Hadoop and the

§ Problems with Tradi,onal Large-­‐scale Systems

§ Tradi,onally, computa,on has been

§ The early solu,on: bigger computers

§ The beJer solu,on: more computers

“In pioneer days they used oxen for heavy

§ Challenges with distributed systems

Introduc,on to Hadoop and the

§ Problems with Tradi)onal Large-­‐scale Systems

§ Scalable and economical data storage, processing and analysis

Batch Search Analy)c Machine Stream Other

Filesystem Online NoSQL

§ Extract/Transform/Load (ETL) § Collabora,ve ﬁltering

§ Tradi,onally, data is stored in a central loca,on

§ Modern systems have much more data

§ Hadoop introduced a radical new approach:

Resource Management Storage

1. Ingest 2. Process 3. Analyze 4. Access

Introduc,on to Hadoop and the

§ Problems with Tradi)onal Large-­‐scale Systems

§ Hadoop typically ingests data from many

§ Hadoop Distributed File System (HDFS)

Introduc,on to Hadoop and the

§ Problems with Tradi)onal Large-­‐scale Systems

§ Spark is large-­‐scale data processing engine

§ Hadoop MapReduce is the original Hadoop

Introduc,on to Hadoop and the

§ Problems with Tradi)onal Large-­‐scale Systems

§ Impala is a high-­‐performance SQL engine

§ Hive is an abstrac,on layer on top of Hadoop

§ Interac,ve full-­‐text search for data in a Hadoop cluster

Introduc,on to Hadoop and the

§ Problems with Tradi)onal Large-­‐scale Systems

§ Hue = Hadoop User Experience

§ Sentry provides ﬁne-­‐grained access control

Introduc,on to Hadoop and the

§ Problems with Tradi)onal Large-­‐scale Systems

§ The best way to learn is to do!

§ The Homework Labs are based on a hypothe,cal scenario

§ Loudacre needs to migrate their exis,ng infrastructure to Hadoop

§ Instruc,ons are in the Homework Labs

§ Your virtual machine

Introduc,on to Hadoop and the

§ Problems with Tradi)onal Large-­‐scale Systems

§ Hadoop is a framework for distributed storage and processing

You might also like

§ Problems with Tradi,onal Large-‐scale Systems

§ Problems with Tradi)onal Large-‐scale Systems

§ Problems with Tradi)onal Large-‐scale Systems

§ Problems with Tradi)onal Large-‐scale Systems

§ Spark is large-‐scale data processing engine

§ Problems with Tradi)onal Large-‐scale Systems

§ Impala is a high-‐performance SQL engine

§ Interac,ve full-‐text search for data in a Hadoop cluster

§ Problems with Tradi)onal Large-‐scale Systems

§ Sentry provides ﬁne-‐grained access control

§ Problems with Tradi)onal Large-‐scale Systems

§ Problems with Tradi)onal Large-‐scale Systems