0% found this document useful (0 votes)

27 views50 pages

BigData Nov2019

Uploaded by

Muhammad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views50 pages

BigData Nov2019

Uploaded by

Muhammad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Big Data and Apache Hadoop

November 2019
Contents
1. 5Vs of Big Data

2. Types of Data

3. Introduction to Apache Hadoop

4. Principles of Hadoop

5. Hadoop Ecosystem

6. Use Cases of Apache Spark

5Vs of Big Data
5Vs of Big Data
Definition
5Vs of Big Data
Definition
5Vs of Big Data
Definition
Types of Data
Types of Data
Definition

Semi-
Structured Unstructured
Structured

§ Stored in databases. § Data that is not § Data that do not

§ Organized in rows stored in traditional have clear format in
and columns. database, but being storage.
stored in certain
§ Example: Data organizational way. § Example: Pictures
received from web uploaded online,
logs and sensors. § Example: NoSQL YouTube videos, Text
documents messages sent to
social media
Introduction to Apache Hadoop
Apache Hadoop
Introduction

Overview
Open source programs and frameworks which
can be used as the backbone of the big data
operations.

Advantages
§ Scalability
§ Reliability
§ Flexibility
Apache Hadoop
Introduction

Initiative
Longer time is needed to read
the data when the physical
storage devices become bigger.

Instead, many smaller storage

devices working in parallel are
more efficient than a large
storage device.
Principles of Hadoop
Principles of Hadoop
Hadoop Distributed File System (HDFS)

Overview

§ Reliable architecture to store very large files in Hadoop

cluster.

§ Store less number of large files rather than huge number

of small files.

§ Fault tolerance.

§ High throughput by providing data access in parallel.

Principles of Hadoop
Hadoop Distributed File System (HDFS)

HDFS Architecture
Principles of Hadoop
MapReduce
Principles of Hadoop
MapReduce

Terminologies
Job Complete process from input to final
output

Task A part of the job executed on a slice of

data

JobTracker Master node to manage the jobs and

resources

TaskTracker Agent deployed in each machine to run

MapReduce
Principles of Hadoop
MapReduce

Task of Mapper
§ The input is mapped into Key-Value (KV) pair.

§ For example < 🍎 , 1> is in the format of <key, value>.

Intermediate Process
§ Mapper output undergoes shuffle and sorting.

§ The intermediate data would be stored in local file

system without having replications in other nodes.
Principles of Hadoop
MapReduce

Task of Reducer

§ Started only after all the mappers have completed their

operations.

§ Perform mathematical operations (such as

aggregation/summation).

§ User could define function to meet custom business

logic.

§ The output of Reducer is stored in HDFS.

Principles of Hadoop
YARN
Hadoop Ecosystem
Hadoop Ecosystem
Apache Spark

Overview

§ Open source cluster computing framework

that is suitable for large-scale data processing.

§ Provides machine learning projects, batch

processing, near real-time processing, and
graph analysis.
Hadoop Ecosystem
Apache Spark

Application

§ Suitable for large-scale Data Science use cases.

§ Able to run on Hadoop, Amazon AWS cloud, and

different databases such as Cassandra, Amazon
Dynamo DB etc.
Hadoop Ecosystem
Apache Hive

Overview

§ Virtual data warehouse software to perform

MapReduce based SQL engine that runs on
top of Hadoop.

§ Apache Hive employs HiveQL (SQL-like query

language) to access the files stored in Apache
HDFS or other data storage system such as
Apache Hbase.
Hadoop Ecosystem
Apache Hive

Application
§ Suitable to build data warehouse without requiring
programmers to write complex MapReduce code.

§ A real world application is the friend

recommendation system on Facebook.
Recommendation system has two characteristics:
§ Require high volume of input data.
§ Outputs/Recommendations do not change
frequently.
Hadoop Ecosystem
Apache HBase

Overview

§ Distributed, scalable, and multi-level big data store

on top of Hadoop and HDFS.

§ NoSQL database used for real-time data streaming

due to the two advantages:
§ Able to provide fast and random read-writes.
§ Able to work well with sparse data due to the
column-oriented property
Hadoop Ecosystem
Apache HBase

Application
§ Suitable for random, real-time read/write access to
big data.

§ Real world applications of Apache Hbase:

§ Helps Facebook to perform real-time analytics,
such as counting Facebook likes and for
messaging.
§ Helps Financial Industry Regulatory Authority
(FINRA) and Pinterest to store graphs.
§ Helps Flipboard to personalize the content feef
for the users.
Hadoop Ecosystem
Apache Pig

Overview
§ Platform for analyzing large data sets.

§ Apache Pig employs Pig Latin for queries and data

manipulation.

§ Apache Pig has competitive advantages to perform

more complex data manipulation queries by
providing:
§ Nested data types like Maps, Tuples, and Bags.
§ Support to major data operations like Ordering,
Filters, and Joins.
Hadoop Ecosystem
Apache Pig

Application
§ Suitable for constructing scheduled job. Hence, it is
appropriate for automated batch jobs that move data
between HDFS and other systems.

§ Suitable to read data from the databases reside in

Hadoop that are not structured with Apache Hive
metadata schemas.
Hadoop Ecosystem
Apache Sqoop

Overview

§ A tool for automating the transfer process of bulk

data between Apache Hadoop and structured
datastores efficiently.

§ Able to execute the data transfer in parallel.

Hadoop Ecosystem
Apache Flume

Overview

§ A tool to collect, aggregate, and transport large

amounts of streaming data from variety of sources in
both real-time and batch mode to a centralized data
store (e.g. HDFS).
Hadoop Ecosystem
Apache Flume

Application

§ Suitable to import huge volumes of event data

generated by websites such as Facebook, Twitter,
Amazon, and Flipkart in real-time.
Hadoop Ecosystem
Apache Flink

Overview

§ A tool for data streaming and processing applications

and it is exceling for stateful streaming applications at
any scale.

§ Provides real-time processing, machine learning

projects, batch processing, and graph analysis.
Hadoop Ecosystem
Apache Flink

Application
§ Apache Flink is able to run on third-party data
sources such as Amazon Kinesis Streams,
Elasticsearch, Cassandra, and Twitter Streaming API.

§ The introduction of ACID into data Artisans

platform reinforces position of Apache Flink as the
integration hub for the real-time large financial and
eCommerce organizations.
Hadoop Ecosystem
Apache Impala

Overview

§ A low latency high performance SQL like queries

engine to query data that stored on Hadoop clusters
(e.g. HDFS, Apache Hbase) in real-time.

§ Impala shares the same SQL syntax (Hive SQL), ODBC

driver, metadata, and user interface (Hue Beeswax) as
Apache Hive. Hive users can then use Impala with little
setup overhead.
Hadoop Ecosystem
Apache Impala

Application

§ Suitable for the interactive applications that require

complicated queries to react relatively fast. It allows
users to obtain the outputs to the unexpected
questions (complicated queries) in seconds or at most
a few minutes.
Hadoop Ecosystem
Apache Kafka

Overview

§ Designed to build a central data backbone for a large

organization with a single cluster for processing
ingests data in real-time.

§ A single Kafka broker can handle hundreds of

megabytes of reads and writes per second from
thousands of clients.
Hadoop Ecosystem
Apache Kafka

Application
§ Suitable for publish-subscribe messaging. Users can
publish and subscribe to information as and when
they occur.

§ Suitable to manage the variety of use cases

commonly required for a Data Lake.

§ Able to render streaming data through a

combination of Apache Hbase, Apache Storm, and
Apache Spark systems.
Hadoop Ecosystem
Apache Storm

Overview

§ A real-time computational system for accepting high

volume data coming in high velocity, possibly from
various sources.

§ Easy to implement and can be integrated with any

programming language.
Hadoop Ecosystem
Apache Storm

Application
§ Suitable for applications that primarily focused on
stream processing and CEP-style processing.

§ Apache Storm has the advantage of broader

language support over Apache Spark.
Hadoop Ecosystem
Apache Drill

Overview

§ A schema-free SQL query engine for Hadoop, NoSQL,

and cloud storage.

§ Does not depend on Hadoop as Drill does not use

MapReduce job internally. Drill has its own distributed
processing service called DrillBit.
Hadoop Ecosystem
Apache Drill

Application
§ Can be used to join data from multiple datastores
with just a single query.

§ Can be used to connect between standard

BI/analytics tools and non-relational datastores by
leveraging Apache Drill’s JDBC and ODBC drivers.
Hadoop Ecosystem
Apache Arrow

Overview
§ Built by the lead developers of many Apache
projects.
Apache § A component used to exchange data with low
Arrow overhead and hence accelerating the data analytics.

§ Apache Arrow is extremely important for Python and

R communities as it provide data interoperability
between the two communities with big data systems
(which largely run on the JVM).
Hadoop Ecosystem
Apache Arrow

Application

Apache § Used to reduce the time spent gathering and

processing data. For example:
Arrow
§ PySPark: IBM measured a 53x speedup in data
processing by Python and Spark with the support
of Apache Arrow in PySpark.
Hadoop Ecosystem
Big Data Technology Stack from Tesla
Use Cases of Apache Spark
CHALLENGES
§ Analyzes users’ preferences and events happening in the
outside world.
§ Relates users to their interested events or news accurately
and promptly.

INDUSTRY:
Web Services Provider SOLUTION
USE CASE:
News Pages Personalization § Yahoo uses Machine Learning (ML) algorithms running on
Spark to analyze users’ preferences and categorize news
stories based on the types of users whose would be
interested in reading them.
CHALLENGES
§ To make Spark compatible with existing BI tools to view and
query the advertising analytic data stored in Hadoop.

INDUSTRY: SOLUTION
Web Services Provider
§ Spark Shark is compatible with the standard Hive server
USE CASE:
Advertisement Analytics with Existing
API and hence there is no issue to work with tools that
BI Tools plugs into Hive (e.g. Tableau).

§ With the compatibility, Yahoo is able to query their

advertisement visit data interactively.
CHALLENGES
§ Users expect to have good video quality without much
delays.
§ Require highly-sophisticated behind-the-scenes technology
to ensure a high quality of service by avoiding dreaded
screen buffering.
INDUSTRY:
Online Video Streaming Provider

USE CASE:
SOLUTION
Online video optimization and online
video analytics
§ Conviva deploys Spark Streaming to analyze the network
traffics in real time. Subsequently, the results are fed
directly into the video player (e.g. Flash player) to optimize
the speeds.
CHALLENGES
§ Require a platform to integrate the internal data sources
with external sources (e.g. social media traffic and public
data feeds) for business users without using complex data
modeling.

INDUSTRY: SOLUTION
Data Intelligence Company

USE CASE:
§ ClearStory uses both Hadoop and Apache Spark for their
Internal and External Data service. They store data uploaded by users on a Hadoop
Harmonization Distributed File System (HDFS). Then, they utilize the
Spark’s core in-memory query-optimization engine that
allows fast data preparation, data blending, and iterative
analysis.
Thank you.

! thecads.org
! [email protected]
! The Center of Applied Data Science
" thecads.org
# thecadsmalaysia

Computer Organization & Architecture: Exercises 1
100% (1)
Computer Organization & Architecture: Exercises 1
13 pages
EVO DSP PLUS LED MM HE 1200va 2400va 3600va 6800va
100% (1)
EVO DSP PLUS LED MM HE 1200va 2400va 3600va 6800va
1 page
Chap3 OverviewOfBigDataEcosystem
No ratings yet
Chap3 OverviewOfBigDataEcosystem
91 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
Module 2
No ratings yet
Module 2
20 pages
Module 2.2
No ratings yet
Module 2.2
32 pages
Hadoop Vs Apache Spark
No ratings yet
Hadoop Vs Apache Spark
6 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
21 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Big Data Technology Stack
100% (1)
Big Data Technology Stack
12 pages
Module 1 - Introduction To Big Data
100% (1)
Module 1 - Introduction To Big Data
40 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Lesson 1 - Introduction To Big Data and Hadoop
No ratings yet
Lesson 1 - Introduction To Big Data and Hadoop
46 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Hadoopvsspark 180108070838
No ratings yet
Hadoopvsspark 180108070838
17 pages
CC-KML051-Unit V
No ratings yet
CC-KML051-Unit V
17 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
55 pages
Hadoop Ecosystem and Their Components
No ratings yet
Hadoop Ecosystem and Their Components
12 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
INTRO Hadoop-Ecosystem
No ratings yet
INTRO Hadoop-Ecosystem
6 pages
BDA Unit 2
No ratings yet
BDA Unit 2
52 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
MODULE 2 Hadoop Ecosystem Tools
No ratings yet
MODULE 2 Hadoop Ecosystem Tools
44 pages
Unit 2
No ratings yet
Unit 2
73 pages
Open Source Technologies
No ratings yet
Open Source Technologies
19 pages
What Is Apache Hadoop?: Ambari™
No ratings yet
What Is Apache Hadoop?: Ambari™
1 page
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Unit2 Bda
No ratings yet
Unit2 Bda
12 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
277 pages
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
CASE STUDY On Application of Hadoop
No ratings yet
CASE STUDY On Application of Hadoop
16 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
8 MapReduce Different Phases 08-01-2025
No ratings yet
8 MapReduce Different Phases 08-01-2025
28 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Big Data Course Agenda
No ratings yet
Big Data Course Agenda
3 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Chapter 3
No ratings yet
Chapter 3
21 pages
Hadoop
No ratings yet
Hadoop
61 pages
BDA - Unit 4
No ratings yet
BDA - Unit 4
18 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
BDA Module2
No ratings yet
BDA Module2
43 pages
Fro CH3
No ratings yet
Fro CH3
21 pages
Week 3 (8W) - Exploring Hadoop Ecosystem (W6) - Revised
No ratings yet
Week 3 (8W) - Exploring Hadoop Ecosystem (W6) - Revised
66 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Big Data Emerging Technologie
No ratings yet
Big Data Emerging Technologie
10 pages
HADOOP
No ratings yet
HADOOP
10 pages
Unit Ii
No ratings yet
Unit Ii
30 pages
Ch6 Architectural Design v1
No ratings yet
Ch6 Architectural Design v1
26 pages
Data Science
No ratings yet
Data Science
87 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
Day 2 S1 Intro - To - Hadoop - Ashok
No ratings yet
Day 2 S1 Intro - To - Hadoop - Ashok
27 pages
PowerScale OneFS Technical Specifications Guide 9.2.1.0
No ratings yet
PowerScale OneFS Technical Specifications Guide 9.2.1.0
18 pages
Simulated Annealing: by Rohit Ray ESE 251
No ratings yet
Simulated Annealing: by Rohit Ray ESE 251
20 pages
CSTA Standards Crosswalk Template StateDistrictSchoolProduct Standards
No ratings yet
CSTA Standards Crosswalk Template StateDistrictSchoolProduct Standards
21 pages
ACT Digital Security Guidelines 2019
No ratings yet
ACT Digital Security Guidelines 2019
29 pages
Shodan - Io Cheat Sheet: by Via
No ratings yet
Shodan - Io Cheat Sheet: by Via
1 page
TS1620 Techstar
No ratings yet
TS1620 Techstar
12 pages
s71500 Cpu1513 1 PN Manual en-US
No ratings yet
s71500 Cpu1513 1 PN Manual en-US
41 pages
DDCA Material
No ratings yet
DDCA Material
4 pages
Practice Questions For PP Final Orals-2022
No ratings yet
Practice Questions For PP Final Orals-2022
8 pages
Full Stack Java Developer
No ratings yet
Full Stack Java Developer
8 pages
List of Open Elective 2021
No ratings yet
List of Open Elective 2021
3 pages
Introduction To Microprocessor and Computer Organization
No ratings yet
Introduction To Microprocessor and Computer Organization
26 pages
Name: Kanok Chanpa Saha Bhowmik ID: 201120 Robustness Testing (Length of String)
No ratings yet
Name: Kanok Chanpa Saha Bhowmik ID: 201120 Robustness Testing (Length of String)
3 pages
Medium Com Better Programming Here Are 6 Frontend Challenges
No ratings yet
Medium Com Better Programming Here Are 6 Frontend Challenges
16 pages
Compatible Design: GSM/GPRS Module Series
No ratings yet
Compatible Design: GSM/GPRS Module Series
37 pages
CAT GR 12 Study Notes For Prelim
No ratings yet
CAT GR 12 Study Notes For Prelim
10 pages
Workday Course Curriculum TEKSLATE
No ratings yet
Workday Course Curriculum TEKSLATE
9 pages
Introduction
No ratings yet
Introduction
100 pages
Yeshfa Noor Android Developer Resume
No ratings yet
Yeshfa Noor Android Developer Resume
1 page
AJP DatagramPacket MCQ
No ratings yet
AJP DatagramPacket MCQ
9 pages
Adld Ad DD 12 To 19 Q Bank
No ratings yet
Adld Ad DD 12 To 19 Q Bank
40 pages
Finance Management System
No ratings yet
Finance Management System
110 pages
BCDX36HP Sentinel Software Manual
0% (1)
BCDX36HP Sentinel Software Manual
29 pages
GSLAM - A General SLAM Framework and Benchmark
No ratings yet
GSLAM - A General SLAM Framework and Benchmark
11 pages
ComProg Notes
No ratings yet
ComProg Notes
4 pages
Video Intercom Firmware V1.5.1 Release Notes - 5
No ratings yet
Video Intercom Firmware V1.5.1 Release Notes - 5
6 pages
Linux Monitoring System
No ratings yet
Linux Monitoring System
50 pages
Mohan Babu University: Vision
No ratings yet
Mohan Babu University: Vision
8 pages

BigData Nov2019

Uploaded by

BigData Nov2019

Uploaded by

Big Data and Apache Hadoop

3. Introduction to Apache Hadoop

6. Use Cases of Apache Spark

§ Stored in databases. § Data that is not § Data that do not

Instead, many smaller storage

§ Reliable architecture to store very large files in Hadoop

§ Store less number of large files rather than huge number

§ High throughput by providing data access in parallel.

Task A part of the job executed on a slice of

JobTracker Master node to manage the jobs and

TaskTracker Agent deployed in each machine to run

§ For example < 🍎 , 1> is in the format of <key, value>.

§ The intermediate data would be stored in local file

§ Started only after all the mappers have completed their

§ Perform mathematical operations (such as

§ User could define function to meet custom business

§ The output of Reducer is stored in HDFS.

§ Open source cluster computing framework

§ Provides machine learning projects, batch

§ Suitable for large-scale Data Science use cases.

§ Able to run on Hadoop, Amazon AWS cloud, and

§ Virtual data warehouse software to perform

§ Apache Hive employs HiveQL (SQL-like query

§ A real world application is the friend

§ Distributed, scalable, and multi-level big data store

§ NoSQL database used for real-time data streaming

§ Real world applications of Apache Hbase:

§ Apache Pig employs Pig Latin for queries and data

§ Apache Pig has competitive advantages to perform

§ Suitable to read data from the databases reside in

§ A tool for automating the transfer process of bulk

§ Able to execute the data transfer in parallel.

§ A tool to collect, aggregate, and transport large

§ Suitable to import huge volumes of event data

§ A tool for data streaming and processing applications

§ Provides real-time processing, machine learning

§ The introduction of ACID into data Artisans

§ A low latency high performance SQL like queries

§ Impala shares the same SQL syntax (Hive SQL), ODBC

§ Suitable for the interactive applications that require

§ Designed to build a central data backbone for a large

§ A single Kafka broker can handle hundreds of

§ Suitable to manage the variety of use cases

§ Able to render streaming data through a

§ A real-time computational system for accepting high

§ Easy to implement and can be integrated with any

§ Apache Storm has the advantage of broader

§ A schema-free SQL query engine for Hadoop, NoSQL,

§ Does not depend on Hadoop as Drill does not use

§ Can be used to connect between standard

§ Apache Arrow is extremely important for Python and

Apache § Used to reduce the time spent gathering and

§ With the compatibility, Yahoo is able to query their

You might also like