0% found this document useful (0 votes)
27 views50 pages

BigData Nov2019

Uploaded by

Muhammad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views50 pages

BigData Nov2019

Uploaded by

Muhammad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Big Data and Apache Hadoop

November 2019
Contents
1. 5Vs of Big Data

2. Types of Data

3. Introduction to Apache Hadoop

4. Principles of Hadoop

5. Hadoop Ecosystem

6. Use Cases of Apache Spark


5Vs of Big Data
5Vs of Big Data
Definition
5Vs of Big Data
Definition
5Vs of Big Data
Definition
Types of Data
Types of Data
Definition

Semi-
Structured Unstructured
Structured

§ Stored in databases. § Data that is not § Data that do not


§ Organized in rows stored in traditional have clear format in
and columns. database, but being storage.
stored in certain
§ Example: Data organizational way. § Example: Pictures
received from web uploaded online,
logs and sensors. § Example: NoSQL YouTube videos, Text
documents messages sent to
social media
Introduction to Apache Hadoop
Apache Hadoop
Introduction

Overview
Open source programs and frameworks which
can be used as the backbone of the big data
operations.

Advantages
§ Scalability
§ Reliability
§ Flexibility
Apache Hadoop
Introduction

Initiative
Longer time is needed to read
the data when the physical
storage devices become bigger.

Instead, many smaller storage


devices working in parallel are
more efficient than a large
storage device.
Principles of Hadoop
Principles of Hadoop
Hadoop Distributed File System (HDFS)

Overview

§ Reliable architecture to store very large files in Hadoop


cluster.

§ Store less number of large files rather than huge number


of small files.

§ Fault tolerance.

§ High throughput by providing data access in parallel.


Principles of Hadoop
Hadoop Distributed File System (HDFS)

HDFS Architecture
Principles of Hadoop
MapReduce
Principles of Hadoop
MapReduce

Terminologies
Job Complete process from input to final
output

Task A part of the job executed on a slice of


data

JobTracker Master node to manage the jobs and


resources

TaskTracker Agent deployed in each machine to run


MapReduce
Principles of Hadoop
MapReduce

Task of Mapper
§ The input is mapped into Key-Value (KV) pair.

§ For example < 🍎 , 1> is in the format of <key, value>.

Intermediate Process
§ Mapper output undergoes shuffle and sorting.

§ The intermediate data would be stored in local file


system without having replications in other nodes.
Principles of Hadoop
MapReduce

Task of Reducer

§ Started only after all the mappers have completed their


operations.

§ Perform mathematical operations (such as


aggregation/summation).

§ User could define function to meet custom business


logic.

§ The output of Reducer is stored in HDFS.


Principles of Hadoop
YARN
Hadoop Ecosystem
Hadoop Ecosystem
Apache Spark

Overview

§ Open source cluster computing framework


that is suitable for large-scale data processing.

§ Provides machine learning projects, batch


processing, near real-time processing, and
graph analysis.
Hadoop Ecosystem
Apache Spark

Application

§ Suitable for large-scale Data Science use cases.

§ Able to run on Hadoop, Amazon AWS cloud, and


different databases such as Cassandra, Amazon
Dynamo DB etc.
Hadoop Ecosystem
Apache Hive

Overview

§ Virtual data warehouse software to perform


MapReduce based SQL engine that runs on
top of Hadoop.

§ Apache Hive employs HiveQL (SQL-like query


language) to access the files stored in Apache
HDFS or other data storage system such as
Apache Hbase.
Hadoop Ecosystem
Apache Hive

Application
§ Suitable to build data warehouse without requiring
programmers to write complex MapReduce code.

§ A real world application is the friend


recommendation system on Facebook.
Recommendation system has two characteristics:
§ Require high volume of input data.
§ Outputs/Recommendations do not change
frequently.
Hadoop Ecosystem
Apache HBase

Overview

§ Distributed, scalable, and multi-level big data store


on top of Hadoop and HDFS.

§ NoSQL database used for real-time data streaming


due to the two advantages:
§ Able to provide fast and random read-writes.
§ Able to work well with sparse data due to the
column-oriented property
Hadoop Ecosystem
Apache HBase

Application
§ Suitable for random, real-time read/write access to
big data.

§ Real world applications of Apache Hbase:


§ Helps Facebook to perform real-time analytics,
such as counting Facebook likes and for
messaging.
§ Helps Financial Industry Regulatory Authority
(FINRA) and Pinterest to store graphs.
§ Helps Flipboard to personalize the content feef
for the users.
Hadoop Ecosystem
Apache Pig

Overview
§ Platform for analyzing large data sets.

§ Apache Pig employs Pig Latin for queries and data


manipulation.

§ Apache Pig has competitive advantages to perform


more complex data manipulation queries by
providing:
§ Nested data types like Maps, Tuples, and Bags.
§ Support to major data operations like Ordering,
Filters, and Joins.
Hadoop Ecosystem
Apache Pig

Application
§ Suitable for constructing scheduled job. Hence, it is
appropriate for automated batch jobs that move data
between HDFS and other systems.

§ Suitable to read data from the databases reside in


Hadoop that are not structured with Apache Hive
metadata schemas.
Hadoop Ecosystem
Apache Sqoop

Overview

§ A tool for automating the transfer process of bulk


data between Apache Hadoop and structured
datastores efficiently.

§ Able to execute the data transfer in parallel.


Hadoop Ecosystem
Apache Flume

Overview

§ A tool to collect, aggregate, and transport large


amounts of streaming data from variety of sources in
both real-time and batch mode to a centralized data
store (e.g. HDFS).
Hadoop Ecosystem
Apache Flume

Application

§ Suitable to import huge volumes of event data


generated by websites such as Facebook, Twitter,
Amazon, and Flipkart in real-time.
Hadoop Ecosystem
Apache Flink

Overview

§ A tool for data streaming and processing applications


and it is exceling for stateful streaming applications at
any scale.

§ Provides real-time processing, machine learning


projects, batch processing, and graph analysis.
Hadoop Ecosystem
Apache Flink

Application
§ Apache Flink is able to run on third-party data
sources such as Amazon Kinesis Streams,
Elasticsearch, Cassandra, and Twitter Streaming API.

§ The introduction of ACID into data Artisans


platform reinforces position of Apache Flink as the
integration hub for the real-time large financial and
eCommerce organizations.
Hadoop Ecosystem
Apache Impala

Overview

§ A low latency high performance SQL like queries


engine to query data that stored on Hadoop clusters
(e.g. HDFS, Apache Hbase) in real-time.

§ Impala shares the same SQL syntax (Hive SQL), ODBC


driver, metadata, and user interface (Hue Beeswax) as
Apache Hive. Hive users can then use Impala with little
setup overhead.
Hadoop Ecosystem
Apache Impala

Application

§ Suitable for the interactive applications that require


complicated queries to react relatively fast. It allows
users to obtain the outputs to the unexpected
questions (complicated queries) in seconds or at most
a few minutes.
Hadoop Ecosystem
Apache Kafka

Overview

§ Designed to build a central data backbone for a large


organization with a single cluster for processing
ingests data in real-time.

§ A single Kafka broker can handle hundreds of


megabytes of reads and writes per second from
thousands of clients.
Hadoop Ecosystem
Apache Kafka

Application
§ Suitable for publish-subscribe messaging. Users can
publish and subscribe to information as and when
they occur.

§ Suitable to manage the variety of use cases


commonly required for a Data Lake.

§ Able to render streaming data through a


combination of Apache Hbase, Apache Storm, and
Apache Spark systems.
Hadoop Ecosystem
Apache Storm

Overview

§ A real-time computational system for accepting high


volume data coming in high velocity, possibly from
various sources.

§ Easy to implement and can be integrated with any


programming language.
Hadoop Ecosystem
Apache Storm

Application
§ Suitable for applications that primarily focused on
stream processing and CEP-style processing.

§ Apache Storm has the advantage of broader


language support over Apache Spark.
Hadoop Ecosystem
Apache Drill

Overview

§ A schema-free SQL query engine for Hadoop, NoSQL,


and cloud storage.

§ Does not depend on Hadoop as Drill does not use


MapReduce job internally. Drill has its own distributed
processing service called DrillBit.
Hadoop Ecosystem
Apache Drill

Application
§ Can be used to join data from multiple datastores
with just a single query.

§ Can be used to connect between standard


BI/analytics tools and non-relational datastores by
leveraging Apache Drill’s JDBC and ODBC drivers.
Hadoop Ecosystem
Apache Arrow

Overview
§ Built by the lead developers of many Apache
projects.
Apache § A component used to exchange data with low
Arrow overhead and hence accelerating the data analytics.

§ Apache Arrow is extremely important for Python and


R communities as it provide data interoperability
between the two communities with big data systems
(which largely run on the JVM).
Hadoop Ecosystem
Apache Arrow

Application

Apache § Used to reduce the time spent gathering and


processing data. For example:
Arrow
§ PySPark: IBM measured a 53x speedup in data
processing by Python and Spark with the support
of Apache Arrow in PySpark.
Hadoop Ecosystem
Big Data Technology Stack from Tesla
Use Cases of Apache Spark
CHALLENGES
§ Analyzes users’ preferences and events happening in the
outside world.
§ Relates users to their interested events or news accurately
and promptly.

INDUSTRY:
Web Services Provider SOLUTION
USE CASE:
News Pages Personalization § Yahoo uses Machine Learning (ML) algorithms running on
Spark to analyze users’ preferences and categorize news
stories based on the types of users whose would be
interested in reading them.
CHALLENGES
§ To make Spark compatible with existing BI tools to view and
query the advertising analytic data stored in Hadoop.

INDUSTRY: SOLUTION
Web Services Provider
§ Spark Shark is compatible with the standard Hive server
USE CASE:
Advertisement Analytics with Existing
API and hence there is no issue to work with tools that
BI Tools plugs into Hive (e.g. Tableau).

§ With the compatibility, Yahoo is able to query their


advertisement visit data interactively.
CHALLENGES
§ Users expect to have good video quality without much
delays.
§ Require highly-sophisticated behind-the-scenes technology
to ensure a high quality of service by avoiding dreaded
screen buffering.
INDUSTRY:
Online Video Streaming Provider

USE CASE:
SOLUTION
Online video optimization and online
video analytics
§ Conviva deploys Spark Streaming to analyze the network
traffics in real time. Subsequently, the results are fed
directly into the video player (e.g. Flash player) to optimize
the speeds.
CHALLENGES
§ Require a platform to integrate the internal data sources
with external sources (e.g. social media traffic and public
data feeds) for business users without using complex data
modeling.

INDUSTRY: SOLUTION
Data Intelligence Company

USE CASE:
§ ClearStory uses both Hadoop and Apache Spark for their
Internal and External Data service. They store data uploaded by users on a Hadoop
Harmonization Distributed File System (HDFS). Then, they utilize the
Spark’s core in-memory query-optimization engine that
allows fast data preparation, data blending, and iterative
analysis.
Thank you.

! thecads.org
! [email protected]
! The Center of Applied Data Science
" thecads.org
# thecadsmalaysia

You might also like