BigData Nov2019
BigData Nov2019
November 2019
Contents
1. 5Vs of Big Data
2. Types of Data
4. Principles of Hadoop
5. Hadoop Ecosystem
Semi-
Structured Unstructured
Structured
Overview
Open source programs and frameworks which
can be used as the backbone of the big data
operations.
Advantages
§ Scalability
§ Reliability
§ Flexibility
Apache Hadoop
Introduction
Initiative
Longer time is needed to read
the data when the physical
storage devices become bigger.
Overview
§ Fault tolerance.
HDFS Architecture
Principles of Hadoop
MapReduce
Principles of Hadoop
MapReduce
Terminologies
Job Complete process from input to final
output
Task of Mapper
§ The input is mapped into Key-Value (KV) pair.
Intermediate Process
§ Mapper output undergoes shuffle and sorting.
Task of Reducer
Overview
Application
Overview
Application
§ Suitable to build data warehouse without requiring
programmers to write complex MapReduce code.
Overview
Application
§ Suitable for random, real-time read/write access to
big data.
Overview
§ Platform for analyzing large data sets.
Application
§ Suitable for constructing scheduled job. Hence, it is
appropriate for automated batch jobs that move data
between HDFS and other systems.
Overview
Overview
Application
Overview
Application
§ Apache Flink is able to run on third-party data
sources such as Amazon Kinesis Streams,
Elasticsearch, Cassandra, and Twitter Streaming API.
Overview
Application
Overview
Application
§ Suitable for publish-subscribe messaging. Users can
publish and subscribe to information as and when
they occur.
Overview
Application
§ Suitable for applications that primarily focused on
stream processing and CEP-style processing.
Overview
Application
§ Can be used to join data from multiple datastores
with just a single query.
Overview
§ Built by the lead developers of many Apache
projects.
Apache § A component used to exchange data with low
Arrow overhead and hence accelerating the data analytics.
Application
INDUSTRY:
Web Services Provider SOLUTION
USE CASE:
News Pages Personalization § Yahoo uses Machine Learning (ML) algorithms running on
Spark to analyze users’ preferences and categorize news
stories based on the types of users whose would be
interested in reading them.
CHALLENGES
§ To make Spark compatible with existing BI tools to view and
query the advertising analytic data stored in Hadoop.
INDUSTRY: SOLUTION
Web Services Provider
§ Spark Shark is compatible with the standard Hive server
USE CASE:
Advertisement Analytics with Existing
API and hence there is no issue to work with tools that
BI Tools plugs into Hive (e.g. Tableau).
USE CASE:
SOLUTION
Online video optimization and online
video analytics
§ Conviva deploys Spark Streaming to analyze the network
traffics in real time. Subsequently, the results are fed
directly into the video player (e.g. Flash player) to optimize
the speeds.
CHALLENGES
§ Require a platform to integrate the internal data sources
with external sources (e.g. social media traffic and public
data feeds) for business users without using complex data
modeling.
INDUSTRY: SOLUTION
Data Intelligence Company
USE CASE:
§ ClearStory uses both Hadoop and Apache Spark for their
Internal and External Data service. They store data uploaded by users on a Hadoop
Harmonization Distributed File System (HDFS). Then, they utilize the
Spark’s core in-memory query-optimization engine that
allows fast data preparation, data blending, and iterative
analysis.
Thank you.
! thecads.org
! [email protected]
! The Center of Applied Data Science
" thecads.org
# thecadsmalaysia