BDS Session 1
BDS Session 1
3
Topics for today
• Motivation
✓ Why do modern Enterprises need to work with data
✓ What is Big Data and data classification
✓ Scaling RDBMS
• What is a Big Data System
✓ Characteristics
✓ Design challenges
• Architecture
✓ High level architecture of Big Data solutions
✓ Technology ecosystem
✓ Case studies
• Why locality of reference and storage organisation matter
4
Example of a data-driven Enterprise: A large online retailer (1)
5
Example of a data-driven Enterprise: A large online retailer (2)
6
Data volume growth
9
Data usage pattern
Structured
✓ analysis of social media content for
Semi-structured sentiment analysis
Unstructured
✓ analysis of unstructured text content
by search engines on the web as well
as within enterprise
✓ analysis of support emails, calls for
customer satisfaction
✓ NLP / Conversational AI for answering
user questions from backend data
10
Structured Data
12
Unstructured Data (1)
13
Unstructured Data (2)
14
Define Big Data
15
Isn’t a traditional RDBMS good enough ?
Example Web Analytics Application
17
Scaling with Database Partitions
Using Database shards
18
Issues with RDBMS (1)
19
Issues with RDBMS (2)
replicas of shards in a social site DB
• Not all BigData use cases need strong ACID
semantics, esp Systems of Engagement
✓ Becomes a bottleneck with many replicas and
many attributes - need to optimise fast writes
and reads with less updates Write Read
21
Topics for today
• Motivation
✓ Why do modern Enterprises need to work with data
✓ What is Big Data and data classification
✓ Scaling RDBMS
• What is a Big Data System
✓ Characteristics
✓ Design challenges
• Architecture
✓ High level architecture of Big Data solutions
✓ Technology ecosystem
✓ Case studies
• Why locality of reference and storage organisation matter
22
Characteristics of Big Data Systems (1)
25
Challenges in Big Data Systems (1)
• Latency issues in algorithms and data storage working with large data
sets (S1*-LoR)
• Basic design considerations of Distributed and Parallel systems -
reliability, availability, consistency (S2, S3, S4)
• What data to keep and for how long - depends on analysis use case
• Cleaning / Curation of data (S4-Lifecycle)
• Overall orchestration involving large volumes of data (S9)
• Choose the right technologies from many options, including open
source, to build the Big Data System for the use cases (S6 onwards)
• Motivation
✓ Why do modern Enterprises need to work with data
✓ What is Big Data and data classification
✓ Scaling RDBMS
• What is a Big Data System
✓ Characteristics
✓ Design challenges
• Architecture
✓ High level architecture of Big Data solutions
✓ Technology ecosystem
✓ Case studies
• Why locality of reference and storage organisation matter
• Summary
28
Types of Big Data solutions
1. Batch processing of big data sources at rest
✓ Building ML models, statistical aggregates
✓ “What percentage of users in US last year watched shows starring Kevin
Spacey and completed a season within 4 weeks or a movie within 4 hours”
✓ “Predict number of US family users in age 30-40 who will buy a Kelloggs
cereal if they purchase milk”
• Designed to handle the ingestion, processing, and analysis of data that is too large or complex
for traditional database systems.
30
Big Data Systems Components (1)
1. Data sources
✓ One or more data sources like
databases, docs, files, IoT
devices, images, video etc.
2. Data Storage
✓ Data for batch processing
operations is typically stored in a
distributed file store that can hold
high volumes of large files in
various formats.
✓ Data can also be stored in key-
value stores.
3. Batch processing
✓ Process data files using long-
running parallel batch jobs to filter,
sort, aggregate or prepare the data
for analysis.
✓ Usually these jobs involve reading
source files, processing them, and
writing the output to new files.
e.g. search scans on unindexed docs
5. Stream processing
Real-time in-memory filtering, aggregating or
preparing the data for further analysis. The
processed stream data is then written to an
output sink. These are mainly in-memory
systems. Data can be written to files, database,
or integrated with an API. e.g. fraud detection logic
34
Technology Ecosystem (showing mostly Apache projects)
Keep meta-data in-memory processing Key-value Indexed Data Complex Manage
DWH stores data Scripting
for parallel tasks of streaming data ingest processing workflows
In-memory processing:
HBASE, MongoDB
SQL over Hadoop:
Machine Learning:
Flume, Sqoop
SparkMLlib
Coordination: Zookeeper
Scripting:
NoSQL:
Search:
Spark
Hive
ETL:
Solr
Pig
Scheduler:
Oozie
Resource management and basic map-reduce: Yarn for Hadoop* nodes
Manage
map-reduce
Storage: HDFS
* nodes run map-reduce jobs (more on this later) ** we’ll cover technologies in bold
35
spark vs storm
Case Study: Netflix
How Netflix uses Big Data for exponential user growth and retention
(reference: https://www.clickz.com/how-netflix-uses-big-data-content/228201/)
Data collected from 150+ million users
37
Recommendation System
38
Big Data System
Netflix Ginie : Overall Hadoop cluster assignment, management for end users
Scripting:
Hive
ETL:
Pig
Pig
Resource Management: Yarn for Hadoop nodes
Storage: Amazon S3
41
IT Operations Analytics
In-memory processing:
Stream processing:
Machine Learning:
Spark streaming
Kafka, Logstash
SparkMLlib
Cassandra
Coordination: Zookeeper
Scripting:
NoSQL:
Search:
Spark
Hive
ETL:
Solr
Pig
Scheduler:
Custom
Resource management and basic map-reduce: Yarn for Hadoop nodes
Storage: HDFS
43
Where will you apply this architecture style
✓ Store and process data in volumes too large for a traditional database
✓ Handle semi-structured or data with evolving structure - e.g. demographic
data with hundreds of attributes
✓ Transform unstructured data for analysis and reporting
✓ Capture, process, and analyze unbounded streams of data with low latency
✓ Capture, process, and analyze unbounded historical data cost effectively
44
Big data architecture benefits
• Technology choices
✓ Variety of technology options in open source and from vendors are available
• Performance through parallelism
✓ Big data solutions take advantage of data or task parallelism, enabling high-performance
solutions that scale to large volumes of data.
• Elastic scale
✓ All of the components in the big data architecture support scale-out provisioning, so that you
can adjust your solution to small or large workloads, and pay only for the resources that you use.
• Flexibility with consistency semantics (more in CAP theorem)
✓ E.g. Cassandra or MongoDB can make inconsistent reads for better scale and fault tolerance
• Good cost performance ratio
✓ Ability to reduce cost at the expense of performance. E.g. long term data storage in commodity
HDFS nodes.
• Interoperability with existing solutions
✓ The components of the big data architecture are also used for IoT processing and enterprise BI
solutions, enabling you to create an integrated solution across data workloads. e.g. Hadoop can
work with data in Amazon S3.
45
Big data architecture challenges
• Complexity
✓ Big data solutions can be extremely complex, with numerous components to
handle data ingestion from multiple data sources. It can be challenging to build,
test, and troubleshoot big data processes.
• Skillset
✓ Many big data technologies are highly specialized, and use frameworks and
languages that are not typical of more general application architectures. On the
other hand, big data technologies are evolving new APIs that build on more
established languages.
• Technology maturity
✓ Many of the technologies used in big data are evolving. While core Hadoop
technologies such as Hive and Pig have stabilized, emerging technologies such as
Spark introduce extensive changes and enhancements with each new release.
46
Topics for today
• Motivation
✓Why do modern Enterprises need to work with data
✓What is Big Data and data classification
✓Scaling RDBMS
• What is a Big Data System
✓Characteristics
✓Design challenges
• Architecture
✓High level architecture of Big Data solutions
✓Technology ecosystem
✓Case studies
• Why locality of reference and storage organisation matter
47
Locality of Reference (LoR)
Spatial and temporal locality principles for processing and storage
Context
49
Levels of storage
50
Cost of access: Memory vs. Storage vs. Network
Reference: http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
51
Memory hierarchy – motivation
• A memory hierarchy amortizes cost in computer architecture:
✓ fast (and therefore costly) but small-sized memory to
✓ large-sized but slow (and therefore cheap) memory In a BigData search, recent Solr
indexed data can be a cache
for long term HDFS data.
52
Memory hierarchy - access
53
Memory hierarchy - cache hit/miss
54
Cache performance
Hit Ratio - The performance of the cache
• Cache hit
✓ When CPU refers to memory and find the data or instruction within the Cache Memory
• Cache miss
✓ If the desired data or instruction is not found in the cache memory and CPU refers to the
main memory to find that data or instruction
• One can generalise this to any form of cache e.g. movie request from user to nearest Netflix
content cache
55
Access time of memories
56
Data reference empirical evidence
• Programs tend to reuse data and instructions they have used recently.
• Implication:
✓ Predict with reasonable accuracy what instructions and data a
program will use in the near future based on the recent past
57
Principle of Locality of Reference (LoR)
60
Memory hierarchy and locality of reference (1)
61
Memory hierarchy and locality of reference (2)
62
LoR and data structure choices
• Analysis - sequential access in arrays is faster than on linked lists on many machines,
because they have a very good spacial locality of reference and thus make good use of data
caching.
64
Locality Example : Matrices addition
• Consider matrices and an operation such as the addition of two matrices:
• Elements M[i,j] may be accessed either row by row or column by column
• Option 1: Add columns within each row (Row Major)
for (i=0; i<N; i++)
for(j=0; j<N; j++)
M3[i, j] = M2[i, j] + M1[i, j]
66
Locality Analysis (Column major)
67
Big Data: Storage organisation matters
Example
69
Next Session:
Parallel and Distributed Processing