100% found this document useful (1 vote)

116 views70 pages

BDS Session 1

The document provides an overview of the course "Big Data Systems". It outlines the topics that will be covered in each session, including parallel and distributed processing, big data analytics systems, consistency models, distributed systems programming, and specific big data technologies like Hadoop and Spark. It also gives examples of how large companies generate and use big data, and how data volumes and varieties are growing exponentially. Finally, it defines the characteristics of structured, semi-structured and unstructured data, and challenges in scaling relational databases to handle big data workloads.

Uploaded by

Swati Bhagavatula

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

116 views70 pages

BDS Session 1

Uploaded by

Swati Bhagavatula

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

DSECL ZG 522: Big Data Systems

Session 1: Introduction to Big Data, Locality of Reference

Dr. Anindya Neogi

Associate Professor
[email protected]
Course Outline

• S1: Introduction to Big Data and data locality

• S2: Parallel and Distributed Processing
• S3: Big Data Analytics and Big Data Systems
• S4: Consistency, Availability, Partition tolerance and Data Lifecycle
• S5: Distributed Systems Programming
• S6-S9: Hadoop ecosystem technologies
• S10: NoSQL Databases
• S11: Big Data on Cloud
• S12: Amazon storage services
• S13-16: In-memory and streaming - Spark
2
Books

T1 Seema Acharya and Subhashini Chellappan. Big Data and Analytics.

Wiley India Pvt. Ltd. Second Edition
R1 DT Editorial Services. Big Data - Black Book. DreamTech. Press. 2016
R2 Kai Hwang, Jack Dongarra, and Geoffrey C. Fox. Distributed and
Cloud Computing: From Parallel Processing to the Internet of Things.
Morgan Kauffman 2011
AR Additional Reading (As per topic)

3
Topics for today

• What data is collected

✓ Millions of transactions and browsing clicks per day across products, users
✓ Delivery tracking
✓ Reviews on multiple channels - website, social media, customer support
✓ Support emails, logged calls
✓ Ad click and browsing data
✓…
• Data is a mix of metrics, natural language text, logs, events, videos, images etc.

5
Example of a data-driven Enterprise: A large online retailer (2)

• What is this data used for

✓ User profiling for better shopping experience
✓ Operations efficiency metrics
✓ Improve customer support experience, support training
✓ Demand forecasting
✓ Product marketing
✓…
• Data is the only way to create competitive differentiators, retain
customers and ensure growth

6
Data volume growth

• Facebook: 500+ TB/

day of comments,
images, videos etc.
• NYSE: 1TB/day of
trading data
• A Jet Engine: 20TB /
hour of sensor / log
data

Source : What is big data? 7

Variety of data sources

Source : What is Big Data?

8
Data classification

• Structured data is metrics, events

that can be put in RDBMS with
fixed schema
• Structured • Semi- • Unstructured
structured • Semi-structured data are XML,
JSON structure where traditional
RDBMS have support with varying
Web pages efficiency but needs new kind of
Databases XML
Images NoSQL databases
• New applications produce
unstructured data which could be
natural language text and multi-
media content

9
Data usage pattern

• Higher demand now of analysing

unstructured data to glean insights
• Examples

Structured
✓ analysis of social media content for
Semi-structured sentiment analysis
Unstructured
✓ analysis of unstructured text content
by search engines on the web as well
as within enterprise
✓ analysis of support emails, calls for
customer satisfaction
✓ NLP / Conversational AI for answering
user questions from backend data
10
Structured Data

• Data is transformed and stored as per pre-

defined schema TABLE Employee (
• Traditionally stored in RDBMS emp_id int PRIMARY KEY,
• CRUD operations on records
name varchar (50),
• ACID semantics (Atomicity, Consistency,
Isolation, Durability) designation varchar(25),
• Fine grain security and authorisation salary int,
• Known techniques on scaling RDBMS - more
dept_code int FOREIGN KEY
on this later
• Typically used by Systems of Record, e.g. )
OLTP systems, with strong consistency
requirements and read / write workloads
11
Semi-Structured Data

• No explicit data and schema {

separation "title": "Sweet fresh strawberry",
"type": "fruit",
• Models real life situations better
"description": "Sweet fresh strawberry",
because attributes for every record
“image": "1.jpg",
could be different “weight": 250,
• Easy to add new attributes "expiry": 30/5/2021,
"price": 29.45,
• XML, JSON structures
“avg_rating": 4
• Databases typically support flexible “reviews”: [
ACID properties, esp consistency of { “user” : “p1“, “rating”: 2, “review”: “ ….. “ }, …
replicas ]
• Typically used by Systems of }
Engagement, e.g. social media

12
Unstructured Data (1)

• More real-life data

✓ video, voice, text, emails, chats, comments,
reviews, blogs …
• There is some structure that is typically extracted
from the data depending on the use case
✓ image adjustments at pixel structure level
✓ face recognition from video
✓ tagging of features extracted in image / video
✓ annotation of text

13
Unstructured Data (2)

• What can we do with it ?

✓ Data mining
• Association rule mining, e.g. market basket or affinity analysis
• Regression, e.g. predict dependent variable from independent variables
• Collaborative filtering, e.g. predict a user preference from group
preferences
✓ NLP - e.g. Human to Machine interaction, conversational systems
✓ Text Analytics - e.g. sentiment analysis, search
✓ Noisy text analytics - e.g. spell correction, speech to text

14
Define Big Data

From patient records to

social media

From modelling jobs From clickstream analysis

to fraud detection to sentiment analysis

15
Isn’t a traditional RDBMS good enough ?
Example Web Analytics Application

• Designing an application to monitor the page hits for a portal

• Every time a user visiting a portal page in browser, the server side keeps track of that visit
• Maintains a simple database table that holds information about each page hit
• If user visits the same page again, the page hit count is increased by one
• Uses this information for doing analysis of popular pages among the users

Source : Adapted from Big Data by Nathan Marz

16
Scaling with intermediate layer
Using a queue
• Portal is very popular, lot of users visiting it
✓ Many users are concurrently visiting the pages of portal
✓ Every time a page is visited, database needs to be updated to keep track of this visit
✓ Database write is heavy operation
✓ Database write is now a bottleneck !
• Solution
✓ Use an intermediate queue between the web server and database
✓ Queue will hold messages
✓ Message will not be lost

17
Scaling with Database Partitions
Using Database shards

• Application is too popular

✓ Users are using it very heavily, increasing the load on application
✓ Maintaining the page view count is becoming difficult even with queue
• Solution
✓ Use database partitions
✓ Data is divided into partitions (shards) which are hosted on multiple machines
✓ Database writes are parallelized
✓ Scalability increasing
✓ Also complexity increasing!

18
Issues with RDBMS (1)

• With too many shards some disk is bound to fail

• Fault tolerance needs shard replicas - so more things to
manage replicas of shards

• Complex logic to read / write because need to locate the

right shard - human errors can be devastating
• Keep re-sharding and balancing as data grows or load
increases
Write Read
• What is the consistency semantics of updating replicas ?
Should a read on a replica be allowed before it is updated ?
• Is it optimised when data is written once and read many
times or vice versa ?

19
Issues with RDBMS (2)
replicas of shards in a social site DB
• Not all BigData use cases need strong ACID
semantics, esp Systems of Engagement
✓ Becomes a bottleneck with many replicas and
many attributes - need to optimise fast writes
and reads with less updates Write Read

• Fixed schema is not sufficient … as application

becomes popular more attributes need to be
captured and DB modelling becomes an issue.
✓ Which attributes are used depends on the use
case.
want to add field applicable to
20 some products
Issues with RDBMS (3)

• Very wide de-normalized attribute sets

• Data layout formats - column or row major -
depends on use case
✓ What if we query only few columns ? Do I need
to touch the entire row in storage layer ?
• Expensive to retain and query long term data - need what if a JSON had 1000+
low cost solution attributes demographic records
for millions of users

21
Topics for today

• Application does not need to bother about common

issues like sharding, replication replicated / partitioned storage

✓ Developers more focused on application logic

rather than data management
• Easier to model data with flexible schema
key-value document
✓ Not necessary that every record has same set of
attributes graph

• If possible, treat data as immutable

✓ Keep adding timestamped versions of data values t1, k t2, k t3, k …
✓ Avoid human errors by not destroying a good
copy
23
Characteristics of Big Data Systems (2)

• Application specific consistency models

✓ a reader may read a replica that’s has not been updated
yet as in “read preference” options in MongoDB
✓ e.g. comments on social media
• Handles high data volume, at very fast rate coming
from variety of sources because immutable writes are
faster with flexible consistency models
✓ Keep adding data versions with timestamp
✓ Replica updates can keep happening in the
background

Cassandra is a NoSQL database for write-heavy workload and eventual consistency

24
Characteristics of Big Data Systems (3)

• Built as distributed and incrementally scalable systems

✓ add new nodes to scale as in a Hadoop cluster
• Options to have cheaper long term data retention
✓ long term data reads can have more latency and can be less
expensive to store on commodity hardware, e.g. Hadoop file
system (HDFS)
• Generalized programming models that work close to the data
✓ e.g. Hadoop map-reduce that runs tasks on data nodes

25
Challenges in Big Data Systems (1)

• Latency issues in algorithms and data storage working with large data
sets (S1*-LoR)
• Basic design considerations of Distributed and Parallel systems -
reliability, availability, consistency (S2, S3, S4)
• What data to keep and for how long - depends on analysis use case
• Cleaning / Curation of data (S4-Lifecycle)
• Overall orchestration involving large volumes of data (S9)
• Choose the right technologies from many options, including open
source, to build the Big Data System for the use cases (S6 onwards)

* Si: Topic is discussed in session i

26
Challenges in Big Data Systems (2)

• Programming models for analysis (S5, S6 - MapReduce, S13-

Spark)
• Scale out for high volume (S6-Hadoop, S10-NoSQL, S13-Spark)
• Search (S10-NoSQL)
• Cloud is the cost effective way long term - but need to host Big
Data outside the Enterprise (S11-Cloud)
• Data privacy and governance
• Skilled coordinated teams to build/maintain Big Data Systems
and analyse data
27
Topics for today

• Motivation
✓ Why do modern Enterprises need to work with data
✓ What is Big Data and data classification
✓ Scaling RDBMS
• What is a Big Data System
✓ Characteristics
✓ Design challenges
• Architecture
✓ High level architecture of Big Data solutions
✓ Technology ecosystem
✓ Case studies
• Why locality of reference and storage organisation matter
• Summary
28
Types of Big Data solutions
1. Batch processing of big data sources at rest
✓ Building ML models, statistical aggregates
✓ “What percentage of users in US last year watched shows starring Kevin
Spacey and completed a season within 4 weeks or a movie within 4 hours”
✓ “Predict number of US family users in age 30-40 who will buy a Kelloggs
cereal if they purchase milk”

2. Real-time processing of big data in motion

✓ Fraud detection from real-time financial transaction data
✓ Detect fake news on social media platforms

3. Interactive exploration with ad-hoc queries

✓ Which region and product has least sales growth in last quarter
29
Big data architecture style

• Designed to handle the ingestion, processing, and analysis of data that is too large or complex
for traditional database systems.

Source : Microsoft Big Data Architecture

30
Big Data Systems Components (1)

1. Data sources
✓ One or more data sources like
databases, docs, files, IoT
devices, images, video etc.

2. Data Storage
✓ Data for batch processing
operations is typically stored in a
distributed file store that can hold
high volumes of large files in
various formats.
✓ Data can also be stored in key-
value stores.

e.g. social data e.g. medical images

31
Big Data Systems Components (2)

3. Batch processing
✓ Process data files using long-
running parallel batch jobs to filter,
sort, aggregate or prepare the data
for analysis.
✓ Usually these jobs involve reading
source files, processing them, and
writing the output to new files.
e.g. search scans on unindexed docs

4. Real-time message ingestion

✓ Capture data from real-time
sources and integrate with stream
processing. Typically these are in-
memory systems with optional
storage backup for resiliency.
e.g. credit card transactions for fraud detection 32
Big Data Systems Components (3)

5. Stream processing
Real-time in-memory filtering, aggregating or
preparing the data for further analysis. The
processed stream data is then written to an
output sink. These are mainly in-memory
systems. Data can be written to files, database,
or integrated with an API. e.g. fraud detection logic

6. Analytical data store

Real-time or batch processing can be used to
prepare the data for further analysis. The
processed data in stored in a structured format
to be queried using analytical tools. The
analytical data store used to serve these queries
can be a Kimball-style relational data warehouse
or BigData warehouse like Hive. There may be
also NoSQL stores such as MongoDB, HBase.
e.g. financial transaction history across clients
33 for spend analysis
Big Data Systems Components (4)

7. Analysis and reporting

The goal of most big data solutions is to provide
insights into the data through analysis and
reporting. These can be various OLAP, search and
reporting tools.
e.g. weekly management report
8. Orchestration and ETL
Most big data solutions consist of repeated data
processing operations, encapsulated in workflows,
that transform source data, move data between
multiple sources and sinks, load the processed data
into an analytical data store, or push the results
straight to a report or dashboard. To automate these
workflows, you can use an orchestration technology
such Azure Data Factory, Apache Oozie or Sqoop.

34
Technology Ecosystem (showing mostly Apache projects)
Keep meta-data in-memory processing Key-value Indexed Data Complex Manage
DWH stores data Scripting
for parallel tasks of streaming data ingest processing workflows

In-memory processing:

Storm, Kafka, Spark-S

Stream processing:

HBASE, MongoDB
SQL over Hadoop:

Machine Learning:
Flume, Sqoop

SparkMLlib
Coordination: Zookeeper

Scripting:
NoSQL:

Search:
Spark

Hive

ETL:
Solr

Pig

Scheduler:
Oozie
Resource management and basic map-reduce: Yarn for Hadoop* nodes
Manage
map-reduce
Storage: HDFS

* nodes run map-reduce jobs (more on this later) ** we’ll cover technologies in bold
35
spark vs storm
Case Study: Netflix
How Netflix uses Big Data for exponential user growth and retention
(reference: https://www.clickz.com/how-netflix-uses-big-data-content/228201/)
Data collected from 150+ million users

• Date content watched

• The device on which the content was watched
• How the nature of the content watched varied based on the device
• Searches on its platform
• Portions of content that got re-watched
• Whether content was paused
• User location data
• Time of the day and week in which content was watched and how it
influences the kind of content watched
• Metadata from third parties like Nielsen
• Social media data from Facebook and Twitter

37
Recommendation System

• Personalized context ranking

• Content promotion influenced by personal interests and not just
what’s popular or needs promotion
• Intelligent prioritization of viewed content that users are expected to
watch / re-watch - never bore the user
• What type of content is likely to be popular to decide new shows
✓ E.g. “House of Cards” was contracted for 2 seasons without a pilot
because analysis said that show will be popular based on analysis of
historical viewership data across regions for past content by cast and
crew

38
Big Data System
Netflix Ginie : Overall Hadoop cluster assignment, management for end users

SQL over Hadoop:

Scripting:
Hive

ETL:
Pig

Pig
Resource Management: Yarn for Hadoop nodes

Storage: Amazon S3

39 Amazon EMR (Elastic Map Reduce) - Hadoop on Cloud

Case Study: IT Ops
Using Big Data tools and architecture for managing IT
IT Operations Analytics

• IT systems generate large volumes of monitoring, logging and

event data
• Can we use this data to proactively look for anomalous patterns
and predict an issue
• Can we localise possible root causes
• Can we help an engineer quickly explore the data to confirm the
specific root cause

41
IT Operations Analytics

• IT systems generate large volumes of monitoring, logging and event data

• Can we use this data to proactively look for anomalous patterns and predict an
issue
• Can we localise possible symptoms and possible causes
• Can we help an engineer quickly explore the data to confirm the specific root cause

Real-time streaming analysis of metrics

- in-memory fast compute
- real-time model updates and model lookups
Interactive time-sensitive search and exploration of log and metric data
- older data may take time (has this happened earlier in last month)
- but new data should be fast (what happened42few minutes back)
Big Data Platform
Search on long-term logs Store metrics as key-value pairs Integrate with metric and log sources
Build models on metrics
Search on short-term logs Modelling logic for metric dependencies
Detect metric anomalies and normal ranges

In-memory processing:

Stream processing:

SQL over Hadoop:

Machine Learning:
Spark streaming

Kafka, Logstash

SparkMLlib
Cassandra
Coordination: Zookeeper

Scripting:
NoSQL:

Search:
Spark

Hive

ETL:
Solr

Pig

Scheduler:
Custom
Resource management and basic map-reduce: Yarn for Hadoop nodes

Storage: HDFS
43
Where will you apply this architecture style

• Consider this architecture style when you need to:

✓ Store and process data in volumes too large for a traditional database
✓ Handle semi-structured or data with evolving structure - e.g. demographic
data with hundreds of attributes
✓ Transform unstructured data for analysis and reporting
✓ Capture, process, and analyze unbounded streams of data with low latency
✓ Capture, process, and analyze unbounded historical data cost effectively

44
Big data architecture benefits
• Technology choices
✓ Variety of technology options in open source and from vendors are available
• Performance through parallelism
✓ Big data solutions take advantage of data or task parallelism, enabling high-performance
solutions that scale to large volumes of data.
• Elastic scale
✓ All of the components in the big data architecture support scale-out provisioning, so that you
can adjust your solution to small or large workloads, and pay only for the resources that you use.
• Flexibility with consistency semantics (more in CAP theorem)
✓ E.g. Cassandra or MongoDB can make inconsistent reads for better scale and fault tolerance
• Good cost performance ratio
✓ Ability to reduce cost at the expense of performance. E.g. long term data storage in commodity
HDFS nodes.
• Interoperability with existing solutions
✓ The components of the big data architecture are also used for IoT processing and enterprise BI
solutions, enabling you to create an integrated solution across data workloads. e.g. Hadoop can
work with data in Amazon S3.
45
Big data architecture challenges

• Complexity
✓ Big data solutions can be extremely complex, with numerous components to
handle data ingestion from multiple data sources. It can be challenging to build,
test, and troubleshoot big data processes.

• Skillset
✓ Many big data technologies are highly specialized, and use frameworks and
languages that are not typical of more general application architectures. On the
other hand, big data technologies are evolving new APIs that build on more
established languages.

• Technology maturity
✓ Many of the technologies used in big data are evolving. While core Hadoop
technologies such as Hive and Pig have stabilized, emerging technologies such as
Spark introduce extensive changes and enhancements with each new release.
46
Topics for today

• Big Data Systems need to move large volumes of data

✓ e.g. browse and stream content on Netflix or answer analytical

queries on e-commerce transaction data in Amazon

• Is there a way to reduce latency of data requests using locality of

reference in the data and request origin

49
Levels of storage

Data Location – Memory vs Storage vs Network

• Computational Data is stored in primary memory aka memory

• Persistent Data is stored in secondary memory aka Storage
• Remote data access from another computer’s memory or storage over
network

50
Cost of access: Memory vs. Storage vs. Network

Reference: http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
51
Memory hierarchy – motivation
• A memory hierarchy amortizes cost in computer architecture:
✓ fast (and therefore costly) but small-sized memory to
✓ large-sized but slow (and therefore cheap) memory In a BigData search, recent Solr
indexed data can be a cache
for long term HDFS data.

52
Memory hierarchy - access

53
Memory hierarchy - cache hit/miss

54
Cache performance
Hit Ratio - The performance of the cache

• Cache hit
✓ When CPU refers to memory and find the data or instruction within the Cache Memory
• Cache miss
✓ If the desired data or instruction is not found in the cache memory and CPU refers to the
main memory to find that data or instruction

Hit + Miss = Total CPU Reference

Hit Ratio h = Hit / ( Hit + Miss )

• One can generalise this to any form of cache e.g. movie request from user to nearest Netflix
content cache

55
Access time of memories

• Average access time of any memory system consists of two levels:

✓ Cache Memory
✓ Main Memory

• If Tc is time to access cache memory and Tm is the time to access main

memory and h is the cache hit ration, then

Tavg = Average time to access memory

Tavg = h * Tc + ( 1-h ) * ( Tm + Tc )

56
Data reference empirical evidence

• Programs tend to reuse data and instructions they have used recently.

• 90/10 rule comes from empirical observation:

"A program spends 90% of its time in 10% of its code“

• Implication:
✓ Predict with reasonable accuracy what instructions and data a
program will use in the near future based on the recent past

57
Principle of Locality of Reference (LoR)

1. The locus of data access – and hence that of

memory references – is small at any point during
program execution
2. It is the tendency of a processor to access the same
set of memory locations repetitively over a short
period of time
3. Locality is a type of predictable behavior that occurs
in computer systems
4. Systems that exhibit strong locality of reference are
great candidates for performance optimization
through the use of techniques such as the caching,
prefetching for memory and advanced branch
predictors at the pipelining stage of a processor core.
58
Locality of Reference - Temporal locality

• Data that is accessed (at a point in program execution) is likely to be accessed

again in the near future:
i.e. data is likely to be repeatedly accessed in a short span of time during
execution

• Examples void swap(int x, int y)

1. Instructions in the body of a loop {
t = x;
2. Parameters / Local variables of a function / procedure x = y;
3. Data (or a variable) that is computed iteratively y = t;
}
• e.g. a cumulative sum or product
4. Another user in Netflix in same region will watch the same episode soon
5. A recent social media post will be viewed soon by other users
59
Locality of Reference - Spatial locality

• Data accessed (at a point in program execution) is likely

located adjacent to data that is to be accessed in near
future:
✓ data accessed in a short span during execution is likely to
be within a small region (in memory)
int sum_array_rows(int marks[8]){
int i, sum = 0;
• Examples for (i = 0; i < 8; i++)
sum = sum + marks[i];
1. A linear sequence of instructions return sum;
}
2. Elements of an Array (accessed sequentially)
3. Another user’s query will reuse part of the data file
brought in for current user’s query

60
Memory hierarchy and locality of reference (1)

• A memory hierarchy is effective only due to locality exhibited by

programs (and the data they access)
• Longer the range of execution time of the program, larger is the
locus of data accesses
• Need to have cache replacement strategies because increasing
cache size will also increase access time and increase cost
✓ Least Recently Used (LRU) is a typical cache entry replacement
strategy
• This is an example of temporal locality

61
Memory hierarchy and locality of reference (2)

• LOR leads to memory hierarchy at two main interface levels:

✓ Processor - Main memory
• Introduction of caches
• Unit of data transfer is blocks (# of cache lines)
✓ Main memory - Secondary memory (storage)
• Virtual memory (paging systems)
• Unit of data transfer is pages (page size is 4 or 8 KB)

• Fetching larger chunks of data enables spatial locality

62
LoR and data structure choices

• In Java, the following 2 classes are available in util package

✓ LinkedList<Integer> QuizMarks – a dynamic list of objects
✓ ArrayList<Integer> QuizMarks - a dynamic array of objects
✓ Which is a better data structure to use for sequential access performance ?
• Test
✓ Append a single element in a loop N times to the end of a LinkedList. Repeat with
ArrayList.
✓ Take average for 100 runs.
✓ The time complexity of the operation on both collections is the same.

• Which one works faster ?

https://dzone.com/articles/performance-of-array-vs-linked-list-on-modern-
comp#:~:text=But%20when%20we%20need%20to,have%20better%20performance%20than%20array.
63
LoR and data structure choices (2)

• Analysis - sequential access in arrays is faster than on linked lists on many machines,
because they have a very good spacial locality of reference and thus make good use of data
caching.
64
Locality Example : Matrices addition
• Consider matrices and an operation such as the addition of two matrices:
• Elements M[i,j] may be accessed either row by row or column by column
• Option 1: Add columns within each row (Row Major)
for (i=0; i<N; i++)
for(j=0; j<N; j++)
M3[i, j] = M2[i, j] + M1[i, j]

• Option 2: Add rows within each column (Column Major)

for (j=0; j<N; j++)
for(i=0; i<N; i++) •
M3[i, j] = M2[i, j] + M1[i, j]

Both options have same time complexity.

Why is one faster ?
65
Locality Analysis (Row major)

M[I][J] J=0 J=1 J=2 J=3

I=0 W[0] W[1] W[2] W[3]

miss hit hit hit

I=1 W[4] W[5] W[6] W[7]

Miss Hit Hit hit

I=2 W[8] W[9] W[10] W[11]

miss hit hit hit

I=3 W[12] W[13] W[14] W[15]

miss hit hit hit

66
Locality Analysis (Column major)

M[I][J] J=0 J=1 J=2 J=3

I=0 W[0] W[1] W[2] W[3]

miss miss miss miss

I=1 W[4] W[5] W[6] W[7]

Miss miss miss miss

I=2 W[8] W[9] W[10] W[11]

miss miss miss miss

I=3 W[12] W[13] W[14] W[15]

miss miss miss miss

67
Big Data: Storage organisation matters

Example

• We need to build a prediction model of sales for various regions

• There are many attributes for a region and “total sales” is the metric
used total_sales

• Suppose the database is “columnar” (Column major data

organisation)
✓ Will exhibit high spatial locality and hit rate because data blocks
will fetch blocks of total sales column and not other unnecessary
columns
✓ Will improve speed of modelling logic

• Columnar storage is common in most Big Data Systems to run

analysis and queries that focus on specific attributes at a time for
searching, aggregating, modelling etc.
68
Summary

• Why modern Enterprises and new age applications are data-centric

• Challenges with existing data systems
• Advantages and challenges with Big Data systems
• High level architecture and technology ecosystem
• Some real applications using the tech stack
• Data locality of reference and why it is useful for Big Data Systems

69
Next Session:
Parallel and Distributed Processing

Maths For AI
No ratings yet
Maths For AI
176 pages
Microsoft Fabric
No ratings yet
Microsoft Fabric
22 pages
Machine Learning
100% (1)
Machine Learning
90 pages
Big Data Notes
No ratings yet
Big Data Notes
51 pages
Big Data
No ratings yet
Big Data
92 pages
The Big Data System, Components, Tools, and Technologies A Survey
No ratings yet
The Big Data System, Components, Tools, and Technologies A Survey
100 pages
Ethnotech - Data Science With Python
No ratings yet
Ethnotech - Data Science With Python
480 pages
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
No ratings yet
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
40 pages
MC4411 Project Work - Format
No ratings yet
MC4411 Project Work - Format
65 pages
SPA Full Course PPTs (9 Files Merged)
No ratings yet
SPA Full Course PPTs (9 Files Merged)
239 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
IDS - 3 - Data Analytics - Part-2 - Methodologies - Dantu
No ratings yet
IDS - 3 - Data Analytics - Part-2 - Methodologies - Dantu
60 pages
CC Module 5
No ratings yet
CC Module 5
26 pages
STM Lab Manual
No ratings yet
STM Lab Manual
50 pages
Unit V
No ratings yet
Unit V
67 pages
Introduction To Information and Big Data Security
No ratings yet
Introduction To Information and Big Data Security
39 pages
Deep Learning Hands On
100% (1)
Deep Learning Hands On
18 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
Topic 1 Etw3482
100% (2)
Topic 1 Etw3482
69 pages
Unit I Introduction To Big Data: 1.1 Definition
No ratings yet
Unit I Introduction To Big Data: 1.1 Definition
16 pages
CHAPTER 02: Big Data Analytics
No ratings yet
CHAPTER 02: Big Data Analytics
62 pages
Unit 5 Intro To Machine Learning
No ratings yet
Unit 5 Intro To Machine Learning
25 pages
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
Predicting Autism in Children
No ratings yet
Predicting Autism in Children
52 pages
Data Streams: Models and Algorithms
No ratings yet
Data Streams: Models and Algorithms
372 pages
Unit 3,4,5-1
No ratings yet
Unit 3,4,5-1
15 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
Big Data Analytics
No ratings yet
Big Data Analytics
134 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
5 pages
AUTODYN - Chapter 11 - Parallel - Processing PDF
No ratings yet
AUTODYN - Chapter 11 - Parallel - Processing PDF
42 pages
Oracle Reports Faq's
No ratings yet
Oracle Reports Faq's
21 pages
ZFNET Architecture
No ratings yet
ZFNET Architecture
14 pages
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
100% (1)
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
31 pages
Contact Profile: Technical Support Engineer
No ratings yet
Contact Profile: Technical Support Engineer
1 page
Cluster
100% (1)
Cluster
72 pages
Unit 1 Full Notes
No ratings yet
Unit 1 Full Notes
52 pages
Data As A Service The Future of Data Management
No ratings yet
Data As A Service The Future of Data Management
7 pages
CP5074 - SNA Unit III Notes
No ratings yet
CP5074 - SNA Unit III Notes
27 pages
Big Data and Hadoop
No ratings yet
Big Data and Hadoop
111 pages
Intro To ML
No ratings yet
Intro To ML
32 pages
Unit4 - DataAnalytics and IoT PDF
No ratings yet
Unit4 - DataAnalytics and IoT PDF
40 pages
IS3110 LAB 5 Assesment Worksheet
100% (1)
IS3110 LAB 5 Assesment Worksheet
6 pages
1 - Understanding Big Data
No ratings yet
1 - Understanding Big Data
46 pages
Seminar
No ratings yet
Seminar
16 pages
JDBC Driver Manager
No ratings yet
JDBC Driver Manager
9 pages
Fluent Tips and Tricks
No ratings yet
Fluent Tips and Tricks
137 pages
x300 Schematics
No ratings yet
x300 Schematics
99 pages
Cloud Question Bank
No ratings yet
Cloud Question Bank
3 pages
Flume Case Study
No ratings yet
Flume Case Study
2 pages
Part 1 - Introduction To Big Data
No ratings yet
Part 1 - Introduction To Big Data
24 pages
U1T3 - White Paper - Data Visualization Techniques From Basics To Big Data With SAS Visual Analytics
No ratings yet
U1T3 - White Paper - Data Visualization Techniques From Basics To Big Data With SAS Visual Analytics
19 pages
Azure Disaster Recovery Full Presentation
No ratings yet
Azure Disaster Recovery Full Presentation
14 pages
Instructor Materials Chapter 6: Architecture For Big Data and Data Engineering
No ratings yet
Instructor Materials Chapter 6: Architecture For Big Data and Data Engineering
32 pages
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
No ratings yet
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
15 pages
The Next Frontier For Innovation, Competition and Productivity
No ratings yet
The Next Frontier For Innovation, Competition and Productivity
23 pages
Artificial Neural Networks - MiniProject
100% (1)
Artificial Neural Networks - MiniProject
16 pages
Data Visualization For Industry 4
No ratings yet
Data Visualization For Industry 4
3 pages
Hospital Management System Synopsis
No ratings yet
Hospital Management System Synopsis
9 pages
(2007) - The Aesthetics of Graph Visualization
No ratings yet
(2007) - The Aesthetics of Graph Visualization
8 pages
Data Mining New Notes Unit 3 PDF
No ratings yet
Data Mining New Notes Unit 3 PDF
12 pages
DATA Mining
No ratings yet
DATA Mining
55 pages
Deep Learning and CNNFYTGS5101-Guoyangxie
No ratings yet
Deep Learning and CNNFYTGS5101-Guoyangxie
42 pages
Tech Note 892 - Configure SQL and Alarm DB Logger Manager PDF
No ratings yet
Tech Note 892 - Configure SQL and Alarm DB Logger Manager PDF
11 pages
Install OpenMPI in Linux
No ratings yet
Install OpenMPI in Linux
5 pages
Mastering Citrix® XenDesktop® - Sample Chapter
No ratings yet
Mastering Citrix® XenDesktop® - Sample Chapter
63 pages
Big Data Not Right Data Yes
No ratings yet
Big Data Not Right Data Yes
8 pages
An Introduction To Programming For Hackers
No ratings yet
An Introduction To Programming For Hackers
62 pages
CatOS Commands
No ratings yet
CatOS Commands
740 pages
IM ch04
No ratings yet
IM ch04
8 pages
CRM Practices of Amazon
No ratings yet
CRM Practices of Amazon
45 pages
Lecture 0 INT330.ppt 20250120 072501 0000
No ratings yet
Lecture 0 INT330.ppt 20250120 072501 0000
41 pages
2017 Summer Model Answer Paper
No ratings yet
2017 Summer Model Answer Paper
29 pages
RF-BM-ND04 Hardware Datasheet V1.2
No ratings yet
RF-BM-ND04 Hardware Datasheet V1.2
19 pages
Create Ly
No ratings yet
Create Ly
14 pages
CV 2
No ratings yet
CV 2
1 page
Software Quality & Testing Lecture 8
No ratings yet
Software Quality & Testing Lecture 8
20 pages
Onrfile Sample
No ratings yet
Onrfile Sample
97 pages
On The RAN: The State of Next Generation RAN Transformations
No ratings yet
On The RAN: The State of Next Generation RAN Transformations
22 pages
7th HP
No ratings yet
7th HP
10 pages
Advanced Web Programming - Chapter 3
No ratings yet
Advanced Web Programming - Chapter 3
10 pages
Sti College - Gensan, Inc.: Ict Laboratory Exercise
No ratings yet
Sti College - Gensan, Inc.: Ict Laboratory Exercise
4 pages
LAPORAN PRAKTIKUM CRUD DATA PHP OOP MySQL LANJUTAN RETNO XII RPL D 30
No ratings yet
LAPORAN PRAKTIKUM CRUD DATA PHP OOP MySQL LANJUTAN RETNO XII RPL D 30
7 pages
It1050 Ooc
No ratings yet
It1050 Ooc
6 pages
The Security Network Coding System With Physical Layer Key Generation in Two-Way Relay Networks
No ratings yet
The Security Network Coding System With Physical Layer Key Generation in Two-Way Relay Networks
9 pages
An Electronic Journal Management System
No ratings yet
An Electronic Journal Management System
6 pages
The Definitive Guide to Data Integration: Unlock the power of data integration to efficiently manage, transform, and analyze data
From Everand
The Definitive Guide to Data Integration: Unlock the power of data integration to efficiently manage, transform, and analyze data
Pierre-yves Bonnefoy
No ratings yet
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
From Everand
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
Manoj Kumar
No ratings yet
Pentaho Data Integration Cookbook - Second Edition
From Everand
Pentaho Data Integration Cookbook - Second Edition
María Carina Roldán
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet