0% found this document useful (0 votes)

37 views7 pages

Lecture 10

The document discusses different techniques for processing and storing big data. It introduces Hadoop as an open-source framework that can be used for large-scale data storage and processing. It describes how data can be processed in Hadoop in both batch and real-time modes. It also discusses different options for big data storage, including using distributed file systems and databases on disk, and explains the advantages and disadvantages of each approach.

Uploaded by

Anoohya VS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views7 pages

Lecture 10

Uploaded by

Anoohya VS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Module 4.

Utilize Big Data Storage and

Processing Techniques.
Introduce Hadoop.
Hadoop is an open-source framework for large-scale data storage and data
processing that is compatible with commodity hardware. The Hadoop
framework has established itself as a platform for contemporary Big Data
solutions. It can be used as an ETL engine or as an analytics engine for
processing large amounts of structured, semi structured and unstructured data.
Figure 6.3 illustrates some of Hadoop’s features.

Figure 6.3 Hadoop is a versatile framework that provides both processing

and storage capabilities.

Process Data in Batch and Real-time mode

A processing workload in Big Data is defined as the amount and nature of data
that is processed within a certain amount of time. Workloads are usually divided
into two types:
• batch
• transactional

Batch
Batch processing, also known as offline processing, involves processing data in
batches and usually imposes delays, which in turn results in high-latency
responses. Batch workloads typically involve large quantities of data with
sequential read/writes and comprise of groups of read or write queries.

Queries can be complex and involve multiple joins. OLAP systems commonly
process workloads in batches. Strategic BI and analytics are batch-oriented as
they are highly read-intensive tasks involving large volumes of data. As shown
in Figure 6.4, a batch workload comprises grouped read/writes that have a large
data footprint and may contain complex joins and provide high-latency
responses.

Figure 6.4 A batch workload can include grouped read/writes to INSERT,

SELECT, UPDATE and DELETE.

Transactional
Transactional processing is also known as online processing. Transactional
workload processing follows an approach whereby data is processed
interactively without delay, resulting in low-latency responses. Transaction
workloads involve small amounts of data with random reads and writes.

OLTP and operational systems, which are generally write-intensive, fall within
this category. Although these workloads contain a mix of read/write queries,
they are generally more write-intensive than read-intensive.
Transactional workloads comprise random reads/writes that involve fewer joins
than business intelligence and reporting workloads. Given their online nature
and operational significance to the enterprise, they require low-latency
responses with a smaller data footprint, as shown in Figure 6.5.

Figure 6.5 Transactional workloads have few joins and lower latency responses
than batch workloads.

Utilize On-Disk Storage Devices

On-disk storage generally utilizes low cost hard-disk drives for long-term
storage. On-disk storage can be implemented via a distributed file system or a
database as shown in Figure 7.1.

Figure 7.1 On-disk storage can be implemented with a distributed file system or
a database.
Distributed File Systems
Distributed file systems, like any file system, are agnostic to the data being
stored and therefore support schema-less data storage. In general, a distributed
file system storage device provides out of box redundancy and high availability
by copying data to multiple locations via replication.
A storage device that is implemented with a distributed file system provides
simple, fast access data storage that is capable of storing large datasets that are
non-relational in nature, such as semi-structured and unstructured data.
Although based on straightforward file locking mechanisms for concurrency
control, it provides fast read/write capability, which addresses the velocity
characteristic of Big Data.

A distributed file system is not ideal for datasets comprising a large number of
small files as this creates excessive disk-seek activity, slowing down the overall
data access. There is also more overhead involved in processing multiple
smaller files, as dedicated processes are generally spawned by the processing
engine at runtime for processing each file before the results are synchronized
from across the cluster.

Due to these limitations, distributed file systems work best with fewer but larger
files accessed in a sequential manner. Multiple smaller files are generally
combined into a single file to enable optimum storage and processing. This
allows the distributed file systems to have increased performance when data
must be accessed in streaming mode with no random reads and writes (Figure
7.2).

Figure 7.2 A distributed file system accessing data in streaming mode with no
random reads and writes.

A distributed file system storage device is suitable when large datasets of raw
data are to be stored or when archiving of datasets is required. In addition, it
provides an inexpensive storage option for storing large amounts of data over a
long period of time that needs to remain online. This is because more disks can
simply be added to the cluster without needing to offload the data to offline data
storage, such as tapes. It should be noted that distributed file systems do not
provide the ability to search the contents of files as standard out-of-the-box
capability.

RDBMS Databases
Relational database management systems (RDBMSs) are good for handling
transactional workloads involving small amounts of data with random
read/write properties. RDBMSs are generally restricted to a single node. For this
reason, RDBMSs do not provide out-of-the-box redundancy and fault tolerance.

To handle large volumes of data arriving at a fast pace, relational databases

generally need to scale. RDBMSs employ vertical scaling, not horizontal
scaling, which is a more costly and disruptive scaling strategy. This makes
RDBMSs less than ideal for long-term storage of data that accumulates over
time. Note that some relational databases are capable of being run on clusters
(Figure 7.3). However, these database clusters still use shared storage that can
act as a single point of failure.

Figure 7.3 A clustered rational database uses a shared storage architecture,

which is a potential single point of failure that affects the availability of the
database.
Relational databases need to be manually sharded, mostly using application
logic. This means that the application logic needs to know which shard to query
in order to get the required data. This further complicates data processing when
data from multiple shards is required.

The following steps are shown in Figure 7.4:

1. A user writes a record (id = 2).
2. The application logic determines which shard it should be written to.
3. It is sent to the shard determined by the application logic.
4. The user reads a record (id = 4), and the application logic determines which
shard contains the data.
5. The data is read and returned to the application.
6. The application then returns the record to the user.

Figure 7.4 A relational database is manually sharded using application logic.

The following steps are shown in Figure 7.5:

1. A user requests multiple records (id = 1, 3) and the application logic is used
to determine which shards need to be read.
2. It is determined by the application logic that both Shard A and Shard B need
to be read.
3. The data is read and joined by the application.
4. Finally, the data is returned to the user.
Figure 7.5 An example of the use of the application logic to join data retrieved
from multiple shards.

Relational databases generally require data to adhere to a schema. As a result,

storage of semi-structured and unstructured data whose schemas are non-
relational is not directly supported. Furthermore, with a relational database
schema conformance is validated at the time of data insert or update by
checking the data against the constraints of the schema.

This introduces overhead that creates latency. This latency makes relational
databases a less than ideal choice for storing high velocity data that needs a
highly available database storage device with fast data write capability.
As a result of its shortcomings, a traditional RDBMS is generally not useful as
the primary storage device in a Big Data solution environment.

RDBMS For BCOM 6th (Old) and 3rd (New) Sem
80% (5)
RDBMS For BCOM 6th (Old) and 3rd (New) Sem
110 pages
Computers Quiz - Basic Computer Science Quiz Questions
80% (5)
Computers Quiz - Basic Computer Science Quiz Questions
10 pages
NoSQL DBs
No ratings yet
NoSQL DBs
46 pages
Chapter 9 - BDMT
No ratings yet
Chapter 9 - BDMT
61 pages
A Manual For A Laboratory Information Management System (Lims) For Light Stable Isotopes
No ratings yet
A Manual For A Laboratory Information Management System (Lims) For Light Stable Isotopes
131 pages
Distributed Database System (KCA045)
No ratings yet
Distributed Database System (KCA045)
9 pages
Big Data and Hadoop Overview
100% (1)
Big Data and Hadoop Overview
17 pages
Week 2 Parallel and Distributed Database
No ratings yet
Week 2 Parallel and Distributed Database
7 pages
Ch02 - Big Data Storage Concepts
No ratings yet
Ch02 - Big Data Storage Concepts
23 pages
Big Data Analysis
No ratings yet
Big Data Analysis
9 pages
The Database Approach
No ratings yet
The Database Approach
10 pages
Big Data IN A Gist
No ratings yet
Big Data IN A Gist
16 pages
Chapter 3 - 大数据管理
No ratings yet
Chapter 3 - 大数据管理
38 pages
777 1651399819 BD Module 5
No ratings yet
777 1651399819 BD Module 5
75 pages
Cloud Computing Unit 3
No ratings yet
Cloud Computing Unit 3
10 pages
Original 1404102477 Computer Organization and Architecture 2
No ratings yet
Original 1404102477 Computer Organization and Architecture 2
20 pages
Overview of Database
No ratings yet
Overview of Database
25 pages
Chapter 07
No ratings yet
Chapter 07
45 pages
Tesla TV 32T320BHS 32T320SHS EU User Manual
No ratings yet
Tesla TV 32T320BHS 32T320SHS EU User Manual
100 pages
Big Data Finance T8 2 CHOI NEOMA Ch7 2024
No ratings yet
Big Data Finance T8 2 CHOI NEOMA Ch7 2024
10 pages
Lecture 6
No ratings yet
Lecture 6
9 pages
Chapter 2 - Data and Knowledge Management-Notes
No ratings yet
Chapter 2 - Data and Knowledge Management-Notes
49 pages
Ralated Terms of DBMS
No ratings yet
Ralated Terms of DBMS
19 pages
Hbase in Practice
No ratings yet
Hbase in Practice
46 pages
Key Ideas Behind Mapreduce 3. What Is Mapreduce? 4. Hadoop Implementation of Mapreduce 5. Anatomy of A Mapreduce Job Run
No ratings yet
Key Ideas Behind Mapreduce 3. What Is Mapreduce? 4. Hadoop Implementation of Mapreduce 5. Anatomy of A Mapreduce Job Run
27 pages
New RDBMS All Unit For BCOM 6th Sem Notes-Signed
No ratings yet
New RDBMS All Unit For BCOM 6th Sem Notes-Signed
98 pages
Files 1 2020 April NotesHubDocument 1586849482
No ratings yet
Files 1 2020 April NotesHubDocument 1586849482
60 pages
Distributed File System
No ratings yet
Distributed File System
27 pages
Cse Data Science Dbms Be III Sem Autonomous Unit One
No ratings yet
Cse Data Science Dbms Be III Sem Autonomous Unit One
20 pages
1 To 5 DBMS
No ratings yet
1 To 5 DBMS
65 pages
Concept of Big Data
No ratings yet
Concept of Big Data
29 pages
Unit 5
No ratings yet
Unit 5
28 pages
Chapter 4 Bing
No ratings yet
Chapter 4 Bing
5 pages
Lecture 2 - Relational Data Processing
No ratings yet
Lecture 2 - Relational Data Processing
10 pages
Final
No ratings yet
Final
3 pages
Unit-1 Q&a
No ratings yet
Unit-1 Q&a
24 pages
Big Data Unit5
No ratings yet
Big Data Unit5
57 pages
Bda Ia2 Bda
No ratings yet
Bda Ia2 Bda
7 pages
DBMS - Chapter 1
No ratings yet
DBMS - Chapter 1
45 pages
DBS Reviewer
No ratings yet
DBS Reviewer
4 pages
Bda CH3
No ratings yet
Bda CH3
10 pages
Cheat Sheet v4
No ratings yet
Cheat Sheet v4
3 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
DBMS Chapter 1
No ratings yet
DBMS Chapter 1
14 pages
Web Integration For Dbms Chapter 1: Introduction: Q1.What Is Difference Between DBMS and RDBMS? Answer
No ratings yet
Web Integration For Dbms Chapter 1: Introduction: Q1.What Is Difference Between DBMS and RDBMS? Answer
8 pages
Cheat Sheet v2
No ratings yet
Cheat Sheet v2
3 pages
Dbmsco 3,4 Part 2
No ratings yet
Dbmsco 3,4 Part 2
5 pages
Unit 1
No ratings yet
Unit 1
23 pages
Itec 212
No ratings yet
Itec 212
5 pages
Lecture 16
No ratings yet
Lecture 16
31 pages
DDB Slides
No ratings yet
DDB Slides
30 pages
Microsoft Access 2003: Manual - Foundation Level
No ratings yet
Microsoft Access 2003: Manual - Foundation Level
114 pages
Final Sujan
No ratings yet
Final Sujan
2 pages
DP900 Chapter1 Notes
No ratings yet
DP900 Chapter1 Notes
10 pages
CV Backup Guide
100% (1)
CV Backup Guide
5 pages
4 Marks Chapter (12) : 1) Physical Storage Media
No ratings yet
4 Marks Chapter (12) : 1) Physical Storage Media
6 pages
Compensation Administration Adminguide 1305 PDF
No ratings yet
Compensation Administration Adminguide 1305 PDF
125 pages
OS
No ratings yet
OS
5 pages
Distributed Database System
No ratings yet
Distributed Database System
5 pages
Dbms r23 Unit-I Notes
No ratings yet
Dbms r23 Unit-I Notes
17 pages
Ch04 PDF
86% (7)
Ch04 PDF
13 pages
Hadoop 2
No ratings yet
Hadoop 2
27 pages
CHS NC2 Reviewer
No ratings yet
CHS NC2 Reviewer
14 pages
NoteGPT - What Is HBase - HBase Architecture - HBase Tutorial For Beginners - Hadoop Tutorial - Simplilearn
No ratings yet
NoteGPT - What Is HBase - HBase Architecture - HBase Tutorial For Beginners - Hadoop Tutorial - Simplilearn
5 pages
COMMAND Line Interface
No ratings yet
COMMAND Line Interface
26 pages
s8674 - IsPF Hidden Treasures
No ratings yet
s8674 - IsPF Hidden Treasures
19 pages
Module 1.3
No ratings yet
Module 1.3
8 pages
SIM4ME FNP1111 AdministratorGuide
No ratings yet
SIM4ME FNP1111 AdministratorGuide
39 pages
WinIDEA Manual
No ratings yet
WinIDEA Manual
390 pages
Kaibo Flex OTA Firmware Upgrade Tutorial Windows OS V1.0
0% (1)
Kaibo Flex OTA Firmware Upgrade Tutorial Windows OS V1.0
5 pages
GPSeismic Brochure
No ratings yet
GPSeismic Brochure
13 pages
OptiStruct - 01 - Design Concept For A Structural C-Clip
No ratings yet
OptiStruct - 01 - Design Concept For A Structural C-Clip
12 pages
DP 71 Tutorial
No ratings yet
DP 71 Tutorial
150 pages
Hiren's BootCD 15.2 - All in One Bootable CD WWW - Hiren
No ratings yet
Hiren's BootCD 15.2 - All in One Bootable CD WWW - Hiren
13 pages
OceanofPDF - Com - Never - Chase - Men - Again - 38 - Dating - Secrets - T - Bruce - Bryans 2
No ratings yet
OceanofPDF - Com - Never - Chase - Men - Again - 38 - Dating - Secrets - T - Bruce - Bryans 2
26 pages
Simple Programming in MATLAB PDF
No ratings yet
Simple Programming in MATLAB PDF
11 pages
Archive Definition: Extraction
No ratings yet
Archive Definition: Extraction
7 pages
User Manual: Worksheet
No ratings yet
User Manual: Worksheet
65 pages
Acolite Manua
No ratings yet
Acolite Manua
35 pages
S10-231X V100R001 SD-Card Upgrade User Guide V1.5
No ratings yet
S10-231X V100R001 SD-Card Upgrade User Guide V1.5
7 pages
Practical Perforce Files in The Depot - PG.: Safari Books Online
No ratings yet
Practical Perforce Files in The Depot - PG.: Safari Books Online
7 pages
Index PHP 1 10
No ratings yet
Index PHP 1 10
13 pages
Ms Dos Notes
No ratings yet
Ms Dos Notes
11 pages
Exp 1.3 Linux Administration Lab
No ratings yet
Exp 1.3 Linux Administration Lab
4 pages
Data Entry Operations
No ratings yet
Data Entry Operations
20 pages
Overview of OneDrive For Business in SharePoint Server - Microsoft Docs
No ratings yet
Overview of OneDrive For Business in SharePoint Server - Microsoft Docs
3 pages
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Introduction to Oracle Database Administration
From Everand
Introduction to Oracle Database Administration
Ying Wang
5/5 (1)
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
From Everand
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
Rob Botwright
No ratings yet

Lecture 10

Uploaded by

Lecture 10

Uploaded by

Module 4.

Utilize Big Data Storage and

Figure 6.3 Hadoop is a versatile framework that provides both processing

Process Data in Batch and Real-time mode

Figure 6.4 A batch workload can include grouped read/writes to INSERT,

Utilize On-Disk Storage Devices

To handle large volumes of data arriving at a fast pace, relational databases

Figure 7.3 A clustered rational database uses a shared storage architecture,

The following steps are shown in Figure 7.4:

Figure 7.4 A relational database is manually sharded using application logic.

The following steps are shown in Figure 7.5:

Relational databases generally require data to adhere to a schema. As a result,

You might also like