0% found this document useful (0 votes)
37 views7 pages

Lecture 10

The document discusses different techniques for processing and storing big data. It introduces Hadoop as an open-source framework that can be used for large-scale data storage and processing. It describes how data can be processed in Hadoop in both batch and real-time modes. It also discusses different options for big data storage, including using distributed file systems and databases on disk, and explains the advantages and disadvantages of each approach.

Uploaded by

Anoohya VS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views7 pages

Lecture 10

The document discusses different techniques for processing and storing big data. It introduces Hadoop as an open-source framework that can be used for large-scale data storage and processing. It describes how data can be processed in Hadoop in both batch and real-time modes. It also discusses different options for big data storage, including using distributed file systems and databases on disk, and explains the advantages and disadvantages of each approach.

Uploaded by

Anoohya VS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Module 4.

Utilize Big Data Storage and


Processing Techniques.
Introduce Hadoop.
Hadoop is an open-source framework for large-scale data storage and data
processing that is compatible with commodity hardware. The Hadoop
framework has established itself as a platform for contemporary Big Data
solutions. It can be used as an ETL engine or as an analytics engine for
processing large amounts of structured, semi structured and unstructured data.
Figure 6.3 illustrates some of Hadoop’s features.

Figure 6.3 Hadoop is a versatile framework that provides both processing


and storage capabilities.

Process Data in Batch and Real-time mode


A processing workload in Big Data is defined as the amount and nature of data
that is processed within a certain amount of time. Workloads are usually divided
into two types:
• batch
• transactional

Batch
Batch processing, also known as offline processing, involves processing data in
batches and usually imposes delays, which in turn results in high-latency
responses. Batch workloads typically involve large quantities of data with
sequential read/writes and comprise of groups of read or write queries.

Queries can be complex and involve multiple joins. OLAP systems commonly
process workloads in batches. Strategic BI and analytics are batch-oriented as
they are highly read-intensive tasks involving large volumes of data. As shown
in Figure 6.4, a batch workload comprises grouped read/writes that have a large
data footprint and may contain complex joins and provide high-latency
responses.

Figure 6.4 A batch workload can include grouped read/writes to INSERT,


SELECT, UPDATE and DELETE.

Transactional
Transactional processing is also known as online processing. Transactional
workload processing follows an approach whereby data is processed
interactively without delay, resulting in low-latency responses. Transaction
workloads involve small amounts of data with random reads and writes.

OLTP and operational systems, which are generally write-intensive, fall within
this category. Although these workloads contain a mix of read/write queries,
they are generally more write-intensive than read-intensive.
Transactional workloads comprise random reads/writes that involve fewer joins
than business intelligence and reporting workloads. Given their online nature
and operational significance to the enterprise, they require low-latency
responses with a smaller data footprint, as shown in Figure 6.5.

Figure 6.5 Transactional workloads have few joins and lower latency responses
than batch workloads.

Utilize On-Disk Storage Devices


On-disk storage generally utilizes low cost hard-disk drives for long-term
storage. On-disk storage can be implemented via a distributed file system or a
database as shown in Figure 7.1.

Figure 7.1 On-disk storage can be implemented with a distributed file system or
a database.
Distributed File Systems
Distributed file systems, like any file system, are agnostic to the data being
stored and therefore support schema-less data storage. In general, a distributed
file system storage device provides out of box redundancy and high availability
by copying data to multiple locations via replication.
A storage device that is implemented with a distributed file system provides
simple, fast access data storage that is capable of storing large datasets that are
non-relational in nature, such as semi-structured and unstructured data.
Although based on straightforward file locking mechanisms for concurrency
control, it provides fast read/write capability, which addresses the velocity
characteristic of Big Data.

A distributed file system is not ideal for datasets comprising a large number of
small files as this creates excessive disk-seek activity, slowing down the overall
data access. There is also more overhead involved in processing multiple
smaller files, as dedicated processes are generally spawned by the processing
engine at runtime for processing each file before the results are synchronized
from across the cluster.

Due to these limitations, distributed file systems work best with fewer but larger
files accessed in a sequential manner. Multiple smaller files are generally
combined into a single file to enable optimum storage and processing. This
allows the distributed file systems to have increased performance when data
must be accessed in streaming mode with no random reads and writes (Figure
7.2).

Figure 7.2 A distributed file system accessing data in streaming mode with no
random reads and writes.

A distributed file system storage device is suitable when large datasets of raw
data are to be stored or when archiving of datasets is required. In addition, it
provides an inexpensive storage option for storing large amounts of data over a
long period of time that needs to remain online. This is because more disks can
simply be added to the cluster without needing to offload the data to offline data
storage, such as tapes. It should be noted that distributed file systems do not
provide the ability to search the contents of files as standard out-of-the-box
capability.

RDBMS Databases
Relational database management systems (RDBMSs) are good for handling
transactional workloads involving small amounts of data with random
read/write properties. RDBMSs are generally restricted to a single node. For this
reason, RDBMSs do not provide out-of-the-box redundancy and fault tolerance.

To handle large volumes of data arriving at a fast pace, relational databases


generally need to scale. RDBMSs employ vertical scaling, not horizontal
scaling, which is a more costly and disruptive scaling strategy. This makes
RDBMSs less than ideal for long-term storage of data that accumulates over
time. Note that some relational databases are capable of being run on clusters
(Figure 7.3). However, these database clusters still use shared storage that can
act as a single point of failure.

Figure 7.3 A clustered rational database uses a shared storage architecture,


which is a potential single point of failure that affects the availability of the
database.
Relational databases need to be manually sharded, mostly using application
logic. This means that the application logic needs to know which shard to query
in order to get the required data. This further complicates data processing when
data from multiple shards is required.

The following steps are shown in Figure 7.4:


1. A user writes a record (id = 2).
2. The application logic determines which shard it should be written to.
3. It is sent to the shard determined by the application logic.
4. The user reads a record (id = 4), and the application logic determines which
shard contains the data.
5. The data is read and returned to the application.
6. The application then returns the record to the user.

Figure 7.4 A relational database is manually sharded using application logic.

The following steps are shown in Figure 7.5:


1. A user requests multiple records (id = 1, 3) and the application logic is used
to determine which shards need to be read.
2. It is determined by the application logic that both Shard A and Shard B need
to be read.
3. The data is read and joined by the application.
4. Finally, the data is returned to the user.
Figure 7.5 An example of the use of the application logic to join data retrieved
from multiple shards.

Relational databases generally require data to adhere to a schema. As a result,


storage of semi-structured and unstructured data whose schemas are non-
relational is not directly supported. Furthermore, with a relational database
schema conformance is validated at the time of data insert or update by
checking the data against the constraints of the schema.

This introduces overhead that creates latency. This latency makes relational
databases a less than ideal choice for storing high velocity data that needs a
highly available database storage device with fast data write capability.
As a result of its shortcomings, a traditional RDBMS is generally not useful as
the primary storage device in a Big Data solution environment.

You might also like