Lecture 10
Lecture 10
Batch
Batch processing, also known as offline processing, involves processing data in
batches and usually imposes delays, which in turn results in high-latency
responses. Batch workloads typically involve large quantities of data with
sequential read/writes and comprise of groups of read or write queries.
Queries can be complex and involve multiple joins. OLAP systems commonly
process workloads in batches. Strategic BI and analytics are batch-oriented as
they are highly read-intensive tasks involving large volumes of data. As shown
in Figure 6.4, a batch workload comprises grouped read/writes that have a large
data footprint and may contain complex joins and provide high-latency
responses.
Transactional
Transactional processing is also known as online processing. Transactional
workload processing follows an approach whereby data is processed
interactively without delay, resulting in low-latency responses. Transaction
workloads involve small amounts of data with random reads and writes.
OLTP and operational systems, which are generally write-intensive, fall within
this category. Although these workloads contain a mix of read/write queries,
they are generally more write-intensive than read-intensive.
Transactional workloads comprise random reads/writes that involve fewer joins
than business intelligence and reporting workloads. Given their online nature
and operational significance to the enterprise, they require low-latency
responses with a smaller data footprint, as shown in Figure 6.5.
Figure 6.5 Transactional workloads have few joins and lower latency responses
than batch workloads.
Figure 7.1 On-disk storage can be implemented with a distributed file system or
a database.
Distributed File Systems
Distributed file systems, like any file system, are agnostic to the data being
stored and therefore support schema-less data storage. In general, a distributed
file system storage device provides out of box redundancy and high availability
by copying data to multiple locations via replication.
A storage device that is implemented with a distributed file system provides
simple, fast access data storage that is capable of storing large datasets that are
non-relational in nature, such as semi-structured and unstructured data.
Although based on straightforward file locking mechanisms for concurrency
control, it provides fast read/write capability, which addresses the velocity
characteristic of Big Data.
A distributed file system is not ideal for datasets comprising a large number of
small files as this creates excessive disk-seek activity, slowing down the overall
data access. There is also more overhead involved in processing multiple
smaller files, as dedicated processes are generally spawned by the processing
engine at runtime for processing each file before the results are synchronized
from across the cluster.
Due to these limitations, distributed file systems work best with fewer but larger
files accessed in a sequential manner. Multiple smaller files are generally
combined into a single file to enable optimum storage and processing. This
allows the distributed file systems to have increased performance when data
must be accessed in streaming mode with no random reads and writes (Figure
7.2).
Figure 7.2 A distributed file system accessing data in streaming mode with no
random reads and writes.
A distributed file system storage device is suitable when large datasets of raw
data are to be stored or when archiving of datasets is required. In addition, it
provides an inexpensive storage option for storing large amounts of data over a
long period of time that needs to remain online. This is because more disks can
simply be added to the cluster without needing to offload the data to offline data
storage, such as tapes. It should be noted that distributed file systems do not
provide the ability to search the contents of files as standard out-of-the-box
capability.
RDBMS Databases
Relational database management systems (RDBMSs) are good for handling
transactional workloads involving small amounts of data with random
read/write properties. RDBMSs are generally restricted to a single node. For this
reason, RDBMSs do not provide out-of-the-box redundancy and fault tolerance.
This introduces overhead that creates latency. This latency makes relational
databases a less than ideal choice for storing high velocity data that needs a
highly available database storage device with fast data write capability.
As a result of its shortcomings, a traditional RDBMS is generally not useful as
the primary storage device in a Big Data solution environment.