0% found this document useful (0 votes)
244 views10 pages

Chap 2 Emerging Database Landscape

The document discusses the evolving database landscape and scale-out architectures for managing big data. It describes how traditionally databases have been scaled vertically through hardware upgrades but this becomes costly. Scale-out architectures distribute data across multiple nodes and allow horizontal scaling. Common techniques for distributing data include replication, where data is copied to multiple nodes, and sharding, where different parts of the data are placed on different nodes. This approach improves performance and scalability but requires considerations for queries that require aggregating data from multiple nodes.

Uploaded by

SwatiJadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
244 views10 pages

Chap 2 Emerging Database Landscape

The document discusses the evolving database landscape and scale-out architectures for managing big data. It describes how traditionally databases have been scaled vertically through hardware upgrades but this becomes costly. Scale-out architectures distribute data across multiple nodes and allow horizontal scaling. Common techniques for distributing data include replication, where data is copied to multiple nodes, and sharding, where different parts of the data are placed on different nodes. This approach improves performance and scalability but requires considerations for queries that require aggregating data from multiple nodes.

Uploaded by

SwatiJadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Chap 2.

Emerging Database Landscape

Scale-Out Architecture ,RDBMS Vs Non-Relational Workload & its

Characteristics Implication Of Big data Scale on Data Processing

Emerging Database Landscape

In the new data management paradigm and especially considering the


influence of big data, IT solutions and enterprise infrastructure landscapes
may encompass many technologies working together. Figuring out which of
the several technologies are relevant for you is not a trivial matter. In this
chapter we will discuss several of these technologies and share best
practices: which data management approach is best for what kind of data
related challenges?
The ongoing explosion of data today challenges businesses. Organizations
capture, track, analyze and store everything from mass quantities of
transactional, online, and mobile data, to growing amounts of machine-
generated data. In fact, machine-generated data, including sources ranging
from web, telecom network and call-detail records, to data from online
gaming, social networks, sensors, computer logs, satellites, financial
transaction feeds and more, represents the fastest-growing category of big
data. High volume websites can generate billions of data entries every
month.
Extracting useful intelligence from current data volumes with mostly
structured data had been a challenge anyway; imagine the situation when
you deal with big data scales!
In order to solve data-volume-related challenges, traditionally architects
have applied the below mentioned typical approaches, but each one of the
approaches have several implications:

 Tuning or upgrading existing database resulting in significantly


increased costs, either through admin costs or licensing fees
 Upgrading hardware processing capabilities increasing overall
total cost of ownership to the enterprise (TCO) in terms additional
hardware costs and subsequent annual maintenance fees
 Increasing storage capacity, which sets off a recurring pattern:
put more storage capacity in direct proportion to the growth of
data add incur additional costs
 Implementing a data archiving policy wherein old data is
periodically moved into lower cost storage solutions. While this is a
sensible approach, it also puts constraints on data usage and
analysis needs: less data is made available to your analysts and
business users for analysis at any one time. This may result in less
comprehensive analysis of user patterns and can greatly impact
analytic conclusions
 Upgrading network infrastructure leads to both increased costs
and, potentially, more complex network configurations.

From the above-mentioned arguments, it is clear that throwing money at


your database problem doesn’t really solve the issue. Are there any
alternative approaches? Before we dive deep into alternative solutions and
architectural strategies, let us first understand how databases have evolved
over the past decade or so.

The Database Evolution

It is a widely acceptable fact that innovation in database technologies began


with the appearance of the relational database management system
(RDBMS) and its associated database access mechanism through structured
query language (SQL). The RDBMS was primarily designed to handle both
online transaction processing (OLTP) workloads and business intelligence
(BI) workloads. In addition, a plethora of products and add-on utilities got
developed in quick time augmenting the RDBMS capabilities thus developing
a rich ecosystem of software products that depended upon its SQL interface
and fulfilled many business needs.
Database engineering was primarily built to access data held on spinning
disks. The data access operations utilized systems memory to cache data
and were largely dependent on the CPU power available. Over time,
innovations in efficient usage of memory and faster CPU cycle speeds
significantly improved data access and usage patterns. Databases also
began to explore options around parallel processing of workloads. During the
early days, the typical RDBMS installation was a large symmetric
multiprocessing (SMP) server, later these individual servers were clustered
with interconnects between two or more servers, thus appearing as a single
logical database server. This cluster based architectures significantly
improved parallelism, and provided high performance and failover
capabilities.

The Scale-Out Architecture

As the data volumes grew exponentially and increasingly there was a need
to integrate and leverage a vast array of data sources, a new generation of
database products began to emerge. These were labeled as Not Only
SQL (NoSQL) products. These products were
Fig 1Scale out Architecture

designed to cater to the distributed architecture styles enabling high


concurrency and partition tolerance to manage data volumes up to the
petabyte range.
Fig1 illustrates scale-out database architecture. You can see the design
philosophy where data from several sources are acquired and then
distributed across multiple nodes. The full database is spread across multiple
computers. In the earlier versions of NoSQL databases there was a
constraint that the data for a transaction or query be limited to a single
node.
The concept of a multi-node database with transactions or queries isolated
to individual nodes was a design consideration to support transactional
workloads of large websites. Due to this limitation, the back-end database
infrastructure of these nodes required manual partitioning of data in identical
schemas across nodes. The local database running on each node held a
portion of the total data, a technique referred to as sharding, for breaking
the database up into shards. The queries are broken into sub-queries, which
are then applied to specific nodes in a server cluster. The results from each
one of these sub-queries are then aggregated to get the final answer. All
resources are exploited to run in parallel. To improve performance or cater
to larger data volumes, more nodes are added to the cluster as and when
needed.
Most NoSQL databases have a scale-out architecture and can be distributed
across many server nodes. How they handle data distribution, data
compression, and node failure varies from product to product, but the
general architecture is similar. They are usually built in a shared-nothing
manner so that no node has to know much about what’s happening on other
nodes.
The scale-out architecture brings to light two interesting features, and both
of these features focus on the ability to distribute data over a cluster of
servers.
Replication: This is all about taking the same data and copying it over
multiple nodes. There are two types of replication strategies.

 Master-Slave
 Peer-To-Peer

In Master-Slave approach, you replicate data across multiple nodes. One


node acts as the designated master and the rest are slave nodes keeping
copies of the entire data sets, thereby providing resilience to node failures.
The master node is the most updated and accurate source for the data sets
and is responsible for managing consistency. Periodically, the slaves
synchronize their content with the master.
Master-Slave replication is most helpful for scaling when you have a read-
intensive data set. You can scale horizontally to handle more read requests
by adding more slave nodes and ensuring all read requests are routed to the
slaves. However, this approach will have a major bottleneck when you have
workloads that are read- and write-intensive, the master will have to juggle
around updates and pass on those updates to the slave nodes to make the
data consistent everywhere!
While the Master-Slave approach provides read scalability, it severely lacks
in write scalability. Peer-to-Peer replication approach addresses this issue by
not having a master node altogether. All replication nodes have equal
weight, they all accept write requests, and the loss of any of the nodes
doesn’t prevent access to the data store because rest of the nodes are
accessible and have the copies of the same data, although it may not be the
most updated data.
In this approach, the concerning fact is about data consistency across all the
nodes: when you perform write operations on two different nodes on the
same data set, you run into the risk of two different users attempting to
update the same record at the same time thus introducing a write-write
conflict. This sort of write-write conflicts are managed through a concept
called “serialization” wherein, you decide to apply the write operations one
after another. Serialization is applied either as pessimistic or optimistic
mode. Pessimistic works by preventing conflicts from occurring, in a sense,
all write operations are performed in a sequential manner, when all are
done, and then only the data set is made available. Optimistic works by
letting conflicts occur but detects the instances of conflict and later takes
corrective actions to sort them out, making all the write operations
eventually consistent.

Sharding: This is all about selectively organizing a particular set of data on


different nodes. Once you have data in your data store, different applications
and data analysts access different parts of the data set. In such situations,
you can introduce horizontal scalability by selectively putting different parts
of the data set onto different servers. When the user accesses specific data
elements, their queries hit only the designated server. As a result, they get
rapid responses!
However, there is one drawback to this approach. If your query consists of
data sets distributed over several nodes, how do you aggregate these
different data sets? This is a design consideration you need to acknowledge
while distributing data over several nodes.
You need to understand the query patterns first and then design the data
distribution in such a manner that, data that is commonly accessed together
is kept on a single node. This helps in improving query performance.
For example, if you know that most accesses of certain data sets are based
on a physical location, you can place that data close to the location where
it’s being accessed. Or if you see most of the query patterns are around
customer’s surnames, then you might put all customers with surnames
starting from A to E on one node, F to J on another node, like so.
Sharding greatly improves the read and write performance; however, it does
little to improve resilience when used alone. Although the data is on different
nodes, hence a node failure makes that part of the data unavailable; thus
only the users of the data on that shard will have issues, and the rest of the
users do not get impacted.
Combining Sharding with Replication: Replication and sharding are two
orthogonal techniques for data distribution, which means in your data design
considerations; you can use either approach or both the approaches. If you
use both the approaches, essentially you are taking the sharding approach
but for each shard you are appointing a master node (thus ensuring write
consistency); the rest are all slaves with copies of the data items (thus
ensuring scalable read operations).

The Relational Database and the Non-Relational Database

On a broad level, we can assume that there are two specific kinds of
databases: the relational database and the “non-relational” database. There
are several definitions and interpretations of what the characteristics of
these two types of databases are.
Let’s first define what structured data is and what unstructured data is.
These definitions heavily weigh into the characteristics of RDBMS and non-
RDBMS systems.
Structured Data: Structured data contains an explicit structure of the
data elements. In other words, there exists metadata for every data element
and how it will be stored and accessed through SQL-based commands or
other programming constructs are clearly defined.
Unstructured Data: Unstructured data constitutes all other data that fall
outside the definition of structured data. Its structure is not explicitly
declared in a schema. In some cases, as with natural language, the structure
may need to be discovered.
The Relational Database (RDBMS): A relational database stores data in
tables and pre-dominantly uses SQL-based commands to access the data.
Mostly, the data structures and resulting data models take the third-normal
form (3NF) structure. In practice, the data model is a set of tables and
relationships between them, which are expressed in terms of keys and
integrity constraints across related tables such as foreign keys. A row of any
table consists of columns of structured data, and the database as a whole
contains only structured data. The logical model of the data held in the
database is based on tables and relationships.
For example, for an Employee table we can define the columns
as Employee_ID, First_Name, Initial, Last_Name, Address_Line_1,
Address_Line_2, City, State, Zip_Code, Home_Tel_No, Cell_Tel_No. In the
database schema, we further define the data types for each one of these
columns: integer, char, varchar, etc. These column names feature in the
SQL queries as data of interest for the user. We call this structured data
because the data held in the database is represented in a tabular fashion
and is known in advance and recorded in a schema.
The Non-Relational Database: Since RDBMS is confined to representing
data as related tables made up of rows and columns, it does not easily
accommodate data that have nested or hierarchical structures such as a bill
of materials or a complex document. Non-relational databases cater to a
wider variety of data structures (older mainframe data structures, object and
object-relational data structures, document and XML data structures, graph
data structures, etc.) than just tables. What we have defined here is an
“everything else bucket” that includes all databases that are not purely
relational.

Database workload
The above table provides a summary of the database landscape that has
emerged. The traditional databases, including open source ones like MySQL
and ProgreSQL are, of course, suited to the traditional OLTP, data mart and
data warehouse workloads. There are also databases like Aerospike and
VoltDB that specialize in extremely high volumes of OLTP transactions. This
category is very close, but not identical, to the in-memory databases like
SAP’s HANA or Kognitio, which simply focus on speed and response time. 

The final category consists of databases whose common characteristic is that


they are built to run on Hadoop’s HDFS file system. Currently, all of them
seem to target the workloads of the traditional RDBMS, but with the nuance
that they scale out better. They are unlikely to ever challenge either the
analytical or high volume OLTP databases in respect to scale and capability,
but as they mature, they may become attractive alternatives to the
traditional RDBMS.

For businesses who are selecting database products for a specific type of
application, our advice is to determine which category of database they need
before thinking of which products to investigate. While, as time passes, we
can expect there to be some rationalization among these database
categories, we expect most of them to persist with two or three products
dominating each category. This is because the categories have been derived
based on different types of workload, and we do not expect a database
engine that is excellent in one of these categories to perform particularly
well in other categories.

OldSQL, NewSQL, and the Emerging NoSQL

The relational database was driven by the idea of database standardization


around a generally applicable structure of data to store the data, and a
universally acceptable interface like SQL to query the data. We will refer to
the traditional RDBMS systems as OldSQL databases. These technologies
have proven to be excellent for most transactional data and also for
querying and analyzing broad collections of corporate data. These databases
are

characterized by the use of SQL as the primary means of data access,


although they may have other data access features.
There is also a relatively new category of relational databases that although
they adhere to the traditional RDBMS philosophy they are designed
differently, extending the relational model. A key offering of these databases
is new architectures to improve performance, scalability, and most
commonly scale-out. They include such products as Infobright, SAP Sybase
IQ, Greenplum, ParAccel, SAND Technologies, Teradata, Vertica,
Vectorwise, and others. We categorize these as NewSQL databases, since
they employ SQL as their primary means of access and fundamentally deal
with structured data only.

There is also an emerging set of databases specifically designed to provide


non-SQL modes of data access. These are commonly categorized as NoSQL
databases for their definition of “not only SQL” or “noSQL at all.” These
NoSQL databases exhibit a wide range of characteristics and design
philosophies.
Fig 2. illustrates the area of applicability of OldSQL, NewSQL, and NoSQL.
Fig2. Applicability of OldSQL, NewSQL, and NoSQL

The vertical axis in Fig2 indicates complexity of data structure. A single table


is less complex than the star schema and the snowflake schema structures
that one often sees in data warehouses. These are simpler than a third
normal form (TNF) relational schema. Nested data, graph data, and other
forms of complex data structures represent an increasing complexity of data
structures.
It is easy to place OldSQL and NewSQL databases on this diagram. Both
cater to all of the data structures up to the snowflake schema models. The
distinction between the two categories of product is simply in their ability to
scale up to very high volumes of data. The OldSQL databases, built for single
server or clustered environments, have a limit to their scalability. Most
NewSQL databases, designed for queries over high data volumes, provide
little or no support for OLTP, but their scale-out architectures offer good
support for data volumes up to the Petabyte level.
As soon as we enter the diverse schema models, NoSQL databases come
into the picture. It includes products like key-value pair databases, graph
databases, document databases, etc. Such databases are built to support
extremely large sparse tables and the JOIN is superfluous to the intended
workloads.
Implication Of Big data Scale on Data Processing:

 Satellite images: This includes weather data or the data that the


government captures in its satellite surveillance imagery. Just think about
Google Earth, and you get the picture (pun intended).
 Scientific data: This includes seismic imagery, atmospheric data, and high
energy physics.
 Photographs and video: This includes security, surveillance, and traffic
video.
 Radar or sonar data: This includes vehicular, meteorological, and
oceanographic seismic profiles. The following list shows a few examples of
human-generated unstructured data:

 Text internal to your company: Think of all the text within documents,


logs, survey results, and e-mails. Enterprise information actually represents
a large percent of the text information in the world today.
 Social media data: This data is generated from the social media platforms
such as YouTube, Facebook, Twitter, LinkedIn, and Flickr.
 Mobile data: This includes data such as text messages and location
information.
 Website content: This comes from any site delivering unstructured
content, like

You might also like