Understanding Databases
Understanding Databases
Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in the book, and Manning
Publications was aware of a trademark claim, the designations have been printed in initial caps
or all caps.
Recognizing the importance of preserving what has been written, it is Manning’s policy to have
the books we publish printed on acid-free paper, and we exert our best efforts to that end.
Recognizing also our responsibility to conserve the resources of our planet, Manning books
are printed on paper that is at least 15 percent recycled and processed without the use of
elemental chlorine.
ISBN: 9781617297663
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 - EBM - 24 23 22 21 20 19
MANAGING DATABASES 1
Managing databases
Chapter 4 from Learn SQL Server Administration in a Month of Lunches 2
DATABASES ON AWS 12
Databases on AWS
Chapter 4 from Learn Amazon Web Services in a Month of Lunches 13
iii
iv
Managing databases
Databases are the basic unit of management and work within SQL Server. A data-
base is an almost entirely self-contained package that bundles security settings, con-
figuration settings, your actual data, and much more. That makes databases a good
place to start your education in SQL Server maintenance. Some of the things you’ll
look at in this chapter will need a much more complete explanation, which will
come in later chapters; the goal right now is to focus on the database container
itself. We’ll be covering database configuration options and some basics of how to
manage databases.
Figure 4.1 The General tab includes basic information about the database.
Collation defaults to the collation setting of the SQL Server instance (and you
incur a performance hit if you select a collation other than that). It refers to the
way SQL Server treats data for sorting and comparing purposes. For example, in
Lithuania the letter “Y” comes between “I” and “J,” so SQL Server would need to
know if you wanted to use those rules. The “Latin” collations work well for
English and most Romance languages.
Recovery model determines how the database can be recovered. Production data-
bases should use the Full model; nonproduction databases, or databases that
are primarily read-only, may use Simple. You’ll learn more about these in the
next chapter.
Compatibility level can be used to change the behavior of the database to corre-
spond with a previous version of SQL Server. This is mainly used when you
migrate a database from an older version to a newer version.
Containment type specifies exactly how standalone the database is. Normally,
databases have a few dependencies on certain instance-wide objects, meaning
you can’t easily move only the database to a different instance. A contained data-
base has fewer, or zero, dependencies, making it more standalone—but less
centrally managed. Chapter 6 will go into some of these details.
Auto Close should, in general, be False. This specifies that SQL Server should
close the physical database files when the database isn’t being used. When
someone does try to use it, there will be a delay as SQL Server opens the file. For
a large database, or on a busy server, that delay can be significant.
Auto Create Statistics should usually be True. This tells SQL Server to automati-
cally create certain statistical information used to optimize query performance.
You’ll learn a lot more about statistics in several upcoming chapters, and you’ll
also learn about the Auto Update Statistics and Auto Update Statistics Asyn-
chronously options.
Auto Shrink is typically set to False. Setting it to True causes SQL Server to peri-
odically try to make the database files smaller by returning unused space to the
OS. Later in this chapter, you’ll learn why that’s often a bad idea.
Those are the main settings that you need to focus on for now. There are some others
that will come up in discussion in upcoming chapters, so just remember where you
found the database properties, and how to get back to that dialog.
You should already have seen how to perform an attach operation; if not, turn to the
appendix and follow the lab setup procedure.
Colocated log and data files. For both performance and reliability reasons, you
want your data (.MDF and .NDF) files separated from your log (.LDF) files. Sepa-
rated ideally means on different physical disks.
Unnecessary secondary files. If your database is split into one .MDF and one or
more .NDF files, and they’re all located on the same storage volume, then those
secondary files may be unnecessary. On a huge database, the .NDF files may
serve to break up the backup tasks into more manageable chunks; on smaller
databases, there’s no reason to have .NDF files if they’re going to live on the
same disk volume as the .MDF file.
Heavy disk fragmentation. Use Windows’ disk defragmentation tools to look at the
level of physical disk fragmentation. Ideally, you want SQL Server’s database files
to come from a single contiguous block of space on whatever disk volumes they
live on. Anything else slows down SQL Server. Note that you can’t defragment
these files without either detaching the database or shutting down the SQL
Server instance.
The last thing to look at is a bit more of a judgment call, and that’s how your database
uses any .NDF files it may have. When you create a .NDF file, you choose which data-
base objects are moved to that file. Similarly, when a developer creates a database
object, they pick which file it lives in. Those objects can include tables, which house all
of your actual data, and programming objects like views, stored procedures, and the
like. If your goal is to improve performance, or to split up a large database to make
backups more manageable, then you can’t just move any old files into a .NDF: you
have to be strategic.
Let’s say your database includes two heavily used tables, A and B. There are a
bunch of other tables, but those two get the majority of the traffic. Moving A and B to
separate .NDFs, while leaving everything else in the main .MDF, can be a good way to
improve performance if—and only if—the three files live on separate disk volumes,
and ideally if those disk volumes are accessed by different controllers (either actual
disk controller cards for local storage, or SAN adapters for SAN storage). That way, SQL
Server can access each of the three without the other two getting in the way.
Imagine that you run a freight train. You have three boxcars full of merchandise
that you need to get to three different destinations. One way to do so would be to tack
all three boxcars behind a single engine, and run it to each of the three destinations.
That’d be pretty slow, and it’s what SQL Server does when all of your data is in a single
.MDF file.
Another approach would be to attach each boxcar to a different engine. If they’re
all sharing track for a portion of the journey, you’re really not getting much benefit.
This is like having multiple .NDF files, but only one controller channel to the storage.
The best approach would be to have three boxcars, three engines, and three sepa-
rate tracks. Everyone can go their own way, without worrying about what the others
are doing. This is the same as having multiple .NDFs, with different controller chan-
nels to each storage volume.
Modern SANs can help alleviate the need to have multiple .NDFs. SANs are fast, so
controller channel contention may not be a problem. SANs inherently spread your
data across many different physical disks, further reducing contention and improving
performance. With a fast, well-built SAN, you may not need .NDFs for performance
purposes.
database file is being used, and monitor that figure. Get a feel for how much larger a
database grows every day, for example, and you’ll be able to predict when it will need
more room. You’ll also know about how much room to give it to have it last for a speci-
fied number of additional days. To perform that monitoring, you can pop into SQL
Server Management Studio every few days and look at the databases’ properties. You’ll
see the file sizes, and, if you like, you can record them in an application such as Micro-
soft Office Excel. That way, you can have the spreadsheet generate charts that show
you database growth over time. Charts help me visualize when the database size is
going to become too large for the disk it’s on, for example.
4.4.3 Filegroups
We haven’t yet discussed filegroups. If your database consists of a single .MDF file (and a
.LDF file), you won’t use filegroups. Filegroups come into play when you have one or
more .NDF files. In that scenario, filegroups act purely as a convenience. They let you
target specific operations—primarily backups and restores—to a single filegroup,
thereby affecting a bunch of files.
Let’s say that, for performance reasons, you’ve split your database across three
.NDF files in addition to the main .MDF file. Each of those four files is located on a
separate disk volume, and each volume is accessed by a different disk controller.
That’s great for parallel operations, and it should help performance in the system. But
now you have to back up and recover four files, making a lot more work for yourself.
Instead, you can group those into a single filegroup, and then run backups against
that filegroup, grabbing all four files at once.
On a huge database, you might have many different .NDF files to spread out the
data. If, taken altogether, they’re too large to back up in a single maintenance window,
you might group them into different filegroups. That way, you can back them up sepa-
rately. Of course, recovering becomes a lot more complicated, since you won’t have a
single point-in-time backup that includes everything. You’ll dive more into this topic
in the next chapter.
of individual disks inside the SAN. Rather than plugging in disks using a copper cable,
you connect a server to the SAN over a network of some kind. Most SQL Server
instances keep their data on a SAN, because the SAN itself can provide a high level of
both performance and reliability. SANs are commonly managed by specialized admin-
istrators, so as a SQL Server administrator you might have to work with them to under-
stand how your servers’ data is being physically stored.
T his chapter explores how and why to move your database away from the
WordPress instance to run independently in its own environment. You’ll also
learn about relational databases, NoSQL databases, and planning your infra-
structure design.
Databases on AWS
As you saw in the previous chapter, WordPress stores in a database all the bits and
pieces that make up your website. But of course, this approach isn’t limited to
WordPress: it would be hard to imagine any public-facing application of even mini-
mal complexity that didn’t rely on structured data of one sort or another. Working
on an application? Learn to love databases. The coming pages explore how to
choose a database architecture and how (and why) to move your database away
from the WordPress instance to run independently in its own environment.
13
Here, the database has records identified by the numbers 1 and 2, and each record
contains fields made up of a name, an address, and a number of purchases.
Perhaps the key benefit of this kind of strong structure is that it allows high levels
of predictability and reliability, because carefully defined rules can be applied to all
transactions affecting your data. You can, for example, apply constraints to the way
users of an application can access the database, to ensure that two users aren’t trying
to write changes to a single record at one time (which could lead to data corruption).
pler integration of data stored across multiple clients and in different formats. This
makes it possible to easily accommodate fast-growing data sources.
Despite what you may think, some people argue that NoSQL stands not for No SQL or
Not SQL, but rather for Not Only SQL. That’s because these databases can sometimes sup-
port SQL-like operations. In other words, you can sometimes trick a NoSQL database
into providing functionality similar to what you might expect from a relational database.
If you’d like more complete insights into NoSQL and how it fits into the larger
spectrum of database models, the AWS document “What Is NoSQL?” should be help-
ful: http://aws.amazon.com/nosql.
Data accessibility —It’s common to launch more than one server as part of a sin-
gle application. This can be because each provides a unique service, but it’s usu-
ally either so you can duplicate your content to protect against the failure of any
single server, or to accommodate growing user demand. In any case, when you
have multiple application servers using the same data, it’s often a good idea to
keep your database separate.
Hardware —Web or application servers often consume compute resources differ-
ently than databases. The former may rely heavily on the power of a strong,
multicore CPU, whereas the latter may thrive on super-fast, solid-state drives. It’s
always nice to let everyone play with their favorite toys.
Software —Suppose you have an application that requires a Windows server, but
you want your data kept on a Linux machine. Even if, technically, it can be done
using the magic of virtualization, you may want to avoid the extra complications
involved in running both OSs on a single server.
AWS RDS —We’re back at last to the AWS in Learn AWS in a Month of Lunches.
Amazon Relational Database Service (RDS) provides a fully managed database
solution that can be easily integrated into EC2-based applications. Managed
means that Amazon takes care of all the hardware and administrative worries
and gives you a single internet address (called an endpoint) through which to
access the resource. AWS provides you with a database (MySQL, Oracle, Aurora,
and so on) that’s guaranteed to be available, replicated (backed up to protect
against data loss through failure), and patched (the software is the latest and
greatest available). You can only get these features by off-loading your database.
Figure 4.1 is a simple illustration of how an on-instance versus a managed RDS
arrangement might work.
Two ways to provide database
functionality to WordPress on EC2
WordPress on EC2
using an on-instance
MySQL WordPress
MySQL database
WordPress on EC2
using an off-instance
WordPress
RDS database instance
Figure 4.1 WordPress running on an EC2 instance and accessing a database either on-instance or
from a managed RDS instance
Figure 4.2 The RDS section of the AWS Simple Monthly Calculator lets you quickly try out alternative profiles.
quick cost estimates for a variety of profile options. In particular, note the amount
listed as the title of the Estimate of Your Monthly Bill, which updates with every
change you make on the Services tab.
Try it now
Navigate to the AWS Simple Monthly Calculator in your browser, click the Amazon
RDS tab on the left, and see how much a couple of RDS on-demand DB instances
would cost you. Play around with a range of estimates for the Data Transfer fields,
and see what kind of difference that can make.
To give you a sense of the instance you should choose, AWS offers three type families,
each with its own set of member instance classes: burst capable class type (db.t2) are the
cheapest instances, but despite their relatively weak specs, they can provide brief
bursts of higher performance. These make sense for applications that face only inter-
mittent spikes in usage. Standard (db.m4) instances are cost effective for sustained
usage, but their balance of resources may not give you enough to hold up against
extreme demand. And I’ll bet you can figure out the use case for memory optimized
(db.r3) instances all on your own.
I’ll talk more about the Simple Monthly Calculator later, in chapter 9.
NOTE If your database is currently active and you don’t want any ongoing
transactions to be lost, then you’ll need to add some careful preparation to
this process—which would go far beyond the scope of this book. One excel-
lent tool to consider using to help ease the transition is the AWS Database
Migration Service (https://aws.amazon.com/dms/).
This example assumes that the name of your MySQL user (identified by -u) is wpuser,
you want to be prompted for that user’s password (-p), the name of the database to be
dumped is wordpressdb, and you want the output saved to a file called mybackup.sql
in your user’s home directory. We’ll come back to this file once you’ve got an RDS
instance to host it.
TIP Forgot your username and password? This time you’re in luck. Both should
be available in plain text in the wp-config.php file you created in chapter 3.
NOTE If you’re working with the instance from chapter 3 after having
stopped and restarted it, it may have a new IP address. This may require you
to update values pointing to your site’s old IP address in the wp_options table
of your MySQL database. WordPress has directions at https://codex.word-
press.org/Changing_The_Site_URL. If you’re not comfortable working with
databases, these edits might cost you more time and trouble than firing up a
brand-new instance.
Because this is Amazon, you’re given one more chance to select its Aurora engine. If
you’re trying RDS for the first time, you should definitely choose the test version,
because by default, the profile you’re given is available under the Free Tier. As you’ll
soon see, you’re given only 5 GB of storage, but that’s usually more than enough for a
test database.
Figure 4.3 AWS currently supports these six powerful relational database engines.
Figure 4.5 shows the database details page. I settled for the latest stable version of
MySQL—but you might want to use an older version if you’re having compatibility
problems with existing data. I went with a Free Tier db.t2.micro instance class
because, well, it’s free.
Choose a Database
database configuration
version. details
Figure 4.5 Defining your database configuration (release version, instance class, Multi-AZ, authentication, and
so on )
I said “no” to Multi-AZ Deployment, but only because this is a test deployment and
the option carries extra cost. Normally, Multi-AZ (availability zone) is an excellent
choice: it replicates your instance in a second AZ so that even if one Amazon data cen-
ter suffers a catastrophic failure (like a total power blackout), your database will sur-
vive thanks to the instance running in the second. Because AWS manages the service,
the transition or failover between a dying instance and its live replacement is pretty
much instantaneous, and your users will probably never know anything happened.
A provisioned IOPS (input/output operations per second) drive—available
through the Storage Type drop-down—may be an attractive option if you’re looking
for extra-fast access to your data. You can enter the maximum amount of allocated
storage you’ll need, in gigabytes. For this example, your needs are minimal, so the
General Purpose option is fine.
The name you choose for DB Instance Identifier in the Settings section will be the
official name given to the database; you’ll use this database name later, in the wp-con-
fig.php file, when you’re ready to connect. You don’t have to use the same database
name you used in your original, local configuration, but you may find it convenient.
Master Username and Password, on the other hand, must match the values you used in
the wp-config.php file on your EC2 instance.
The next screen (figure 4.6) contains some important settings that define the envi-
ronment you’ll give your database instance. The instance is currently set to launch in
my account’s default virtual private cloud (VPC), which is exactly what I want, because
that network is also hosting my EC2 server. Having them both in the same VPC allows
for faster and (more important) more-secure communications. I’ll talk much more
about VPCs later in the book.
Figure 4.6 Defining the launch environment details (network, security group, port, and so on)
In the case of the MySQL database on your WordPress site—whose only legitimate
user is the WordPress instance—there’s no reason to provide access to your database
to anyone or anything besides your EC2 server. So, set Publicly Accessible to No. One
of the cardinal rules of IT administration security is to open the fewest possible system
resources to the fewest possible users. That way, you limit your exposure to risk.
RDS instances have network-traffic-filtering security groups, just as EC2 instances
do. Choose Create New Security Group; later, you’ll edit this to permit traffic between
the WordPress instance and the database. Provide a name for an empty database that
RDS will create on your new instance (an RDS instance can have more than one data-
base, by the way). This isn’t the same as the instance identifier you created on the pre-
vious screen, which is used to identify the RDS instance. For simplicity, you can give
this database the same name you gave the original database on your EC2 instance.
The remaining settings on this page are well documented, so I’ll leave you to
explore them on your own. For now, click Launch DB Instance, and wait for the
instance to come to life.
Figure 4.7 Allow network traffic between your database and EC2 instances.
With that done, return to the RDS dashboard. Click the Instances link in the left
panel, and details about your now-running database instance will be displayed. The
one that interests you the most right now is the endpoint instance (see figure 4.8).
Copy that endpoint, leaving out the :3306 at the end; MySQL won’t require that when
you connect to the database from EC2.
Database
endpoint
Figure 4.8 The instance details dashboard for your MySQL database, including the database endpoint
The next command connects to the RDS MySQL database at the host address, logs in
using the wpuser name and password, and then, using the < character, streams the con-
tents of the mybackup.sql file into the wordpressdb database:
mysql -u wpuser -p --database=wordpressdb \
--host=wpdatabase.co7swtzbtfg6.us-east-1.rds.amazonaws.com \
< mybackup.sql
Troubleshooting
That’s a long command, and any out-of-place character will throw the whole thing off.
Adding a space after a single or double dash, using two dashes rather than one or
one instead of two, getting the username or password wrong, adding a space after
the backslash line-break character, or even forgetting to substitute your RDS end-
point for my example as the value for host will cause the command to fail.
Didn’t work on your first try? You’re in good company. Carefully go through the com-
mand again from scratch. And don’t forget to do an internet search for the error mes-
sage you receive—you’ll be surprised how many others have already been there.
All that’s left is to enter the new endpoint into the WordPress wp-config .php file in
the /var/www/html/ directory on your EC2 server:
sudo nano /var/www/html/wp-config.php
Check to be sure the DB_USER and DB_PASSWORD values are correct, and then edit the
hostname DB_HOST value to equal your RDS endpoint. In my case, it’s wpdata-
base.co7swtzbtfg6.us-east-1.rds.amazonaws.com.
One more thing: spelling counts. While setting up this environment to make sure
everything I’m teaching you works (did you think Manning would sell you flaky
goods?), I found that I couldn’t connect to the RDS instance. Turns out I had acciden-
tally set the username as wbuser rather than wpuser. For a moment or two, I couldn’t
figure out why I was getting this error:
ERROR 1045 (28000): Access denied for user
'wpuser'@'172.31.59.16' (using password: YES)
For those of you keeping score at home, your WordPress site is still happily running on
its own EC2 Free Tier instance, but now it’s using a MySQL database that’s hosted on an
RDS-managed DB instance (that happens to be equally Free Tier). If you have an
instance running on your own AWS account, you may want to keep it for now—it will be
useful in chapter 6 (although not in chapter 5). Of course, you can always terminate it
and quickly re-create it later, giving you the added benefit of a helpful review.
4.9 Lab
Launch your RDS MySQL database, and then connect to it using a MySQL client on a
remote computer. It may take a few attempts before you get the syntax quite right, but
it’s worth it to see the triumphant smile on your face when you succeed. Even if you
don’t see the smile (after all, your eyes aren’t well positioned for the job), feel free to
send me a picture.
Definitions
Relational database—Highly structured database that uses the Structured
Query Language (SQL) standard.
Relational database elements—Tables, records, and fields.
NoSQL—Not Only SQL. Databases that feature less structure but, in some
cases, far greater speed than SQL.
Managed services—Services provided by AWS (including RDS and Dynamo-DB),
for which it takes full responsibility for the underlying hardware, leaving users
to focus on their data.
RDS instance class types—Burst capable, standard, and memory optimized.
Database dump—Making an accurate copy of a database and its contents to
allow it to be archived or migrated.
Multi-AZ—An RDS deployment replicated over more than on availability zone
to increase reliability.
Endpoint—Public address through which an RDS database can be accessed.
I n this chapter, we’ll learn what NoSQL is (and what it isn’t), and we’ll explore
NoSQL business drivers as well as some interesting real-world case studies.
The complexity for minimum component costs has increased at a rate of roughly a
factor of two per year...Certainly over the short term this rate can be expected to
continue, if not to increase.
—Gordon Moore, 1965
…Then you better start swimmin’…Or you’ll sink like a stone…For the times they are
a-changin’.
—Bob Dylan
In writing this book we have two goals: first, to describe NoSQL databases, and sec-
ond, to show how NoSQL systems can be used as standalone solutions or to aug-
ment current SQL systems to solve business problems. Though we invite anyone
who has an interest in NoSQL to use this as a guide, the information, examples,
28
and case studies are targeted toward technical managers, solution architects, and data
architects who are interested in learning about NoSQL.
This material will help you objectively evaluate SQL and NoSQL database systems
to see which business problems they solve. If you’re looking for a programming guide
for a particular product, you’ve come to the wrong place. Here you’ll find informa-
tion about the motivations behind NoSQL, as well as related terminology and con-
cepts. There may be sections and chapters of this book that cover topics you already
understand; feel free to skim or skip over them and focus on the unknown.
Finally, we feel strongly about and focus on standards. The standards associated
with SQL systems allow applications to be ported between databases using a common
language. Unfortunately, NoSQL systems can’t yet make this claim. In time, NoSQL
application vendors will pressure NoSQL database vendors to adopt a set of standards
to make them as portable as SQL.
In this chapter, we’ll begin by giving a definition of NoSQL. We’ll talk about the
business drivers and motivations that make NoSQL so intriguing to and popular with
organizations today. Finally, we’ll look at five case studies where organizations have
successfully implemented NoSQL to solve a particular business problem.
It works on many processors —NoSQL systems allow you to store your database on
multiple processors and maintain high-speed performance.
It uses shared-nothing commodity computers —Most (but not all) NoSQL systems
leverage low-cost commodity processors that have separate RAM and disk.
It supports linear scalability —When you add more processors, you get a consistent
increase in performance.
It’s innovative —NoSQL offers options to a single way of storing, retrieving, and
manipulating data. NoSQL supporters (also known as NoSQLers) have an inclu-
sive attitude about NoSQL and recognize SQL solutions as viable options. To
the NoSQL community, NoSQL means “Not only SQL.”
Equally important is what NoSQL is not:
It’s not about the SQL language —The definition of NoSQL isn’t an application
that uses a language other than SQL. SQL as well as other query languages are
used with NoSQL databases.
It’s not only open source —Although many NoSQL systems have an open source
model, commercial products use NOSQL concepts as well as open source initia-
tives. You can still have an innovative approach to problem solving with a com-
mercial product.
It’s not only big data —Many, but not all, NoSQL applications are driven by the
inability of a current application to efficiently scale when big data is an issue.
Though volume and velocity are important, NoSQL also focuses on variability
and agility.
It’s not about cloud computing —Many NoSQL systems reside in the cloud to take
advantage of its ability to rapidly scale when the situation dictates. NoSQL sys-
tems can run in the cloud as well as in your corporate data center.
It’s not about a clever use of RAM and SSD —Many NoSQL systems focus on the effi-
cient use of RAM or solid state disks to increase performance. Though this is
important, NoSQL systems can run on standard hardware.
It’s not an elite group of products —NoSQL isn’t an exclusive club with a few prod-
ucts. There are no membership dues or tests required to join. To be considered
a NoSQLer, you only need to convince others that you have innovative solutions
to their business problems.
NoSQL applications use a variety of data store types (databases). From the simple key-
value store that associates a unique key with a value, to graph stores used to associate
relationships, to document stores used for variable data, each NoSQL type of data
store has unique attributes and uses as identified in table 1.1.
Table 1.1 Types of NoSQL data stores—the four main categories of NoSQL systems, and sample
products for each data store type
Column family store—A sparse matrix Web crawler results Apache HBase
system that uses a row and a column Big data problems that can Apache Cassandra
as keys relax consistency rules Hypertable
Apache Accumulo
NoSQL systems have unique characteristics and capabilities that can be used alone or
in conjunction with your existing systems. Many organizations considering NoSQL sys-
tems do so to overcome common issues such as volume, velocity, variability, and agility,
the business drivers behind the NoSQL movement.
Volume
1.2.1 Volume
Without a doubt, the key factor pushing organizations to look at alternatives to their
current RDBMSs is a need to query big data using clusters of commodity processors.
Until around 2005, performance concerns were resolved by purchasing faster proces-
sors. In time, the ability to increase processing speed was no longer an option. As chip
density increased, heat could no longer dissipate fast enough without chip overheat-
ing. This phenomenon, known as the power wall, forced systems designers to shift
their focus from increasing speed on a single chip to using more processors working
together. The need to scale out (also known as horizontal scaling), rather than scale up
(faster processors), moved organizations from serial to parallel processing where data
problems are split into separate paths and sent to separate processors to divide and
conquer the work.
1.2.2 Velocity
Though big data problems are a consideration for many organizations moving away
from RDBMSs, the ability of a single processor system to rapidly read and write data is
also key. Many single-processor RDBMSs are unable to keep up with the demands of
real-time inserts and online queries to the database made by public-facing websites.
RDBMSs frequently index many columns of every new row, a process which decreases
system performance. When single-processor RDBMSs are used as a back end to a web
store front, the random bursts in web traffic slow down response for everyone, and tun-
ing these systems can be costly when both high read and write throughput is desired.
1.2.3 Variability
Companies that want to capture and report on exception data struggle when attempt-
ing to use rigid database schema structures imposed by RDBMSs. For example, if a
business unit wants to capture a few custom fields for a particular customer, all cus-
tomer rows within the database need to store this information even though it doesn’t
apply. Adding new columns to an RDBMS requires the system be shut down and ALTER
TABLE commands to be run. When a database is large, this process can impact system
availability, costing time and money.
1.2.4 Agility
The most complex part of building applications using RDBMSs is the process of putting
data into and getting data out of the database. If your data has nested and repeated
subgroups of data structures, you need to include an object-relational mapping layer.
The responsibility of this layer is to generate the correct combination of INSERT,
UPDATE, DELETE, and SELECT SQL statements to move object data to and from the
RDBMS persistence layer. This process isn’t simple and is associated with the largest bar-
rier to rapid change when developing new or modifying existing applications.
Generally, object-relational mapping requires experienced software developers
who are familiar with object-relational frameworks such as Java Hibernate (or NHiber-
nate for .Net systems). Even with experienced staff, small change requests can cause
slowdowns in development and testing schedules.
You can see how velocity, volume, variability, and agility are the high-level drivers
most frequently associated with the NoSQL movement. Now that you’re familiar with
these drivers, you can look at your organization to see how NoSQL solutions might
impact these drivers in a positive way to help your business meet the changing
demands of today’s competitive marketplace.
Table 1.2 The key case studies associated with the NoSQL movement—the name of the case
study/standard, the business drivers, and the results (findings) of the selected solutions
LiveJournal’s Memcache Need to increase performance By using hashing and caching, data in
of database queries. RAM can be shared. This cuts down the
number of read requests sent to the
database, increasing performance.
Google’s MapReduce Need to index billions of web By using parallel processing, indexing
pages for search using low-cost billions of web pages can be done
hardware. quickly with a large number of commod-
ity processors.
Table 1.2 The key case studies associated with the NoSQL movement—the name of the case
study/standard, the business drivers, and the results (findings) of the selected solutions (continued)
Google’s Bigtable Need to flexibly store tabular By using a sparse matrix approach,
data in a distributed system. users can think of all data as being
stored in a single table with billions of
rows and millions of columns without the
need for up-front data modeling.
Amazon’s Dynamo Need to accept a web order 24 A key-value store with a simple interface
hours a day, 7 days a week. can be replicated even when there are
large volumes of data to be processed.
Map
Input Shuffle Final
Reduce
data sort result
Map
Map
Figure 1.2 The map and reduce functions are ways of partitioning large datasets into
smaller chunks that can be transformed on isolated and independent transformation
systems. The key is isolating each function so that it can be scaled onto many servers.
The use of MapReduce inspired engineers from Yahoo! and other organizations to
create open source versions of Google’s MapReduce. It fostered a growing awareness
of the limitations of traditional procedural programming and encouraged others to
use functional programming systems.
Many in the NoSQL community cite Amazon’s Dynamo paper as a significant turn-
ing point in the movement. At a time when relational models were still used, it chal-
lenged the status quo and current best practices. Amazon found that because key-
value stores had a simple interface, it was easier to replicate the data and more reli-
able. In the end, Amazon used a key-value store to build a turnkey system that was reli-
able, extensible, and able to support their 24/7 business model, making them one of
the most successful online retailers in the world.
Now let’s see how Sally applies her knowledge in two examples. In the first exam-
ple, a group that needed to track equipment warranties of hardware purchases came
to Sally for advice. Since the hardware information was already in an RDBMS and the
team had experience with SQL, Sally recommended they extend the RDBMS to
include warranty information and create reports using joins. In this case, it was clear
that SQL was appropriate.
In the second example, a group that was in charge of storing digital image infor-
mation within a relational database approached Sally because the performance of the
database was negatively impacting their web application’s page rendering. In this case,
Sally recommended moving all images to a key-value store, which referenced each
image with a URL. A key-value store is optimized for read-intensive applications and
works with content distribution networks. After removing the image management
load from the RDBMS, the web application as well as other applications saw an
improvement in performance.
Note that Sally doesn’t see her job as a black-and-white, RDBMS versus NoSQL
selection process. Sometimes the best solution involves using hybrid approaches.
Summary
This chapter began with an introduction to the concept of NoSQL and reviewed the
core business drivers behind the NoSQL movement. We then showed how the power
wall forced systems designers to use highly parallel processing designs and required a
new type of thinking for managing data. You also saw that traditional systems that use
object-middle tiers and RDBMS databases require the use of complex object-relational
mapping systems to manipulate the data. These layers often get in the way of an orga-
nization’s ability to react quickly to changes (agility).
When we venture into any new technology, it’s critical to understand that each
area has its own patterns of problem solving. These patterns vary dramatically from
technology to technology. Making the transition from SQL to NoSQL is no different.
NoSQL is a new paradigm and requires a new set of pattern recognition skills, new
ways of thinking, and new ways of solving problems. It requires a new cognitive style.
Opting to use NoSQL technologies can help organizations gain a competitive edge
in their market, making them more agile and better equipped to adapt to changing
business conditions. NoSQL approaches that leverage large numbers of commodity
processors save companies time and money and increase service reliability.
As you’ve seen in the case studies, these changes impacted more than early tech-
nology adopters: engineers around the world realize there are alternatives to the
RDBMS-as-our-only-option mantra. New companies focused on new thinking, technol-
ogies, and architectures have emerged not as a lark, but as a necessity to solving real
business problems that don’t fit into a relational mold. As organizations continue to
change and move into global economies, this trend will continue to expand.
As we move into our next chapter, we’ll begin looking at the core concepts and
technologies associated with NoSQL. We’ll talk about simplicity of design and see how
it’s fundamental to creating NoSQL systems that are modular, scalable, and ultimately
lower-cost to you and your organization.
39
system 9 G
transaction log file 6
unnecessary secondary files 7 General tab 3
database dump 26 Google
Database Properties dialog 2 Bigtable 36
databases MapReduce 35–36
building AWS RDS instances 19–23 graph store, data store types 31
configuring security group settings 23–24
estimating costs of 17–18 H
infrastructure of 15–16
migrating to AWS RDS 18–19 Hadoop 35
models of 14–15 hardware 16
choosing 15 hashes, for SQL query 34
NoSQL 14–15 heavy disk fragmentation 7
relational 14 horizontal scaling, defined 32
overview 13 Hypertable 31
populating new 25
detaching database 4
Dev/Test environment 19
I
disk InfiniteGraph 31
array 10 infrastructure, of databases 15–16
defined 10 instance class types, RDS 26
mirror 10
instances, AWS RDS, building 19–23
redundancy and 10
IOPS (input/output operations per second) 21
storage, overview 10–11
distributed storage system 36
document nodes 37 J
document store, data store types 31
downtime, and Amazon Dynamo 36 Java Hibernate 33
Drop Connections option 4 joins, absence of 29
dumps, creating for MySQL database
management 18–19 K
DynamoDB 31
data stores 31 key-value store, data store types 31
overview 36–37 Kuhn, Thomas 31
E L
Failover 21 M
file
layout, problems with 6–8 managed services 26
primary 6 map function 35
size, problems with 8–9 MapReduce, overview 35–36
filegroups 9 MarkLogic 31
Filegroups page 2 data stores 31
Files page 2 overview 37
formats for NoSQL systems 29 master, system database 9
Memcache Q
data stores 31
overview 34 query nodes 37
memory optimized class type 18
message URL MoreLunches.com 11 R
migrating databases to AWS RDS 18–19
mirror, disk and 10 RAM (random access memory), dealing with
.MDF file 6–7 shortage of 34
model, system database 9 RDBMS (relational database management
MongoDB, data stores 31 system), defined 29
MSDB, system database 10 RDS (Relational Database Service) 16
Multi-AZ 26 Recovery model setting 3
MySQL database management system, creating Redis, data stores 31
dumps 18–19 reduce function 35
redundancy, disk and 10
N relational database elements 26
relational database management system. See
Neo4j query language RDBMS
data stores 31 relational databases 14
NHibernate 33 reliability, and Amazon Dynamo 36
.NDF file 6–7 Riak, data stores 31
NoSQL
business drivers S
agility 33
variability 32 secondary files, unnesessary 7
velocity 32 security 15
volume 32 security groups, configuring settings 23–24
case studies Simple Monthly Calculator 17–18
Amazon Dynamo 36–37 software 16
comparison 33–38 solid-state disk 10
Google Bigtable 36 SQL (Structured Query Language) 14
Google MapReduce 35–36 SSD. See solid-state disk
LiveJournal Memcache 34 standard class type 18
MarkLogic 37 Storage Area Network (SAN), adapter 7
defined 29–31 stripe arrays 10
NoSQL (Not Only SQL) 14, 26 system databases 9–10
master 9
O model 9
MSDB 10
Options page 2 temporary (TempDb) 9
P T
paradigm shift 31 TempDb, system database 9
parity information, disk and 10 transaction log file 6
populating new databases 25
power wall phenomenon 32 U
processors, support for 30
production environment 19 Update Statistics option 4
V Y
variability business driver 32 Yahoo! 36
velocity business driver 32
volume business driver 32
VPCs (virtual private clouds) 22