0% found this document useful (0 votes)
201 views

Understanding Databases

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
201 views

Understanding Databases

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Save 50% on these books and videos – eBook, pBook, and MEAP.

Enter meub50 in the


Promotional Code box when you checkout. Only at manning.com.

Make Sense of NoSQL


by Daniel G. McCreary and Ann M. Kelly
ISBN 9781617291074
312 pages
$27.99

Learn SQL Server Administration 


in a Month of Lunches
by Don Jones
ISBN 9781617292132
256 pages
$27.99

Learn Amazon Web Services 


in a Month of Lunches
by David Clinton
ISBN 9781617294440
328 pages
$31.99

Licensed to Alexander Reyes <[email protected]>


Understanding Databases
Edited by David Clinton

Manning Author Picks

Copyright 2019 Manning Publications


To pre-order or learn more about these books go to www.manning.com

Licensed to Alexander Reyes <[email protected]>


For online information and ordering of these and other Manning books, please visit
www.manning.com. The publisher offers discounts on these books when ordered in quantity.

For more information, please contact

Special Sales Department


Manning Publications Co.
20 Baldwin Road
PO Box 761
Shelter Island, NY 11964
Email: Erin Twohey, [email protected]

©2019 by Manning Publications Co. All rights reserved.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in


any form or by means electronic, mechanical, photocopying, or otherwise, without prior written
permission of the publisher.

Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in the book, and Manning
Publications was aware of a trademark claim, the designations have been printed in initial caps
or all caps.

Recognizing the importance of preserving what has been written, it is Manning’s policy to have
the books we publish printed on acid-free paper, and we exert our best efforts to that end.
Recognizing also our responsibility to conserve the resources of our planet, Manning books
are printed on paper that is at least 15 percent recycled and processed without the use of
elemental chlorine.

Manning Publications Co.


20 Baldwin Road Technical
PO Box 761
Shelter Island, NY 11964

Cover designer: Leslie Haimes

ISBN: 9781617297663
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 - EBM - 24 23 22 21 20 19

Licensed to Alexander Reyes <[email protected]>


contents
introduction iv

MANAGING DATABASES 1
Managing databases
Chapter 4 from Learn SQL Server Administration in a Month of Lunches 2

DATABASES ON AWS 12
Databases on AWS
Chapter 4 from Learn Amazon Web Services in a Month of Lunches 13

NOSQL: IT’S ABOUT MAKING INTELLIGENT CHOICES 27


NoSQL: It’s about making intelligent choices
Chapter 1 from Making Sense of NoSQL 28
index 39

iii

Licensed to Alexander Reyes <[email protected]>


introduction
More data is being collected, stored, and analyzed these days than ever before, and the
trend shows no sign of slowing down: businesses rely on customer data to glean valuable
insights that enable them to tailor product recommendations and predict market
trends; healthcare organizations need to keep important information like patient
health history and insurance data; the financial sector, whether it be your local bank or
the stock market, tracks millions of transactions daily; social media platforms continu-
ously cross-reference user data to customize news feeds and suggest friends, products,
and businesses. And those are just a handful of the ways data is being used globally!
Managing all of this data requires powerful databases, and choosing the right one
for your task is vital. This book combines three chapters from Manning books to help
you understand the different kinds of databases available, their capabilities and limita-
tions, and which option best suits your needs.
Chapter 4 from Learn SQL Server Administration in a Month of Lunches is all about
“Managing Databases.” Author and Microsoft MVP Don Jones explains how to config-
ure database options, how to detach and attach databases when you need to move or
copy your database, and how to assess database storage. In this chapter, you’ll also
learn about the four types of system databases in SQL Server.
Chapter 4, “Databases on AWS”, from my own book Learn Amazon Web Services in a
Month of Lunches explores how to choose a database architecture with an emphasis on
how and why to move your database away from a WordPress instance to run inde-
pendently in its own environment. You’ll take a look at relational databases, NoSQL
databases, infrastructure design, and the factors to consider when making your archi-
tecture choice.
Chapter 1 from Making Sense of NoSQL by Daniel G. McCreary and Ann M. Kelly
showcases NoSQL databases: what they are, what they are not, and how the demands
of volume, velocity, variability, and agility drive NoSQL business solutions. It also
includes interesting real-world case studies illuminating quick, effective, and econom-
ical out-of-the-box NoSQL solutions.

iv

Licensed to Alexander Reyes <[email protected]>


INTRODUCTION v

In these chapters, you’ll find straightforward examples that highlight important


database aspects as well as many clearly illustrated hands-on exercises that provide
valuable firsthand experience. This quick-start guide is an excellent way to begin
learning the essentials for understanding databases, how to choose them, and how to
use them!

Licensed to Alexander Reyes <[email protected]>


Managing
databases

I n this chapter, we’ll cover the database container itself, configuration


options, and some basics of how to manage databases.

Licensed to Alexander Reyes <[email protected]>


Chapter 4
Chapter 4 from Learn SQL Server Administration
in a Month of Lunches by Don Jones

Managing databases

Databases are the basic unit of management and work within SQL Server. A data-
base is an almost entirely self-contained package that bundles security settings, con-
figuration settings, your actual data, and much more. That makes databases a good
place to start your education in SQL Server maintenance. Some of the things you’ll
look at in this chapter will need a much more complete explanation, which will
come in later chapters; the goal right now is to focus on the database container
itself. We’ll be covering database configuration options and some basics of how to
manage databases.

4.1 Configuring database options


If you don’t have it open, get SQL Server Management Studio on your screen.
Right-click a database, such as AdventureWorks2012, and select Properties. You
should be looking at the General page of the Database Properties dialog, as shown
in figure 4.1.
There’s not really anything to change on this tab, but it does give you a quick
glance at who owns it, when it was last backed up, and how big it is—all useful infor-
mation. The next page is Files, and this is where we can start assessing database
storage. You’ll do that in an upcoming section of this chapter, so just remember
where you saw the information. That goes for the Filegroups page as well.
The other main page we need to look at right now is the Options page. Here’s
what you should see:

Licensed to Alexander Reyes <[email protected]>


Configuring database options 3

Figure 4.1 The General tab includes basic information about the database.

 Collation defaults to the collation setting of the SQL Server instance (and you
incur a performance hit if you select a collation other than that). It refers to the
way SQL Server treats data for sorting and comparing purposes. For example, in
Lithuania the letter “Y” comes between “I” and “J,” so SQL Server would need to
know if you wanted to use those rules. The “Latin” collations work well for
English and most Romance languages.
 Recovery model determines how the database can be recovered. Production data-
bases should use the Full model; nonproduction databases, or databases that
are primarily read-only, may use Simple. You’ll learn more about these in the
next chapter.
 Compatibility level can be used to change the behavior of the database to corre-
spond with a previous version of SQL Server. This is mainly used when you
migrate a database from an older version to a newer version.

Licensed to Alexander Reyes <[email protected]>


4 CHAPTER 4 Managing databases

 Containment type specifies exactly how standalone the database is. Normally,
databases have a few dependencies on certain instance-wide objects, meaning
you can’t easily move only the database to a different instance. A contained data-
base has fewer, or zero, dependencies, making it more standalone—but less
centrally managed. Chapter 6 will go into some of these details.
 Auto Close should, in general, be False. This specifies that SQL Server should
close the physical database files when the database isn’t being used. When
someone does try to use it, there will be a delay as SQL Server opens the file. For
a large database, or on a busy server, that delay can be significant.
 Auto Create Statistics should usually be True. This tells SQL Server to automati-
cally create certain statistical information used to optimize query performance.
You’ll learn a lot more about statistics in several upcoming chapters, and you’ll
also learn about the Auto Update Statistics and Auto Update Statistics Asyn-
chronously options.
 Auto Shrink is typically set to False. Setting it to True causes SQL Server to peri-
odically try to make the database files smaller by returning unused space to the
OS. Later in this chapter, you’ll learn why that’s often a bad idea.

Those are the main settings that you need to focus on for now. There are some others
that will come up in discussion in upcoming chapters, so just remember where you
found the database properties, and how to get back to that dialog.

4.2 Detaching and attaching databases


When you need to move a database between instances, or even copy a database—per-
haps to a test environment—detaching and attaching can be an easy way to do it.
Detaching tells SQL Server to close the database and forget that it exists. At that point,
the database is just some files on disk. You can copy them, move them, or do whatever
else you want.
Attaching is the opposite—and you’ve already done it if you set up your lab envi-
ronment, since you attached the AdventureWorks database. Attaching “connects” the
database to a SQL Server instance, so that administrators and users can work with the
database.
To detach a database, right-click it, select Tasks, then select Detach…. As shown in
figure 4.2, you have two options.
Drop Connections, the first option, will forcibly disconnect anyone using the data-
base. That could result in a user being upset, so you don’t want to do that on a produc-
tion database in the middle of the workday. Update Statistics, the second option, tells
SQL Server to make one last update of the database’s performance statistics before
detaching. That way, when the database is attached to a new instance, it’s ready to go
right away—provided the new instance is the same version of SQL Server; statistics
aren’t compatible across versions.

Licensed to Alexander Reyes <[email protected]>


Detaching and attaching databases 5

Figure 4.2 You have two options when detaching a database.

You should already have seen how to perform an attach operation; if not, turn to the
appendix and follow the lab setup procedure.

ABOVE AND BEYOND


There’s another way to copy a database that you should be aware of, although we’re
not going to use this technique in this book. It’s called the Copy Database Wizard,
and you start it by right-clicking the database, choosing Tasks, then selecting Copy
Database….
This Wizard is capable of copying a database to a different instance, with the advan-
tage of not taking the database offline as a Detach would do. The Wizard is also capa-
ble of copying the instance-level dependencies that a database may need to function
properly.

Licensed to Alexander Reyes <[email protected]>


6 CHAPTER 4 Managing databases

4.3 Assessing database storage


Go back to the Files page in the database properties. Every SQL Server database has at
least two, and possibly many more, files:
 Files with a .MDF filename extension are a database’s primary files. This is where
everything in the database is normally stored. Every database has a .MDF file. In
cases where you have multiple database files (see .NDF files in the third item of
this list), the .MDF file will always store the database’s internal configuration
information and other internal system data.
 A .LDF filename extension indicates the database’s transaction log file. This file
serves several important operational and recovery purposes, and you’ll learn
about it in the next chapter. Every database has one of these, and only one,
although databases in the Simple recovery model automatically empty the log
so it’ll usually never be very large. We’re going to discuss the Simple model in
the next chapter, and I’ll share more details about it then.
 Databases can also have zero or more .NDF, or secondary data, files. These are
usually created as a tactic to improve performance or to make backing up a
large database more manageable.
Start your assessment by figuring out what files you have, and where they are physically
stored. If possible, work with a server administrator to understand how the server is
accessing the files. For example, if .LDF files are stored on one Storage Area Network
(SAN), and .MDF files are on another, try to discover if the server is accessing both
SANs by means of a single network adapter, or if it’s using different adapters for each.
Those answers will become important in the next section.
Also try to learn more about what type of storage is being used. Are files being writ-
ten to a single local disk? A local array, or mirror? If they’re being written to a SAN,
how is the SAN configured? Are there any solid-state drives (SSDs) in use? Again, these
are all important answers as you try to identify any potential concerns with the storage
subsystem.

4.4 Identifying potential performance concerns in storage


Storage is the absolute single biggest bottleneck for SQL Server performance. As such,
one of your biggest concerns should be whether or not the storage layout is optimized
for performance. Storage is also one of the biggest potential failure points, since disk
drives do indeed fail every now and then. That means reliability and recoverability will
be another concern. Your final concern relates to whether or not you’re able to effi-
ciently back up and restore the database, should the need for restoration arise. How
your database storage is laid out can affect that significantly.

4.4.1 Problems with file layout


Let’s start with a few major problem areas. You should make sure that, unless there’s a
good reason, none of your databases have any of the following situations:

Licensed to Alexander Reyes <[email protected]>


Identifying potential performance concerns in storage 7

 Colocated log and data files. For both performance and reliability reasons, you
want your data (.MDF and .NDF) files separated from your log (.LDF) files. Sepa-
rated ideally means on different physical disks.
 Unnecessary secondary files. If your database is split into one .MDF and one or
more .NDF files, and they’re all located on the same storage volume, then those
secondary files may be unnecessary. On a huge database, the .NDF files may
serve to break up the backup tasks into more manageable chunks; on smaller
databases, there’s no reason to have .NDF files if they’re going to live on the
same disk volume as the .MDF file.
 Heavy disk fragmentation. Use Windows’ disk defragmentation tools to look at the
level of physical disk fragmentation. Ideally, you want SQL Server’s database files
to come from a single contiguous block of space on whatever disk volumes they
live on. Anything else slows down SQL Server. Note that you can’t defragment
these files without either detaching the database or shutting down the SQL
Server instance.
The last thing to look at is a bit more of a judgment call, and that’s how your database
uses any .NDF files it may have. When you create a .NDF file, you choose which data-
base objects are moved to that file. Similarly, when a developer creates a database
object, they pick which file it lives in. Those objects can include tables, which house all
of your actual data, and programming objects like views, stored procedures, and the
like. If your goal is to improve performance, or to split up a large database to make
backups more manageable, then you can’t just move any old files into a .NDF: you
have to be strategic.
Let’s say your database includes two heavily used tables, A and B. There are a
bunch of other tables, but those two get the majority of the traffic. Moving A and B to
separate .NDFs, while leaving everything else in the main .MDF, can be a good way to
improve performance if—and only if—the three files live on separate disk volumes,
and ideally if those disk volumes are accessed by different controllers (either actual
disk controller cards for local storage, or SAN adapters for SAN storage). That way, SQL
Server can access each of the three without the other two getting in the way.
Imagine that you run a freight train. You have three boxcars full of merchandise
that you need to get to three different destinations. One way to do so would be to tack
all three boxcars behind a single engine, and run it to each of the three destinations.
That’d be pretty slow, and it’s what SQL Server does when all of your data is in a single
.MDF file.
Another approach would be to attach each boxcar to a different engine. If they’re
all sharing track for a portion of the journey, you’re really not getting much benefit.
This is like having multiple .NDF files, but only one controller channel to the storage.
The best approach would be to have three boxcars, three engines, and three sepa-
rate tracks. Everyone can go their own way, without worrying about what the others
are doing. This is the same as having multiple .NDFs, with different controller chan-
nels to each storage volume.

Licensed to Alexander Reyes <[email protected]>


8 CHAPTER 4 Managing databases

Modern SANs can help alleviate the need to have multiple .NDFs. SANs are fast, so
controller channel contention may not be a problem. SANs inherently spread your
data across many different physical disks, further reducing contention and improving
performance. With a fast, well-built SAN, you may not need .NDFs for performance
purposes.

4.4.2 Problems with file size


Sizing is also something to look at. Go back to the database properties dialog’s Files
tab and look at the properties for each file. You’ll notice that each has a current size,
as well as a growth option. Most folks leave their files set to autogrow by some size or
percentage. Autogrowth is a bad thing, although you may want to leave it on for spe-
cific reasons. Here’s why it’s bad:
 Autogrowth puts a significant burden on a server, and when it happens during
production hours your users will be impacted.
 Autogrowth results in the physical fragmentation of the database files, which
slows down database access. This is because disk space is requested from the OS
in chunks (set by the autogrowth size you’ve configured), and those chunks
won’t be all lined up in a contiguous space.
 Autogrowth isn’t managed. Either you’ll allow unlimited growth, which will
eventually consume available disk space, or you’ll cap the growth, forcing you to
deal with the capacity problem anyway.
Without autogrowth won’t your database eventually fill up and stop working? No—not
if you’re monitoring it. Growth is fine. Databases grow. When a database starts to need
more room, you should plan to manually increase its size during a scheduled mainte-
nance window, to avoid the downsides of autogrowth. That maintenance window can
involve taking the database (or the entire SQL Server instance) offline, so that Win-
dows can defragment the physical layout of the database file. In other words, manage
your database sizes. You can do that right from the properties dialog box. Some
administrators prefer to leave autogrowth turned on as a kind of last-ditch safety net—
it’s a way to ensure the application doesn’t stop working unexpectedly. I personally
don’t like to rely on that safety net, because, in my experience, autogrowth has caused
me a lot of grief over the years.
Autogrowth on a log file is also bad. Typically, your log files should remain at a
more or less constant size. The backup process should empty out the log, leaving it
ready for new entries. Your log may need to grow if there’s an increase in overall work-
load, meaning the log has more to keep track of between backup operations, but once
it hits its new “comfort zone” size, the growth should stop. Again, that growth should,
as much as possible, be managed manually.
Autogrowth is a good last-ditch emergency measure (which is why some folks leave
it on), but your goal should be for it to never happen. So how can you properly size
your database? You do so through monitoring. Pay attention to how much of the

Licensed to Alexander Reyes <[email protected]>


System databases 9

database file is being used, and monitor that figure. Get a feel for how much larger a
database grows every day, for example, and you’ll be able to predict when it will need
more room. You’ll also know about how much room to give it to have it last for a speci-
fied number of additional days. To perform that monitoring, you can pop into SQL
Server Management Studio every few days and look at the databases’ properties. You’ll
see the file sizes, and, if you like, you can record them in an application such as Micro-
soft Office Excel. That way, you can have the spreadsheet generate charts that show
you database growth over time. Charts help me visualize when the database size is
going to become too large for the disk it’s on, for example.

4.4.3 Filegroups
We haven’t yet discussed filegroups. If your database consists of a single .MDF file (and a
.LDF file), you won’t use filegroups. Filegroups come into play when you have one or
more .NDF files. In that scenario, filegroups act purely as a convenience. They let you
target specific operations—primarily backups and restores—to a single filegroup,
thereby affecting a bunch of files.
Let’s say that, for performance reasons, you’ve split your database across three
.NDF files in addition to the main .MDF file. Each of those four files is located on a
separate disk volume, and each volume is accessed by a different disk controller.
That’s great for parallel operations, and it should help performance in the system. But
now you have to back up and recover four files, making a lot more work for yourself.
Instead, you can group those into a single filegroup, and then run backups against
that filegroup, grabbing all four files at once.
On a huge database, you might have many different .NDF files to spread out the
data. If, taken altogether, they’re too large to back up in a single maintenance window,
you might group them into different filegroups. That way, you can back them up sepa-
rately. Of course, recovering becomes a lot more complicated, since you won’t have a
single point-in-time backup that includes everything. You’ll dive more into this topic
in the next chapter.

4.5 System databases


We should spend a few minutes discussing the four system databases in SQL Server.
 Master is where SQL Server stores its own configuration information. This data-
base doesn’t change often—mainly when you add or remove a database, or
make a server-wide configuration change—but you should back it up regularly.
There are special procedures for restoring this database, and for moving it to a
different location. Consult SQL Server’s documentation for instructions.
 Model is a template from which all new databases are created. Unless you’ve
made changes to it (which is rare), there’s no real need to back it up or move it.
 TempDb is a temporary database. This should be located on a fast storage vol-
ume, ideally separate from the volumes used by your databases. It doesn’t need
to be backed up (SQL Server makes a new one each time it starts, anyway), and

Licensed to Alexander Reyes <[email protected]>


10 CHAPTER 4 Managing databases

its disk volume doesn’t need to be redundant (meaning it doesn’t necessarily


need to be on a disk array). Many organizations install a local SSD (often chosen
because they’re fast) and put TempDB on it.
 MSDB is technically a system database, but it’s really used just like any of your
databases. SQL Server Agent uses it for all of its tasks, which you’ll learn about
in several upcoming chapters. MSDB should be backed up, and treated more or
less like any of your own databases.
TempDB is worth discussing a bit more. There are a number of different query opera-
tions that might need to use some space in TempDB, and there are certain administra-
tive options that use it as well. Developers can explicitly put things into TempDB as
part of their code. It’s used for lots of things. In terms of properly maintaining SQL
Server, you don’t need to worry so much about what it’s used for as you do about how
to position it properly for maximum performance. That’s something we’ll touch on in
several upcoming chapters.

4.6 An overview of storage hardware


It’s almost impossible to talk about SQL Server performance without talking about
disk storage, because the disk system contributes most of SQL Server’s performance
overhead. Since I brought up the topic of arrays and SSDs, I will quickly review some of
those options for folks that might not be familiar with them.
A disk is a basic unit of storage hardware. SQL Server is completely capable of run-
ning itself from a single disk, provided it’s big enough to store the data you’ll be keep-
ing. But a single disk can only do so much in a given period of time, and it can quickly
become a performance bottleneck. Imagine asking someone to read the contents of
this book aloud: their mouth is a single output channel, and it’s going to take some
time for them to stream all of those words out into the air.
An array is a group of disks that work together. There are many different types of
arrays. A mirror, for example, is one or more disks that retain an exact copy of another
set of disks. Mirrors provide redundancy, meaning if the first group of disks dies, you’ve
got a live backup ready to go. Stripe arrays take data and spread it across multiple disks,
reducing the bottleneck that a single disk would create. So instead of having one disk
reading and writing your data, you’ve got several disks working together to do so.
Some of those arrays can include parity information, which helps provide redundancy
for the data on the array. With parity, you could have one or more disks (depending
on how the array is set up) fail, and not lose any data.
An SSD is a solid-state disk, which stores data in microchips rather than on mag-
netic platters. They’re generally much faster than disks, because they don’t have
mechanical bits that have to spin and move.
A SAN is a big box full of disk arrays that sits on the network. Servers are assigned
pieces of the SAN’s total storage, and they use their pieces just like they’d use a locally
attached set of disks. SANs are a way of centralizing disks for management purposes.
SANs are nearly always configured as large arrays, providing resilience against the failure

Licensed to Alexander Reyes <[email protected]>


Hands-on lab 11

of individual disks inside the SAN. Rather than plugging in disks using a copper cable,
you connect a server to the SAN over a network of some kind. Most SQL Server
instances keep their data on a SAN, because the SAN itself can provide a high level of
both performance and reliability. SANs are commonly managed by specialized admin-
istrators, so as a SQL Server administrator you might have to work with them to under-
stand how your servers’ data is being physically stored.

4.7 Hands-on lab


Let’s practice what you’ve just read about in this chapter. Go to MoreLunches.com,
click this book’s cover image, and find the Database Inventory Sheet download. Com-
plete one sheet for each database on one of your SQL Server instances. If you don’t
have a production environment to inventory, complete the worksheet for your Adven-
tureWorks database instead.

Licensed to Alexander Reyes <[email protected]>


Databases
on AWS

T his chapter explores how and why to move your database away from the
WordPress instance to run independently in its own environment. You’ll also
learn about relational databases, NoSQL databases, and planning your infra-
structure design.

Licensed to Alexander Reyes <[email protected]>


Chapter 4
Chapter 4 from Learn Amazon Web Services
in a Month of Lunches by David Clinton

Databases on AWS

As you saw in the previous chapter, WordPress stores in a database all the bits and
pieces that make up your website. But of course, this approach isn’t limited to
WordPress: it would be hard to imagine any public-facing application of even mini-
mal complexity that didn’t rely on structured data of one sort or another. Working
on an application? Learn to love databases. The coming pages explore how to
choose a database architecture and how (and why) to move your database away
from the WordPress instance to run independently in its own environment.

4.1 The database


Just in case you haven’t yet been formally introduced, I’ll take a moment to explain
what a database does. A database is software that’s good at reading digital informa-
tion and then reliably storing it in a structured format so that the information can
later be efficiently retrieved in useful formats and combinations.
Imagine that your business keeps records of all your customers, including their
names, addresses, and previous purchases. From time to time, you’ll probably want
access to that information. Perhaps you need an address so you can mail an invoice,
or maybe you’d like to analyze your data to look for correlations between street
addresses and purchasing patterns.

13

Licensed to Alexander Reyes <[email protected]>


14 CHAPTER 4 Databases on AWS

4.2 Choosing the right database model


The data world can get really complex, really fast. In practical terms, though, it’s fair
to say that most projects can be successfully served by one of two database models:
relational (SQL) and NoSQL.

4.2.1 Relational databases


If you need your data organized in ways that carefully define how various categories of
information relate to each other, then a relational database may be what you’re after.
Think of it in terms of a business that, say, has to manage its employees in the context
of the jobs they do, the way they’re paid, and their health insurance status. Data
related to each employee appears in each of those categories but, at the same time,
may not be accessible to other users beyond what’s individually necessary.
Relational databases are often managed by one flavor or another of the SQL stan-
dard. SQL stands for Structured Query Language, and the “structured” part of that tells
most of the story. An SQL-type database (leading examples of which include Oracle,
MySQL, PostgreSQL, Microsoft’s SQL Server, and, more recently, Amazon’s Aurora) is
made up of tables, which, in turn, contain records (or, as some call them, rows). Records
are made up of individual values known as fields. Thus, the contents of a database of
customer information could be represented this way:

ID Name Address City # of purchases

1 John Doe 123 Any St. Yourtown 5

2 Jane Smith 321 Yna Ave. Hertown 2

Here, the database has records identified by the numbers 1 and 2, and each record
contains fields made up of a name, an address, and a number of purchases.
Perhaps the key benefit of this kind of strong structure is that it allows high levels
of predictability and reliability, because carefully defined rules can be applied to all
transactions affecting your data. You can, for example, apply constraints to the way
users of an application can access the database, to ensure that two users aren’t trying
to write changes to a single record at one time (which could lead to data corruption).

4.2.2 NoSQL databases


Times change. Across all segments of the IT world, data is being produced, consumed,
and analyzed in volumes and at speeds that weren’t anticipated when the first relational
database was designed more than a generation ago. Imagine, for instance, that you’re
building a wholesale business that needs to handle constantly changing product descrip-
tions and inventory information for tens of thousands of items; that data, in turn, must
be integrated with sales, shipping, and customer service operations.
In such a case, you’ll likely want to use a NoSQL database (such as AWS’s Dyna-
moDB): highly flexible relationships between NoSQL data elements allow much sim-

Licensed to Alexander Reyes <[email protected]>


Infrastructure design: where does your database belong? 15

pler integration of data stored across multiple clients and in different formats. This
makes it possible to easily accommodate fast-growing data sources.
Despite what you may think, some people argue that NoSQL stands not for No SQL or
Not SQL, but rather for Not Only SQL. That’s because these databases can sometimes sup-
port SQL-like operations. In other words, you can sometimes trick a NoSQL database
into providing functionality similar to what you might expect from a relational database.
If you’d like more complete insights into NoSQL and how it fits into the larger
spectrum of database models, the AWS document “What Is NoSQL?” should be help-
ful: http://aws.amazon.com/nosql.

4.2.3 How to choose


The database architecture you choose for a project will often depend on the specific
needs of your application. For instance, if you’re running financial transactions, and,
because of the overriding need for absolute accuracy and consistency, it’s critical that
a single record can never have more than one value, you’ll probably opt for a rela-
tional platform. Just imagine the chaos that would result if all the money in a particu-
lar account was withdrawn by two concurrent client sessions.
On the other hand, suppose you host a popular online multiplayer game. If being
able to quickly update data points can make all the difference for player experience,
and the occasional write failure won’t cause a zombie apocalypse, you’ll definitely
want to consider NoSQL.

4.3 Infrastructure design: where does 


your database belong?
The WordPress project you created in chapter 3 was built on top of a LAMP server. As
you no doubt remember, that means WordPress stored all of its data on the MySQL
database you installed on the same EC2 instance you used for WordPress itself. That’s a
common way to do things for a lighter, less mission-critical deployment like a tempo-
rary demo website or a hobby blog. But although such single-machine arrangements
can be simpler and inexpensive, simplicity and short-term cost savings aren’t always
the only goals.
Here are some reasons you might want to install and run a database off-instance (on
its own dedicated server):
 Security —Although you usually want the contents of a website or application to
be open to anyone out for a stroll on the internet, that won’t be the case for
your database. Wherever possible, it should be protected from outside access.
Think what might happen if everything Google knows about its billions of cus-
tomers was exposed to the public eye (it’s scary enough that Google knows
those things). Isolating your various resources from each other on completely
separate machines can make it a lot easier to open up what needs to be open
and close off the rest. Databases usually have significantly different access pro-
files than applications, so they’re perfect candidates for this kind of separation.

Licensed to Alexander Reyes <[email protected]>


16 CHAPTER 4 Databases on AWS

 Data accessibility —It’s common to launch more than one server as part of a sin-
gle application. This can be because each provides a unique service, but it’s usu-
ally either so you can duplicate your content to protect against the failure of any
single server, or to accommodate growing user demand. In any case, when you
have multiple application servers using the same data, it’s often a good idea to
keep your database separate.
 Hardware —Web or application servers often consume compute resources differ-
ently than databases. The former may rely heavily on the power of a strong,
multicore CPU, whereas the latter may thrive on super-fast, solid-state drives. It’s
always nice to let everyone play with their favorite toys.
 Software —Suppose you have an application that requires a Windows server, but
you want your data kept on a Linux machine. Even if, technically, it can be done
using the magic of virtualization, you may want to avoid the extra complications
involved in running both OSs on a single server.
 AWS RDS —We’re back at last to the AWS in Learn AWS in a Month of Lunches.
Amazon Relational Database Service (RDS) provides a fully managed database
solution that can be easily integrated into EC2-based applications. Managed
means that Amazon takes care of all the hardware and administrative worries
and gives you a single internet address (called an endpoint) through which to
access the resource. AWS provides you with a database (MySQL, Oracle, Aurora,
and so on) that’s guaranteed to be available, replicated (backed up to protect
against data loss through failure), and patched (the software is the latest and
greatest available). You can only get these features by off-loading your database.
Figure 4.1 is a simple illustration of how an on-instance versus a managed RDS
arrangement might work.
Two ways to provide database
functionality to WordPress on EC2

WordPress on EC2
using an on-instance
MySQL WordPress
MySQL database

WordPress on EC2
using an off-instance
WordPress
RDS database instance

Figure 4.1 WordPress running on an EC2 instance and accessing a database either on-instance or
from a managed RDS instance

Licensed to Alexander Reyes <[email protected]>


Estimating costs 17

4.4 Estimating costs


Before I show you how to migrate the MySQL database of your existing WordPress
deployment to RDS, we should have a frank discussion about money. Running your data-
base on RDS rather than sharing your application instance will, obviously, incur some
extra costs. In fact, even selecting a light database instance like db.t2.small that will be
used 100% of the time will come with a monthly bill of around $25 US at current rates.
Swapping that for a db.m3.xlarge instance (which you’ll need to handle higher demand
loads) will increase those costs by about 10 times. Running a second instance to greatly
enhance reliability using Amazon’s Multi-AZ feature will, of course, double your costs.
On the other hand, maintaining a busy database on your EC2 instance may require
more computing horsepower to keep up without placing a possible drain on applica-
tion performance. Moving from a t2.medium EC2 instance to m4.large, for instance,
could add nearly $50 per month to your bill.
When planning your deployment, your job is to accurately estimate the kind of
usage you’ll encounter and what that will cost you on RDS, and then weigh the security
and performance benefits RDS can offer as they apply to you. To produce a decent
estimate, you’ll need to have an idea of what class of database instance you’ll be using,
how many hours of demand you’ll face each month, how much data you’ll need to
store, and how much data will be transferred across the internet or between your AWS
resources. As you can see from figure 4.2, the AWS Simple Monthly Calculator
(http://calculator.s3.amazonaws.com/index.html) is an excellent tool for producing

Total current Enter database


monthly estimate usage parameters.

Figure 4.2 The RDS section of the AWS Simple Monthly Calculator lets you quickly try out alternative profiles.

Licensed to Alexander Reyes <[email protected]>


18 CHAPTER 4 Databases on AWS

quick cost estimates for a variety of profile options. In particular, note the amount
listed as the title of the Estimate of Your Monthly Bill, which updates with every
change you make on the Services tab.

Try it now
Navigate to the AWS Simple Monthly Calculator in your browser, click the Amazon
RDS tab on the left, and see how much a couple of RDS on-demand DB instances
would cost you. Play around with a range of estimates for the Data Transfer fields,
and see what kind of difference that can make.

To give you a sense of the instance you should choose, AWS offers three type families,
each with its own set of member instance classes: burst capable class type (db.t2) are the
cheapest instances, but despite their relatively weak specs, they can provide brief
bursts of higher performance. These make sense for applications that face only inter-
mittent spikes in usage. Standard (db.m4) instances are cost effective for sustained
usage, but their balance of resources may not give you enough to hold up against
extreme demand. And I’ll bet you can figure out the use case for memory optimized
(db.r3) instances all on your own.
I’ll talk more about the Simple Monthly Calculator later, in chapter 9.

4.5 Migrating your database to RDS


Assuming you’ve decided a managed RDS-based relational database is what you’ve always
wanted for your birthday, you have to figure out how to build one—or, in this case, how
to take the database you’ve already built for your WordPress project and migrate it to RDS
without having to rebuild it. Here’s what you’re going to do to make this work:
1 Make a usable copy of your existing database, which, by default, is already popu-
lated and active. This is known as a database dump.
2 Head over to the AWS Console to create an RDS instance. Make sure a secure
connection between it and your EC2 instance is possible.
3 Upload your saved database dump to the RDS database server.

NOTE If your database is currently active and you don’t want any ongoing
transactions to be lost, then you’ll need to add some careful preparation to
this process—which would go far beyond the scope of this book. One excel-
lent tool to consider using to help ease the transition is the AWS Database
Migration Service (https://aws.amazon.com/dms/).

4.5.1 Creating a MySQL dump


You’re ready to start the dump of the database. From an SSH session in your EC2
instance, which you set up according to the instructions in chapter 3, run this single
command:
mysqldump -u wpuser -p wordpressdb > /home/ubuntu/mybackup.sql

Licensed to Alexander Reyes <[email protected]>


Building an Amazon RDS instance 19

This example assumes that the name of your MySQL user (identified by -u) is wpuser,
you want to be prompted for that user’s password (-p), the name of the database to be
dumped is wordpressdb, and you want the output saved to a file called mybackup.sql
in your user’s home directory. We’ll come back to this file once you’ve got an RDS
instance to host it.

TIP Forgot your username and password? This time you’re in luck. Both should
be available in plain text in the wp-config.php file you created in chapter 3.

NOTE If you’re working with the instance from chapter 3 after having
stopped and restarted it, it may have a new IP address. This may require you
to update values pointing to your site’s old IP address in the wp_options table
of your MySQL database. WordPress has directions at https://codex.word-
press.org/Changing_The_Site_URL. If you’re not comfortable working with
databases, these edits might cost you more time and trouble than firing up a
brand-new instance.

4.6 Building an Amazon RDS instance


For now, it’s back to the AWS Console (http://console.aws.amazon .com) to launch an
RDS instance. As long as your account is eligible for the Free Tier, this won’t cost you
anything. Click the RDS link in the Database part of the Console page, and you’ll be
taken to the RDS dashboard. Assuming you don’t already have an RDS instance run-
ning, you’ll see a blue Get Started Now button.
Go ahead—click it. You know you want to.
As you can see in figure 4.3, you need to select a database engine. Your choice will,
to some degree, depend on your specific needs. For instance, older companies with
existing Oracle or Microsoft SQL Server databases will probably want to go with those
engines for the sake of compatibility. MariaDB is a community version of MySQL that,
by some accounts, boasts stronger security and far better support. And Aurora was
built by AWS and optimized for use in that environment. It’s your call; but considering
that your original EC2-based database uses MySQL, going with MySQL here should
keep things as simple as possible.
Once you’ve selected your database engine, the next screen (see figure 4.4) allows
you to choose an environment that best fits your needs:
 Production —An instance that will handle real customers or users
 Dev/Test —Something lighter-weight, to let you try out a possible setup

Because this is Amazon, you’re given one more chance to select its Aurora engine. If
you’re trying RDS for the first time, you should definitely choose the test version,
because by default, the profile you’re given is available under the Free Tier. As you’ll
soon see, you’re given only 5 GB of storage, but that’s usually more than enough for a
test database.

Licensed to Alexander Reyes <[email protected]>


20 CHAPTER 4 Databases on AWS

Figure 4.3 AWS currently supports these six powerful relational database engines.

Figure 4.4 Matching your project to the right environment

Licensed to Alexander Reyes <[email protected]>


Building an Amazon RDS instance 21

Figure 4.5 shows the database details page. I settled for the latest stable version of
MySQL—but you might want to use an older version if you’re having compatibility
problems with existing data. I went with a Free Tier db.t2.micro instance class
because, well, it’s free.

Choose a Database
database configuration
version. details

Figure 4.5 Defining your database configuration (release version, instance class, Multi-AZ, authentication, and
so on )

I said “no” to Multi-AZ Deployment, but only because this is a test deployment and
the option carries extra cost. Normally, Multi-AZ (availability zone) is an excellent
choice: it replicates your instance in a second AZ so that even if one Amazon data cen-
ter suffers a catastrophic failure (like a total power blackout), your database will sur-
vive thanks to the instance running in the second. Because AWS manages the service,
the transition or failover between a dying instance and its live replacement is pretty
much instantaneous, and your users will probably never know anything happened.
A provisioned IOPS (input/output operations per second) drive—available
through the Storage Type drop-down—may be an attractive option if you’re looking
for extra-fast access to your data. You can enter the maximum amount of allocated
storage you’ll need, in gigabytes. For this example, your needs are minimal, so the
General Purpose option is fine.

Licensed to Alexander Reyes <[email protected]>


22 CHAPTER 4 Databases on AWS

The name you choose for DB Instance Identifier in the Settings section will be the
official name given to the database; you’ll use this database name later, in the wp-con-
fig.php file, when you’re ready to connect. You don’t have to use the same database
name you used in your original, local configuration, but you may find it convenient.
Master Username and Password, on the other hand, must match the values you used in
the wp-config.php file on your EC2 instance.
The next screen (figure 4.6) contains some important settings that define the envi-
ronment you’ll give your database instance. The instance is currently set to launch in
my account’s default virtual private cloud (VPC), which is exactly what I want, because
that network is also hosting my EC2 server. Having them both in the same VPC allows
for faster and (more important) more-secure communications. I’ll talk much more
about VPCs later in the book.

Security Group settings Network settings

Figure 4.6 Defining the launch environment details (network, security group, port, and so on)

In the case of the MySQL database on your WordPress site—whose only legitimate
user is the WordPress instance—there’s no reason to provide access to your database
to anyone or anything besides your EC2 server. So, set Publicly Accessible to No. One
of the cardinal rules of IT administration security is to open the fewest possible system
resources to the fewest possible users. That way, you limit your exposure to risk.
RDS instances have network-traffic-filtering security groups, just as EC2 instances
do. Choose Create New Security Group; later, you’ll edit this to permit traffic between

Licensed to Alexander Reyes <[email protected]>


Configuring security group settings 23

the WordPress instance and the database. Provide a name for an empty database that
RDS will create on your new instance (an RDS instance can have more than one data-
base, by the way). This isn’t the same as the instance identifier you created on the pre-
vious screen, which is used to identify the RDS instance. For simplicity, you can give
this database the same name you gave the original database on your EC2 instance.
The remaining settings on this page are well documented, so I’ll leave you to
explore them on your own. For now, click Launch DB Instance, and wait for the
instance to come to life.

4.7 Configuring security group settings


In the meantime, you can return to the EC2 dashboard to edit your security group to
allow traffic between your EC2 and RDS instances. Click Security Groups in the left
panel. Depending on how you’ve been using your AWS account, you may see many
groups, but in any case you should see at least three: Default VPC Security Group, the
group being used by your EC2 instance, and one whose description is Created from
the RDS Management Console. The last choice is the one used by your RDS instance,
so select it and then click its Inbound tab.
If you’ve already been experimenting with RDS instances, you may see more than
one Created from the RDS Management Console security group. How do you know
which one you’re after? Here are three possible approaches:
 If you expand the Description column associated with a security group—you
may need to scroll to the right and then drag the vertical column separator over
a bit—there might be a Created On date stamp. Go with the date that makes
the most sense.
 Visit the RDS dashboard, select your instance, and note the ID of the security
group that’s displayed on the bottom half of the page.
 Give your security groups (and other AWS resources) descriptive tags. See chap-
ter 10 for more on tagging.
Now that you’re sure you’ve got the right security group, click Edit. As shown in figure
4.7, you should see a single rule of the MYSQL/Aurora type that uses port 3306 (the
default MySQL port) and allows traffic from the IP address listed in the Source box.
That’s probably the public IP address of your local computer, which AWS assumes is
the one you want to use. Instead, because this is an AWS deployment, you’re going to
connect directly to the Security Group of your EC2 instance: select Custom from the
drop-down menu, and type the letter s in the box. AWS will understand that s is the
start of a security group ID (they always begin with the letters sg) and offer you a
choice of all the groups currently available in your account. One of those is the group
used by the EC2 WordPress server, which you can confirm by checking the EC2
instance dashboard. Select it, and then save the rule.

Licensed to Alexander Reyes <[email protected]>


24 CHAPTER 4 Databases on AWS

Figure 4.7 Allow network traffic between your database and EC2 instances.

With that done, return to the RDS dashboard. Click the Instances link in the left
panel, and details about your now-running database instance will be displayed. The
one that interests you the most right now is the endpoint instance (see figure 4.8).
Copy that endpoint, leaving out the :3306 at the end; MySQL won’t require that when
you connect to the database from EC2.

Database
endpoint

Figure 4.8 The instance details dashboard for your MySQL database, including the database endpoint

Licensed to Alexander Reyes <[email protected]>


Populating the new database 25

4.8 Populating the new database


Now, from the SSH session on the EC2 server, move to the directory to which you saved
your MySQL dump:
cd /home/ubuntu

The next command connects to the RDS MySQL database at the host address, logs in
using the wpuser name and password, and then, using the < character, streams the con-
tents of the mybackup.sql file into the wordpressdb database:
mysql -u wpuser -p --database=wordpressdb \
--host=wpdatabase.co7swtzbtfg6.us-east-1.rds.amazonaws.com \
< mybackup.sql

Troubleshooting
That’s a long command, and any out-of-place character will throw the whole thing off.
Adding a space after a single or double dash, using two dashes rather than one or
one instead of two, getting the username or password wrong, adding a space after
the backslash line-break character, or even forgetting to substitute your RDS end-
point for my example as the value for host will cause the command to fail.
Didn’t work on your first try? You’re in good company. Carefully go through the com-
mand again from scratch. And don’t forget to do an internet search for the error mes-
sage you receive—you’ll be surprised how many others have already been there.

All that’s left is to enter the new endpoint into the WordPress wp-config .php file in
the /var/www/html/ directory on your EC2 server:
sudo nano /var/www/html/wp-config.php

Check to be sure the DB_USER and DB_PASSWORD values are correct, and then edit the
hostname DB_HOST value to equal your RDS endpoint. In my case, it’s wpdata-
base.co7swtzbtfg6.us-east-1.rds.amazonaws.com.
One more thing: spelling counts. While setting up this environment to make sure
everything I’m teaching you works (did you think Manning would sell you flaky
goods?), I found that I couldn’t connect to the RDS instance. Turns out I had acciden-
tally set the username as wbuser rather than wpuser. For a moment or two, I couldn’t
figure out why I was getting this error:
ERROR 1045 (28000): Access denied for user
'wpuser'@'172.31.59.16' (using password: YES)

For those of you keeping score at home, your WordPress site is still happily running on
its own EC2 Free Tier instance, but now it’s using a MySQL database that’s hosted on an
RDS-managed DB instance (that happens to be equally Free Tier). If you have an
instance running on your own AWS account, you may want to keep it for now—it will be
useful in chapter 6 (although not in chapter 5). Of course, you can always terminate it
and quickly re-create it later, giving you the added benefit of a helpful review.

Licensed to Alexander Reyes <[email protected]>


26 CHAPTER 4 Databases on AWS

4.9 Lab
Launch your RDS MySQL database, and then connect to it using a MySQL client on a
remote computer. It may take a few attempts before you get the syntax quite right, but
it’s worth it to see the triumphant smile on your face when you succeed. Even if you
don’t see the smile (after all, your eyes aren’t well positioned for the job), feel free to
send me a picture.

Definitions
 Relational database—Highly structured database that uses the Structured
Query Language (SQL) standard.
 Relational database elements—Tables, records, and fields.
 NoSQL—Not Only SQL. Databases that feature less structure but, in some
cases, far greater speed than SQL.
Managed services—Services provided by AWS (including RDS and Dynamo-DB),
for which it takes full responsibility for the underlying hardware, leaving users
to focus on their data.
 RDS instance class types—Burst capable, standard, and memory optimized.
 Database dump—Making an accurate copy of a database and its contents to
allow it to be archived or migrated.
 Multi-AZ—An RDS deployment replicated over more than on availability zone
to increase reliability.
 Endpoint—Public address through which an RDS database can be accessed.

Licensed to Alexander Reyes <[email protected]>


NoSQL: It’s about
making intelligent choices

I n this chapter, we’ll learn what NoSQL is (and what it isn’t), and we’ll explore
NoSQL business drivers as well as some interesting real-world case studies.

Licensed to Alexander Reyes <[email protected]>


Chapter 1
Chapter 1 from Making Sense of NoSQL by
Daniel G. McCreary and Ann M. Kelly

NoSQL: It’s about


making intelligent choices

This chapter covers


 What’s NoSQL?
 NoSQL business drivers
 NoSQL case studies

The complexity for minimum component costs has increased at a rate of roughly a
factor of two per year...Certainly over the short term this rate can be expected to
continue, if not to increase.
—Gordon Moore, 1965
…Then you better start swimmin’…Or you’ll sink like a stone…For the times they are
a-changin’.
—Bob Dylan

In writing this book we have two goals: first, to describe NoSQL databases, and sec-
ond, to show how NoSQL systems can be used as standalone solutions or to aug-
ment current SQL systems to solve business problems. Though we invite anyone
who has an interest in NoSQL to use this as a guide, the information, examples,

28

Licensed to Alexander Reyes <[email protected]>


What is NoSQL? 29

and case studies are targeted toward technical managers, solution architects, and data
architects who are interested in learning about NoSQL.
This material will help you objectively evaluate SQL and NoSQL database systems
to see which business problems they solve. If you’re looking for a programming guide
for a particular product, you’ve come to the wrong place. Here you’ll find informa-
tion about the motivations behind NoSQL, as well as related terminology and con-
cepts. There may be sections and chapters of this book that cover topics you already
understand; feel free to skim or skip over them and focus on the unknown.
Finally, we feel strongly about and focus on standards. The standards associated
with SQL systems allow applications to be ported between databases using a common
language. Unfortunately, NoSQL systems can’t yet make this claim. In time, NoSQL
application vendors will pressure NoSQL database vendors to adopt a set of standards
to make them as portable as SQL.
In this chapter, we’ll begin by giving a definition of NoSQL. We’ll talk about the
business drivers and motivations that make NoSQL so intriguing to and popular with
organizations today. Finally, we’ll look at five case studies where organizations have
successfully implemented NoSQL to solve a particular business problem.

1.1 What is NoSQL?


One of the challenges with NoSQL is defining it. The term NoSQL is problematic since
it doesn’t really describe the core themes in the NoSQL movement. The term origi-
nated from a group in the Bay Area who met regularly to talk about common con-
cerns and issues surrounding scalable open source databases, and it stuck. Descriptive
or not, it seems to be everywhere: in trade press, product descriptions, and confer-
ences. We’ll use the term NoSQL in this book as a way of differentiating a system from
a traditional relational database management system (RDBMS).
For our purpose, we define NoSQL in the following way:
NoSQL is a set of concepts that allows the rapid and efficient processing of data sets with
a focus on performance, reliability, and agility.
Seems like a broad definition, right? It doesn’t exclude SQL or RDBMS systems, right?
That’s not a mistake. What’s important is that we identify the core themes behind
NoSQL, what it is, and most importantly what it isn’t.
So what is NoSQL?
 It’s more than rows in tables —NoSQL systems store and retrieve data from many
formats: key-value stores, graph databases, column-family (Bigtable) stores, doc-
ument stores, and even rows in tables.
 It’s free of joins —NoSQL systems allow you to extract your data using simple
interfaces without joins.
 It’s schema-free —NoSQL systems allow you to drag-and-drop your data into a
folder and then query it without creating an entity-relational model.

Licensed to Alexander Reyes <[email protected]>


30 CHAPTER 1 NoSQL: It’s about making intelligent choices

 It works on many processors —NoSQL systems allow you to store your database on
multiple processors and maintain high-speed performance.
 It uses shared-nothing commodity computers —Most (but not all) NoSQL systems
leverage low-cost commodity processors that have separate RAM and disk.
 It supports linear scalability —When you add more processors, you get a consistent
increase in performance.
 It’s innovative —NoSQL offers options to a single way of storing, retrieving, and
manipulating data. NoSQL supporters (also known as NoSQLers) have an inclu-
sive attitude about NoSQL and recognize SQL solutions as viable options. To
the NoSQL community, NoSQL means “Not only SQL.”
Equally important is what NoSQL is not:
 It’s not about the SQL language —The definition of NoSQL isn’t an application
that uses a language other than SQL. SQL as well as other query languages are
used with NoSQL databases.
 It’s not only open source —Although many NoSQL systems have an open source
model, commercial products use NOSQL concepts as well as open source initia-
tives. You can still have an innovative approach to problem solving with a com-
mercial product.
 It’s not only big data —Many, but not all, NoSQL applications are driven by the
inability of a current application to efficiently scale when big data is an issue.
Though volume and velocity are important, NoSQL also focuses on variability
and agility.
 It’s not about cloud computing —Many NoSQL systems reside in the cloud to take
advantage of its ability to rapidly scale when the situation dictates. NoSQL sys-
tems can run in the cloud as well as in your corporate data center.
 It’s not about a clever use of RAM and SSD —Many NoSQL systems focus on the effi-
cient use of RAM or solid state disks to increase performance. Though this is
important, NoSQL systems can run on standard hardware.
 It’s not an elite group of products —NoSQL isn’t an exclusive club with a few prod-
ucts. There are no membership dues or tests required to join. To be considered
a NoSQLer, you only need to convince others that you have innovative solutions
to their business problems.
NoSQL applications use a variety of data store types (databases). From the simple key-
value store that associates a unique key with a value, to graph stores used to associate
relationships, to document stores used for variable data, each NoSQL type of data
store has unique attributes and uses as identified in table 1.1.

Licensed to Alexander Reyes <[email protected]>


NoSQL business drivers 31

Table 1.1 Types of NoSQL data stores—the four main categories of NoSQL systems, and sample
products for each data store type

Type Typical usage Examples

Key-value store—A simple data stor- Image stores Berkeley DB


age system that uses a key to access Key-based filesystems Memcache
a value Object cache Redis
Systems designed to scale Riak
DynamoDB

Column family store—A sparse matrix Web crawler results Apache HBase
system that uses a row and a column Big data problems that can Apache Cassandra
as keys relax consistency rules Hypertable
Apache Accumulo

Graph store—For relationship- Social networks Neo4j


intensive problems Fraud detection AllegroGraph
Relationship-heavy data Bigdata (RDF data store)
InfiniteGraph (Objectivity)

Document store—Storing hierarchical High-variability data MongoDB (10Gen)


data structures directly in the data- Document search CouchDB
base Integration hubs Couchbase
Web content management MarkLogic
Publishing eXist-db
Berkeley DB XML

NoSQL systems have unique characteristics and capabilities that can be used alone or
in conjunction with your existing systems. Many organizations considering NoSQL sys-
tems do so to overcome common issues such as volume, velocity, variability, and agility,
the business drivers behind the NoSQL movement.

1.2 NoSQL business drivers


The scientist-philosopher Thomas Kuhn coined the term paradigm shift to identify a
recurring process he observed in science, where innovative ideas came in bursts and
impacted the world in nonlinear ways. We’ll use Kuhn’s concept of the paradigm shift
as a way to think about and explain the NoSQL movement and the changes in
thought patterns, architectures, and methods emerging today.
Many organizations supporting single-CPU relational systems have come to a cross-
roads: the needs of their organizations are changing. Businesses have found value in
rapidly capturing and analyzing large amounts of variable data, and making immedi-
ate changes in their businesses based on the information they receive.
Figure 1.1 shows how the demands of volume, velocity, variability, and agility play a
key role in the emergence of NoSQL solutions. As each of these drivers applies pres-
sure to the single-processor relational model, its foundation becomes less stable and
in time no longer meets the organization’s needs.

Licensed to Alexander Reyes <[email protected]>


32 CHAPTER 1 NoSQL: It’s about making intelligent choices

Volume

Velocity Single-node Agility


Figure 1.1 In this figure, we see how the business drivers
RDBMS volume, velocity, variability, and agility apply pressure to the
single CPU system, resulting in the cracks. Volume and
velocity refer to the ability to handle large datasets that
arrive quickly. Variability refers to how diverse data types
Variability don’t fit into structured tables, and agility refers to how
quickly an organization responds to business change.

1.2.1 Volume
Without a doubt, the key factor pushing organizations to look at alternatives to their
current RDBMSs is a need to query big data using clusters of commodity processors.
Until around 2005, performance concerns were resolved by purchasing faster proces-
sors. In time, the ability to increase processing speed was no longer an option. As chip
density increased, heat could no longer dissipate fast enough without chip overheat-
ing. This phenomenon, known as the power wall, forced systems designers to shift
their focus from increasing speed on a single chip to using more processors working
together. The need to scale out (also known as horizontal scaling), rather than scale up
(faster processors), moved organizations from serial to parallel processing where data
problems are split into separate paths and sent to separate processors to divide and
conquer the work.

1.2.2 Velocity
Though big data problems are a consideration for many organizations moving away
from RDBMSs, the ability of a single processor system to rapidly read and write data is
also key. Many single-processor RDBMSs are unable to keep up with the demands of
real-time inserts and online queries to the database made by public-facing websites.
RDBMSs frequently index many columns of every new row, a process which decreases
system performance. When single-processor RDBMSs are used as a back end to a web
store front, the random bursts in web traffic slow down response for everyone, and tun-
ing these systems can be costly when both high read and write throughput is desired.

1.2.3 Variability
Companies that want to capture and report on exception data struggle when attempt-
ing to use rigid database schema structures imposed by RDBMSs. For example, if a
business unit wants to capture a few custom fields for a particular customer, all cus-
tomer rows within the database need to store this information even though it doesn’t
apply. Adding new columns to an RDBMS requires the system be shut down and ALTER
TABLE commands to be run. When a database is large, this process can impact system
availability, costing time and money.

Licensed to Alexander Reyes <[email protected]>


NoSQL case studies 33

1.2.4 Agility
The most complex part of building applications using RDBMSs is the process of putting
data into and getting data out of the database. If your data has nested and repeated
subgroups of data structures, you need to include an object-relational mapping layer.
The responsibility of this layer is to generate the correct combination of INSERT,
UPDATE, DELETE, and SELECT SQL statements to move object data to and from the
RDBMS persistence layer. This process isn’t simple and is associated with the largest bar-
rier to rapid change when developing new or modifying existing applications.
Generally, object-relational mapping requires experienced software developers
who are familiar with object-relational frameworks such as Java Hibernate (or NHiber-
nate for .Net systems). Even with experienced staff, small change requests can cause
slowdowns in development and testing schedules.
You can see how velocity, volume, variability, and agility are the high-level drivers
most frequently associated with the NoSQL movement. Now that you’re familiar with
these drivers, you can look at your organization to see how NoSQL solutions might
impact these drivers in a positive way to help your business meet the changing
demands of today’s competitive marketplace.

1.3 NoSQL case studies


Our economy is changing. Companies that want to remain competitive need to find
new ways to attract and retain their customers. To do this, the technology and people
who create it must support these efforts quickly and in a cost-effective way. New
thoughts about how to implement solutions are moving away from traditional meth-
ods toward processes, procedures, and technologies that at times seem bleeding-edge.
The following case studies demonstrate how business problems have successfully
been solved faster, cheaper, and more effectively by thinking outside the box. Table 1.2
summarizes five case studies where NoSQL solutions were used to solve particular busi-
ness problems. It presents the problems, the business drivers, and the ultimate findings.
As you view subsequent sections, you’ll begin to see a common theme emerge: some
business problems require new thinking and technology to provide the best solution.

Table 1.2 The key case studies associated with the NoSQL movement—the name of the case
study/standard, the business drivers, and the results (findings) of the selected solutions

Case study/standard Driver Finding

LiveJournal’s Memcache Need to increase performance By using hashing and caching, data in
of database queries. RAM can be shared. This cuts down the
number of read requests sent to the
database, increasing performance.

Google’s MapReduce Need to index billions of web By using parallel processing, indexing
pages for search using low-cost billions of web pages can be done
hardware. quickly with a large number of commod-
ity processors.

Licensed to Alexander Reyes <[email protected]>


34 CHAPTER 1 NoSQL: It’s about making intelligent choices

Table 1.2 The key case studies associated with the NoSQL movement—the name of the case
study/standard, the business drivers, and the results (findings) of the selected solutions (continued)

Case study/standard Driver Finding

Google’s Bigtable Need to flexibly store tabular By using a sparse matrix approach,
data in a distributed system. users can think of all data as being
stored in a single table with billions of
rows and millions of columns without the
need for up-front data modeling.

Amazon’s Dynamo Need to accept a web order 24 A key-value store with a simple interface
hours a day, 7 days a week. can be replicated even when there are
large volumes of data to be processed.

MarkLogic Need to query large collections By distributing queries to commodity


of XML documents stored on servers that contain indexes of XML doc-
commodity hardware using stan- uments, each server can be responsible
dard query languages. for processing data in its own local disk
and returning the results to a query
server.

1.3.1 Case study: LiveJournal’s Memcache


Engineers working on the blogging system LiveJournal started to look at how their sys-
tems were using their most precious resource: the RAM in each web server. Live-
Journal had a problem. Their website was so popular that the number of visitors using
the site continued to increase on a daily basis. The only way they could keep up with
demand was to continue to add more web servers, each with its own separate RAM.
To improve performance, the LiveJournal engineers found ways to keep the results
of the most frequently used database queries in RAM, avoiding the expensive cost of
rerunning the same SQL queries on their database. But each web server had its own
copy of the query in RAM; there was no way for any web server to know that the server
next to it in the rack already had a copy of the query sitting in RAM.
So the engineers at LiveJournal created a simple way to create a distinct “signa-
ture” of every SQL query. This signature or hash was a short string that represented a
SQL SELECT statement. By sending a small message between web servers, any web
server could ask the other servers if they had a copy of the SQL result already exe-
cuted. If one did, it would return the results of the query and avoid an expensive
round trip to the already overwhelmed SQL database. They called their new system
Memcache because it managed RAM memory cache.
Many other software engineers had come across this problem in the past. The con-
cept of large pools of shared-memory servers wasn’t new. What was different this time
was that the engineers for LiveJournal went one step further. They not only made this
system work (and work well), they shared their software using an open source license,
and they also standardized the communications protocol between the web front ends
(called the memcached protocol). Now anyone who wanted to keep their database from
getting overwhelmed with repetitive queries could use their front end tools.

Licensed to Alexander Reyes <[email protected]>


NoSQL case studies 35

1.3.2 Case study: Google’s MapReduce—use commodity hardware 


to create search indexes
One of the most influential case studies in the NoSQL movement is the Google
MapReduce system. In this paper, Google shared their process for transforming large
volumes of web data content into search indexes using low-cost commodity CPUs.
Though sharing of this information was significant, the concepts of map and reduce
weren’t new. Map and reduce functions are simply names for two stages of a data
transformation, as described in figure 1.2.
The initial stages of the transformation are called the map operation. They’re
responsible for data extraction, transformation, and filtering of data. The results of
the map operation are then sent to a second layer: the reduce function. The reduce
function is where the results are sorted, combined, and summarized to produce the
final result.
The core concepts behind the map and reduce functions are based on solid com-
puter science work that dates back to the 1950s when programmers at MIT imple-
mented these functions in the influential LISP system. LISP was different than other
programming languages because it emphasized functions that transformed isolated
lists of data. This focus is now the basis for many modern functional programming
languages that have desirable properties on distributed systems.
Google extended the map and reduce functions to reliably execute on billions of
web pages on hundreds or thousands of low-cost commodity CPUs. Google made map
and reduce work reliably on large volumes of data and did it at a low cost. It was Goo-
gle’s use of MapReduce that encouraged others to take another look at the power of
functional programming and the ability of functional programming systems to scale
over thousands of low-cost CPUs. Software packages such as Hadoop have closely mod-
eled these functions.

The map layer extracts the data from


the input and transforms the results into The shuffle/sort layer
key-value pairs. The key-value pairs are returns the key-value pairs
then sent to the shuffle/sort layer. sorted by the keys.

The reduce layer collects


the sorted results and performs
counts and totals before it returns
Map the final results.

Map
Input Shuffle Final
Reduce
data sort result
Map

Map

Figure 1.2 The map and reduce functions are ways of partitioning large datasets into
smaller chunks that can be transformed on isolated and independent transformation
systems. The key is isolating each function so that it can be scaled onto many servers.

Licensed to Alexander Reyes <[email protected]>


36 CHAPTER 1 NoSQL: It’s about making intelligent choices

The use of MapReduce inspired engineers from Yahoo! and other organizations to
create open source versions of Google’s MapReduce. It fostered a growing awareness
of the limitations of traditional procedural programming and encouraged others to
use functional programming systems.

1.3.3 Case study: Google’s Bigtable—a table with a billion rows 


and a million columns
Google also influenced many software developers when they announced their Big-
table system white paper titled A Distributed Storage System for Structured Data. The moti-
vation behind Bigtable was the need to store results from the web crawlers that extract
HTML pages, images, sounds, videos, and other media from the internet. The result-
ing dataset was so large that it couldn’t fit into a single relational database, so Google
built their own storage system. Their fundamental goal was to build a system that
would easily scale as their data increased without forcing them to purchase expensive
hardware. The solution was neither a full relational database nor a filesystem, but
what they called a “distributed storage system” that worked with structured data.
By all accounts, the Bigtable project was extremely successful. It gave Google devel-
opers a single tabular view of the data by creating one large table that stored all the data
they needed. In addition, they created a system that allowed the hardware to be located
in any data center, anywhere in the world, and created an environment where develop-
ers didn’t need to worry about the physical location of the data they manipulated.

1.3.4 Case study: Amazon’s Dynamo—accept an order 24 hours a day,


7 days a week
Google’s work focused on ways to make distributed batch processing and reporting
easier, but wasn’t intended to support the need for highly scalable web storefronts that
ran 24/7. This development came from Amazon. Amazon published another signifi-
cant NoSQL paper: Amazon’s 2007 Dynamo: A Highly Available Key-Value Store. The busi-
ness motivation behind Dynamo was Amazon’s need to create a highly reliable web
storefront that supported transactions from around the world 24 hours a day, 7 days a
week, without interruption.
Traditional brick-and-mortar retailers that operate in a few locations have the lux-
ury of having their cash registers and point-of-sale equipment operating only during
business hours. When not open for business, they run daily reports, and perform back-
ups and software upgrades. The Amazon model is different. Not only are their custom-
ers from all corners of the world, but they shop at all hours of the day, every day. Any
downtime in the purchasing cycle could result in the loss of millions of dollars. Ama-
zon’s systems need to be iron-clad reliable and scalable without a loss in service.
In its initial offerings, Amazon used a relational database to support its shopping
cart and checkout system. They had unlimited licenses for RDBMS software and a
consulting budget that allowed them to attract the best and brightest consultants for
their projects. In spite of all that power and money, they eventually realized that a rela-
tional model wouldn’t meet their future business needs.

Licensed to Alexander Reyes <[email protected]>


NoSQL case studies 37

Many in the NoSQL community cite Amazon’s Dynamo paper as a significant turn-
ing point in the movement. At a time when relational models were still used, it chal-
lenged the status quo and current best practices. Amazon found that because key-
value stores had a simple interface, it was easier to replicate the data and more reli-
able. In the end, Amazon used a key-value store to build a turnkey system that was reli-
able, extensible, and able to support their 24/7 business model, making them one of
the most successful online retailers in the world.

1.3.5 Case study: MarkLogic


In 2001 a group of engineers in the San Francisco Bay Area with experience in docu-
ment search formed a company that focused on managing large collections of XML
documents. Because XML documents contained markup, they named the company
MarkLogic.
MarkLogic defined two types of nodes in a cluster: query and document nodes.
Query nodes receive query requests and coordinate all activities associated with execut-
ing a query. Document nodes contain XML documents and are responsible for executing
queries on the documents in the local filesystem.
Query requests are sent to a query node, which distributes queries to each remote
server that contains indexed XML documents. All document matches are returned to
the query node. When all document nodes have responded, the query result is then
returned.
The MarkLogic architecture, moving queries to documents rather than moving
documents to the query server, allowed them to achieve linear scalability with peta-
bytes of documents.
MarkLogic found a demand for their products in US federal government systems
that stored terabytes of intelligence information and large publishing entities that
wanted to store and search their XML documents. Since 2001, MarkLogic has matured
into a general-purpose highly scalable document store with support for ACID transac-
tions and fine-grained, role-based access control. Initially, the primary language of
MarkLogic developers was XQuery paired with REST; newer versions support Java as
well as other language interfaces.
MarkLogic is a commercial product that requires a software license for any data-
sets over 40 GB. NoSQL is associated with commercial as well as open source products
that provide innovative solutions to business problems.

1.3.6 Applying your knowledge


To demonstrate how the concepts in this book can be applied, we introduce you to
Sally Solutions. Sally is a solution architect at a large organization that has many busi-
ness units. Business units that have information management issues are assigned a
solution architect to help them select the best solution to their information challenge.
Sally works on projects that need custom applications developed and she’s knowledge-
able about SQL and NoSQL technologies. Her job is to find the best fit for the busi-
ness problem.

Licensed to Alexander Reyes <[email protected]>


38 CHAPTER 1 NoSQL: It’s about making intelligent choices

Now let’s see how Sally applies her knowledge in two examples. In the first exam-
ple, a group that needed to track equipment warranties of hardware purchases came
to Sally for advice. Since the hardware information was already in an RDBMS and the
team had experience with SQL, Sally recommended they extend the RDBMS to
include warranty information and create reports using joins. In this case, it was clear
that SQL was appropriate.
In the second example, a group that was in charge of storing digital image infor-
mation within a relational database approached Sally because the performance of the
database was negatively impacting their web application’s page rendering. In this case,
Sally recommended moving all images to a key-value store, which referenced each
image with a URL. A key-value store is optimized for read-intensive applications and
works with content distribution networks. After removing the image management
load from the RDBMS, the web application as well as other applications saw an
improvement in performance.
Note that Sally doesn’t see her job as a black-and-white, RDBMS versus NoSQL
selection process. Sometimes the best solution involves using hybrid approaches.

Summary
This chapter began with an introduction to the concept of NoSQL and reviewed the
core business drivers behind the NoSQL movement. We then showed how the power
wall forced systems designers to use highly parallel processing designs and required a
new type of thinking for managing data. You also saw that traditional systems that use
object-middle tiers and RDBMS databases require the use of complex object-relational
mapping systems to manipulate the data. These layers often get in the way of an orga-
nization’s ability to react quickly to changes (agility).
When we venture into any new technology, it’s critical to understand that each
area has its own patterns of problem solving. These patterns vary dramatically from
technology to technology. Making the transition from SQL to NoSQL is no different.
NoSQL is a new paradigm and requires a new set of pattern recognition skills, new
ways of thinking, and new ways of solving problems. It requires a new cognitive style.
Opting to use NoSQL technologies can help organizations gain a competitive edge
in their market, making them more agile and better equipped to adapt to changing
business conditions. NoSQL approaches that leverage large numbers of commodity
processors save companies time and money and increase service reliability.
As you’ve seen in the case studies, these changes impacted more than early tech-
nology adopters: engineers around the world realize there are alternatives to the
RDBMS-as-our-only-option mantra. New companies focused on new thinking, technol-
ogies, and architectures have emerged not as a lark, but as a necessity to solving real
business problems that don’t fit into a relational mold. As organizations continue to
change and move into global economies, this trend will continue to expand.
As we move into our next chapter, we’ll begin looking at the core concepts and
technologies associated with NoSQL. We’ll talk about simplicity of design and see how
it’s fundamental to creating NoSQL systems that are modular, scalable, and ultimately
lower-cost to you and your organization.

Licensed to Alexander Reyes <[email protected]>


index
A Google Bigtable 36
Google MapReduce 35–36
AllegroGraph 31 LiveJournal Memcache 34
Amazon DynamoDB, overview 36–37 MarkLogic 37
Apache Accumulo, data stores 31 chart, database and 9
Apache Cassandra, data stores 31 class types, RDS 26
Apache HBase 31 Collation setting 3
array, disk and 10 column family store, data store types 31
attaching database 4 compatibility level, setting 3
Auto Close setting 4 configuring, security group settings 23–24
Auto Create Statistics setting 4 Containment type setting 4
Auto Shrink setting 4 Copy Database Wizard 5
autogrowth, file size problem 8–9 costs, of databases 17–18
AWS RDS (Relational Database Service) Couchbase 31
building instances 19–23 data stores 31
migrating databases to 18–19 CouchDB (cluster of unreliable commodity
hardware) 31
B data stores 31

Berkeley DB, data stores 31 D


Bigdata (RDF data store) 31
Bigtable 36 data accessibility 16
burst capable 18 database
business agility 33 as a basic unit of management 2
business drivers autogrowth and physical fragmentation 8
agility 33 autogrowth vs. growth 8
variability 32 colocated log and data files 7
velocity 32 configuring options 2–4
volume 32 detaching and attaching 4–5
heavy disk fragmentation 7
C importance of size management 8
managing 2–11
case studies primary file 6
Amazon Dynamo 36–37 reason for different filegroups 9
comparison 33–38 storage, assessment 6

39

Licensed to Alexander Reyes <[email protected]>


40 INDEX

system 9 G
transaction log file 6
unnecessary secondary files 7 General tab 3
database dump 26 Google
Database Properties dialog 2 Bigtable 36
databases MapReduce 35–36
building AWS RDS instances 19–23 graph store, data store types 31
configuring security group settings 23–24
estimating costs of 17–18 H
infrastructure of 15–16
migrating to AWS RDS 18–19 Hadoop 35
models of 14–15 hardware 16
choosing 15 hashes, for SQL query 34
NoSQL 14–15 heavy disk fragmentation 7
relational 14 horizontal scaling, defined 32
overview 13 Hypertable 31
populating new 25
detaching database 4
Dev/Test environment 19
I
disk InfiniteGraph 31
array 10 infrastructure, of databases 15–16
defined 10 instance class types, RDS 26
mirror 10
instances, AWS RDS, building 19–23
redundancy and 10
IOPS (input/output operations per second) 21
storage, overview 10–11
distributed storage system 36
document nodes 37 J
document store, data store types 31
downtime, and Amazon Dynamo 36 Java Hibernate 33
Drop Connections option 4 joins, absence of 29
dumps, creating for MySQL database
management 18–19 K
DynamoDB 31
data stores 31 key-value store, data store types 31
overview 36–37 Kuhn, Thomas 31

E L

endpoint 16, 26 linear scaling, defined 30


eXist, -db 31 LISP, system 35
LiveJournal, Memcache 34
F log and data files, colocated 7

Failover 21 M
file
layout, problems with 6–8 managed services 26
primary 6 map function 35
size, problems with 8–9 MapReduce, overview 35–36
filegroups 9 MarkLogic 31
Filegroups page 2 data stores 31
Files page 2 overview 37
formats for NoSQL systems 29 master, system database 9

Licensed to Alexander Reyes <[email protected]>


INDEX 41

Memcache Q
data stores 31
overview 34 query nodes 37
memory optimized class type 18
message URL MoreLunches.com 11 R
migrating databases to AWS RDS 18–19
mirror, disk and 10 RAM (random access memory), dealing with
.MDF file 6–7 shortage of 34
model, system database 9 RDBMS (relational database management
MongoDB, data stores 31 system), defined 29
MSDB, system database 10 RDS (Relational Database Service) 16
Multi-AZ 26 Recovery model setting 3
MySQL database management system, creating Redis, data stores 31
dumps 18–19 reduce function 35
redundancy, disk and 10
N relational database elements 26
relational database management system. See
Neo4j query language RDBMS
data stores 31 relational databases 14
NHibernate 33 reliability, and Amazon Dynamo 36
.NDF file 6–7 Riak, data stores 31
NoSQL
business drivers S
agility 33
variability 32 secondary files, unnesessary 7
velocity 32 security 15
volume 32 security groups, configuring settings 23–24
case studies Simple Monthly Calculator 17–18
Amazon Dynamo 36–37 software 16
comparison 33–38 solid-state disk 10
Google Bigtable 36 SQL (Structured Query Language) 14
Google MapReduce 35–36 SSD. See solid-state disk
LiveJournal Memcache 34 standard class type 18
MarkLogic 37 Storage Area Network (SAN), adapter 7
defined 29–31 stripe arrays 10
NoSQL (Not Only SQL) 14, 26 system databases 9–10
master 9
O model 9
MSDB 10
Options page 2 temporary (TempDb) 9

P T
paradigm shift 31 TempDb, system database 9
parity information, disk and 10 transaction log file 6
populating new databases 25
power wall phenomenon 32 U
processors, support for 30
production environment 19 Update Statistics option 4

Licensed to Alexander Reyes <[email protected]>


42 INDEX

V Y
variability business driver 32 Yahoo! 36
velocity business driver 32
volume business driver 32
VPCs (virtual private clouds) 22

Licensed to Alexander Reyes <[email protected]>

You might also like