Object Storage Overview
Object Storage Overview
Object Storage
A Fresh Approach to Long-Term
File Storage
A Dell Technical White Paper
THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL
ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR
IMPLIED WARRANTIES OF ANY KIND.
2010 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without
the express written permission of Dell Inc. is strictly forbidden. For more information, contact Dell.
Dell, the DELL logo, and the DELL badge are trademarks of Dell Inc. Other trademarks and trade names
may be used in this document to refer to either the entities claiming the marks and names or their
products. Dell Inc. disclaims any proprietary interest in trademarks and trade names other than its
own.
May 2010
2
Contents
Executive Summary .................................................................................................... 4
Introduction ............................................................................................................. 4
The new challenges of unstructured data .......................................................................... 5
A Need for New File Storage Solutions .............................................................................. 6
A fresh approach: Object Storage .................................................................................. 7
Object Storage and Traditional NAS Coexist ....................................................................... 9
Object Storage in Intelligent Data Management ................................................................ 10
Summary ............................................................................................................... 11
Figures
Figure 1.
Figure 2.
Example contrasting the amount of metadata associated with an Object vs. a File .......... 7
The frequency of data usage is a factor in using Object vs. traditional file storage ........ 10
Executive Summary
The world is increasingly awash in digital data not only because of the Internet and Web 2.0, but also
because data that used to be collected on paper or media such as film, DVDs and compact discs has
moved online. Most of this data is unstructured and in diverse formats such as e-mail, instant
messages, documents, spreadsheets, graphics, images, and videos. For storage managers, the growth
in unstructured data is proving to be a challenge: Companies require the data be readily accessible for
business, regulatory and compliance needs, but traditional file storage management systems such as
NAS are proving to be both costly and inadequate. With unstructured data growth expected to
continue unabated -- at a compound annual growth rate estimated to exceed 60 per cent1 -- storage
managers are looking at new ways to cope. An alternative that has emerged is Object Storage. This is
an approach that is designed to solve many of the traditional NAS shortcomings, and is considered more
cost effective. However, it is not a one size fits all solution and traditional NAS will continue to have
a strong role to play in todays storage environment. In this white paper we explore Object Storage,
compare it to traditional NAS, and demonstrate that an intelligent, policy based data management
strategy is the best approach to determining when it is beneficial for organizations to use Object
Storage, or continue to use NAS.
Introduction
We live in interesting digital times. It used to be that computers primarily stored structured data such
as financial and supply chain information. This has changed. Today, more and more of the worlds
unstructured data everything from videos, music files, blogs, images, instant messages and even the
day-to-day paperwork generated by businesses is being created, distributed and stored digitally. This
is a phenomenon that is pervading all aspects of human life: In the doctors office, for example, x-rays
that were once produced on films are now created and stored digitally. In banks, cashed checks that
used to be stored in microfiche are now stored on computer hard drives. Legal contracts, too, which
had been solely be paper based are now created and stored digitally, with digital signatures taking
the place of handwritten ones. The end result is an explosion of predominantly unstructured data
being stored on computer storage systems. It is estimated that the amount of digital information will
double every 18 months, with 95% of this coming from unstructured data, and only the remaining 5%
being driven by traditional structured data 2. Unstructured data is expected to far outpace the growth
of structured data well into the future.
For storage managers, this phenomenal growth in data, particularly in unstructured data, is creating
new challenges. It means they must continue to find cost effective storage strategies while ensuring
data is available as needed for business or compliance requirements. It means they must make sure
the data is well protected according to back-up and retention policies. But it now also means that they
must ask how they are to best accomplish these goals as well as what they need to do differently when most of the data they are managing is unstructured and inherently different from structured
data.
Object storage gains steam as unstructured data grows, Beth Pariseau, Storage Magazine,
November/December 2009
2
IDC White Paper sponsored by EMC, As the Economy Contracts, the Digital Universe Expands, May 2009
4
Unstructured information is different. It is often generated at the time of a particular event and then
stored outside of a database. It also may not be touched or needed again after the particular event.
Take x-rays, for example. These are most often created to help a physician diagnose a patient. Once
the diagnosis is complete and if the patient is cured, the x-rays are no longer needed and are stored
away. On the other hand, if the patient continues treatment, the x-rays may need to be recalled. This
example shows the challenges in managing unstructured data and the importance of understanding the
context around the data -- x-rays which are no longer needed should be sent to archival storage, while
those still needed should be kept on near-line storage. The difficulty with unstructured data is that
there isnt a mechanism analogous to a database that allows this context to be maintained. Instead,
the context is often lost or separated from the data, and storage managers must make decisions based
purely on the data type e.g. the x-ray image - itself.
Similarly, it is difficult to know the content in unstructured data and use this information to help guide
storage decisions. Consider, for example, a company that stores its product blueprint drawings as JPEG
files. Without knowing the content in the companys JPEG files, storage managers can incorrectly give
the blueprint files the same importance and storage priority as the JPEG picture files sent around to
announce the arrival of an employees new born baby. Similarly, for example, storing HR (e.g.
employee offer letter or performance review) information for an employee clearly has a different
priority than the minutes for a staff meeting even though the data for both may be stored within the
same Microsoft Word format.
5
Given the characteristics of unstructured data, the difficult question facing storage managers is how
they can effectively and efficiently store this data and be mindful of both the context around, and the
content, within the data to make the right storage decisions. Part of the challenge in solving this
problem is that storage managers are working with a limited set of storage tools. Todays dominant
approach is to store unstructured data on file systems such as Network Attached Storage (NAS).
However, NAS was designed in a different age and time, when the world was much less digitized and
unstructured data was not as prevalent as it is today.
Figure 1.
Example contrasting the amount of metadata associated with an Object vs. a File
When an MRI scan is stored as a file the typical metadata attached to it is basic and may include only
information such as file name, creation data, creator, and file type. When the MRI scan is stored as an
Object, on the other hand, the generating application can include all the file metadata plus additional
metadata information that might summarize the content contained within the file and include the
patient name, the patients ID, the procedure date, the attending physicians name, the physicians
notes, as well as any other metadata that can help add context to the MRI scan.
7
A richer set of metadata can also make it easier to apply eDiscovery and business intelligence tools to
help an organization uncover data assets and gain new insights. For example, the metadata can make
it possible for a hospital to find all stored MRIs for a particular disease and then collect statistics on,
for example, the number of MRI scans done per stage of the disease condition to help allocate
resources. In this way, object storage helps to ensure that the value of information contained within
unstructured data is maximized and preserved for future use.
Objects are also useful in directly keeping related information together by enabling multiple file types
to be grouped together. For example, it is possible to group the MRI image with the physicians
recorded notes (in an MP3 file) along with the text file that has the patients history. This is similar to
the way that information is managed in the paper based world where a patients file can contain
different file types. And similar to the way that a traditional hospital file may be used, the object with
its grouped files can be used by any subsequent physician treating the patient to easily access the
previous physicians notes attached to the scan and obtain additional context about the patients
condition.
An object is also different from a file in that a unique ID is assigned and associated with each object.
This ID is generated using a 128-bit random number generator and guarantees that every object is
uniquely identified. It allows objects to be stored in an infinitely vast flat address space containing
billions of objects without the complexity file systems impose. Similar to the function of URLs in the
Internet, an Object ID serves as the unique pointer to the object; hence there is no directory hierarchy
(or tree) and the objects location does not have to be specified in the same way that a files
directory path has to known in order to retrieve it. The unique identifier also allows objects to be
easily migrated from one storage node or system to another without interrupting application or user
access if the underlying hardware is being upgraded.
In addition to the unique Object ID a hash signature also has a strong role to play in managing storage
for unstructured data, particularly in the removal of duplicates and in helping to address compliance
mandates. Since the hash signature associated with each object is generated according to the data
contained within the object, if the same signature is recognized as already being in the hash table, it is
immediately known that duplicate data exists in the storage system. With this knowledge, storage
managers can decide how they want to treat it.
8
However, 10% of all data is actively used, and it is for this data that traditional file systems, such as
NAS, are best suited. For example, a company developing marketing collateral will often have a team
working on the content. This team will collaborate and in doing so may be simultaneously reading and
writing to the data (contained in a Microsoft Word document). In these I/O and performance
prioritized cases, traditional file storage systems like NAS can, and will, continue to play a role.
Consequently, as shown in Figure 2, the frequency of data usage is a driver for multiple storage system
types within a storage environment and demonstrates that both Object Storage and traditional file
storage systems have strong roles to play within an organization.
Measurement & Analysis of Large-Scale Network File System Workloads, University of California Santa
Cruz, 2008
9
Figure 2.
The frequency of data usage is a factor in using Object vs. traditional file storage
10
Summary
As an increasing amount of the Worlds information is born and lives
digital, IT organizations will need to simultaneously manage two
challenges. The first is that most of the digitized information will be
unstructured, which means that it is inherently highly variable and
not easily managed without understanding the content and context of
the data. The second will be the huge growth in data both currently
and in the years ahead. This growth in data will exceed the
capabilities of the file systems upon which storage managers have
traditionally relied to store unstructured (non database) data. As the
volume of unstructured data continues to grow, organizations will
find it increasingly difficult to cope.
11