Distributed Data Storage Systems - From RAID To Blockchains
Distributed Data Storage Systems - From RAID To Blockchains
Applicant’s Data
Name and Surname: Angela Chifligaroska
A new era started at the beginning of the XXI century – the Digital Era. The majority of
things now become digital or heavily dependent on technology – starting with things like radio
and TV, going through healthcare, even most of our memories. Between 1986 and 2007 the
amount of data per person has been growing with 23% per year, as Computer World [1] reports.
As a result there is a huge amount of digital data which is created daily and accumulates to
unseen amounts.
Storing data has evolved during the years in order to accommodate the raising needs of
companies and individuals. We are now reaching a tipping point at which the traditional
approach to storage – the use of a stand-alone, specialized storage box – no longer works, for
both technical and economical reasons. We need not just faster drives and networks; we need a
new approach, new concept of doing data storage. At present the best approach to satisfying
current demands for storing data seems to be distributed storage.
This concept has appeared in different forms and shapes through the years. And while there is
no commonly-accepted definition of what distributed storage system is, we can summarize it as:
“Storing data on multitude of standard servers, which behave as one storage system although
data is distributed between these servers.”
A Distributed Storage System (DSS) is an advanced form of the “Software-Defined Storage”
concept, [2-5]. It is like SDS 2.0, [6]. Unlike old-fashioned SDS solutions:
1. DSS can run compute workloads on the same physical servers, i.e. they can build
efficient Hyper-Converged Infrastructure (HCI);
2. DSS can scale-out, i.e. they make one shared storage system out of many, many nodes.
Old-fashioned SDS solutions were scale-up systems, which formed two node clusters in
an active-passive or mirrored configurations;
3. DSS systems can achieve performance which is impossible for SDS 1.0 solutions. And
this performance is achieved with extremely low usage of compute power (CPU &
RAM). This is one of the reasoned why a DSS can run in a hyper-converged manner,
unlike old-fashioned SDS solutions.
4. Finally the usability and functionality of a good distributed storage system is qualitatively
different than using generation 1 SDS. To give it with an analogy – SDS 1.0 has the
usability of a button cell/mobile phone. DSS systems have the usability of a modern
touch-screen smart phone.
Why is the distributed storage system becoming so important?
The main reason is that the current approach to storage does not work anymore: it is not
flexible enough, fast enough or the cost is prohibitively high. In many cases all at the same time.
By design a distributed storage system solves all of these issues at once.
The simplest way to store large-scale data or files online is using cloud storage. However,
with increasing demands and usage, these centralized systems have become major targets for
hacks and data breaches. This makes the data vulnerable and prone to tampering. Even if
encryption is used, the keys are stored with the cloud service provider. This reduces the security
provided by encryption. Another problem is that the data is usually not encrypted during
transmission. The data can hence be intercepted during transmission from the user’s computer to
the cloud.
The solution to make cloud storage faster and more secure is using blockchain. Blockchain is
a database or ledger that is shared across a network. This ledger is encrypted such that only
authorized parties can access the data. Since the data is shared, the records cannot be tampered.
Thus, the data will not be held by a single entity. Therefore, new players will enter the market
and the understanding of how data is stored will change. This will inevitably take time in a
market that is risk averse and focused on keeping the lights on, [7]. However, it will also be
driven by overarching business objectives that center on embracing blockchain technologies and
the decentralized applications that are built on them.
While it may take time to become the established go-to choice, decentralized data storage
offers a more secure, efficient and scalable solution in an increasingly data-hungry and data-
heavy world.
Work plan
This Final Project reviews the history and development of distributed data storage systems
from the traditional RAID solution by today blockchain technology; the importance of DSS and
how to achieve data reliability, efficiency and security in different based DSS.
First, in my project I discuss about the most commonly deployed multi-storage device
systems RAID (Redundant Array of Independent/Inexpensive Disks) systems [8-10], which store
the data across multiple disks, some of which containing the actual information, while the others
provide fault-tolerance by storing redundancy. Furthermore, distributing the data over multiple
storage disks may also help increase the throughput of reading data, thanks to the parallelization
of disk accesses. RAID systems traditionally put the multiple storage disks within a single
computing unit, making the internal distribution transparent both logically as well as physically
for the end users. Currently, typical RAID configurations allow for two failures within a RAID
unit, though configurations tolerating more failures have also been studied.
Recently, the computational requirements for large-scale data-intensive analysis of scientific
data have grown significantly. Popular statistics and info-graphics often showcase the amount of
data, such as social media content, emails, or videos, being produced, uploaded, and stored daily
on servers around the world. Such numbers are impressive, with trillions of gigabytes of data
being created each day and many companies storing hundreds of terabytes of data; data creation
has increased so much that the data in the world in 2020 is expected to be 300 times from 2005,
[11].
Current data storage systems based on RAID arrays were not designed to scale to this type of
data growth. As a result, the cost of RAID-based storage systems increases as the total amount of
data storage increases, while data protection degrades, resulting in permanent digital asset loss.
With the capacity of storage devices today, RAID-based systems cannot protect data from loss.
Most IT organizations using RAID for big data storage incur additional costs to copy their data
two or three times to protect it from inevitable data loss.
The idea of distributing data across multiple disks has been naturally extended to multiple
storage nodes which are interconnected over a network, as we witness in data-centers, and some
peer-to-peer (P2P) storage systems. We call such systems networked distributed storage systems
(NDSS) [12], where the word “networked” insists on the importance of the network interconnect.
It is worth recalling that the individual storage nodes in an NDSS may themselves be comprised
of multi-disk RAID systems, whose storage disks may themselves employ some redundancy
scheme for fault-tolerance of their physical medium. Thus, while redundancy is present at
several layers of a large storage system, in this thesis I look only redundancy through coding
techniques [13-16] at the highest level of abstraction.
It’s estimated that there will be over 20 billion connected devices by 2020 [17], all of which
will generate and then require management, storage, and retrieval of enormous amounts of data.
Connected devices, combined with consumer personalization apps and the increasing need to
share data across business lines, are all playing their part in increasing demand for storage.
Businesses wanting to launch new, data-driven applications face a mountain of time, effort and
coordination to provision new databases today.
Right now, single system and even cloud-based databases are highly centralized, which makes
them a beacon for hackers looking to attack.
Blockchain technology has been one of the major technological breakthroughs of this century,
[18]. Bitcoin, the first Blockchain application, allows a network of users to perform transactions
without requiring the trust of anyone on the network, or a third party. Everything is encrypted,
and nobody can tamper with the Blockchain without everyone else noticing immediately.
There are a few ways that a Blockchain can be used in distributed storage software. One of
the most common is to break up data into chunks, encrypt the data so that you are the only one
with access to it and distribute files across a network in a way that means all your files are
available, even if part of the network is down. Essentially, instead of handing your files to a
company like Amazon or Microsoft, you distribute it across a network of people all over the
world, [19]. The cloud is shared by the community, and nobody can read or tamper with anyone
else’s sensitive data. In other words, you stay in control. This could also be useful in public
services to keep public records safe, available, and decentralized.
Therefore, I make a brief introduction of blockchains as a decentralized solution of storage
which will make file storage cheaper and much more secure.
In finish my thesis with making conclusions for the importance and usage of internet and
cloud storage services. These conclusions are based on 2014 and 2018 statistics about IT usage
of technology among individuals in EU members given in Eurostat and my own survey based on
2019 online interviews of individuals from Macedonia. Main findings are the individuals both in
EU countries and Macedonia use internet every day usage or internet and use occasionally cloud
storage services. Moreover, although more than half of the citizens of developed EU countries
were aware of the usage of cloud storage services before five years, only one of ten Macedonian
were aware of this possibility, but from our survey we noticed that every day they are becoming
more aware about the usage of cloud storage services with little concern for the security of data.
Reference
[1] L. Mearian, Scientists calculate total data stored to date: 295+ exabytes, Computer world,
2011 online: https://www.computerworld.com/article/2513110/data-center/scientists-
calculate-total-data-stored-to-date--295--exabytes.html .
[6] G. Crump, What is Software Defined Storage 2.0?, Storage Switzerland LCC, 2015, online
https://storageswiss.com/2015/07/07/what-is-software-defined-storage-2-0/ .
[7] P. Bains, Blockchain and data storage: the future is decentralized, Dataconomy, 2018,
online https://dataconomy.com/2018/01/blockchain-data-storage-decentralized-future/ .
[8] D. A. Patterson, G. Gibson, R. H. Katz ”A case for redundant arrays of inexpensive disks
(RAID)” ACM SIGMOD International Conference on Management of Data, 1988.
[9] Cisco, Cisco UCS Servers RAID Guide, Cisco Systems, Inc., 2013.
[11] IBM Big Data & Analytics Hub, The four v’s of big data, 2004, infographic online
http://www.ibmbigdatahub.com/infographic/four-vs-big-data .
[12] F. Oggier and A. Datta, Coding Techniques for Repairability in Networked Distributed
Storage Systems, Nanyang Technological University Singapore, 2008.
[14] K. V. Rashmi, N. B. Shah, and P. V. Kumar, Regenerating Codes for Errors and Erasures in
Distributed Storage, in Proc. IEEE International Symposium on Information Theory (ISIT),
(Cambridge, MA), 2012.
[17] R. Van Der Meulen, Gartner Says 8.4 Billion Connected "Things" Will Be in Use in 2017,
Up 31 Percent From 2016, Gartner, 2017, online
https://www.gartner.com/en/newsroom/press-releases/2017-02-07-gartner-says-8-billion-
connected-things-will-be-in-use-in-2017-up-31-percent-from-2016 .
[18] M. A. Callahan, How Blockchain Can be Used to Secure Sensitive Data Storage,
Dataversity, 2017, online http://www.dataversity.net/blockchain-can-used-secure-sensitive-
data-storage/ .
[19] S. Lee, Blockchain Is Critical To The Future Of Data Storage -- Here's Why, Forbes, 2018,
online https://www.forbes.com/sites/shermanlee/2018/06/08/blockchain-is-critical-to-the-
future-of-data-storage-heres-why/#62e6cc2833e9 .
Suggested mentor
Aneta Velkoska, PhD
Suggested member
Aleksandar Karadimce, PhD
Date: Applicant:
__________________ Angela Chifligaroska
______________________