Unit 3
Unit 3
Computing
”Storage in Cloud“
2
”Storage in Cloud“
4
”Storage in Cloud“
Free Cloud Storage
Google Drive
Google is one of the giants in cloud-storage. It offers:
• Free Data Storage up to 15GB – Google Drive is one of the most
generous cloud offerings. Google storage space is also shared with
other Google services including Gmail and Google Photos. Mobile apps
are also available for easy access for iOS and Android users.
• G Suit Tools – Includes online office tools for word processing,
spreadsheets and presentations which make sharing files with others
effortless.
One Drive
• One Drive is particularly for Microsoft Windows users. It
allows 5GB of free data storage. It has a great integration with
Microsoft products.
• The files can be edited without downloading. File sharing in One
Drive is possible with other users even if they aren’t One Drive users.
5
”Storage in Cloud“
Dropbox
• It has a great storage support for third-party apps with web interface
that remains streamlined and easy-to-use.
• Dropbox has 2GB of storage space for new users. However there are
other ways for boosting this space without paying, such as inviting
friends (500MB for referral), completing getting started guide
(250MB), etc.
• There are desktop apps for Windows, Linux and Mac, and mobile apps
including Android, iOS and even Kindle.
• The web version lets you edit files without the need of downloading
them.
Business Cloud Storage
Spider Oak
• Founded in 2007, Spider Oak is a collaboration tool, file
hosting and online backup service. It allows users to access,
synchronize and share data using a cloud-based server.
• The main focus in Spider Oak is on privacy and security.
• 6
The tool has a very basic design which makes the admin console and
”Storage in Cloud“
Tresorit
• Founded in 2011, Tresorit is a cloud storage provider based in
Hungary and Switzerland. It emphasizes on enhanced security and
data encryption for businesses and personal users.
• It allows you to keep control of your files through ‘zero-knowledge
encryption’ which means only you and the chosen few you decide to
share with and see your data.
Egnyte
• Founded in 2007, Egnyte provides software for enterprise file
synchronization and sharing. It allows businesses to store their data
locally and online.
7
”Storage in Cloud“
8
”Storage in Cloud“
9
Big Data in“
”Cloud
10
Big Data in“
”Cloud
11
Big Data in“
”Cloud
Characteristics of big data
Volume
The key characteristic of big data is its scale—the volume of data that is
available for collection by your enterprise from a variety of devices and sources.
Variety
Variety refers to the formats that data comes in, such as email messages, audio
files, videos, sensor data, and more. Classifications of big data variety include
structured, semi-structured, and unstructured data.
Velocity
Big data velocity refers to the speed at which large datasets are acquired,
processed, and accessed.
Variability
Big data variability means the meaning of the data constantly changes.
Therefore, before big data can be analyzed, the context and meaning of the
datasets must be properly understood.
12
Big Data in“
”Cloud
13
Big Data in“
”Cloud
14
Big Data in“
”Cloud
15
Big Data in“
”Cloud
Cloud Computing and Big Data
• In cloud computing, all data is gathered in data centers and
then distributed to the end-users. Further, automatic backups
and recovery of data is also ensured for business continuity, all
such resources are available in the cloud.
16
Big Data in“
”Cloud
Cloud for Big Data
Below are some examples of how cloud applications are used for Big Data:
IAAS in a public cloud: Using a cloud provider’s infrastructure for Big Data services,
gives access to almost limitless storage and compute power. IaaS can be utilized by
enterprise customers to create cost-effective and easily scalable IT solutions where cloud
providers bear the complexities and expenses of managing the underlying hardware.
PAAS in a private cloud: PaaS vendors are beginning to incorporate Big Data
technologies such as Hadoop and MapReduce into their PaaS offerings, which eliminate the
dealing with the complexities of managing individual software and hardware elements.
For example, web developers can use individual PaaS environments at every stage of
development, testing and ultimately hosting their websites.
However, businesses that are developing their own internal software can also utilize
Platform as a Service, particularly to create distinct ring-fenced development and testing
environments.
SAAS in a hybrid cloud: Many organizations feel the need to analyze the customer’s
voice, especially on social media. SaaS vendors provide the platform for the analysis as
well as the social media data.
Office software is the best example of businesses utilizing SaaS. Tasks related to
accounting, sales, invoicing, and planning can all be performed through SAAS. Businesses
may wish to use one piece of software that performs all of these tasks or several that each
performs different tasks.
17
Big Data in“
”Cloud
• Providers in the Big Data Cloud Market
Infrastructure as a Service cloud computing companies:
Amazon’s offerings include S3 (Data storage/file system), SimpleDB (non-
relational database) and EC2 (computing servers). Rackspace’s offerings
include Cloud Drive (Data storage/file system), Cloud Sites (web site hosting on
cloud) and Cloud Servers(computing servers).
IBM’s offerings include Smart Business Storage Cloud and Computing on
Demand (CoD).
AT&T’s provides Synaptic Storage and Synaptic Compute as a service.
19
Virtual Data“
”Center
• s
20
Cloud file“
”systems
21
Cloud file“
”systems
• Stored data is divided into large chunks (64 MB), which are replicated
in the network a minimum of three times. The large chunk size
reduces network overhead.
• GFS is designed to accommodate Google’s large cluster requirements
without burdening applications. Files are stored in hierarchical
directories identified by path names. Metadata - such as namespace,
access control data, and mapping information - is controlled by the
master, which interacts with and monitors the status updates of each
chunk server through timed heartbeat messages.
GFS features include:
Fault tolerance
Critical data replication
Automatic and efficient data recovery
High aggregate throughput
Reduced client and master interaction because of large chunk server
size
Namespace management and locking
High availability 22
• The largest GFS clusters have more than 1,000 nodes with 300 TB disk
Cloud file“
”systems
• Stored data is divided into large chunks (64 MB), which are replicated
in the network a minimum of three times. The large chunk size
reduces network overhead.
• GFS is designed to accommodate Google’s large cluster requirements
without burdening applications. Files are stored in hierarchical
directories identified by path names. Metadata - such as namespace,
access control data, and mapping information - is controlled by the
master, which interacts with and monitors the status updates of each
chunk server through timed heartbeat messages.
GFS features include:
Fault tolerance
Critical data replication
Automatic and efficient data recovery
High aggregate throughput
Reduced client and master interaction because of large chunk server
size
Namespace management and locking
High availability 23
• The largest GFS clusters have more than 1,000 nodes with 300 TB disk
Cloud file“
”systems
What is HDFS
Hadoop comes with a distributed file system called HDFS.
• In HDFS data is distributed over several machines and
replicated to ensure their durability to failure and high availability to
parallel application.
• It is cost effective as it uses commodity hardware. It involves the
concept of blocks, data nodes and node name.
Where to use HDFS
• Very Large Files: Files should be of hundreds of megabytes,
gigabytes or more.
• Streaming Data Access: The time to read whole data set is more
important than latency in reading the first. HDFS is built on write-once
and read-many-times pattern.
• Commodity Hardware:It works on low cost hardware.
24
Cloud file“
”systems
• HDFS Concepts
Blocks: A Block is the minimum amount of data that it can read or
write. HDFS blocks are 128 MB by default and this is configurable.
• Files HDFS are broken into block-sized chunks, which are stored as
independent units.
Name Node: HDFS works in master-worker pattern where the name
node acts as master.
• Name Node is controller and manager of HDFS as it knows the status
and the metadata of all the files in HDFS;
• the metadata information being file permission, names and location of
each block.
• The file system operations like opening, closing, renaming etc. are
executed by it.
Data Node: They store and retrieve blocks when they are told to; by
client or name node.
• They report back to name node periodically, with list of blocks
that they are storing.
• The data node being a commodity hardware also does the work 25
Cloud file“
”systems
• a
26
Cloud file“
”systems
29
Cloud file“
”systems
Command Description
rm- Removes file or directory
ls- Lists files with permissions and other details
mkdir- Creates a directory named path in HDFS
cat- Shows contents of the file
rmdir- Deletes a directory
31
Cloud file“
”systems
32