Big Data and Cloud Computing
Big Data and Cloud Computing
Big Data and Cloud Computing
Computing
By
S.Pavithra
U.Roopashree
Department of Information Technology
Kongu Engineering College.
Introduction
Cloud and Big data
Part of Big data
Part of Cloud and its services
Apache Hadoop
Goals of HDFS
Implementation considerations
Conclusion
Big data
Big data is the term for a collection of data sets which are so
complex to handle using normal data processing applications
and management tools.
The challenges include capture, curation, storage, search,
sharing, transfer, analysis,and visualization.
As of 2012, it was found that 2.5 Exabytes of data were created
every day.
The data set may contain both structured and unstructured data.
Big data
There are 3v's in big data
Variety
Volume
Velocity
Cloud Computing
Cloud computing is a phrase used to describe a variety of
computing concepts that involve a large number of computers
connected through a real-time communication network such as
the Internet.
There are three different services provided by cloud,
SaaS
IaaS
PaaS
It is critical for network administrators to consider the impact
of these technologies on their server, storage, networking, and
operations infrastructure.
Big data is pushing the envelope of networking requirements,
and it is forcing many new strategies to provide real time
business analytics.
Cost of processing 1 Petabyte of data with 1000 node ?
1 PB = 10^15 B = 1 million gigabytes = 1 thousand terabytes
9 hours for each node to process 500GB at rate of 15MB/S
15*60*60*9 = 486000MB ~ 500 GB
1000 * 9 * 0.34$ = 3060$ for single run
1 PB = 10^15 / 500 = 2000 * 9 = 18000 h /24 = 750 Days
The cost for 1000 cloud node each processing 1PB
2000 * 3060$ = 6,120,000$
Big data stored in cloud: Used by managers and analysts to
understand the business and make judgments
Scalability to large data volumes
Cost-efficiency:
Commodity nodes (cheap, but unreliable)
Commodity network
Automatic fault-tolerance (fewer administrators)
Easy to use (fewer programmers)
IaaS enables us to allocate or buy time on shared server
resources, which are often virtualized, to handle the computing
and storage needs for big data analytics.
Cloud operating systems manage high-performance servers,
network, and storage resources.
IaaS also needs some investment from organization for
software such as Hadoop framework or No-sql database.
PaaS provides developers with tools and libraries to build, test,
deploy, and run applications on cloud infrastructure.
PaaS reduces management workload by eliminating the need
to configure and scale elements of Hadoop implementation
and serves as a development platform for advanced analytics
applications.
Multiple SaaS applications can be used to cover a wide range
of business scenarios.
SaaS can be offered as a standalone application or part of a
greater cloud provider solution.
It can be offered as Pay - as - You go application.
Map Reduce
Hadoop
Column oriented database
PIG
Hive
WibiData
PLATFORA
Sky Tree
The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets across
clusters of computers using simple programming models.
It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage.
The library itself is designed to detect and handle failures at
the application layer, so delivering a highly-available service
on top of a cluster of computers, each of which may be prone
to failures.
The Hadoop Distributed File System (HDFS) is a
distributed file system designed to run on commodity
hardware.
HDFS is highly fault-tolerant and is designed to be
deployed on low-cost hardware.
HDFS provides high throughput access to application
data and is suitable for applications that has large data
sets.
A Hadoop implementation creates four unique node types for
cataloging, tracking, and managing data throughout the
infrastructure: data node, client node, name node, and job
tracker.
Data node - the data nodes are the repositories for the data,
and consist of multiple smaller database infrastructures that are
horizontally scaled across compute and storage resources
through the infrastructure.
Client - the client represents the user interface to the big data
implementation and query engine. The client could be a server
or PC with a traditional user interface.
Name node - the name node is the equivalent of the address
router for the big data implementation. This node maintains
the index and location of every data node.
Job tracker - the job tracker represents the software job
tracking mechanism to distribute and aggregate search queries
across multiple nodes for ultimate client analysis.
Hardware Failure
The data will be stored in multiple systems, hence if system
fails some trivial data will be lost. Hence HDFS should be
built in such a way that to overcome hardware failure.
Streaming Data Access
Since HDFS is designed for batch processing, it needs
streaming access to data sets.
Large Data Sets
HDFS should provide high aggregate data bandwidth and
scale to hundreds of nodes in a single cluster as it process
terabytes to petabytes of data.
Simple Coherence Model
HDFS applications need a write-once-read-many access
model for files. A file once created, written, and closed
need not be changed.
This enables high throughput data access.
Data replication
HDFS stores each file as a sequence of blocks - all blocks
in a file except the last block are the same size.
The blocks of a file are replicated for fault tolerance.
The replication factor can be specified at file creation time
and can be changed later.
The Name Node makes all decisions regarding replication
of blocks.
Name node periodically receives a signal and a Block
report from each of the Data Nodes in the cluster.
Signal indicates that the Data Node is functioning properly.
A Block report contains a list of all blocks on a Data Node.
Financial service providers
To determine eligibility of their customer for equity capital,
insurance, mortgage, or credit.
Airlines and trucking companies
using big data to track fuel consumption and traffic patterns
across their fleets in real time to improve efficiencies and
save costs.
Healthcare providers
To manage and share patient electronic health records from
multiple sources.
Telecommunications and utilities
using big data solutions to analyze user behaviors and
demand patterns for a better and more efficient power grid.
Location independence - allows big data clusters and
applications to be placed anywhere in the data center and still
achieve optimum performance.
Scalability - offers unprecedented scalability, from tens to
6,144 10gbe ports, all managed as a single switch.
Performance - Provides any-to-any connectivity within a data
center, increases bandwidth, and deterministic low latency
provides substantial improvement in performance (40 tbps
capacity) compared with traditional network architectures.
Convergence - allows Hadoop cluster, storage area network
(SAN), and network access server (NAS) to easily
communicate across the fabric with Channel/fibre Channel
over Ethernet (FC/FCoE).
Operational simplicity - Helps to manage a group of
resources with one switch abstraction dynamically and easily
to apply policies across the entire data center.
There is a great need to process and analyze the data
generated.
Big data analysis helps us in processing these complex data
easily.
The newer technologies can be implemented to analyze the big
data efficiently.
Cloud reduces the amount to be invested on analysis.
It provides the necessary infrastructure at lower cost.