100% found this document useful (1 vote)

169 views227 pages

Big Data Analytics

This document provides an overview of big data analytics. It discusses what big data is, sources of big data generation, challenges of big data like capturing, storing, searching and analyzing large volumes of varied data. It describes the 3Vs of big data - volume, velocity and variety. It also discusses additional Vs like value, veracity, validity, variability etc. The document then covers technologies used for big data like Apache Hadoop, HDFS, YARN, MapReduce and NoSQL databases. It provides details on Hadoop ecosystem and architecture of HDFS and YARN frameworks.

Uploaded by

Varaprasad D

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

169 views227 pages

Big Data Analytics

Uploaded by

Varaprasad D

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 227

Big Data Analytics

Dr. Rajiv Misra

Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Analytics Vu Pham FDP
Vu Pham
Vu Pham
Vu Pham
Vu Pham
Vu Pham
Vu Pham
Vu Pham
Vu Pham
ANALYTICS

Vu Pham
Vu Pham
Vu Pham
Vu Pham
Vu Pham
Vu Pham
Vu Pham
What is Big Data?
Big data is a collection of data sets so large and
complex that it becomes difficult to process using
traditional relational database management systems

Big Data Analytics

Big Data Computing-A practical approach
Vu Pham FDP
What’s making so much data?
ubiquitous computing
more people carrying data-generating devices (mobile
phones with facebook, gps, cameras, etc.)

Big Data Analytics Vu Pham FDP

Source of Data Generation
12+ TBs 4.6
of tweet data 30 billion RFID
every day tags today
billion
camera
(1.3B in 2005)
phones
world
wide
data every day

100s of
? TBs of

millions
of GPS
enabled
devices
sold
annually
25+ TBs of 2+
log data billion
every day people
76 million smart on the
meters in 2009… Web by
200M by 2014 end 2011

Big Data Analytics Vu Pham FDP

Where is the problem?
Traditional RDBMS queries isn't sufficient to get
useful information out of the huge volume of data
To search it with traditional tools to find out if a
particular topic was trending would take so long that
the result would be meaningless by the time it was
computed.
Big Data come up with a solution to store this data in
novel ways in order to make it more accessible, and
also to come up with methods of performing analysis
on it.

Big Data Analytics Vu Pham FDP

Challenges:
Capturing
Storing
Searching
Sharing
Analyzing
Visualization

Big Data Analytics Vu Pham FDP

IBM considers Big Data(3V’s):
Big data spans four dimensions: Volume, Velocity,
Variety.
Volume: Enterprises are awash with ever-growing
data of all types, easily amassing terabytes even
Petabytes of information.
Turn 12 terabytes of Tweets created each day into
improved product sentiment analysis
Convert 350 billion annual meter readings to better
predict power consumption

Big Data Analytics Vu Pham FDP

IBM considers Big Data(3V’s):
Big data spans four dimensions: Volume, Velocity,
Variety.
Velocity: Sometimes 2 minutes is too late. For time-
sensitive processes such as catching fraud, big data
must be used as it streams into your enterprise in order
to maximize its value.
Scrutinize 5 million trade events created each day to
identify potential fraud
Analyze 500 million daily call detail records in real-
time to predict customer churn faster

Big Data Analytics Vu Pham FDP

IBM considers Big Data(3V’s):
Big data spans four dimensions: Volume, Velocity,
Variety.
Variety: Big data is any type of data –

Structured Data (example: tabular data)

Unstructured –text, sensor data, audio, video
Semi Structured : web data, log files

Big Data Analytics Vu Pham FDP

The 3 Big V’s (+1) (+ N more)
Big 3V’s
Volume
Velocity
Variety
Plus 1
Value

Big Data Analytics Vu Pham FDP

The 3 Big V’s (+1) (+ N more)
Plus many more
Veracity
Validity
Variability
Viscosity & Volatility
Viability,
Venue,
Vocabulary, Vagueness,
…

Big Data Analytics Vu Pham FDP

Facts and Figures

Big Data Analytics Vu Pham FDP

Value
Integrating Data
Reducing data complexity
Increase data availability
Unify your data systems
All 3 above will lead to increased data collaboration
-> add value to your big data

Big Data Analytics Vu Pham FDP

Veracity
Veracity refers to the biases ,noise and
abnormality in data, trustworthiness of data.
1 in 3 business leaders don’t trust the information they
use to make decisions.
How can you act upon information if you don’t trust
it?
Establishing trust in big data presents a huge
challenge as the variety and number of sources
grows.

Big Data Analytics Vu Pham FDP

Valence
Valence refers to the connectedness of big
data.
Such as in the form of graph networks

Big Data Analytics Vu Pham FDP

Validity
Accuracy and correctness of the data relative to a
particular use
Example: Gauging storm intensity
satellite imagery vs social media posts

prediction quality vs human impact

Big Data Analytics Vu Pham FDP

Variability
How the meaning of the data changes over time
Language evolution
Data availability
Sampling processes
Changes in characteristics of the data source

Big Data Analytics Vu Pham FDP

Viscosity & Volatility
Both related to velocity
Viscosity: data velocity relative to timescale of
event being studied
Volatility: rate of data loss and stable lifetime
of data
Scientific data often has practically unlimited
lifespan, but social / business data may evaporate
in finite time

Big Data Analytics Vu Pham FDP

More V’s
Viability
Which data has meaningful relations to questions of
interest?
Venue
Where does the data live and how do you get it?
Vocabulary
Metadata describing structure, content, & provenance
Schemas, semantics, ontologies, taxonomies, vocabularies
Vagueness
Confusion about what “Big Data” means

Big Data Analytics Vu Pham FDP

Dealing with Volume
Distill big data down to small information
Parallel and automated analysis
Automation requires standardization
Standardize by reducing Variety:
Format
Standards
Structure

Big Data Analytics Vu Pham FDP

Big Data Enabling Technologies

Big Data Analytics Vu Pham FDP

Introduction
Big Data is used for a collection of data sets so large
and complex that it is difficult to process using
traditional tools.

A recent survey says that 80% of the data created in

the world are unstructured.

One challenge is how we can store and process this big

amount of data. Later on slide we discuss the top
technologies used to store and analyse Big Data.

Big Data Analytics Vu Pham FDP

Apache Hadoop
Apache Hadoop is a java based free software
framework that can effectively store large amount of
data in a cluster.

This framework runs in parallel on a cluster and has an

ability to allow us to process data across all nodes.

Hadoop Distributed File System (HDFS) is the storage

system of Hadoop which splits big data and distribute
across many nodes in a cluster.

Big Data Analytics Vu Pham FDP

Hadoop Ecosystem

Big Data Analytics Vu Pham FDP

Hadoop Ecosystem

Big Data Analytics Vu Pham FDP

HDFS Architecture

Big Data Analytics Vu Pham FDP

YARN
YARN – Yet Another Resource Manager.

Apache Hadoop YARN is the resource management and

job scheduling technology in the open source Hadoop
distributed processing framework.

YARN is responsible for allocating system resources to

the various applications running in a Hadoop cluster
and scheduling tasks to be executed on different
cluster nodes.

Big Data Analytics Vu Pham FDP

YARN

Big Data Analytics Vu Pham FDP

Map Reduce
MapReduce is a programming model and an
associated implementation for processing and
generating large data sets.

Users specify a map function that processes a

key/value pair to generate a set of intermediate
key/value pairs, and a reduce function that merges all
intermediate values associated with the same
intermediate key

Big Data Analytics Vu Pham FDP

Map Reduce

Big Data Analytics Vu Pham FDP

NoSQL
While the traditional SQL can be effectively used to
handle large amount of structured data, we need
NoSQL (Not Only SQL) to handle unstructured data.

NoSQL databases store unstructured data with no

particular schema

Each row can have its own set of column values. NoSQL
gives better performance in storing massive amount of
data.

Big Data Analytics Vu Pham FDP

NoSQL

Big Data Analytics Vu Pham FDP

Hive
This is a distributed data management for Hadoop.

This supports SQL-like query option HiveSQL (HSQL) to

access big data.

This can be primarily used for Data mining purpose.

This runs on top of Hadoop.

Big Data Analytics Vu Pham FDP

Apache Spark
Apache Spark is a big data analytics framework that
was originally developed at the University of California,
Berkeley's AMPLab, in 2012. Since then, it has gained a
lot of attraction both in academia and in industry.

Apache Spark is a lightning-fast cluster computing

technology, designed for fast computation.

Apache Spark is a lightning-fast cluster computing

technology, designed for fast computation

Big Data Analytics Vu Pham FDP

Cassandra
Apache Cassandra is highly scalable, distributed and
high-performance NoSQL database. Cassandra is
designed to handle a huge amount of data.

Cassandra handles the huge amount of data with its

distributed architecture.

Data is placed on different machines with more than

one replication factor that provides high availability
and no single point of failure.

Big Data Analytics Vu Pham FDP

Cassandra

In the image above, circles are Cassandra nodes and

lines between the circles shows distributed
architecture, while the client is sending data to the
node

Big Data Analytics Vu Pham FDP

HBase
HBase is an open source, distributed database,
developed by Apache Software foundation.

Initially, it was Google Big Table, afterwards it was re-

named as HBase and is primarily written in Java.

HBase can store massive amounts of data from

terabytes to petabytes.

Big Data Analytics Vu Pham FDP

HBase

Big Data Analytics Vu Pham FDP

Spark Streaming
Spark Streaming is an extension of the core Spark API
that enables scalable, high-throughput, fault-tolerant
stream processing of live data streams.

Streaming data input from HDFS, Kafka, Flume, TCP

sockets, Kinesis, etc.

Spark ML (Machine Learning) functions and GraphX

graph processing algorithms are fully applicable to
streaming data .

Big Data Analytics Vu Pham FDP

Spark Streaming

Big Data Analytics Vu Pham FDP

Spark MLlib
Spark MLlib is a distributed machine-learning
framework on top of Spark Core.

MLlib is Spark's scalable machine learning library

consisting of common learning algorithms and utilities,
including classification, regression, clustering,
collaborative filtering, dimensionality reduction.

Big Data Analytics Vu Pham FDP

Spark Mllib Component

Big Data Analytics Vu Pham FDP

Spark GraphX
GraphX is a new component in Spark for graphs and
graph-parallel computation. At a high level, GraphX
extends the Spark RDD by introducing a new graph
abstraction.

GraphX reuses Spark RDD concept, simplifies graph

analytics tasks, provides the ability to make operations
on a directed multigraph with properties attached to
each vertex and edge.

Big Data Analytics Vu Pham FDP

Spark GraphX

Big Data Analytics Vu Pham FDP

Hadoop HDFS

Dr. Rajiv Misra

Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Analytics Vu Pham FDP
Hadoop HDFS
Hadoop distributed File System (based on Google File System (GFS) paper,
2004)
Serves as the distributed file system for most tools in the Hadoop
ecosystem
Scalability for large data sets
Reliability to cope with hardware failures

HDFS good for:

Large files
Streaming data
Not good for:
Lots of small files
Random access to files
Low latency access

Big Data Analytics Vu Pham FDP

Design of Hadoop Distributed title System(HDFS)

Master-Slave design
Master Node
Single NameNode for managing metadata
Slave Nodes
Multiple DataNodes for storing data
Other
Secondary NameNode as a backup

Big Data Analytics Vu Pham FDP

HDFS Architecture
NameNode keeps the metadata, the name, location and
directory . DataNode provide storage for blocks of data

Big Data Analytics Vu Pham FDP

File system Namespace
Hierarchical file system with directories and files
Create, remove, move, rename etc.
Namenode maintains the file system
Any meta information changes to the file system
recorded by the Namenode.
An application can specify the number of replicas of
the file needed: replication factor of the file. This
information is stored in the Namenode.

Big Data Analytics Vu Pham FDP

Namenode

Master
Manages filesystem namespace
Maintains filesystem tree and metadata-persistently on
two files-namespace image and editlog
Stores locations of blocks-but not persistently
Metadata – inode data and the list of blocks of each
file

Big Data Analytics Vu Pham FDP

Datanodes
Workhorses of the filesystem
Store and retrieve blocks
Send blockreports to Namenode
Do not use data protection mechanisms like RAID…use
replication
Startup-handshake: After handshake:
o Namespace ID o Registration
o Software version o Storage ID
o Block Report
o Heartbeats

Big Data Computing-A

Big Data Analytics
practical approach
Vu Pham FDP
if node(s) fail?
Replication of Blocks for fault tolerance

Big Data Analytics Vu Pham FDP

Secondary Namenode
If namenode fails, the filesystem cannot be used
Two ways to make it resilient to failure:
o Backup of files
o Secondary Namenode

Periodically merge namespace image with editlog

Runs on separate physical machine

Big Data Analytics Vu Pham FDP

Secondary Namenode
Has a copy of metadata, which can be used to
reconstruct state of the namenode
Disadvantage: state lags that of the primary namenode
Renamed as CheckpointNode (CN) in 0.21 release[1]
Periodic and is not continuous
If the NameNode dies, it does not take over the
responsibilities of the NN

Big Data Analytics Vu Pham FDP

MapReduce

Dr. Rajiv Misra

Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Analytics Vu Pham FDP
Introduction
MapReduce is a programming model and an associated
implementation for processing and generating large data
sets.

Users specify a map function that processes a key/value

pair to generate a set of intermediate key/value pairs, and
a reduce function that merges all intermediate values
associated with the same intermediate key.

Many real world tasks are expressible in this model.

Big Data Analytics Vu Pham FDP

Contd…
Programs written in this functional style are automatically
parallelized and executed on a large cluster of commodity
machines.
The run-time system takes care of the details of partitioning the
input data, scheduling the program's execution across a set of
machines, handling machine failures, and managing the required
inter-machine communication.
This allows programmers without any experience with parallel and
distributed systems to easily utilize the resources of a large
distributed system.
A typical MapReduce computation processes many terabytes of
data on thousands of machines. Hundreds of MapReduce
programs have been implemented and upwards of one thousand
MapReduce jobs are executed on Google's clusters every day.

Big Data Analytics Vu Pham FDP

Distributed File System
Chunk Servers
File is split into contiguous chunks
Typically each chunk is 16-64MB
Each chunk replicated (usually 2x or 3x)
Try to keep replicas in different racks

Master node
Also known as Name Nodes in HDFS
Stores metadata
Might be replicated

Client library for file access

Talks to master to find chunk servers
Connects directly to chunk servers to access data
Big Data Analytics Vu Pham FDP
Motivation for Map Reduce (Why)

Large-Scale Data Processing

Want to use 1000s of CPUs
But don’t want hassle of managing things

MapReduce Architecture provides

Automatic parallelization & distribution
Fault tolerance
I/O scheduling
Monitoring & status updates

Big Data Analytics Vu Pham FDP

What is MapReduce?
Terms are borrowed from Functional Language (e.g., Lisp)
Sum of squares:
(map square ‘(1 2 3 4))
Output: (1 4 9 16)
[processes each record sequentially and independently]
(reduce + ‘(1 4 9 16))
(+ 16 (+ 9 (+ 4 1) ) )
Output: 30
[processes set of all records in batches]
Let’s consider a sample application: Wordcount
You are given a huge dataset (e.g., Wikipedia dump or all of
Shakespeare’s works) and asked to list the count for each of the
words in each of the documents therein
Big Data Analytics Vu Pham FDP
Map

Process individual records to generate intermediate

key/value pairs.

Key Value
Welcome 1
Welcome Everyone Everyone 1
Hello Everyone Hello 1
Everyone 1
Input <filename, file text>

Big Data Analytics Vu Pham FDP

Map

Parallelly Process individual records to generate

intermediate key/value pairs.

MAP TASK 1

Welcome 1
Welcome Everyone
Everyone 1
Hello Everyone
Hello 1
Everyone 1
Input <filename, file
text>
MAP TASK 2

Big Data Analytics Vu Pham FDP

Map

Parallelly Process a large number of individual

records to generate intermediate key/value pairs.

Welcome 1
Welcome Everyone
Everyone 1
Hello Everyone
Hello 1
Why are you here
I am also here Everyone 1
They are also here Why 1
Yes, it’s THEM!
Are 1
The same people we were thinking of
You 1
…….
Here 1
…….

Input <filename, file

MAP TASKS
text>

Big Data Analytics Vu Pham FDP

Reduce
Reduce processes and merges all intermediate values
associated per key

Key Value
Welcome 1
Everyone 2
Everyone 1
Hello 1
Hello 1
Welcome 1
Everyone 1

Big Data Analytics Vu Pham FDP

Reduce
• Each key assigned to one Reduce
• Parallelly Processes and merges all intermediate values
by partitioning keys

Welcome 1 REDUCE Everyone 2

Everyone 1 TASK 1 Hello 1
Hello 1 REDUCE Welcome 1
Everyone 1 TASK 2

• Popular: Hash partitioning, i.e., key is assigned to

– reduce # = hash(key)%number of reduce tasks
Big Data Analytics Vu Pham FDP
Programming Model

The computation takes a set of input key/value pairs, and

produces a set of output key/value pairs.

The user of the Map Reduce library expresses the

computation as two functions:

(i) The Map

(ii) The Reduce

Big Data Analytics Vu Pham FDP

(i) Map Abstraction

Map, written by the user, takes an input pair and produces

a set of intermediate key/value pairs.

The MapReduce library groups together all intermediate

values associated with the same intermediate key ‘I’ and
passes them to the Reduce function.

Big Data Analytics Vu Pham FDP

(ii) Reduce Abstraction
The Reduce function, also written by the user, accepts an
intermediate key ‘I’ and a set of values for that key.

It merges together these values to form a possibly smaller

set of values.

Typically just zero or one output value is produced per

Reduce invocation. The intermediate values are supplied to
the user's reduce function via an iterator.

This allows us to handle lists of values that are too large to

fit in memory.

Big Data Analytics Vu Pham FDP

Map-Reduce Functions for Word Count

map(key, value):
// key: document name; value: text of document
for each word w in value:
emit(w, 1)

reduce(key, values):
// key: a word; values: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)

Big Data Analytics Vu Pham FDP

Map-Reduce Functions

Input: a set of key/value pairs

User supplies two functions:
map(k,v) → list(k1,v1)
reduce(k1, list(v1)) → v2
(k1,v1) is an intermediate key/value pair
Output is the set of (k1,v2) pairs

Big Data Analytics Vu Pham FDP

MapReduce Applications

Big Data Analytics Vu Pham FDP

Applications
Here are a few simple applications of interesting programs that
can be easily expressed as MapReduce computations.
Distributed Grep: The map function emits a line if it matches a
supplied pattern. The reduce function is an identity function that
just copies the supplied intermediate data to the output.
Count of URL Access Frequency: The map function processes
logs of web page requests and outputs (URL; 1). The reduce
function adds together all values for the same URL and emits a
(URL; total count) pair.
ReverseWeb-Link Graph: The map function outputs (target;
source) pairs for each link to a target URL found in a page named
source. The reduce function concatenates the list of all source
URLs associated with a given target URL and emits the pair:
(target; list(source))
Big Data Analytics Vu Pham FDP
Contd…
Term-Vector per Host: A term vector summarizes the
most important words that occur in a document or a set
of documents as a list of (word; frequency) pairs.

The map function emits a (hostname; term vector) pair

for each input document (where the hostname is
extracted from the URL of the document).

The reduce function is passed all per-document term

vectors for a given host. It adds these term vectors
together, throwing away infrequent terms, and then emits
a final (hostname; term vector) pair

Big Data Analytics Vu Pham FDP

Contd…
Inverted Index: The map function parses each document,
and emits a sequence of (word; document ID) pairs. The
reduce function accepts all pairs for a given word, sorts
the corresponding document IDs and emits a (word;
list(document ID)) pair. The set of all output pairs forms a
simple inverted index. It is easy to augment this
computation to keep track of word positions.

Distributed Sort: The map function extracts the key from

each record, and emits a (key; record) pair. The reduce
function emits all pairs unchanged.

Big Data Analytics Vu Pham FDP

Conclusion
The MapReduce programming model has been successfully used
at Google for many different purposes.

The model is easy to use, even for programmers without

experience with parallel and distributed systems, since it hides the
details of parallelization, fault-tolerance, locality optimization, and
load balancing.

A large variety of problems are easily expressible as MapReduce

computations.

For example, MapReduce is used for the generation of data for

Google's production web search service, for sorting, for data
mining, for machine learning, and many other systems.
Big Data Analytics Vu Pham FDP
Conclusion

Mapreduce uses parallelization + aggregation to

schedule applications across clusters

Need to deal with failure

Plenty of ongoing research work in scheduling and

fault-tolerance for Mapreduce and Hadoop.

Cloud Computing and DistributedVuSystems

Pham MapReduce
Big Data Storage Technologies

Big Data Analytics Vu Pham FDP

BIG DATA STORAGETECHNOLOGIES

• Distributed File Systems: Example Hadoop

Distributed File System. It is well suited for quickly
ingesting data and bulk processing.
• NoSQL Databases: Use data models that are
outside the relational world.
• NewSQL Databases: A modern form of relational
databases that aim for comparable scalability as
NoSQL databases
• Big Data Querying Platforms: Technologies that
provide query facades in front of big data stores
such as distributed file systems or NoSQL
databases.

Big Data Analytics Vu Pham FDP

RDBMS TRANSACTION – ACID RULE

• Atomic – All of the work in a transaction completes

(commit) or none of itcompletes.
• Consistent – A transaction transforms the database
from one consistent state to another consistent state.
Consistency is defined in terms ofconstraints.
• Isolated – Modifications of data performed by a
transaction must be independent of another
transaction.
• Durable – When the transaction is completed,
effects of the modifications performed by the
transaction must be permanent in the system.

Big Data Analytics Vu Pham FDP

WHAT IS NOSQL?

• NoSQL is a non-relational database management

systems
• Significantly different from traditional RDBMS
• It is designed for distributed data stores for very
large scale of data storing needs (for example
Google or Facebook which collects terabits of data
every day for their users)
• These type of data storing may not require fixed
schema, avoid join operations and typicallyscale
horizontally.

Big Data Analytics Vu Pham FDP

NOSQL

• Stands for Not OnlySQL

• No declarative querylanguage
• No predefined schema
• Key-Value pair storage, Column Store,Document
Store, Graph databases
• Eventual consistency rather ACID property
• Unstructured and unpredictabledata
• CAP Theorem
• Prioritizes high performance, high availabilityand
scalability

Big Data Analytics Vu Pham FDP

WHY NOSQL?

• Data is becoming easier to access and capture

• Personal user information, social graphs, geo
location data, user-generated content and machine
logging data are just a few examples where the
data has been increasingexponentially
• It is required to process huge amount of data for
which SQL databases were neverdesigned
• NoSQL aims to handle these huge data properly

Big Data Analytics Vu Pham FDP

WHEN TO USE NOSQL

•Big amountof data

•Lots ofreads/writes
•Economic
•Flexibleschema
•No transactionsneeded
•ACID isnot important
•No joins

Big Data Analytics Vu Pham FDP

NOSQL: PROS/CONS

•Advantages:
•High scalability
•Distributed Computing
•Lower cost
•Schema flexibility, semi-structure data
•No complicatedRelationships
•Disadvantages
•No standardization
•Limited query capabilities (sofar)
•Eventual consistent is not intuitive to programfor

Big Data Analytics Vu Pham FDP

THE BASE

•Almost the opposite ofACID.

• The BASE acronym was defined by Eric Brewer, whois
also known for formulating the CAPtheorem.
•A BASE system gives up on consistency.
• Basically Available indicates that the system does guarantee
availability, in terms of the CAP theorem.
• Soft state indicates that the state of the system maychange
over time, even without input. This is because of the
eventual consistency model.
• Eventual consistency indicates that the system will become
consistent over time, given that the system doesn't receive
input during that time.

Big Data Analytics Vu Pham FDP

NOSQL DATABASE TYPES
Document databases pair each key with a complex data
structure known as a document. Eg. FirstName = “Rashmi”,
LastName = “Taneja”, Address = “IIT Patna”, Spouse = [{Name:
“Manoj”, Age: 30}]
Graph stores are used to store information aboutnetworks,
such as socialconnections
In Key-value-store category of NoSQL database, a user can store
data in schema-less way. A key may be strings,hashes, lists, sets,
sorted sets and values are stored against these keys.
Wide column stores are optimized for queries over large datasets
and store columns of data together insteadof rows.

Big Data Analytics Vu Pham FDP

NOSQL CATEGORIES

• Four general types (most common categories)

of NoSQL databases:
•Key-valuestores
•Column-oriented
•Graph
•Documentoriented

Big Data Analytics Vu Pham FDP

KEY – VALUE STORES
•Key-value stores are most basic types of NoSQL databases.
•Designed to handle huge amountsof data.
•Based on Amazon’s Dynamopaper.
•Key value stores allow developer to store schema-lessdata.
• In the key-value storage, database stores data as hash table where each key is
unique and the value can be string, JSON (JavaScript Object Notation), BLOB (basic
large object)etc.
• A key may be strings, hashes, lists, sets, sorted sets and values are stored
against these keys.
• For example a key-value pair might consist of a key like "Name" that is
associated with a value like"Robin".
•Key-Value stores can be used as collections, dictionaries, associative arrays etc.
•Key-Value stores follows the 'Availability' and 'Partition' aspectsof CAP theorem.
• Key-Values stores would work well for shopping cart contents, or individual values
like colour schemes, a landing page URI, or a default account number.

Big Data Analytics Vu Pham FDP

KEY – VALUE STORES

•Example: Redis, Dynamo,Riak.

Big Data Analytics Vu Pham FDP

COLUMN ORIENTED DATABASES
• Work on columns and every column is treated individually.
• Values of asingle column are stored contiguously.
• Column stores data in column specificfiles.
• In Column stores, query processors workon columns too.
• All data within each column have the same type which makes it ideal
forcompression.
• Column stores can improve the performance of queries as it can
access specific columndata.
• High performance on aggregation queries (e.g. COUNT, SUM, AVG,
MIN, MAX).
• Works on data warehouses and business intelligence,
customer relationship management (CRM), Library card
catalogs etc.
Big Data Analytics Vu Pham FDP
COLUMN ORIENTED DATABASES

Row Key. Each row has a unique key, which is a unique identifier for that
row.
Column. Each column contains a name, a value, and timestamp.
Name. This is the name of the name/value pair.
Value. This is the value of the name/value pair.
Timestamp. This provides the date and time that the data was inserted.
This can be used to determine the most recent version of data.

Big Data Analytics Vu Pham FDP

COLUMN ORIENTEDDATABASES EXAMPLE

Examples: BigTable,Cassandra,SimpleDB

Big Data Analytics Vu Pham FDP

GRAPH DATABASES

A graph data structure consists of a finite (and

possibly mutable) set of ordered pairs, called
edges or arcs, of certain entities called nodes or
vertices.

Big Data Analytics Vu Pham FDP

GRAPH DATABASES

• A graph database stores data in agraph.

• It is capable of elegantly representing any kind of data in a highly
accessible way.
• A graph database is a collection of nodes and edges
•Each node represents an entity (such as a student or business) and
each edge represents a connection or relationship between two
nodes.
• Every node and edge is defined by a uniqueidentifier.
• Each node knows its adjacent nodes.
• As the number of nodes increases, the cost of a local step (or hop)
remains thesame.
• Index for lookups.

Big Data Analytics Vu Pham FDP

COMPARISON BETWEEN RELATIONALMODEL AND GRAPH
MODEL

Relational Model Graph Model

Tables Vertices and Edges set
Rows Vertices
Columns Key/value pairs
Joins Edges

Big Data Analytics Vu Pham FDP

GRAPH DATABASES EXAMPLE

Example: OrientDB, Neo4J,Titan

Big Data Analytics Vu Pham FDP

DOCUMENT ORIENTED DATABASES

• A collection of documents
• Data in this model is stored inside documents.
• A document is akey value collection where the key
allows access to its value.
• Documents are not typically forced to have a schema
and therefore are flexible and easy tochange.
• Documents are stored into collections in order togroup
different kinds of data.
• Documents can contain many different key-valuepairs,
or key-array pairs, or even nested documents.

Big Data Analytics Vu Pham FDP

DOCUMENT ORIENTED DATABASES

Example: MongoDB, CouchDB

Big Data Analytics Vu Pham FDP

PRODUCTION DEPLOYMENT

• There is a large number of companies using

NoSQL.
•Google
•Facebook
•Mozilla
•Adobe
•Foursquare
•LinkedIn
•Digg
•McGraw-HillEducation
•Vermont PublicRadio

Big Data Analytics Vu Pham FDP

NEWSQL

• Is a term coined by 451 Group analyst Matt

Aslett
• Offers the bestof both worlds:
• Relational data model
• ACID transactional consistency
• Familiarity and interactivity of SQL
• Scalability and speed of NoSQL
• Example: VoltDB, NuoDB, MemSQL

Big Data Analytics Vu Pham FDP

NEWSQL: WHAT IS NEW?

• Mainmemory storage: reading and writing blocks

to memory cache is much faster.
• Historically memory was much more expensive and
had a limited capacity compared todisks.
• Now the scenario isdifferent.
• Many NewSQL DBMSs are based on this:
• Academic (e.g., H-Store,HyPer)
• Commercial (e.g., MemSQL, SAP HANA, VoltDB) systems
• NewSQL systems evict a subset of the database out to
persistent storage to reduce its memory footprint.

Big Data Analytics Vu Pham FDP

CONCLUSION

•Big data can be operational or analytical.

•Two classes of technologies are complementary and
frequently deployedtogether.
•Big data storage technologies have grown in the
following areas:
•DistributedFile Systems
•NoSQL databases: complies BASE
•NewSQL databases: complies ACID

Big Data Analytics Vu Pham FDP

Introduction to Spark

Dr. Rajiv Misra

Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Analytics Vu Pham FDP
Need of Spark
Apache Spark is a big data analytics framework that
was originally developed at the University of
California, Berkeley's AMPLab, in 2012. Since then, it
has gained a lot of attraction both in academia and in
industry.

It is an another system for big data analytics

Isn’t MapReduce good enough?

Simplifies batch processing on large commodity clusters

Big Data Analytics Vu Pham FDP

Need of Spark
Map Reduce

Input Output

Big Data Analytics Vu Pham FDP

Need of Spark
Map Reduce

Expensive save to disk for fault

tolerance
Input Output

Big Data Analytics Vu Pham FDP

Need of Spark
MapReduce can be expensive for some applications e.g.,
Iterative
Interactive

Lacks efficient data sharing

Specialized frameworks did evolve for different programming

models
Bulk Synchronous Processing (Pregel)
Iterative MapReduce (Hadoop) ….

Big Data Analytics Vu Pham FDP

Solution: Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)

Immutable, partitioned collection of records

Built through coarse grained transformations (map, join …)
Can be cached for efficient reuse

Big Data Analytics Vu Pham FDP

Need of Spark
RDD RDD RDD

Read

HDFS
Read Cache

Map Reduce
Big Data Analytics Vu Pham FDP
Solution: Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)

Immutable, partitioned collection of records

Built through coarse grained transformations (map, join …)

Fault Recovery?
Lineage!
Log the coarse grained operation applied to a
partitioned dataset
Simply recompute the lost partition if failure occurs!
No cost if no failure

Big Data Analytics Vu Pham FDP

RDD RDD RDD

Read

HDFS
Read Cache

Map Reduce

Big Data Analytics Vu Pham FDP

Read
HDFS Map Reduce
Lineage

Introduction to Spark

Big Data Analytics Vu Pham FDP

RDD RDD RDD

Read

HDFS RDDs track the graph of

Read transformations that built them Cache
(their lineage) to rebuild lost data

Map Reduce

Big Data Analytics Vu Pham FDP

What can you do with Spark?
RDD operations
Transformations e.g., filter, join, map, group-by …
Actions e.g., count, print …

Control
Partitioning: Spark also gives you control over how you can
partition your RDDs.

Persistence: Allows you to choose whether you want to

persist RDD onto disk or not.

Big Data Analytics Vu Pham FDP

Partitioning: PageRank

Joins take place

repeatedly
Good partitioning
reduces shuffles

Big Data Analytics Vu Pham FDP

Example: PageRank