100% found this document useful (1 vote)
169 views227 pages

Big Data Analytics

This document provides an overview of big data analytics. It discusses what big data is, sources of big data generation, challenges of big data like capturing, storing, searching and analyzing large volumes of varied data. It describes the 3Vs of big data - volume, velocity and variety. It also discusses additional Vs like value, veracity, validity, variability etc. The document then covers technologies used for big data like Apache Hadoop, HDFS, YARN, MapReduce and NoSQL databases. It provides details on Hadoop ecosystem and architecture of HDFS and YARN frameworks.

Uploaded by

Varaprasad D
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
169 views227 pages

Big Data Analytics

This document provides an overview of big data analytics. It discusses what big data is, sources of big data generation, challenges of big data like capturing, storing, searching and analyzing large volumes of varied data. It describes the 3Vs of big data - volume, velocity and variety. It also discusses additional Vs like value, veracity, validity, variability etc. The document then covers technologies used for big data like Apache Hadoop, HDFS, YARN, MapReduce and NoSQL databases. It provides details on Hadoop ecosystem and architecture of HDFS and YARN frameworks.

Uploaded by

Varaprasad D
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 227

Big Data Analytics

Dr. Rajiv Misra


Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Analytics Vu Pham FDP
Vu Pham
Vu Pham
Vu Pham
Vu Pham
Vu Pham
Vu Pham
Vu Pham
Vu Pham
ANALYTICS

Vu Pham
Vu Pham
Vu Pham
Vu Pham
Vu Pham
Vu Pham
Vu Pham
What is Big Data?
Big data is a collection of data sets so large and
complex that it becomes difficult to process using
traditional relational database management systems

Big Data Analytics


Big Data Computing-A practical approach
Vu Pham FDP
What’s making so much data?
ubiquitous computing
more people carrying data-generating devices (mobile
phones with facebook, gps, cameras, etc.)

Big Data Analytics Vu Pham FDP


Source of Data Generation
12+ TBs 4.6
of tweet data 30 billion RFID
every day tags today
billion
camera
(1.3B in 2005)
phones
world
wide
data every day

100s of
? TBs of

millions
of GPS
enabled
devices
sold
annually
25+ TBs of 2+
log data billion
every day people
76 million smart on the
meters in 2009… Web by
200M by 2014 end 2011

Big Data Analytics Vu Pham FDP


Where is the problem?
Traditional RDBMS queries isn't sufficient to get
useful information out of the huge volume of data
To search it with traditional tools to find out if a
particular topic was trending would take so long that
the result would be meaningless by the time it was
computed.
Big Data come up with a solution to store this data in
novel ways in order to make it more accessible, and
also to come up with methods of performing analysis
on it.

Big Data Analytics Vu Pham FDP


Challenges:
Capturing
Storing
Searching
Sharing
Analyzing
Visualization

Big Data Analytics Vu Pham FDP


IBM considers Big Data(3V’s):
Big data spans four dimensions: Volume, Velocity,
Variety.
Volume: Enterprises are awash with ever-growing
data of all types, easily amassing terabytes even
Petabytes of information.
Turn 12 terabytes of Tweets created each day into
improved product sentiment analysis
Convert 350 billion annual meter readings to better
predict power consumption

Big Data Analytics Vu Pham FDP


IBM considers Big Data(3V’s):
Big data spans four dimensions: Volume, Velocity,
Variety.
Velocity: Sometimes 2 minutes is too late. For time-
sensitive processes such as catching fraud, big data
must be used as it streams into your enterprise in order
to maximize its value.
Scrutinize 5 million trade events created each day to
identify potential fraud
Analyze 500 million daily call detail records in real-
time to predict customer churn faster

Big Data Analytics Vu Pham FDP


IBM considers Big Data(3V’s):
Big data spans four dimensions: Volume, Velocity,
Variety.
Variety: Big data is any type of data –

Structured Data (example: tabular data)


Unstructured –text, sensor data, audio, video
Semi Structured : web data, log files

Big Data Analytics Vu Pham FDP


The 3 Big V’s (+1) (+ N more)
Big 3V’s
Volume
Velocity
Variety
Plus 1
Value

Big Data Analytics Vu Pham FDP


The 3 Big V’s (+1) (+ N more)
Plus many more
Veracity
Validity
Variability
Viscosity & Volatility
Viability,
Venue,
Vocabulary, Vagueness,

Big Data Analytics Vu Pham FDP


Facts and Figures

Big Data Analytics Vu Pham FDP


Value
Integrating Data
Reducing data complexity
Increase data availability
Unify your data systems
All 3 above will lead to increased data collaboration
-> add value to your big data

Big Data Analytics Vu Pham FDP


Veracity
Veracity refers to the biases ,noise and
abnormality in data, trustworthiness of data.
1 in 3 business leaders don’t trust the information they
use to make decisions.
How can you act upon information if you don’t trust
it?
Establishing trust in big data presents a huge
challenge as the variety and number of sources
grows.

Big Data Analytics Vu Pham FDP


Valence
Valence refers to the connectedness of big
data.
Such as in the form of graph networks

Big Data Analytics Vu Pham FDP


Validity
Accuracy and correctness of the data relative to a
particular use
Example: Gauging storm intensity
satellite imagery vs social media posts

prediction quality vs human impact

Big Data Analytics Vu Pham FDP


Variability
How the meaning of the data changes over time
Language evolution
Data availability
Sampling processes
Changes in characteristics of the data source

Big Data Analytics Vu Pham FDP


Viscosity & Volatility
Both related to velocity
Viscosity: data velocity relative to timescale of
event being studied
Volatility: rate of data loss and stable lifetime
of data
Scientific data often has practically unlimited
lifespan, but social / business data may evaporate
in finite time

Big Data Analytics Vu Pham FDP


More V’s
Viability
Which data has meaningful relations to questions of
interest?
Venue
Where does the data live and how do you get it?
Vocabulary
Metadata describing structure, content, & provenance
Schemas, semantics, ontologies, taxonomies, vocabularies
Vagueness
Confusion about what “Big Data” means

Big Data Analytics Vu Pham FDP


Dealing with Volume
Distill big data down to small information
Parallel and automated analysis
Automation requires standardization
Standardize by reducing Variety:
Format
Standards
Structure

Big Data Analytics Vu Pham FDP


Big Data Enabling Technologies

Big Data Analytics Vu Pham FDP


Introduction
Big Data is used for a collection of data sets so large
and complex that it is difficult to process using
traditional tools.

A recent survey says that 80% of the data created in


the world are unstructured.

One challenge is how we can store and process this big


amount of data. Later on slide we discuss the top
technologies used to store and analyse Big Data.

Big Data Analytics Vu Pham FDP


Apache Hadoop
Apache Hadoop is a java based free software
framework that can effectively store large amount of
data in a cluster.

This framework runs in parallel on a cluster and has an


ability to allow us to process data across all nodes.

Hadoop Distributed File System (HDFS) is the storage


system of Hadoop which splits big data and distribute
across many nodes in a cluster.

Big Data Analytics Vu Pham FDP


Hadoop Ecosystem

Big Data Analytics Vu Pham FDP


Hadoop Ecosystem

Big Data Analytics Vu Pham FDP


HDFS Architecture

Big Data Analytics Vu Pham FDP


YARN
YARN – Yet Another Resource Manager.

Apache Hadoop YARN is the resource management and


job scheduling technology in the open source Hadoop
distributed processing framework.

YARN is responsible for allocating system resources to


the various applications running in a Hadoop cluster
and scheduling tasks to be executed on different
cluster nodes.

Big Data Analytics Vu Pham FDP


YARN

Big Data Analytics Vu Pham FDP


Map Reduce
MapReduce is a programming model and an
associated implementation for processing and
generating large data sets.

Users specify a map function that processes a


key/value pair to generate a set of intermediate
key/value pairs, and a reduce function that merges all
intermediate values associated with the same
intermediate key

Big Data Analytics Vu Pham FDP


Map Reduce

Big Data Analytics Vu Pham FDP


NoSQL
While the traditional SQL can be effectively used to
handle large amount of structured data, we need
NoSQL (Not Only SQL) to handle unstructured data.

NoSQL databases store unstructured data with no


particular schema

Each row can have its own set of column values. NoSQL
gives better performance in storing massive amount of
data.

Big Data Analytics Vu Pham FDP


NoSQL

Big Data Analytics Vu Pham FDP


Hive
This is a distributed data management for Hadoop.

This supports SQL-like query option HiveSQL (HSQL) to


access big data.

This can be primarily used for Data mining purpose.


This runs on top of Hadoop.

Big Data Analytics Vu Pham FDP


Apache Spark
Apache Spark is a big data analytics framework that
was originally developed at the University of California,
Berkeley's AMPLab, in 2012. Since then, it has gained a
lot of attraction both in academia and in industry.

Apache Spark is a lightning-fast cluster computing


technology, designed for fast computation.

Apache Spark is a lightning-fast cluster computing


technology, designed for fast computation

Big Data Analytics Vu Pham FDP


Cassandra
Apache Cassandra is highly scalable, distributed and
high-performance NoSQL database. Cassandra is
designed to handle a huge amount of data.

Cassandra handles the huge amount of data with its


distributed architecture.

Data is placed on different machines with more than


one replication factor that provides high availability
and no single point of failure.

Big Data Analytics Vu Pham FDP


Cassandra

In the image above, circles are Cassandra nodes and


lines between the circles shows distributed
architecture, while the client is sending data to the
node

Big Data Analytics Vu Pham FDP


HBase
HBase is an open source, distributed database,
developed by Apache Software foundation.

Initially, it was Google Big Table, afterwards it was re-


named as HBase and is primarily written in Java.

HBase can store massive amounts of data from


terabytes to petabytes.

Big Data Analytics Vu Pham FDP


HBase

Big Data Analytics Vu Pham FDP


Spark Streaming
Spark Streaming is an extension of the core Spark API
that enables scalable, high-throughput, fault-tolerant
stream processing of live data streams.

Streaming data input from HDFS, Kafka, Flume, TCP


sockets, Kinesis, etc.

Spark ML (Machine Learning) functions and GraphX


graph processing algorithms are fully applicable to
streaming data .

Big Data Analytics Vu Pham FDP


Spark Streaming

Big Data Analytics Vu Pham FDP


Spark MLlib
Spark MLlib is a distributed machine-learning
framework on top of Spark Core.

MLlib is Spark's scalable machine learning library


consisting of common learning algorithms and utilities,
including classification, regression, clustering,
collaborative filtering, dimensionality reduction.

Big Data Analytics Vu Pham FDP


Spark Mllib Component

Big Data Analytics Vu Pham FDP


Spark GraphX
GraphX is a new component in Spark for graphs and
graph-parallel computation. At a high level, GraphX
extends the Spark RDD by introducing a new graph
abstraction.

GraphX reuses Spark RDD concept, simplifies graph


analytics tasks, provides the ability to make operations
on a directed multigraph with properties attached to
each vertex and edge.

Big Data Analytics Vu Pham FDP


Spark GraphX

Big Data Analytics Vu Pham FDP


Hadoop HDFS

Dr. Rajiv Misra


Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Analytics Vu Pham FDP
Hadoop HDFS
Hadoop distributed File System (based on Google File System (GFS) paper,
2004)
Serves as the distributed file system for most tools in the Hadoop
ecosystem
Scalability for large data sets
Reliability to cope with hardware failures

HDFS good for:


Large files
Streaming data
Not good for:
Lots of small files
Random access to files
Low latency access

Big Data Analytics Vu Pham FDP


Design of Hadoop Distributed title System(HDFS)

Master-Slave design
Master Node
Single NameNode for managing metadata
Slave Nodes
Multiple DataNodes for storing data
Other
Secondary NameNode as a backup

Big Data Analytics Vu Pham FDP


HDFS Architecture
NameNode keeps the metadata, the name, location and
directory . DataNode provide storage for blocks of data

Big Data Analytics Vu Pham FDP


File system Namespace
Hierarchical file system with directories and files
Create, remove, move, rename etc.
Namenode maintains the file system
Any meta information changes to the file system
recorded by the Namenode.
An application can specify the number of replicas of
the file needed: replication factor of the file. This
information is stored in the Namenode.

Big Data Analytics Vu Pham FDP


Namenode

Master
Manages filesystem namespace
Maintains filesystem tree and metadata-persistently on
two files-namespace image and editlog
Stores locations of blocks-but not persistently
Metadata – inode data and the list of blocks of each
file

Big Data Analytics Vu Pham FDP


Datanodes
Workhorses of the filesystem
Store and retrieve blocks
Send blockreports to Namenode
Do not use data protection mechanisms like RAID…use
replication
Startup-handshake: After handshake:
o Namespace ID o Registration
o Software version o Storage ID
o Block Report
o Heartbeats

Big Data Computing-A


Big Data Analytics
practical approach
Vu Pham FDP
if node(s) fail?
Replication of Blocks for fault tolerance

Big Data Analytics Vu Pham FDP


Secondary Namenode
If namenode fails, the filesystem cannot be used
Two ways to make it resilient to failure:
o Backup of files
o Secondary Namenode

Periodically merge namespace image with editlog


Runs on separate physical machine

Big Data Analytics Vu Pham FDP


Secondary Namenode
Has a copy of metadata, which can be used to
reconstruct state of the namenode
Disadvantage: state lags that of the primary namenode
Renamed as CheckpointNode (CN) in 0.21 release[1]
Periodic and is not continuous
If the NameNode dies, it does not take over the
responsibilities of the NN

Big Data Analytics Vu Pham FDP


MapReduce

Dr. Rajiv Misra


Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Analytics Vu Pham FDP
Introduction
MapReduce is a programming model and an associated
implementation for processing and generating large data
sets.

Users specify a map function that processes a key/value


pair to generate a set of intermediate key/value pairs, and
a reduce function that merges all intermediate values
associated with the same intermediate key.

Many real world tasks are expressible in this model.

Big Data Analytics Vu Pham FDP


Contd…
Programs written in this functional style are automatically
parallelized and executed on a large cluster of commodity
machines.
The run-time system takes care of the details of partitioning the
input data, scheduling the program's execution across a set of
machines, handling machine failures, and managing the required
inter-machine communication.
This allows programmers without any experience with parallel and
distributed systems to easily utilize the resources of a large
distributed system.
A typical MapReduce computation processes many terabytes of
data on thousands of machines. Hundreds of MapReduce
programs have been implemented and upwards of one thousand
MapReduce jobs are executed on Google's clusters every day.

Big Data Analytics Vu Pham FDP


Distributed File System
Chunk Servers
File is split into contiguous chunks
Typically each chunk is 16-64MB
Each chunk replicated (usually 2x or 3x)
Try to keep replicas in different racks

Master node
Also known as Name Nodes in HDFS
Stores metadata
Might be replicated

Client library for file access


Talks to master to find chunk servers
Connects directly to chunk servers to access data
Big Data Analytics Vu Pham FDP
Motivation for Map Reduce (Why)

Large-Scale Data Processing


Want to use 1000s of CPUs
But don’t want hassle of managing things

MapReduce Architecture provides


Automatic parallelization & distribution
Fault tolerance
I/O scheduling
Monitoring & status updates

Big Data Analytics Vu Pham FDP


What is MapReduce?
Terms are borrowed from Functional Language (e.g., Lisp)
Sum of squares:
(map square ‘(1 2 3 4))
Output: (1 4 9 16)
[processes each record sequentially and independently]
(reduce + ‘(1 4 9 16))
(+ 16 (+ 9 (+ 4 1) ) )
Output: 30
[processes set of all records in batches]
Let’s consider a sample application: Wordcount
You are given a huge dataset (e.g., Wikipedia dump or all of
Shakespeare’s works) and asked to list the count for each of the
words in each of the documents therein
Big Data Analytics Vu Pham FDP
Map

Process individual records to generate intermediate


key/value pairs.

Key Value
Welcome 1
Welcome Everyone Everyone 1
Hello Everyone Hello 1
Everyone 1
Input <filename, file text>

Big Data Analytics Vu Pham FDP


Map

Parallelly Process individual records to generate


intermediate key/value pairs.

MAP TASK 1

Welcome 1
Welcome Everyone
Everyone 1
Hello Everyone
Hello 1
Everyone 1
Input <filename, file
text>
MAP TASK 2

Big Data Analytics Vu Pham FDP


Map

Parallelly Process a large number of individual


records to generate intermediate key/value pairs.

Welcome 1
Welcome Everyone
Everyone 1
Hello Everyone
Hello 1
Why are you here
I am also here Everyone 1
They are also here Why 1
Yes, it’s THEM!
Are 1
The same people we were thinking of
You 1
…….
Here 1
…….

Input <filename, file


MAP TASKS
text>

Big Data Analytics Vu Pham FDP


Reduce
Reduce processes and merges all intermediate values
associated per key

Key Value
Welcome 1
Everyone 2
Everyone 1
Hello 1
Hello 1
Welcome 1
Everyone 1

Big Data Analytics Vu Pham FDP


Reduce
• Each key assigned to one Reduce
• Parallelly Processes and merges all intermediate values
by partitioning keys

Welcome 1 REDUCE Everyone 2


Everyone 1 TASK 1 Hello 1
Hello 1 REDUCE Welcome 1
Everyone 1 TASK 2

• Popular: Hash partitioning, i.e., key is assigned to


– reduce # = hash(key)%number of reduce tasks
Big Data Analytics Vu Pham FDP
Programming Model

The computation takes a set of input key/value pairs, and


produces a set of output key/value pairs.

The user of the Map Reduce library expresses the


computation as two functions:

(i) The Map

(ii) The Reduce

Big Data Analytics Vu Pham FDP


(i) Map Abstraction

Map, written by the user, takes an input pair and produces


a set of intermediate key/value pairs.

The MapReduce library groups together all intermediate


values associated with the same intermediate key ‘I’ and
passes them to the Reduce function.

Big Data Analytics Vu Pham FDP


(ii) Reduce Abstraction
The Reduce function, also written by the user, accepts an
intermediate key ‘I’ and a set of values for that key.

It merges together these values to form a possibly smaller


set of values.

Typically just zero or one output value is produced per


Reduce invocation. The intermediate values are supplied to
the user's reduce function via an iterator.

This allows us to handle lists of values that are too large to


fit in memory.

Big Data Analytics Vu Pham FDP


Map-Reduce Functions for Word Count

map(key, value):
// key: document name; value: text of document
for each word w in value:
emit(w, 1)

reduce(key, values):
// key: a word; values: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)

Big Data Analytics Vu Pham FDP


Map-Reduce Functions

Input: a set of key/value pairs


User supplies two functions:
map(k,v) → list(k1,v1)
reduce(k1, list(v1)) → v2
(k1,v1) is an intermediate key/value pair
Output is the set of (k1,v2) pairs

Big Data Analytics Vu Pham FDP


MapReduce Applications

Big Data Analytics Vu Pham FDP


Applications
Here are a few simple applications of interesting programs that
can be easily expressed as MapReduce computations.
Distributed Grep: The map function emits a line if it matches a
supplied pattern. The reduce function is an identity function that
just copies the supplied intermediate data to the output.
Count of URL Access Frequency: The map function processes
logs of web page requests and outputs (URL; 1). The reduce
function adds together all values for the same URL and emits a
(URL; total count) pair.
ReverseWeb-Link Graph: The map function outputs (target;
source) pairs for each link to a target URL found in a page named
source. The reduce function concatenates the list of all source
URLs associated with a given target URL and emits the pair:
(target; list(source))
Big Data Analytics Vu Pham FDP
Contd…
Term-Vector per Host: A term vector summarizes the
most important words that occur in a document or a set
of documents as a list of (word; frequency) pairs.

The map function emits a (hostname; term vector) pair


for each input document (where the hostname is
extracted from the URL of the document).

The reduce function is passed all per-document term


vectors for a given host. It adds these term vectors
together, throwing away infrequent terms, and then emits
a final (hostname; term vector) pair

Big Data Analytics Vu Pham FDP


Contd…
Inverted Index: The map function parses each document,
and emits a sequence of (word; document ID) pairs. The
reduce function accepts all pairs for a given word, sorts
the corresponding document IDs and emits a (word;
list(document ID)) pair. The set of all output pairs forms a
simple inverted index. It is easy to augment this
computation to keep track of word positions.

Distributed Sort: The map function extracts the key from


each record, and emits a (key; record) pair. The reduce
function emits all pairs unchanged.

Big Data Analytics Vu Pham FDP


Conclusion
The MapReduce programming model has been successfully used
at Google for many different purposes.

The model is easy to use, even for programmers without


experience with parallel and distributed systems, since it hides the
details of parallelization, fault-tolerance, locality optimization, and
load balancing.

A large variety of problems are easily expressible as MapReduce


computations.

For example, MapReduce is used for the generation of data for


Google's production web search service, for sorting, for data
mining, for machine learning, and many other systems.
Big Data Analytics Vu Pham FDP
Conclusion

Mapreduce uses parallelization + aggregation to


schedule applications across clusters

Need to deal with failure

Plenty of ongoing research work in scheduling and


fault-tolerance for Mapreduce and Hadoop.

Cloud Computing and DistributedVuSystems


Pham MapReduce
Big Data Storage Technologies

Big Data Analytics Vu Pham FDP


BIG DATA STORAGETECHNOLOGIES

• Distributed File Systems: Example Hadoop


Distributed File System. It is well suited for quickly
ingesting data and bulk processing.
• NoSQL Databases: Use data models that are
outside the relational world.
• NewSQL Databases: A modern form of relational
databases that aim for comparable scalability as
NoSQL databases
• Big Data Querying Platforms: Technologies that
provide query facades in front of big data stores
such as distributed file systems or NoSQL
databases.

Big Data Analytics Vu Pham FDP


RDBMS TRANSACTION – ACID RULE

• Atomic – All of the work in a transaction completes


(commit) or none of itcompletes.
• Consistent – A transaction transforms the database
from one consistent state to another consistent state.
Consistency is defined in terms ofconstraints.
• Isolated – Modifications of data performed by a
transaction must be independent of another
transaction.
• Durable – When the transaction is completed,
effects of the modifications performed by the
transaction must be permanent in the system.

Big Data Analytics Vu Pham FDP


WHAT IS NOSQL?

• NoSQL is a non-relational database management


systems
• Significantly different from traditional RDBMS
• It is designed for distributed data stores for very
large scale of data storing needs (for example
Google or Facebook which collects terabits of data
every day for their users)
• These type of data storing may not require fixed
schema, avoid join operations and typicallyscale
horizontally.

Big Data Analytics Vu Pham FDP


NOSQL

• Stands for Not OnlySQL


• No declarative querylanguage
• No predefined schema
• Key-Value pair storage, Column Store,Document
Store, Graph databases
• Eventual consistency rather ACID property
• Unstructured and unpredictabledata
• CAP Theorem
• Prioritizes high performance, high availabilityand
scalability

Big Data Analytics Vu Pham FDP


WHY NOSQL?

• Data is becoming easier to access and capture


• Personal user information, social graphs, geo
location data, user-generated content and machine
logging data are just a few examples where the
data has been increasingexponentially
• It is required to process huge amount of data for
which SQL databases were neverdesigned
• NoSQL aims to handle these huge data properly

Big Data Analytics Vu Pham FDP


WHEN TO USE NOSQL

•Big amountof data


•Lots ofreads/writes
•Economic
•Flexibleschema
•No transactionsneeded
•ACID isnot important
•No joins

Big Data Analytics Vu Pham FDP


NOSQL: PROS/CONS

•Advantages:
•High scalability
•Distributed Computing
•Lower cost
•Schema flexibility, semi-structure data
•No complicatedRelationships
•Disadvantages
•No standardization
•Limited query capabilities (sofar)
•Eventual consistent is not intuitive to programfor

Big Data Analytics Vu Pham FDP


THE BASE

•Almost the opposite ofACID.


• The BASE acronym was defined by Eric Brewer, whois
also known for formulating the CAPtheorem.
•A BASE system gives up on consistency.
• Basically Available indicates that the system does guarantee
availability, in terms of the CAP theorem.
• Soft state indicates that the state of the system maychange
over time, even without input. This is because of the
eventual consistency model.
• Eventual consistency indicates that the system will become
consistent over time, given that the system doesn't receive
input during that time.

Big Data Analytics Vu Pham FDP


NOSQL DATABASE TYPES
Document databases pair each key with a complex data
structure known as a document. Eg. FirstName = “Rashmi”,
LastName = “Taneja”, Address = “IIT Patna”, Spouse = [{Name:
“Manoj”, Age: 30}]
Graph stores are used to store information aboutnetworks,
such as socialconnections
In Key-value-store category of NoSQL database, a user can store
data in schema-less way. A key may be strings,hashes, lists, sets,
sorted sets and values are stored against these keys.
Wide column stores are optimized for queries over large datasets
and store columns of data together insteadof rows.

Big Data Analytics Vu Pham FDP


NOSQL CATEGORIES

• Four general types (most common categories)


of NoSQL databases:
•Key-valuestores
•Column-oriented
•Graph
•Documentoriented

Big Data Analytics Vu Pham FDP


KEY – VALUE STORES
•Key-value stores are most basic types of NoSQL databases.
•Designed to handle huge amountsof data.
•Based on Amazon’s Dynamopaper.
•Key value stores allow developer to store schema-lessdata.
• In the key-value storage, database stores data as hash table where each key is
unique and the value can be string, JSON (JavaScript Object Notation), BLOB (basic
large object)etc.
• A key may be strings, hashes, lists, sets, sorted sets and values are stored
against these keys.
• For example a key-value pair might consist of a key like "Name" that is
associated with a value like"Robin".
•Key-Value stores can be used as collections, dictionaries, associative arrays etc.
•Key-Value stores follows the 'Availability' and 'Partition' aspectsof CAP theorem.
• Key-Values stores would work well for shopping cart contents, or individual values
like colour schemes, a landing page URI, or a default account number.

Big Data Analytics Vu Pham FDP


KEY – VALUE STORES

•Example: Redis, Dynamo,Riak.

Big Data Analytics Vu Pham FDP


COLUMN ORIENTED DATABASES
• Work on columns and every column is treated individually.
• Values of asingle column are stored contiguously.
• Column stores data in column specificfiles.
• In Column stores, query processors workon columns too.
• All data within each column have the same type which makes it ideal
forcompression.
• Column stores can improve the performance of queries as it can
access specific columndata.
• High performance on aggregation queries (e.g. COUNT, SUM, AVG,
MIN, MAX).
• Works on data warehouses and business intelligence,
customer relationship management (CRM), Library card
catalogs etc.
Big Data Analytics Vu Pham FDP
COLUMN ORIENTED DATABASES

Row Key. Each row has a unique key, which is a unique identifier for that
row.
Column. Each column contains a name, a value, and timestamp.
Name. This is the name of the name/value pair.
Value. This is the value of the name/value pair.
Timestamp. This provides the date and time that the data was inserted.
This can be used to determine the most recent version of data.

Big Data Analytics Vu Pham FDP


COLUMN ORIENTEDDATABASES EXAMPLE

Examples: BigTable,Cassandra,SimpleDB

Big Data Analytics Vu Pham FDP


GRAPH DATABASES

A graph data structure consists of a finite (and


possibly mutable) set of ordered pairs, called
edges or arcs, of certain entities called nodes or
vertices.

Big Data Analytics Vu Pham FDP


GRAPH DATABASES

• A graph database stores data in agraph.


• It is capable of elegantly representing any kind of data in a highly
accessible way.
• A graph database is a collection of nodes and edges
•Each node represents an entity (such as a student or business) and
each edge represents a connection or relationship between two
nodes.
• Every node and edge is defined by a uniqueidentifier.
• Each node knows its adjacent nodes.
• As the number of nodes increases, the cost of a local step (or hop)
remains thesame.
• Index for lookups.

Big Data Analytics Vu Pham FDP


COMPARISON BETWEEN RELATIONALMODEL AND GRAPH
MODEL

Relational Model Graph Model


Tables Vertices and Edges set
Rows Vertices
Columns Key/value pairs
Joins Edges

Big Data Analytics Vu Pham FDP


GRAPH DATABASES EXAMPLE

Example: OrientDB, Neo4J,Titan

Big Data Analytics Vu Pham FDP


DOCUMENT ORIENTED DATABASES

• A collection of documents
• Data in this model is stored inside documents.
• A document is akey value collection where the key
allows access to its value.
• Documents are not typically forced to have a schema
and therefore are flexible and easy tochange.
• Documents are stored into collections in order togroup
different kinds of data.
• Documents can contain many different key-valuepairs,
or key-array pairs, or even nested documents.

Big Data Analytics Vu Pham FDP


DOCUMENT ORIENTED DATABASES

Example: MongoDB, CouchDB

Big Data Analytics Vu Pham FDP


PRODUCTION DEPLOYMENT

• There is a large number of companies using


NoSQL.
•Google
•Facebook
•Mozilla
•Adobe
•Foursquare
•LinkedIn
•Digg
•McGraw-HillEducation
•Vermont PublicRadio

Big Data Analytics Vu Pham FDP


NEWSQL

• Is a term coined by 451 Group analyst Matt


Aslett
• Offers the bestof both worlds:
• Relational data model
• ACID transactional consistency
• Familiarity and interactivity of SQL
• Scalability and speed of NoSQL
• Example: VoltDB, NuoDB, MemSQL

Big Data Analytics Vu Pham FDP


NEWSQL: WHAT IS NEW?

• Mainmemory storage: reading and writing blocks


to memory cache is much faster.
• Historically memory was much more expensive and
had a limited capacity compared todisks.
• Now the scenario isdifferent.
• Many NewSQL DBMSs are based on this:
• Academic (e.g., H-Store,HyPer)
• Commercial (e.g., MemSQL, SAP HANA, VoltDB) systems
• NewSQL systems evict a subset of the database out to
persistent storage to reduce its memory footprint.

Big Data Analytics Vu Pham FDP


CONCLUSION

•Big data can be operational or analytical.


•Two classes of technologies are complementary and
frequently deployedtogether.
•Big data storage technologies have grown in the
following areas:
•DistributedFile Systems
•NoSQL databases: complies BASE
•NewSQL databases: complies ACID

Big Data Analytics Vu Pham FDP


Introduction to Spark

Dr. Rajiv Misra


Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Analytics Vu Pham FDP
Need of Spark
Apache Spark is a big data analytics framework that
was originally developed at the University of
California, Berkeley's AMPLab, in 2012. Since then, it
has gained a lot of attraction both in academia and in
industry.

It is an another system for big data analytics

Isn’t MapReduce good enough?


Simplifies batch processing on large commodity clusters

Big Data Analytics Vu Pham FDP


Need of Spark
Map Reduce

Input Output

Big Data Analytics Vu Pham FDP


Need of Spark
Map Reduce

Expensive save to disk for fault


tolerance
Input Output

Big Data Analytics Vu Pham FDP


Need of Spark
MapReduce can be expensive for some applications e.g.,
Iterative
Interactive

Lacks efficient data sharing

Specialized frameworks did evolve for different programming


models
Bulk Synchronous Processing (Pregel)
Iterative MapReduce (Hadoop) ….

Big Data Analytics Vu Pham FDP


Solution: Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)

Immutable, partitioned collection of records


Built through coarse grained transformations (map, join …)
Can be cached for efficient reuse

Big Data Analytics Vu Pham FDP


Need of Spark
RDD RDD RDD

Read

HDFS
Read Cache

Map Reduce
Big Data Analytics Vu Pham FDP
Solution: Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)

Immutable, partitioned collection of records


Built through coarse grained transformations (map, join …)

Fault Recovery?
Lineage!
Log the coarse grained operation applied to a
partitioned dataset
Simply recompute the lost partition if failure occurs!
No cost if no failure

Big Data Analytics Vu Pham FDP


RDD RDD RDD

Read

HDFS
Read Cache

Map Reduce

Big Data Analytics Vu Pham FDP


Read
HDFS Map Reduce
Lineage

Introduction to Spark

Big Data Analytics Vu Pham FDP


RDD RDD RDD

Read

HDFS RDDs track the graph of


Read transformations that built them Cache
(their lineage) to rebuild lost data

Map Reduce

Big Data Analytics Vu Pham FDP


What can you do with Spark?
RDD operations
Transformations e.g., filter, join, map, group-by …
Actions e.g., count, print …

Control
Partitioning: Spark also gives you control over how you can
partition your RDDs.

Persistence: Allows you to choose whether you want to


persist RDD onto disk or not.

Big Data Analytics Vu Pham FDP


Partitioning: PageRank

Joins take place


repeatedly
Good partitioning
reduces shuffles

Big Data Analytics Vu Pham FDP


Example: PageRank

Give pages ranks (scores) based on links to them


Links from many pages high rank
Links from a high-rank page high rank

Big Data Analytics Vu Pham FDP


Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions

Big Data Analytics Vu Pham FDP


Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions

Big Data Analytics Vu Pham FDP


Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions

Big Data Analytics Vu Pham FDP


Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions

Big Data Analytics Vu Pham FDP


Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions

Big Data Analytics Vu Pham FDP


Spark Program

val links = // RDD of (url, neighbors) pairs


var ranks = // RDD of (url, rank) pairs
for (i <- 1 to ITERATIONS) {
val contribs = links.join(ranks).flatMap {
case (url, (links, rank)) =>
links.map(dest => (dest, rank/links.size))
}
ranks = contribs.reduceByKey (_ + _)
.mapValues (0.15 + 0.85 * _)
}
ranks.saveAsTextFile(...)

Big Data Analytics Vu Pham FDP


PageRank Performance

Big Data Analytics Vu Pham FDP


Generality
RDDs allow unification of different programming models
Stream Processing
Graph Processing
Machine Learning

Big Data Analytics Vu Pham FDP


Big Data Analytics
Spark Streaming

Big Data Analytics Vu Pham FDP


What is Spark Streaming?
Receive data streams from input sources,
process them in a cluster, push out to
databases/ dashboards
Scalable, fault-tolerant, second-scale latencies

Big Data Analytics Vu Pham FDP


Spark Streaming

Big Data Analytics Vu Pham FDP


Spark Streaming
Spark Streaming Characteristics
Spark Streaming Characteristics
Extension to the Spark Core API
Live data streams can be processed
Fault-tolerant and scalable
High throughput (near) real-time data processing 3
0.5 s or longer
Streaming data input from HDFS, Kafka, Flume, TCP
sockets, Kinesis, etc.

Big Data Analytics Vu Pham FDP


Spark Streaming
Spark Streaming Characteristics
Stream processing high-level functions
Map, Reduce, Join, Window, etc.
Processed Output saved on Filesystems,
Databases, and Live Dashboards
Spark ML (Machine Learning) functions and
GraphX graph processing algorithms are fully
applicable to streaming data

Big Data Analytics Vu Pham FDP


Spark Streaming
Spark Streaming Characteristics
Spark uses (size controllable) micro-batch
processing of data for real-time analysis
Hadoop uses batch processing of data, which is time
consuming to obtain results
Spark uses RDD to arrange data and recover from
failures

Big Data Analytics Vu Pham FDP


Spark Streaming
Spark Streaming Input and Output

Big Data Analytics Vu Pham FDP


Spark Streaming
Streaming Receiver Types
Basic
• File systems
• Socket connectors
• Akka Actors
• Sources directly available in Streaming Context API
Custom
• Requires implementing a user-defined receiver
• Anywhere
Advanced
• Requires linking with systems that have extra
dependencies.
• Kafka,Flume,Twitter
Big Data Analytics Vu Pham FDP
Spark Streaming
Spark Streaming process
1. Live input data stream received
2. Input data stream is divided into Mini-Batches
called a DStream (Discretized Stream), which is
saved as a small RDD every mini-batch perm.
3. Spark Stream engine cores process the mini
batches and generate a final output stream of mini
batches

Big Data Analytics Vu Pham FDP


Spark Streaming
Dstream
Dstream (Discretized Stream) is a continuous stream
of data with high-level abstraction
DStreams are created from input data stream
sources (Kafka, Flume, Kinesis, etc.) or high-level
processing operationon other DStreams

Big Data Analytics Vu Pham FDP


Spark Streaming
Dstream
Dstreamsare represented as a sequence of small
RDDs
Mini-Batch size is 0.5 s or longer
RDD is processed through the DAG
Processing latency (through the DAG) has to be
smaller than the mini-batch period

Big Data Analytics Vu Pham FDP


How does Spark Streaming work?
Chop up data streams into batches of few secs
Spark treats each batch of data as RDDs and processes
them using RDD operations
Processed results are pushed out in batches

Big Data Analytics Vu Pham FDP


Spark Streaming Programming Model
Discretized Stream (DStream)
Represents a stream of data
Implemented as a sequence of RDDs
DStreams API very similar to RDD API
Functional APIs in Scala, Java
Create input DStreams from different sources
Apply parallel operations

Big Data Analytics Vu Pham FDP


Example – Get hashtags from Twitter

Example – Get hashtags from Twitter

Big Data Analytics Vu Pham FDP


Example – Get hashtags from Twitter

Big Data Analytics Vu Pham FDP


Example – Get hashtags from Twitter

Big Data Analytics Vu Pham FDP


Example – Get hashtags from Twitter

Big Data Analytics Vu Pham FDP


Languages

Big Data Analytics Vu Pham FDP


Window-based Transformations
Window Operations
Window Length
• Number of blocks (partitions) to conduct a RDD DAG
process together

Sliding Interval
• Number of blocks (partitions) to slide the Window after a
RDD process is conducted

Big Data Analytics Vu Pham FDP


Window-based Transformations
Transformations for Windows
Key parameters: windowLength, slidelnterval
window( )
countByWindow( )
reduceByWindow ( ) IP
reduceByKeyAndWindow( )
countByValueAndWindow( )
etc.

Big Data Analytics Vu Pham FDP


Window-based Transformations
Window Operations
If Window Length > Sliding Interval, then Overlap
Blocks = (Window Length - Sliding Interval) will exist
for each RDD process

Overlap Blocks help to analyze correlation


(dependency) of sequential blocks of the streamed
data.

Big Data Analytics Vu Pham FDP


Window-based Transformations

Big Data Analytics Vu Pham FDP


Spark Streaming
Spark Streaming Examples
IPTV or Web Page Live statistics
• Channel or Page view of clicks
• Use Kafka for buffering
• Spark Streaming for processing
• Draw a Heap Map of the current Channel or Page view
clicks

Big Data Analytics Vu Pham FDP


Spark Streaming
Spark Streaming Examples
Sales Product Type Monitoring
1. Online Sales
– Read through Kafka
2. Department Store Sales
– Read through Flume
Join 2 live data streams into Spark Streaming

Big Data Analytics Vu Pham FDP


Spark Streaming

Big Data Analytics Vu Pham FDP


Spark Streaming
One Stream Input (e.g., from kafka )
1. One Task slot in the Executor will serve as a Receiver
(thread) to receive the live streaming data into a
Block (Partition) of the RDD on the node
2. Receiver will also make a copy of this Partition to
another node (e.g., replication factor 2)
3. DAG Transformations are executed on the new RDD

Big Data Analytics Vu Pham FDP


Spark Streaming
Another Stream Added (e.g., from flume )
1. On a different node, assign one Task slot in the
Executor to serve as a Receiver (thread) to receive
the live streaming data into a Block (Partition) of the
RDD on the node
2. Receiver will also make a copy of this Partition to
another node ( e.g. Replication Factor 2)

Big Data Analytics Vu Pham FDP


Spark Streaming
Another Stream Added (e.g., from flume)
3. DAG Transformations are executed on the new RDD
4. Union can be used to unify the two RDDs into one
RDD

Big Data Analytics Vu Pham FDP


Conclusion

Big Data Analytics Vu Pham FDP


Machine Learning Using Spark

Big Data Analytics Vu Pham FDP


What is Machine Learning ?

• Science of how machines learn without being


explicitly programmed.

Big Data Analytics Vu Pham FDP


Machine Learning Use Cases

Big Data Analytics Vu Pham FDP


What is ML Model ?

• Mathematical formula with a number of parameters that need


to be learned from the data. And fitting a model to the data is
a process known as model training.
• E.g. Linear Regression
• Goal: fit a line y = mx + c to data points

• After model training: y = 2x + 5

input model output

Big Data Analytics Vu Pham FDP


Components Of Machine Learning
• CLASSIFICATION
Logistic Regression
Naïve Bayes
Support Vector Machine(SVM)
Random Forest
• COLLABORATIVE FILTERING
Alternating Least Squares(ALS)
• REGRESSION
Linear regression
• CLUSTERING
Kmeans , LDA
• DIMENSIONALITY REDUCTION
Principal Component Analysis(PCA)

Big Data Analytics Vu Pham FDP


Linear Regression Model Training

• Linear regression is used to predict a quantitative response


Y from the predictor variable X.
• Linear Regression is made with an assumption that there’s a
linear relationship between X and Y.
ONE FEATURE

Big Data Analytics Vu Pham FDP


ML in SPARK

• We have an utility in SPARK to implement machine


learning called MLlib .

Big Data Analytics Vu Pham FDP


Spark MLlib (Goals)

• Practical machine learning is scalable and easy to use.


• Simplify the development and deployment of scalable
machine learning pipelines.
• ~75 organizations, ~200 individuals, ~20 companies.

Big Data Analytics Vu Pham FDP


Spark MLlib Components
ALGORITHMS
• Classification
• Regression
• Collaborative Filtering
• Clustering
PIPELINE
• Constructing
• Evaluating
• Tuning
• Persistence

Big Data Analytics Vu Pham FDP


Spark MLlib Components
UTILITIES
• Linear Algebra
• Statistics
FEATURIZATION
• Extraction
• Transformation

Big Data Analytics Vu Pham FDP


Spark MLlib Pipeline

Big Data Analytics Vu Pham FDP


Spark MLlib Pipeline Concepts

Big Data Analytics Vu Pham FDP


Spark MLlib: Transformer
• Preprocessing step of feature extraction.
• Transforming data into consumable format.
• Take input column, transform it to an output form.
• Examples:-
• Normalizes the data.
• Tokenization – sentences into words.
• Converting categorical values into numbers.

Big Data Analytics Vu Pham FDP


Spark MLlib: Estimator
• Another kind of Transformer.
• Transform data by requiring two passes over data.

• Learning algorithm that trains (fits) on data.


• Return a model, which is a type of Transformer.
• Examples:
• LogisticRegression.fit() => LogisticRegressionModel

Big Data Analytics Vu Pham FDP


Spark MLlib: Evaluator
• Evaluate the model performance based certain metric.
• ROC, RMSE
• Help with automating the model tuning process.
• Comparing model performance.
• Select the best model for generating predictions.
• Examples:
• Binary Classification Evaluator, Cross Validator.

Big Data Analytics Vu Pham FDP


Spark MLlib : Pipeline
• To represent a ML workflow.
• Consist of a set of stages.
• Leverage the uniform API of Transformer & Estimator.
• A type of Estimator - fit().
• Can be persisted.

Big Data Analytics Vu Pham FDP


Spark MLlib: Machine Learning Pipeline

Big Data Analytics Vu Pham FDP


Spark ML Pipeline

Big Data Analytics Vu Pham FDP


Spark MLlib : Pipeline Example in Scala

Big Data Analytics Vu Pham FDP


Spark MLlib: Automatic model tuning
• ParamGridBuilder
• Cross Validator (k-fold)

Big Data Analytics Vu Pham FDP


Exporting ML Models - PMML
• Predictive Model Markup Language (PMML)
– XML-based predictive model interchange format
• Supported models
– K-means
– Linear Regression
– Ridge Regression
– Lasso
– SVM
– Binary

Big Data Analytics Vu Pham FDP


Spark ML Pipeline – Sample Code

Big Data Analytics Vu Pham FDP


Parameter Server
• A machine learning framework
• Distributes a model over multiple machines
• Offers two operations:
• Pull: query parts of the model

• Push: update parts of the model

• Machine learning update equation

• (Stochastic) gradient descent


• Collapsed Gibbs sampling for topic modeling
• Aggregate push updates via addition (+)
Big Data Analytics Vu Pham FDP
Parameter Server (PS)
• Training state stored in PS shards, asynchronous
updates.

Big Data Analytics Vu Pham FDP


The GraphX API

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
What is a Graph?
Graph: vertices connected by edges

Big Data Analytics Vu Pham FDP


Graphs are Essential to Data Mining and Machine
Learning
• Identify influential entities (people,
information…)
• Find communities
• Understand people’s shared interests
• Model complex data dependencies

Big Data Analytics Vu Pham FDP


Real World Graphs
Web pages

Big Data Analytics Vu Pham FDP


Real World Graphs

Web pages

Big Data Analytics Vu Pham FDP


Real World Graphs

Recommendation

Big Data Analytics Vu Pham FDP


Real World Graphs

Credit card fraud detection

Big Data Analytics Vu Pham FDP


Table and Graph Analytics
C

Big Data Analytics Vu Pham FDP


Apache Spark GraphX

• Spark component for graphs and graph- parallel


computations
• Combines data parallel and graph parallel processing
in single API
• View data as graphs and as collections (RDD)
– no duplication or movement of data
• Operations for graph computation
– includes optimized version of Pregel
• Provides graph algorithms and builders

Big Data Analytics Vu Pham FDP


Property Graph

Big Data Analytics Vu Pham FDP


Gather-Apply-Scatter on GraphX

Triplets

Graph Represented In a Table

Triplets

Big Data Analytics Vu Pham FDP


Gather-Apply-Scatter on GraphX

Gather at A
Group-By A

Big Data Analytics Vu Pham FDP


Gather-Apply-Scatter on GraphX

Apply
Map

Big Data Analytics Vu Pham FDP


Gather-Apply-Scatter on GraphX

Scatter
Triplets Join

Big Data Analytics Vu Pham FDP


Graphs Property

Big Data Analytics Vu Pham FDP


Creating a Graph (Scala)

Big Data Analytics Vu Pham FDP


Graph Operations (Scala)

Big Data Analytics Vu Pham FDP


Built-in Algorithms (Scala)

Big Data Analytics Vu Pham FDP


The triplets view

Big Data Analytics Vu Pham FDP


The subgraph transformation

Big Data Analytics Vu Pham FDP


The subgraph transformation

Big Data Analytics Vu Pham FDP


Computation with aggregateMessages

Big Data Analytics Vu Pham FDP


Computation with aggregateMessages

Big Data Analytics Vu Pham FDP


Example: Graph Coarsening

Big Data Analytics Vu Pham FDP


How GraphX Works

Big Data Analytics Vu Pham FDP


Storing Graphs as Tables

Big Data Analytics Vu Pham FDP


Simple Operations
Reuse vertices or edges across multiple graphs

Big Data Analytics Vu Pham FDP


Implementing triplets

Big Data Analytics Vu Pham FDP


Implementing triplets

Big Data Analytics Vu Pham FDP


Implementing aggregateMessages

Big Data Analytics Vu Pham FDP


Future of GraphX
1. Language support
a) Java API
b) Python API: collaborating with Intel, SPARK-3789
2. More algorithms
a) LDA (topic modeling)
b) Correlation clustering
3. Research
a) Local graphs
b) Streaming/time-varying graphs
c) Graph database–like queries

Big Data Analytics Vu Pham FDP


Other Spark Applications
i. Twitter spam classification

ii. EM algorithm for traffic prediction

iii. K-means clustering

iv. Alternating Least Squares matrix factorization

v. In-memory OLAP aggregation on Hive data

vi. SQL on Spark

Big Data Analytics Vu Pham FDP


Thank You!

Big Data Computing-A


Big Data Analytics
practical approach
Vu Pham FDP
References
https://www.crayondata.com/blog/top-big-data-
technologies-used-store-analyse-data

https://www.guru99.com/cassandra-tutorial.html

• Matei Zaharia, Mosharaf Chowdhury, Michael J.


Franklin, Scott Shenker, Ion Stoica
“Spark: Cluster Computing with Working Sets”

Big Data Computing-A


Big Data Analytics
practical approach
Vu Pham FDP
References
• Matei Zaharia, Mosharaf Chowdhury et al.
“Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing”
https://spark.apache.org/
• Jeffrey Dean and Sanjay Ghemawat,

“MapReduce: Simplified Data Processing on Large


Clusters”

http://labs.google.com/papers/mapreduce.html

Big Data
Cloud Computing andAnalytics
DistributedVuSystems
Pham FDP

You might also like