0% found this document useful (0 votes)

49 views41 pages

Unit 4 Spark Cassendra

The document discusses tools related to Hadoop including HBase, Cassandra, Pig, Hive, and Spark. It provides overviews and explanations of each tool's features, components, and limitations.

Uploaded by

downloadjain123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views41 pages

Unit 4 Spark Cassendra

The document discusses tools related to Hadoop including HBase, Cassandra, Pig, Hive, and Spark. It provides overviews and explanations of each tool's features, components, and limitations.

Uploaded by

downloadjain123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Big Data Analytics

Unit 4
Hadoop Related Tools
Faculty: Dr. Vandana Bhatia
❑ Hbase
➢ Data model and implementations,
➢ Hbaseclients
➢ Hbase examples – praxis.

Hadoop ❑ Cassandra:
➢ Cassandra data model

Related ➢ Cassandra examples

➢ Cassandra clients
Tools: ➢ Hadoop integration.

Contents ❑ Pig
➢ Grunt
➢ pig data model
➢ Pig Latin,
➢ Developing and testing Pig Latin scripts.
❑ Hive
➢ Data types and file formats,
➢ HiveQL data definition,
➢ HiveQL data manipulation – HiveQL queries,
❑ Overview of spark.
Overview of Spark
Overview of Spark
• Apache Spark is a lightning-fast cluster computing technology,
designed for fast computation.
• It is based on Hadoop MapReduce and it extends the MapReduce
model to efficiently use it for more types of computations, which
includes interactive queries and stream processing.
• The main feature of Spark is its in-memory cluster
computing that increases the processing speed of an application.
• Spark is designed to cover a wide range of workloads such as batch
applications, iterative algorithms, interactive queries and
streaming.
• Apart from supporting all these workload in a respective system, it
reduces the management burden of maintaining separate tools.
• Fast - It provides high performance for both batch and
streaming data, using a state-of-the-art DAG scheduler,
a query optimizer, and a physical execution engine.
• Easy to Use - It facilitates to write the application in
Java, Scala, Python, R, and SQL. It also provides more
than 80 high-level operators.
Features of • Generality - It provides a collection of libraries
Spark including SQL and DataFrames, MLlib for machine
learning, GraphX, and Spark Streaming.
• Lightweight - It is a light unified analytics engine which
is used for large scale data processing.
• Runs Everywhere - It can easily run on Hadoop, Apache
Mesos, Kubernetes, standalone, or in the cloud.
Uses of Spark

• Data integration: The data generated by systems are not consistent enough to combine for
analysis. To fetch consistent data from systems we can use processes like Extract, transform,
and load (ETL). Spark is used to reduce the cost and time required for this ETL process.
• Stream processing: It is always difficult to handle the real-time generated data such as log files.
Spark is capable enough to operate streams of data and refuses potentially fraudulent
operations.
• Machine learning: Machine learning approaches become more feasible and increasingly
accurate due to enhancement in the volume of data. As spark is capable of storing data in
memory and can run repeated queries quickly, it makes it easy to work on machine learning
algorithms.
• Interactive analytics: Spark is able to generate the respond rapidly. So, instead of running pre-
defined queries, we can handle the data interactively.
Components of Spark
Spark Core is the underlying general execution engine for spark platform that all
other functionality is built upon. It provides In-Memory computing and
referencing datasets in external storage systems.
• Spark SQL
• Spark SQL is a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD, which provides support for structured and
semi-structured data.
• Spark Streaming
• Spark Streaming leverages Spark Core's fast scheduling capability to perform
Apache streaming analytics. It ingests data in mini-batches and performs RDD
(Resilient Distributed Datasets) transformations on those mini-batches of data.
Spark Core • MLlib (Machine Learning Library)
• MLlib is a distributed machine learning framework above Spark because of the
distributed memory-based Spark architecture.
• It is, according to benchmarks, done by the MLlib developers against the
Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as
fast as the Hadoop disk-based version of Apache Mahout (before Mahout
gained a Spark interface).
• GraphX
• GraphX is a distributed graph-processing framework on top of Spark. It
provides an API for expressing graph computation that can model the user-
defined graphs by using Pregel abstraction API. It also provides an optimized
runtime for this abstraction.
Spark Streaming
• While data is arriving continuously in an unbounded sequence is what we call a data stream.
• Streaming divides continuous flowing input data into discrete units. Moreover, we can say it is a low latency processing
and analyzing of streaming data.
a. Internal working of Spark Streaming
• While live input data streams are received.
• It further divided into batches by Spark streaming, Afterwards, these batches are processed by the Spark engine to generate
the final stream of results in batches.
b. Discretized Stream (DStream)
• Apache Spark Discretized Stream is the key abstraction of Spark Streaming.
• It represents a stream of data divided into small batches.
• DStreams are built on Spark RDDs, Spark’s core data abstraction.
• It also allows Streaming to seamlessly integrate with any other Apache Spark components. Such as Spark MLlib and Spark
SQL.
• The Spark follows the master-slave
architecture. Its cluster consists of a
single master and multiple slaves.
Spark Architecture • The Spark architecture depends upon
two abstractions:
• Resilient Distributed Dataset (RDD)
• Directed Acyclic Graph (DAG)
Resilient Distributed
Datasets (RDD)
• The Resilient Distributed Datasets are the
group of data items that can be stored in-
memory on worker nodes. Here,
• Resilient: Restore the data on failure.
• Distributed: Data is distributed among
different nodes.
• Dataset: Group of data.
Ways to create Spark RDD

There are 3 ways to create Spark RDDs

i. Parallelized collections
By invoking parallelize method in the driver program, we can create parallelized collections.

ii. External datasets

One can create Spark RDDs, by calling a textFile method. Hence, this method takes URL of the file and reads
it as a collection of lines.

iii. Existing RDDs

Moreover, we can create new RDD in spark, by applying transformation operation on existing RDDs.
Features of Spark RDD

i. In-memory computation
The data inside RDD are stored in memory for as long as you want to store. Keeping the data in-memory improves the
performance by an order of magnitudes.
ii. Lazy Evaluation
The data inside RDDs are not evaluated on the go. The changes or the computation is performed only after an action is triggered.
Thus, it limits how much work it has to do.
iii. Fault Tolerance
Upon the failure of worker node, using lineage of operations we can re-compute the lost partition of RDD from the original one.
Thus, we can easily recover the lost data.
iv. Immutability
RDDS are immutable in nature meaning once we create an RDD we can not manipulate it. And if we perform any
transformation, it creates new RDD. We achieve consistency through immutability.
v. Persistence
We can store the frequently used RDD in in-memory and we can also retrieve them directly from memory without going to disk,
this speedup the execution. We can perform Multiple operations on the same data, this happens by storing the data explicitly in
memory by calling persist() or cache() function.
Features of Spark RDD

vi. Partitioning
RDD partition the records logically and distributes the data across various nodes in the cluster. The logical divisions are only for processing
and internally it has no division. Thus, it provides parallelism.
vii. Parallel
Rdd, process the data parallelly over the cluster.
viii. Location-Stickiness
RDDs are capable of defining placement preference to compute partitions. Placement preference refers to information about the location of
RDD. The DAGScheduler places the partitions in such a way that task is close to data as much as possible. Thus speed up computation.
ix. Coarse-grained Operation
We apply coarse-grained transformations to RDD. Coarse-grained meaning the operation applies to the whole dataset not on an individual
element in the data set of RDD.
x. Typed
We can have RDD of various types like: RDD [int], RDD [long], RDD [string].
xi. No limitation
we can have any number of RDD. there is no limit to its number. the limit depends on the size of disk and memory.
Limitations of Spark
a. No Support for Real-time Processing
• Basically, Spark is near real-time processing of live data. In other words, Micro-batch processing takes place in Spark Streaming. Hence we can not say Spark is
completely Real-time Processing engine.
b. Problem with Small File
• In RDD, each file is a small partition. It means, there is the large amount of tiny partition within an RDD. Hence, if we want efficiency in our processing, the
RDDs should be repartitioned into some manageable format. Basically, that demands extensive shuffling over the network.
c. No File Management System
• A major issue is Spark does not have its own file management system. Basically, it relies on some other platform like Hadoop or another cloud-based platform.
d. Expensive
• While we desire cost-efficient processing of big data, Spark turns out to be very expensive. Since keeping data in memory is quite expensive. However the
memory consumption is very high, and it is not handled in a user-friendly manner. Moreover, we require lots of RAM to run in-memory, thus the cost of spark is
much higher.
e. Less number of Algorithms
• Spark MLlib have very less number of available algorithms. For example, Tanimoto distance.
f. Manual Optimization
• It is must that Spark job is manually optimized and is adequate to specific datasets. Moreover, to partition and cache in spark to be correct, it is must to control it
manually.
g. Iterative Processing
• Basically, here data iterates in batches. Also, each iteration is scheduled and executed separately.
h. Latency
• On comparing with Flink, Apache Spark has higher latency.
i. Window Criteria
• Spark only support time-based window criteria not record based window criteria.
HBase
• Hbase is an open source and sorted map
data built on Hadoop. It is column oriented
and horizontally scalable.
• It is based on Google's Big Table.It has set of
tables which keep data in key value format.
• Hbase is well suited for sparse data sets
which are very common in big data use
HBase cases.
• Hbase provides APIs enabling development
in practically any programming language.
• It is a part of the Hadoop ecosystem that
provides random real-time read/write access
to data in the Hadoop File System.
Features of Hbase
• Horizontally scalable: You can add any number of columns anytime.
• Automatic Failover: Automatic failover is a resource that allows a system administrator to automatically switch
data handling to a standby system in the event of system compromise
• Integrations with Map/Reduce framework: Al the commands and java codes internally implement Map/ Reduce to
do the task and it is built over Hadoop Distributed File System.
• sparse, distributed, persistent, multidimensional sorted map, which is indexed by rowkey, column key,and
timestamp.
• Often referred as a key value store or column family-oriented database, or storing versioned maps of maps.
• fundamentally, it's a platform for storing and retrieving data with random access.
• It doesn't care about datatypes(storing an integer in one row and a string in another for the same column).
• It doesn't enforce relationships within your data.
• It is designed to run on a cluster of computers, built using commodity hardware.
HBase Vs. RDBMS
HBASE RDBMS
➢ Schema-less in database ➢ Having fixed schema in database
➢ Column-oriented databases ➢ Row oriented datastore
➢ Designed to store De-normalized data ➢ Designed to store Normalized data
➢ Wide and sparsely populated tables present in
➢ Contains thin tables in database
HBase
➢ Supports automatic partitioning ➢ Has no built in support for partitioning
➢ Well suited for OLAP systems ➢ Well suited for OLTP systems
➢ Retrieve one row at a time and hence could read
➢ Read only relevant data from database unnecessary data if only some of the data in a row
is required
➢ Structured and semi-structure data can be stored ➢ Structured data can be stored and processed using
and processed using HBase RDBMS
➢ Enables aggregation over many rows and
➢ Aggregation is an expensive operation
columns
Hbase Data Model
Hbase
Architecture
• In HBase, tables are split into
regions and are served by the
region servers. Regions are
vertically divided by column
families into “Stores”. Stores
are saved as files in HDFS.
• HBase has three major
components: the client
library, a master server, and
region servers.
• Region servers can be added
or removed as per
requirement.
Hbase:
MasterServer
• The master server -
• Assigns regions to the region servers and takes
the help of Apache ZooKeeper for this task.
• Handles load balancing of the regions across
region servers. It unloads the busy servers and
shifts the regions to less occupied servers.
• Maintains the state of the cluster by negotiating
the load balancing.
• Is responsible for schema changes and other
metadata operations such as creation of tables
and column families.
Hbase: Regions
• Regions are nothing but tables that are split up and spread across
the region servers.
Region server
• The region servers have regions that -
• Communicate with the client and handle data-related operations.
• Handle read and write requests for all the regions under it.
• Decide the size of the region by following the region size
thresholds.
• The store contains memory store and HFiles.
• Memstore is just like a cache memory.
• Anything that is entered into the HBase is stored here initially.
• Later, the data is transferred and saved in Hfiles as blocks and the
memstore is flushed.
Hbase: Zookeeper
• Zookeeper is an open-source project that provides
services like maintaining configuration information,
naming, providing distributed synchronization, etc.
• Zookeeper has ephemeral nodes representing
different region servers. Master servers use these
nodes to discover available servers.
• In addition to availability, the nodes are also used to
track server failures or network partitions.
• Clients communicate with region servers via
zookeeper.
• In pseudo and standalone modes, HBase itself will
take care of zookeeper.
HBase Shell
➢ Creating a Table using HBase Shell
create ‘<table name>’,’<column family>’
Eg. Create ‘emp’, ‘empid’, ‘empname’, ‘salary’
➢ List
List
➢ Disable a table
Disable ‘emp’
➢ Enable a table
Enable ‘emp’
➢ Describe
Describe ‘emp’
➢ Alter
• Alter ‘emp’, ‘empid’=>’1’, version=s
• Alter ‘emp’, ‘delete’=>’salary’
HBase Shell
• Existence of Table
• exists 'emp’
• Dropping a Table
• disable 'emp’
• drop 'emp’
• drop_all
• drop_all ‘t.*’ //drop the tables matching the “regex”
• Disable_all
• disable_all 'raj.*’
• exit the shell
• exit
Cassandra
Cassandra
• Cassandra is a distributed database from Apache that
is highly scalable and designed to manage very large
amounts of structured data.
• It provides high availability with no single point of
failure.
• It is a type of NoSQL database.
Cassandra

• It is scalable, fault-tolerant, and consistent.

• It is a column-oriented database.
• Its distribution design is based on Amazon’s Dynamo and its data model on
Google’s Bigtable.
• Created at Facebook, it differs sharply from relational database management
systems.
• Cassandra implements a Dynamo-style replication model with no single point of
failure, but adds a more powerful “column family” data model.
• Cassandra is being used by some of the biggest companies such as Facebook,
Twitter, Cisco, Rackspace, ebay, Twitter, Netflix, and more.
Features of Cassandra

• Elastic scalability − Cassandra is highly scalable; it allows to add more hardware to accommodate more customers and more
data as per requirement.
• Always on architecture − Cassandra has no single point of failure and it is continuously available for business-critical
applications that cannot afford a failure.
• Fast linear-scale performance − Cassandra is linearly scalable, i.e., it increases your throughput as you increase the number
of nodes in the cluster. Therefore it maintains a quick response time.
• Flexible data storage − Cassandra accommodates all possible data formats including: structured, semi-structured, and
unstructured. It can dynamically accommodate changes to your data structures according to your need.
• Easy data distribution − Cassandra provides the flexibility to distribute data where you need by replicating data across
multiple data centers.
• Transaction support − Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID).
• Fast writes − Cassandra was designed to run on cheap commodity hardware. It performs blazingly fast writes and can store
hundreds of terabytes of data, without sacrificing the read efficiency.
All the nodes in a cluster play the same
role. Each node is independent and at the
same time interconnected to other nodes.

Cassandra: Each node in a cluster can accept read and

write requests, regardless of where the
Distributed data is actually located in the cluster.
Environment
When a node goes down, read/write
requests can be served from other nodes
in the network.
Data Replication in
Cassandra
• In Cassandra, one or more of the nodes in a
cluster act as replicas for a given piece of data.
• If it is detected that some of the nodes
responded with an out-of-date value,
Cassandra will return the most recent value to
the client.
• After returning the most recent value,
Cassandra performs a read repair in the
background to update the stale values.
• Cassandra uses the Gossip Protocol in the
background to allow the nodes to communicate
with each other and detect any faulty nodes in
the cluster.
Components of Cassandra

• Node − It is the place where data is stored.

• Data center − It is a collection of related nodes.
• Cluster − A cluster is a component that contains one or more data centers.
• Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write operation is
written to the commit log.
• Mem-table − A mem-table is a memory-resident data structure. After commit log, the data will be
written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables.
• SSTable − It is a disk file to which the data is flushed from the mem-table when its contents reach a
threshold value.
• Bloom filter − These are nothing but quick, nondeterministic, algorithms for testing whether an element
is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query.
Cassandra Query Language
• Users can access Cassandra through its nodes using Cassandra Query Language (CQL).
• CQL treats the database (Keyspace) as a container of tables.
• Programmers use cqlsh: a prompt to work with CQL or separate application language drivers.
• Clients approach any of the nodes for their read-write operations. That node (coordinator) plays a proxy between
the client and the nodes holding the data.
Write Operations
• Every write activity of nodes is captured by the commit logs written in the nodes.
• Later the data will be captured and stored in the mem-table.
• Whenever the mem-table is full, data will be written into the SStable data file.
• All writes are automatically partitioned and replicated throughout the cluster.
• Cassandra periodically consolidates the SSTables, discarding unnecessary data.
Read Operations
• During read operations, Cassandra gets values from the mem-table and checks the bloom filter to find the
appropriate SSTable that holds the required data.
Cluster

• Cassandra database is distributed over several

machines that operate together.
• The outermost container is known as the Cluster.
• For failure handling, every node contains a replica,
and in case of a failure, the replica takes charge.
• Cassandra arranges the nodes in a cluster, in a ring
format, and assigns data to them.
• Keyspace is the outermost container for data in Cassandra. The basic
attributes of a Keyspace in Cassandra are −
• Replication factor − It is the number of machines in the cluster that
will receive copies of the same data.
• Replica placement strategy − It is nothing but the strategy to place
Keyspace replicas in the ring. We have strategies such as simple strategy (rack-
aware strategy), old network topology strategy (rack-aware strategy),
and network topology strategy (datacenter-shared strategy).
• Column families − Keyspace is a container for a list of one or more
column families. A column family, in turn, is a container of a collection
of rows. Each row contains ordered columns. Column families
represent the structure of your data. Each keyspace has at least one and
often many column families.
• The syntax of creating a Keyspace is as follows −
CREATE KEYSPACE Keyspace name
WITH replication = {'class': 'SimpleStrategy', 'replication_factor' :
3};
Column Relational Table Cassandra column Family

Family A schema in a relational In Cassandra, although

model is fixed. Once we the column families are
• A column family is a define certain columns defined, the columns are
container for an ordered for a table, while not. You can freely add
collection of rows. Each
inserting data, in every any column to any
row, in turn, is an ordered
collection of columns. row all the columns must column family at any
be filled at least with a time.
null value.
Relational tables define In Cassandra, a table
only columns and the contains columns, or can
user fills in the table with be defined as a super
values. column family.
• A Cassandra column family has the following
Cassandra attributes −
• keys_cached − It represents the number of locations
column to keep cached per SSTable.
family • rows_cached − It represents the number of rows
whose entire contents will be cached in memory.
• preload_row_cache − It specifies whether you want
to pre-populate the row cache.
• Note − Unlike relational tables where a column
family’s schema is not fixed, Cassandra does not force
individual rows to have all the columns.
• The following figure shows an example of a
Cassandra column family.
Cassandra column family

Column
A column is the basic data structure of
Cassandra with three values, namely key or
column name, value, and a time stamp. Given
below is the structure of a column.

Super Column
A super column is a special column, therefore, it
is also a key-value pair. But a super column
stores a map of sub-columns.
Generally column families are stored on disk in
individual files. Therefore, to optimize
performance, it is important to keep columns
that you are likely to query together in the same
column family, and a super column can be
helpful here.Given below is the structure of a
super column.
Help Material
• 1. https://sparkbyexamples.com/

PySpark Notes
No ratings yet
PySpark Notes
31 pages
Unit V
No ratings yet
Unit V
35 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
SPARK
No ratings yet
SPARK
125 pages
Bda Notes
No ratings yet
Bda Notes
241 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
SPARK
No ratings yet
SPARK
47 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Spark
No ratings yet
Spark
96 pages
Introduction To Spark 1
No ratings yet
Introduction To Spark 1
21 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Bda U4
No ratings yet
Bda U4
49 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Bda Unit 6
No ratings yet
Bda Unit 6
14 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
Shark
No ratings yet
Shark
24 pages
BDA GTU Study Material Presentations Unit-6 03102021061221PM
No ratings yet
BDA GTU Study Material Presentations Unit-6 03102021061221PM
23 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Spark & SparkMLLib
No ratings yet
Spark & SparkMLLib
6 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
8 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
3.5 Apache Spark
No ratings yet
3.5 Apache Spark
12 pages
Module 3
No ratings yet
Module 3
51 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Spark2x: Big Data Huawei Course
No ratings yet
Spark2x: Big Data Huawei Course
25 pages
Spark BD
No ratings yet
Spark BD
9 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Spark
No ratings yet
Spark
9 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Spark SQL
100% (1)
Spark SQL
25 pages
P.prabu (29x61c) CCS334 BDA - Unit 2
No ratings yet
P.prabu (29x61c) CCS334 BDA - Unit 2
29 pages
Unit 5
100% (1)
Unit 5
109 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Chattra Vittaya Sahayta (CVSPK)
No ratings yet
Chattra Vittaya Sahayta (CVSPK)
3 pages
Unit-5 Notes
No ratings yet
Unit-5 Notes
17 pages
Unit 5
No ratings yet
Unit 5
104 pages
Unit 2
No ratings yet
Unit 2
65 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Cosmos DB
67% (3)
Cosmos DB
20 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
BDA GTU Study Material Presentations Unit-3 29092021094744AM
No ratings yet
BDA GTU Study Material Presentations Unit-3 29092021094744AM
37 pages
Cassandra Unit 4
No ratings yet
Cassandra Unit 4
18 pages
Unit 2 - Bda Notes
No ratings yet
Unit 2 - Bda Notes
37 pages
Facebook Cassandra
No ratings yet
Facebook Cassandra
10 pages
Data Intensive Computing
No ratings yet
Data Intensive Computing
33 pages
MongoDB Slides Until ClassTest
No ratings yet
MongoDB Slides Until ClassTest
221 pages
UNIT 5 Notes
No ratings yet
UNIT 5 Notes
47 pages
Implement - Column-Family Stores
No ratings yet
Implement - Column-Family Stores
37 pages
CCS334 Big Data Analytics
No ratings yet
CCS334 Big Data Analytics
20 pages
Cassandra Complete Notes
No ratings yet
Cassandra Complete Notes
5 pages
App Ache
No ratings yet
App Ache
55 pages
Unit 4 Pig and Hive
No ratings yet
Unit 4 Pig and Hive
86 pages
1 s2.0 S1877050922020737 Main
No ratings yet
1 s2.0 S1877050922020737 Main
16 pages
Nosql
No ratings yet
Nosql
26 pages
Pyspark Essentials
No ratings yet
Pyspark Essentials
24 pages
Nosql Cassandra Database: What Is Apache Cassandra?
No ratings yet
Nosql Cassandra Database: What Is Apache Cassandra?
4 pages
ECAP785
No ratings yet
ECAP785
45 pages
DS220 v6 Solutions
No ratings yet
DS220 v6 Solutions
31 pages
Cassandra and DataStax Enterprise Essentials
No ratings yet
Cassandra and DataStax Enterprise Essentials
38 pages
Bda Quiz QA
No ratings yet
Bda Quiz QA
7 pages
Virtual Nodes Strategies For Apache Cassandra
100% (1)
Virtual Nodes Strategies For Apache Cassandra
5 pages
Features of Cassandra
No ratings yet
Features of Cassandra
6 pages
Survey and Comparison of Open Source Time Series Databases
No ratings yet
Survey and Comparison of Open Source Time Series Databases
20 pages
Test Report
No ratings yet
Test Report
1 page
Week6 Iot Big Data
No ratings yet
Week6 Iot Big Data
21 pages
Module 1 - ELECTIVE 1
No ratings yet
Module 1 - ELECTIVE 1
11 pages
Cassandra Data Modeling
No ratings yet
Cassandra Data Modeling
3 pages
Benchmarking Cloud Serving Systems With YCSB
No ratings yet
Benchmarking Cloud Serving Systems With YCSB
12 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
From Everand
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
Adam Jones
No ratings yet

Unit 4 Spark Cassendra

Uploaded by

Unit 4 Spark Cassendra

Uploaded by

Big Data Analytics

Related ➢ Cassandra examples

There are 3 ways to create Spark RDDs

ii. External datasets

iii. Existing RDDs

• It is scalable, fault-tolerant, and consistent.

Cassandra: Each node in a cluster can accept read and

• Node − It is the place where data is stored.

• Cassandra database is distributed over several

Family A schema in a relational In Cassandra, although

You might also like