Unit 4 Spark Cassendra
Unit 4 Spark Cassendra
Unit 4
Hadoop Related Tools
Faculty: Dr. Vandana Bhatia
❑ Hbase
➢ Data model and implementations,
➢ Hbaseclients
➢ Hbase examples – praxis.
Hadoop ❑ Cassandra:
➢ Cassandra data model
Contents ❑ Pig
➢ Grunt
➢ pig data model
➢ Pig Latin,
➢ Developing and testing Pig Latin scripts.
❑ Hive
➢ Data types and file formats,
➢ HiveQL data definition,
➢ HiveQL data manipulation – HiveQL queries,
❑ Overview of spark.
Overview of Spark
Overview of Spark
• Apache Spark is a lightning-fast cluster computing technology,
designed for fast computation.
• It is based on Hadoop MapReduce and it extends the MapReduce
model to efficiently use it for more types of computations, which
includes interactive queries and stream processing.
• The main feature of Spark is its in-memory cluster
computing that increases the processing speed of an application.
• Spark is designed to cover a wide range of workloads such as batch
applications, iterative algorithms, interactive queries and
streaming.
• Apart from supporting all these workload in a respective system, it
reduces the management burden of maintaining separate tools.
• Fast - It provides high performance for both batch and
streaming data, using a state-of-the-art DAG scheduler,
a query optimizer, and a physical execution engine.
• Easy to Use - It facilitates to write the application in
Java, Scala, Python, R, and SQL. It also provides more
than 80 high-level operators.
Features of • Generality - It provides a collection of libraries
Spark including SQL and DataFrames, MLlib for machine
learning, GraphX, and Spark Streaming.
• Lightweight - It is a light unified analytics engine which
is used for large scale data processing.
• Runs Everywhere - It can easily run on Hadoop, Apache
Mesos, Kubernetes, standalone, or in the cloud.
Uses of Spark
• Data integration: The data generated by systems are not consistent enough to combine for
analysis. To fetch consistent data from systems we can use processes like Extract, transform,
and load (ETL). Spark is used to reduce the cost and time required for this ETL process.
• Stream processing: It is always difficult to handle the real-time generated data such as log files.
Spark is capable enough to operate streams of data and refuses potentially fraudulent
operations.
• Machine learning: Machine learning approaches become more feasible and increasingly
accurate due to enhancement in the volume of data. As spark is capable of storing data in
memory and can run repeated queries quickly, it makes it easy to work on machine learning
algorithms.
• Interactive analytics: Spark is able to generate the respond rapidly. So, instead of running pre-
defined queries, we can handle the data interactively.
Components of Spark
Spark Core is the underlying general execution engine for spark platform that all
other functionality is built upon. It provides In-Memory computing and
referencing datasets in external storage systems.
• Spark SQL
• Spark SQL is a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD, which provides support for structured and
semi-structured data.
• Spark Streaming
• Spark Streaming leverages Spark Core's fast scheduling capability to perform
Apache streaming analytics. It ingests data in mini-batches and performs RDD
(Resilient Distributed Datasets) transformations on those mini-batches of data.
Spark Core • MLlib (Machine Learning Library)
• MLlib is a distributed machine learning framework above Spark because of the
distributed memory-based Spark architecture.
• It is, according to benchmarks, done by the MLlib developers against the
Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as
fast as the Hadoop disk-based version of Apache Mahout (before Mahout
gained a Spark interface).
• GraphX
• GraphX is a distributed graph-processing framework on top of Spark. It
provides an API for expressing graph computation that can model the user-
defined graphs by using Pregel abstraction API. It also provides an optimized
runtime for this abstraction.
Spark Streaming
• While data is arriving continuously in an unbounded sequence is what we call a data stream.
• Streaming divides continuous flowing input data into discrete units. Moreover, we can say it is a low latency processing
and analyzing of streaming data.
a. Internal working of Spark Streaming
• While live input data streams are received.
• It further divided into batches by Spark streaming, Afterwards, these batches are processed by the Spark engine to generate
the final stream of results in batches.
b. Discretized Stream (DStream)
• Apache Spark Discretized Stream is the key abstraction of Spark Streaming.
• It represents a stream of data divided into small batches.
• DStreams are built on Spark RDDs, Spark’s core data abstraction.
• It also allows Streaming to seamlessly integrate with any other Apache Spark components. Such as Spark MLlib and Spark
SQL.
• The Spark follows the master-slave
architecture. Its cluster consists of a
single master and multiple slaves.
Spark Architecture • The Spark architecture depends upon
two abstractions:
• Resilient Distributed Dataset (RDD)
• Directed Acyclic Graph (DAG)
Resilient Distributed
Datasets (RDD)
• The Resilient Distributed Datasets are the
group of data items that can be stored in-
memory on worker nodes. Here,
• Resilient: Restore the data on failure.
• Distributed: Data is distributed among
different nodes.
• Dataset: Group of data.
Ways to create Spark RDD
i. Parallelized collections
By invoking parallelize method in the driver program, we can create parallelized collections.
i. In-memory computation
The data inside RDD are stored in memory for as long as you want to store. Keeping the data in-memory improves the
performance by an order of magnitudes.
ii. Lazy Evaluation
The data inside RDDs are not evaluated on the go. The changes or the computation is performed only after an action is triggered.
Thus, it limits how much work it has to do.
iii. Fault Tolerance
Upon the failure of worker node, using lineage of operations we can re-compute the lost partition of RDD from the original one.
Thus, we can easily recover the lost data.
iv. Immutability
RDDS are immutable in nature meaning once we create an RDD we can not manipulate it. And if we perform any
transformation, it creates new RDD. We achieve consistency through immutability.
v. Persistence
We can store the frequently used RDD in in-memory and we can also retrieve them directly from memory without going to disk,
this speedup the execution. We can perform Multiple operations on the same data, this happens by storing the data explicitly in
memory by calling persist() or cache() function.
Features of Spark RDD
vi. Partitioning
RDD partition the records logically and distributes the data across various nodes in the cluster. The logical divisions are only for processing
and internally it has no division. Thus, it provides parallelism.
vii. Parallel
Rdd, process the data parallelly over the cluster.
viii. Location-Stickiness
RDDs are capable of defining placement preference to compute partitions. Placement preference refers to information about the location of
RDD. The DAGScheduler places the partitions in such a way that task is close to data as much as possible. Thus speed up computation.
ix. Coarse-grained Operation
We apply coarse-grained transformations to RDD. Coarse-grained meaning the operation applies to the whole dataset not on an individual
element in the data set of RDD.
x. Typed
We can have RDD of various types like: RDD [int], RDD [long], RDD [string].
xi. No limitation
we can have any number of RDD. there is no limit to its number. the limit depends on the size of disk and memory.
Limitations of Spark
a. No Support for Real-time Processing
• Basically, Spark is near real-time processing of live data. In other words, Micro-batch processing takes place in Spark Streaming. Hence we can not say Spark is
completely Real-time Processing engine.
b. Problem with Small File
• In RDD, each file is a small partition. It means, there is the large amount of tiny partition within an RDD. Hence, if we want efficiency in our processing, the
RDDs should be repartitioned into some manageable format. Basically, that demands extensive shuffling over the network.
c. No File Management System
• A major issue is Spark does not have its own file management system. Basically, it relies on some other platform like Hadoop or another cloud-based platform.
d. Expensive
• While we desire cost-efficient processing of big data, Spark turns out to be very expensive. Since keeping data in memory is quite expensive. However the
memory consumption is very high, and it is not handled in a user-friendly manner. Moreover, we require lots of RAM to run in-memory, thus the cost of spark is
much higher.
e. Less number of Algorithms
• Spark MLlib have very less number of available algorithms. For example, Tanimoto distance.
f. Manual Optimization
• It is must that Spark job is manually optimized and is adequate to specific datasets. Moreover, to partition and cache in spark to be correct, it is must to control it
manually.
g. Iterative Processing
• Basically, here data iterates in batches. Also, each iteration is scheduled and executed separately.
h. Latency
• On comparing with Flink, Apache Spark has higher latency.
i. Window Criteria
• Spark only support time-based window criteria not record based window criteria.
HBase
• Hbase is an open source and sorted map
data built on Hadoop. It is column oriented
and horizontally scalable.
• It is based on Google's Big Table.It has set of
tables which keep data in key value format.
• Hbase is well suited for sparse data sets
which are very common in big data use
HBase cases.
• Hbase provides APIs enabling development
in practically any programming language.
• It is a part of the Hadoop ecosystem that
provides random real-time read/write access
to data in the Hadoop File System.
Features of Hbase
• Horizontally scalable: You can add any number of columns anytime.
• Automatic Failover: Automatic failover is a resource that allows a system administrator to automatically switch
data handling to a standby system in the event of system compromise
• Integrations with Map/Reduce framework: Al the commands and java codes internally implement Map/ Reduce to
do the task and it is built over Hadoop Distributed File System.
• sparse, distributed, persistent, multidimensional sorted map, which is indexed by rowkey, column key,and
timestamp.
• Often referred as a key value store or column family-oriented database, or storing versioned maps of maps.
• fundamentally, it's a platform for storing and retrieving data with random access.
• It doesn't care about datatypes(storing an integer in one row and a string in another for the same column).
• It doesn't enforce relationships within your data.
• It is designed to run on a cluster of computers, built using commodity hardware.
HBase Vs. RDBMS
HBASE RDBMS
➢ Schema-less in database ➢ Having fixed schema in database
➢ Column-oriented databases ➢ Row oriented datastore
➢ Designed to store De-normalized data ➢ Designed to store Normalized data
➢ Wide and sparsely populated tables present in
➢ Contains thin tables in database
HBase
➢ Supports automatic partitioning ➢ Has no built in support for partitioning
➢ Well suited for OLAP systems ➢ Well suited for OLTP systems
➢ Retrieve one row at a time and hence could read
➢ Read only relevant data from database unnecessary data if only some of the data in a row
is required
➢ Structured and semi-structure data can be stored ➢ Structured data can be stored and processed using
and processed using HBase RDBMS
➢ Enables aggregation over many rows and
➢ Aggregation is an expensive operation
columns
Hbase Data Model
Hbase
Architecture
• In HBase, tables are split into
regions and are served by the
region servers. Regions are
vertically divided by column
families into “Stores”. Stores
are saved as files in HDFS.
• HBase has three major
components: the client
library, a master server, and
region servers.
• Region servers can be added
or removed as per
requirement.
Hbase:
MasterServer
• The master server -
• Assigns regions to the region servers and takes
the help of Apache ZooKeeper for this task.
• Handles load balancing of the regions across
region servers. It unloads the busy servers and
shifts the regions to less occupied servers.
• Maintains the state of the cluster by negotiating
the load balancing.
• Is responsible for schema changes and other
metadata operations such as creation of tables
and column families.
Hbase: Regions
• Regions are nothing but tables that are split up and spread across
the region servers.
Region server
• The region servers have regions that -
• Communicate with the client and handle data-related operations.
• Handle read and write requests for all the regions under it.
• Decide the size of the region by following the region size
thresholds.
• The store contains memory store and HFiles.
• Memstore is just like a cache memory.
• Anything that is entered into the HBase is stored here initially.
• Later, the data is transferred and saved in Hfiles as blocks and the
memstore is flushed.
Hbase: Zookeeper
• Zookeeper is an open-source project that provides
services like maintaining configuration information,
naming, providing distributed synchronization, etc.
• Zookeeper has ephemeral nodes representing
different region servers. Master servers use these
nodes to discover available servers.
• In addition to availability, the nodes are also used to
track server failures or network partitions.
• Clients communicate with region servers via
zookeeper.
• In pseudo and standalone modes, HBase itself will
take care of zookeeper.
HBase Shell
➢ Creating a Table using HBase Shell
create ‘<table name>’,’<column family>’
Eg. Create ‘emp’, ‘empid’, ‘empname’, ‘salary’
➢ List
List
➢ Disable a table
Disable ‘emp’
➢ Enable a table
Enable ‘emp’
➢ Describe
Describe ‘emp’
➢ Alter
• Alter ‘emp’, ‘empid’=>’1’, version=s
• Alter ‘emp’, ‘delete’=>’salary’
HBase Shell
• Existence of Table
• exists 'emp’
• Dropping a Table
• disable 'emp’
• drop 'emp’
• drop_all
• drop_all ‘t.*’ //drop the tables matching the “regex”
• Disable_all
• disable_all 'raj.*’
• exit the shell
• exit
Cassandra
Cassandra
• Cassandra is a distributed database from Apache that
is highly scalable and designed to manage very large
amounts of structured data.
• It provides high availability with no single point of
failure.
• It is a type of NoSQL database.
Cassandra
• Elastic scalability − Cassandra is highly scalable; it allows to add more hardware to accommodate more customers and more
data as per requirement.
• Always on architecture − Cassandra has no single point of failure and it is continuously available for business-critical
applications that cannot afford a failure.
• Fast linear-scale performance − Cassandra is linearly scalable, i.e., it increases your throughput as you increase the number
of nodes in the cluster. Therefore it maintains a quick response time.
• Flexible data storage − Cassandra accommodates all possible data formats including: structured, semi-structured, and
unstructured. It can dynamically accommodate changes to your data structures according to your need.
• Easy data distribution − Cassandra provides the flexibility to distribute data where you need by replicating data across
multiple data centers.
• Transaction support − Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID).
• Fast writes − Cassandra was designed to run on cheap commodity hardware. It performs blazingly fast writes and can store
hundreds of terabytes of data, without sacrificing the read efficiency.
All the nodes in a cluster play the same
role. Each node is independent and at the
same time interconnected to other nodes.
Column
A column is the basic data structure of
Cassandra with three values, namely key or
column name, value, and a time stamp. Given
below is the structure of a column.
Super Column
A super column is a special column, therefore, it
is also a key-value pair. But a super column
stores a map of sub-columns.
Generally column families are stored on disk in
individual files. Therefore, to optimize
performance, it is important to keep columns
that you are likely to query together in the same
column family, and a super column can be
helpful here.Given below is the structure of a
super column.
Help Material
• 1. https://sparkbyexamples.com/