0% found this document useful (0 votes)
49 views41 pages

Unit 4 Spark Cassendra

The document discusses tools related to Hadoop including HBase, Cassandra, Pig, Hive, and Spark. It provides overviews and explanations of each tool's features, components, and limitations.

Uploaded by

downloadjain123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views41 pages

Unit 4 Spark Cassendra

The document discusses tools related to Hadoop including HBase, Cassandra, Pig, Hive, and Spark. It provides overviews and explanations of each tool's features, components, and limitations.

Uploaded by

downloadjain123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Big Data Analytics

Unit 4
Hadoop Related Tools
Faculty: Dr. Vandana Bhatia
❑ Hbase
➢ Data model and implementations,
➢ Hbaseclients
➢ Hbase examples – praxis.

Hadoop ❑ Cassandra:
➢ Cassandra data model

Related ➢ Cassandra examples


➢ Cassandra clients
Tools: ➢ Hadoop integration.

Contents ❑ Pig
➢ Grunt
➢ pig data model
➢ Pig Latin,
➢ Developing and testing Pig Latin scripts.
❑ Hive
➢ Data types and file formats,
➢ HiveQL data definition,
➢ HiveQL data manipulation – HiveQL queries,
❑ Overview of spark.
Overview of Spark
Overview of Spark
• Apache Spark is a lightning-fast cluster computing technology,
designed for fast computation.
• It is based on Hadoop MapReduce and it extends the MapReduce
model to efficiently use it for more types of computations, which
includes interactive queries and stream processing.
• The main feature of Spark is its in-memory cluster
computing that increases the processing speed of an application.
• Spark is designed to cover a wide range of workloads such as batch
applications, iterative algorithms, interactive queries and
streaming.
• Apart from supporting all these workload in a respective system, it
reduces the management burden of maintaining separate tools.
• Fast - It provides high performance for both batch and
streaming data, using a state-of-the-art DAG scheduler,
a query optimizer, and a physical execution engine.
• Easy to Use - It facilitates to write the application in
Java, Scala, Python, R, and SQL. It also provides more
than 80 high-level operators.
Features of • Generality - It provides a collection of libraries
Spark including SQL and DataFrames, MLlib for machine
learning, GraphX, and Spark Streaming.
• Lightweight - It is a light unified analytics engine which
is used for large scale data processing.
• Runs Everywhere - It can easily run on Hadoop, Apache
Mesos, Kubernetes, standalone, or in the cloud.
Uses of Spark

• Data integration: The data generated by systems are not consistent enough to combine for
analysis. To fetch consistent data from systems we can use processes like Extract, transform,
and load (ETL). Spark is used to reduce the cost and time required for this ETL process.
• Stream processing: It is always difficult to handle the real-time generated data such as log files.
Spark is capable enough to operate streams of data and refuses potentially fraudulent
operations.
• Machine learning: Machine learning approaches become more feasible and increasingly
accurate due to enhancement in the volume of data. As spark is capable of storing data in
memory and can run repeated queries quickly, it makes it easy to work on machine learning
algorithms.
• Interactive analytics: Spark is able to generate the respond rapidly. So, instead of running pre-
defined queries, we can handle the data interactively.
Components of Spark
Spark Core is the underlying general execution engine for spark platform that all
other functionality is built upon. It provides In-Memory computing and
referencing datasets in external storage systems.
• Spark SQL
• Spark SQL is a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD, which provides support for structured and
semi-structured data.
• Spark Streaming
• Spark Streaming leverages Spark Core's fast scheduling capability to perform
Apache streaming analytics. It ingests data in mini-batches and performs RDD
(Resilient Distributed Datasets) transformations on those mini-batches of data.
Spark Core • MLlib (Machine Learning Library)
• MLlib is a distributed machine learning framework above Spark because of the
distributed memory-based Spark architecture.
• It is, according to benchmarks, done by the MLlib developers against the
Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as
fast as the Hadoop disk-based version of Apache Mahout (before Mahout
gained a Spark interface).
• GraphX
• GraphX is a distributed graph-processing framework on top of Spark. It
provides an API for expressing graph computation that can model the user-
defined graphs by using Pregel abstraction API. It also provides an optimized
runtime for this abstraction.
Spark Streaming
• While data is arriving continuously in an unbounded sequence is what we call a data stream.
• Streaming divides continuous flowing input data into discrete units. Moreover, we can say it is a low latency processing
and analyzing of streaming data.
a. Internal working of Spark Streaming
• While live input data streams are received.
• It further divided into batches by Spark streaming, Afterwards, these batches are processed by the Spark engine to generate
the final stream of results in batches.
b. Discretized Stream (DStream)
• Apache Spark Discretized Stream is the key abstraction of Spark Streaming.
• It represents a stream of data divided into small batches.
• DStreams are built on Spark RDDs, Spark’s core data abstraction.
• It also allows Streaming to seamlessly integrate with any other Apache Spark components. Such as Spark MLlib and Spark
SQL.
• The Spark follows the master-slave
architecture. Its cluster consists of a
single master and multiple slaves.
Spark Architecture • The Spark architecture depends upon
two abstractions:
• Resilient Distributed Dataset (RDD)
• Directed Acyclic Graph (DAG)
Resilient Distributed
Datasets (RDD)
• The Resilient Distributed Datasets are the
group of data items that can be stored in-
memory on worker nodes. Here,
• Resilient: Restore the data on failure.
• Distributed: Data is distributed among
different nodes.
• Dataset: Group of data.
Ways to create Spark RDD

There are 3 ways to create Spark RDDs

i. Parallelized collections
By invoking parallelize method in the driver program, we can create parallelized collections.

ii. External datasets


One can create Spark RDDs, by calling a textFile method. Hence, this method takes URL of the file and reads
it as a collection of lines.

iii. Existing RDDs


Moreover, we can create new RDD in spark, by applying transformation operation on existing RDDs.
Features of Spark RDD

i. In-memory computation
The data inside RDD are stored in memory for as long as you want to store. Keeping the data in-memory improves the
performance by an order of magnitudes.
ii. Lazy Evaluation
The data inside RDDs are not evaluated on the go. The changes or the computation is performed only after an action is triggered.
Thus, it limits how much work it has to do.
iii. Fault Tolerance
Upon the failure of worker node, using lineage of operations we can re-compute the lost partition of RDD from the original one.
Thus, we can easily recover the lost data.
iv. Immutability
RDDS are immutable in nature meaning once we create an RDD we can not manipulate it. And if we perform any
transformation, it creates new RDD. We achieve consistency through immutability.
v. Persistence
We can store the frequently used RDD in in-memory and we can also retrieve them directly from memory without going to disk,
this speedup the execution. We can perform Multiple operations on the same data, this happens by storing the data explicitly in
memory by calling persist() or cache() function.
Features of Spark RDD

vi. Partitioning
RDD partition the records logically and distributes the data across various nodes in the cluster. The logical divisions are only for processing
and internally it has no division. Thus, it provides parallelism.
vii. Parallel
Rdd, process the data parallelly over the cluster.
viii. Location-Stickiness
RDDs are capable of defining placement preference to compute partitions. Placement preference refers to information about the location of
RDD. The DAGScheduler places the partitions in such a way that task is close to data as much as possible. Thus speed up computation.
ix. Coarse-grained Operation
We apply coarse-grained transformations to RDD. Coarse-grained meaning the operation applies to the whole dataset not on an individual
element in the data set of RDD.
x. Typed
We can have RDD of various types like: RDD [int], RDD [long], RDD [string].
xi. No limitation
we can have any number of RDD. there is no limit to its number. the limit depends on the size of disk and memory.
Limitations of Spark
a. No Support for Real-time Processing
• Basically, Spark is near real-time processing of live data. In other words, Micro-batch processing takes place in Spark Streaming. Hence we can not say Spark is
completely Real-time Processing engine.
b. Problem with Small File
• In RDD, each file is a small partition. It means, there is the large amount of tiny partition within an RDD. Hence, if we want efficiency in our processing, the
RDDs should be repartitioned into some manageable format. Basically, that demands extensive shuffling over the network.
c. No File Management System
• A major issue is Spark does not have its own file management system. Basically, it relies on some other platform like Hadoop or another cloud-based platform.
d. Expensive
• While we desire cost-efficient processing of big data, Spark turns out to be very expensive. Since keeping data in memory is quite expensive. However the
memory consumption is very high, and it is not handled in a user-friendly manner. Moreover, we require lots of RAM to run in-memory, thus the cost of spark is
much higher.
e. Less number of Algorithms
• Spark MLlib have very less number of available algorithms. For example, Tanimoto distance.
f. Manual Optimization
• It is must that Spark job is manually optimized and is adequate to specific datasets. Moreover, to partition and cache in spark to be correct, it is must to control it
manually.
g. Iterative Processing
• Basically, here data iterates in batches. Also, each iteration is scheduled and executed separately.
h. Latency
• On comparing with Flink, Apache Spark has higher latency.
i. Window Criteria
• Spark only support time-based window criteria not record based window criteria.
HBase
• Hbase is an open source and sorted map
data built on Hadoop. It is column oriented
and horizontally scalable.
• It is based on Google's Big Table.It has set of
tables which keep data in key value format.
• Hbase is well suited for sparse data sets
which are very common in big data use
HBase cases.
• Hbase provides APIs enabling development
in practically any programming language.
• It is a part of the Hadoop ecosystem that
provides random real-time read/write access
to data in the Hadoop File System.
Features of Hbase
• Horizontally scalable: You can add any number of columns anytime.
• Automatic Failover: Automatic failover is a resource that allows a system administrator to automatically switch
data handling to a standby system in the event of system compromise
• Integrations with Map/Reduce framework: Al the commands and java codes internally implement Map/ Reduce to
do the task and it is built over Hadoop Distributed File System.
• sparse, distributed, persistent, multidimensional sorted map, which is indexed by rowkey, column key,and
timestamp.
• Often referred as a key value store or column family-oriented database, or storing versioned maps of maps.
• fundamentally, it's a platform for storing and retrieving data with random access.
• It doesn't care about datatypes(storing an integer in one row and a string in another for the same column).
• It doesn't enforce relationships within your data.
• It is designed to run on a cluster of computers, built using commodity hardware.
HBase Vs. RDBMS
HBASE RDBMS
➢ Schema-less in database ➢ Having fixed schema in database
➢ Column-oriented databases ➢ Row oriented datastore
➢ Designed to store De-normalized data ➢ Designed to store Normalized data
➢ Wide and sparsely populated tables present in
➢ Contains thin tables in database
HBase
➢ Supports automatic partitioning ➢ Has no built in support for partitioning
➢ Well suited for OLAP systems ➢ Well suited for OLTP systems
➢ Retrieve one row at a time and hence could read
➢ Read only relevant data from database unnecessary data if only some of the data in a row
is required
➢ Structured and semi-structure data can be stored ➢ Structured data can be stored and processed using
and processed using HBase RDBMS
➢ Enables aggregation over many rows and
➢ Aggregation is an expensive operation
columns
Hbase Data Model
Hbase
Architecture
• In HBase, tables are split into
regions and are served by the
region servers. Regions are
vertically divided by column
families into “Stores”. Stores
are saved as files in HDFS.
• HBase has three major
components: the client
library, a master server, and
region servers.
• Region servers can be added
or removed as per
requirement.
Hbase:
MasterServer
• The master server -
• Assigns regions to the region servers and takes
the help of Apache ZooKeeper for this task.
• Handles load balancing of the regions across
region servers. It unloads the busy servers and
shifts the regions to less occupied servers.
• Maintains the state of the cluster by negotiating
the load balancing.
• Is responsible for schema changes and other
metadata operations such as creation of tables
and column families.
Hbase: Regions
• Regions are nothing but tables that are split up and spread across
the region servers.
Region server
• The region servers have regions that -
• Communicate with the client and handle data-related operations.
• Handle read and write requests for all the regions under it.
• Decide the size of the region by following the region size
thresholds.
• The store contains memory store and HFiles.
• Memstore is just like a cache memory.
• Anything that is entered into the HBase is stored here initially.
• Later, the data is transferred and saved in Hfiles as blocks and the
memstore is flushed.
Hbase: Zookeeper
• Zookeeper is an open-source project that provides
services like maintaining configuration information,
naming, providing distributed synchronization, etc.
• Zookeeper has ephemeral nodes representing
different region servers. Master servers use these
nodes to discover available servers.
• In addition to availability, the nodes are also used to
track server failures or network partitions.
• Clients communicate with region servers via
zookeeper.
• In pseudo and standalone modes, HBase itself will
take care of zookeeper.
HBase Shell
➢ Creating a Table using HBase Shell
create ‘<table name>’,’<column family>’
Eg. Create ‘emp’, ‘empid’, ‘empname’, ‘salary’
➢ List
List
➢ Disable a table
Disable ‘emp’
➢ Enable a table
Enable ‘emp’
➢ Describe
Describe ‘emp’
➢ Alter
• Alter ‘emp’, ‘empid’=>’1’, version=s
• Alter ‘emp’, ‘delete’=>’salary’
HBase Shell
• Existence of Table
• exists 'emp’
• Dropping a Table
• disable 'emp’
• drop 'emp’
• drop_all
• drop_all ‘t.*’ //drop the tables matching the “regex”
• Disable_all
• disable_all 'raj.*’
• exit the shell
• exit
Cassandra
Cassandra
• Cassandra is a distributed database from Apache that
is highly scalable and designed to manage very large
amounts of structured data.
• It provides high availability with no single point of
failure.
• It is a type of NoSQL database.
Cassandra

• It is scalable, fault-tolerant, and consistent.


• It is a column-oriented database.
• Its distribution design is based on Amazon’s Dynamo and its data model on
Google’s Bigtable.
• Created at Facebook, it differs sharply from relational database management
systems.
• Cassandra implements a Dynamo-style replication model with no single point of
failure, but adds a more powerful “column family” data model.
• Cassandra is being used by some of the biggest companies such as Facebook,
Twitter, Cisco, Rackspace, ebay, Twitter, Netflix, and more.
Features of Cassandra

• Elastic scalability − Cassandra is highly scalable; it allows to add more hardware to accommodate more customers and more
data as per requirement.
• Always on architecture − Cassandra has no single point of failure and it is continuously available for business-critical
applications that cannot afford a failure.
• Fast linear-scale performance − Cassandra is linearly scalable, i.e., it increases your throughput as you increase the number
of nodes in the cluster. Therefore it maintains a quick response time.
• Flexible data storage − Cassandra accommodates all possible data formats including: structured, semi-structured, and
unstructured. It can dynamically accommodate changes to your data structures according to your need.
• Easy data distribution − Cassandra provides the flexibility to distribute data where you need by replicating data across
multiple data centers.
• Transaction support − Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID).
• Fast writes − Cassandra was designed to run on cheap commodity hardware. It performs blazingly fast writes and can store
hundreds of terabytes of data, without sacrificing the read efficiency.
All the nodes in a cluster play the same
role. Each node is independent and at the
same time interconnected to other nodes.

Cassandra: Each node in a cluster can accept read and


write requests, regardless of where the
Distributed data is actually located in the cluster.
Environment
When a node goes down, read/write
requests can be served from other nodes
in the network.
Data Replication in
Cassandra
• In Cassandra, one or more of the nodes in a
cluster act as replicas for a given piece of data.
• If it is detected that some of the nodes
responded with an out-of-date value,
Cassandra will return the most recent value to
the client.
• After returning the most recent value,
Cassandra performs a read repair in the
background to update the stale values.
• Cassandra uses the Gossip Protocol in the
background to allow the nodes to communicate
with each other and detect any faulty nodes in
the cluster.
Components of Cassandra

• Node − It is the place where data is stored.


• Data center − It is a collection of related nodes.
• Cluster − A cluster is a component that contains one or more data centers.
• Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write operation is
written to the commit log.
• Mem-table − A mem-table is a memory-resident data structure. After commit log, the data will be
written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables.
• SSTable − It is a disk file to which the data is flushed from the mem-table when its contents reach a
threshold value.
• Bloom filter − These are nothing but quick, nondeterministic, algorithms for testing whether an element
is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query.
Cassandra Query Language
• Users can access Cassandra through its nodes using Cassandra Query Language (CQL).
• CQL treats the database (Keyspace) as a container of tables.
• Programmers use cqlsh: a prompt to work with CQL or separate application language drivers.
• Clients approach any of the nodes for their read-write operations. That node (coordinator) plays a proxy between
the client and the nodes holding the data.
Write Operations
• Every write activity of nodes is captured by the commit logs written in the nodes.
• Later the data will be captured and stored in the mem-table.
• Whenever the mem-table is full, data will be written into the SStable data file.
• All writes are automatically partitioned and replicated throughout the cluster.
• Cassandra periodically consolidates the SSTables, discarding unnecessary data.
Read Operations
• During read operations, Cassandra gets values from the mem-table and checks the bloom filter to find the
appropriate SSTable that holds the required data.
Cluster

• Cassandra database is distributed over several


machines that operate together.
• The outermost container is known as the Cluster.
• For failure handling, every node contains a replica,
and in case of a failure, the replica takes charge.
• Cassandra arranges the nodes in a cluster, in a ring
format, and assigns data to them.
• Keyspace is the outermost container for data in Cassandra. The basic
attributes of a Keyspace in Cassandra are −
• Replication factor − It is the number of machines in the cluster that
will receive copies of the same data.
• Replica placement strategy − It is nothing but the strategy to place
Keyspace replicas in the ring. We have strategies such as simple strategy (rack-
aware strategy), old network topology strategy (rack-aware strategy),
and network topology strategy (datacenter-shared strategy).
• Column families − Keyspace is a container for a list of one or more
column families. A column family, in turn, is a container of a collection
of rows. Each row contains ordered columns. Column families
represent the structure of your data. Each keyspace has at least one and
often many column families.
• The syntax of creating a Keyspace is as follows −
CREATE KEYSPACE Keyspace name
WITH replication = {'class': 'SimpleStrategy', 'replication_factor' :
3};
Column Relational Table Cassandra column Family

Family A schema in a relational In Cassandra, although


model is fixed. Once we the column families are
• A column family is a define certain columns defined, the columns are
container for an ordered for a table, while not. You can freely add
collection of rows. Each
inserting data, in every any column to any
row, in turn, is an ordered
collection of columns. row all the columns must column family at any
be filled at least with a time.
null value.
Relational tables define In Cassandra, a table
only columns and the contains columns, or can
user fills in the table with be defined as a super
values. column family.
• A Cassandra column family has the following
Cassandra attributes −
• keys_cached − It represents the number of locations
column to keep cached per SSTable.
family • rows_cached − It represents the number of rows
whose entire contents will be cached in memory.
• preload_row_cache − It specifies whether you want
to pre-populate the row cache.
• Note − Unlike relational tables where a column
family’s schema is not fixed, Cassandra does not force
individual rows to have all the columns.
• The following figure shows an example of a
Cassandra column family.
Cassandra column family

Column
A column is the basic data structure of
Cassandra with three values, namely key or
column name, value, and a time stamp. Given
below is the structure of a column.

Super Column
A super column is a special column, therefore, it
is also a key-value pair. But a super column
stores a map of sub-columns.
Generally column families are stored on disk in
individual files. Therefore, to optimize
performance, it is important to keep columns
that you are likely to query together in the same
column family, and a super column can be
helpful here.Given below is the structure of a
super column.
Help Material
• 1. https://sparkbyexamples.com/

You might also like