0% found this document useful (0 votes)

22 views25 pages

Lecture 3 PPT 22

The document discusses Apache Spark and Hive. It provides definitions and features of Spark and Hive, as well as their use cases. Spark is an open-source framework for large-scale data processing. It is optimized for batch and real-time processing. Hive allows querying and managing large datasets stored in various databases.

Uploaded by

Rishabh Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views25 pages

Lecture 3 PPT 22

Uploaded by

Rishabh Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 25

Lecture 3: EPBA

Spark & Hive

By: Aviral Apurva

Index
Brief Recap

Introduction to Spark

Features of Spark

Use cases of Spark

Introduction to Hive

Features of Hive

Use cases of Hive

1st Lecture Recap -

Structured, semi- Five V's of Big Data: Applications of Big Introduction to Hadoop Architecture:
structured, and Volume, Velocity, Data Analytics: Used Hadoop: An open- Comprises HDFS,
unstructured data: Variety, Veracity, and in various industries source framework YARN, and
Different types of Value. for decision making, used to store and MapReduce, with
data that are customer insights, process large HDFS storing the
analyzed in Big Data product datasets across a data, YARN managing
Analytics. development, and cluster of computers. the resources, and
many more. MapReduce
processing the data.
Definition of
Apache Spark
• Apache Spark is an open-
source distributed
computing system designed
to process large-scale data
sets in parallel across a
cluster of computers.
• It was initially developed at
UC Berkeley's AMPLab in
2009 and later became an
Apache Software
Foundation project.
Spark was created by Matei Zaharia at UC
Berkeley's AMPLab as part of his Ph.D.
research in 2009.

The History It was released as an open-source project in

2010 and later became an Apache Software
of Spark Foundation project in 2013.

Since then, it has become one of the most

popular big data processing frameworks, used
by companies such as Netflix, IBM, and Uber.
• Hadoop is a distributed file system and a batch processing
framework, whereas Spark is a general-purpose data processing
framework that can work with different data sources.
• Hadoop MapReduce is optimized for batch processing, whereas Spark
is optimized for both batch and real-time processing.
Difference • Hadoop uses HDFS for storage, while Spark supports various storage
options such as HDFS, NoSQL databases, and cloud storage services.
b/w Hadoop • Hadoop has a steep learning curve, and programming with
MapReduce is complex, whereas Spark is more accessible and
and Spark provides easy-to-use APIs and libraries for data processing, machine
learning, and graph processing.
• Hadoop's performance is generally slower than Spark, especially for
iterative algorithms that require frequent access to data.
• Hadoop is well-suited for processing large datasets in a batch, while
Spark is better for applications that require processing large amounts
of data in real-time or near real-time, such as streaming, machine
learning, and graph processing.
Why use Spark?

Spark is designed for large-scale data processing, making it ideal for handling big
data workloads.

Spark's in-memory processing capabilities enable faster processing times

compared to disk-based processing.

Spark supports real-time data processing and iterative algorithms, which are not
well-supported by other big data processing frameworks.

Spark provides a wide range of libraries for machine learning, graph processing,
and real-time stream processing.
Spark's architecture

Spark consists of a cluster manager and a

distributed storage system.

The cluster manager allocates resources to

the Spark applications running on the cluster.

The distributed storage system allows Spark

to store and manage data across a cluster of
machines.

The processing is done by Spark executors,

which run on each machine in the cluster.
Architecture of
Spark compared
to Hadoop
RDD (Resilient
Distributed Datasets)
• RDDs are Spark's fundamental
abstraction for distributed data
processing.
• RDDs are immutable (Cannot alter the
state) and distributed across a cluster
of machines.
• RDDs are fault-tolerant, meaning they
can recover from node failures.
• RDDs support two types of operations:
transformations and actions.
(Transformations create RDDs from each other
(Count, Union, but when we want to work with
the actual dataset, at that point action is
performed.(reduce, take, foreach)
Distributed processing: Spark can distribute data across a
cluster of machines for large-scale processing

Features of In-memory processing: Spark stores data in memory,

making it faster than Hadoop

Apache Fault-tolerance: Built-in mechanisms ensure reliability in

case of machine failures or data loss

Spark Lazy evaluation: Spark delays computations until

necessary, optimizing resource usage

Parallel processing: Spark can parallelize computations

across cores and machines, achieving faster performance
• Real-time processing: Spark supports real-
time processing, making it ideal for
applications like fraud detection and
sensor data processing
• Streaming: Spark Streaming is a real-time
processing engine built on top of Spark for
Features of Apache processing data streams
• RDDs: Spark uses Resilient Distributed
Spark (Contd.) Datasets for parallel processing
• Ecosystem: Spark has a rich ecosystem
with libraries for machine learning, graph
processing, and more
• Accessibility: Spark provides easy-to-use
APIs and libraries for data processing,
making it more accessible than Hadoop.
UseCase of Apache Spark
• Data processing and analytics: Spark can handle large-scale data processing and is commonly used for data analysis,
data mining, and data exploration. For example, Spark can be used to analyze social media data to extract insights on
customer sentiment and behavior.
• Machine learning and AI: Spark has a rich library of machine learning algorithms and is widely used for building AI
applications. For example, Spark can be used to build a recommendation engine for a streaming platform like Netflix.
• Real-time data processing: Spark can process data in real-time, making it suitable for applications like fraud detection
and sensor data processing. For example, Spark can be used to detect fraudulent credit card transactions in real-time.
• Graph processing: Spark can handle large-scale graph processing, making it suitable for applications like social network
analysis and recommendation systems. For example, Spark can be used to build a recommendation system for a social
media platform.
UseCase of Apache Spark
(Contd.)
• Log processing and analysis: Spark is commonly used for log processing and analysis to extract insights from server logs, application logs,
and more. For example, Spark can be used to analyze server logs to identify performance issues and improve server reliability.
• Sensor data processing: Spark can handle real-time sensor data processing, making it suitable for applications like IoT and smart cities.
For example, Spark can be used to process sensor data from a fleet of vehicles to optimize routing and reduce fuel consumption. (Tesla
Fleet Learning)
• Fraud detection: Spark is widely used for fraud detection in finance, insurance, and e-commerce industries. For example, Spark can be
used to detect insurance fraud by analyzing large volumes of claims data.
• E-commerce: Spark is used for personalization, recommendation, and customer segmentation in e-commerce. For example, Spark can be
used to recommend products to customers based on their past purchase history.
• Genomics: Spark is used for genomics processing and analysis in bioinformatics. For example, Spark can be used to analyze genomic data
to identify potential causes of disease and develop new treatments.
Practical
Example
Database, Data Mart, Data Warehouse
• A database is a collection of data that is organized and managed to allow for easy access and
manipulation. It can be used for a wide range of purposes, from managing customer data to
tracking inventory.
• A data mart is a subset of a larger data warehouse that is focused on a particular area of the
business. It is designed to support specific business functions, such as sales or marketing, and
provides a simplified view of the data that is relevant to those functions.
• A data warehouse is a centralized repository for all the data that a business collects. It
integrates data from multiple sources, including databases and data marts, and is optimized
for fast querying and reporting. It is designed to support decision-making through analytics,
reporting, and data mining, and requires data governance, ETL processes, and data quality
management.
Examples
1.Database: A company that sells products online may have a database that stores customer
information, including names, addresses, and order history.
2.Data Mart: In the same company, the marketing department may have a data mart that contains
customer data relevant to their function, such as demographics, purchase behaviour, and
campaign response rates.
3.Data Warehouse: The company may then have a data warehouse that integrates data from
multiple sources, including the customer database and the marketing data mart, as well as data
from other departments such as finance and operations. The data warehouse provides a holistic
view of the business, allowing for analysis and reporting across all areas. For example, it may be
used to analyze sales trends by product category, customer segment, and geographic region, or to
track inventory levels and supply chain performance. The data warehouse requires data
governance to ensure data accuracy and consistency, ETL processes to extract, transform and load
data from various sources, and data quality management to identify and correct errors in the data.
ETL :
Extract,
Transform,
Load
Introduction to Hive

Hive provides a familiar

Hive is a data warehousing Hive was developed by
SQL-like syntax for
tool built on top of Facebook and is now an
querying data stored in
Hadoop that provides an open-source project
Hadoop, making it easy for
SQL-like interface to query maintained by the Apache
users who are familiar with
data stored in Hadoop. Software Foundation.
SQL to work with big data.
Features of Hive

• Hive supports a wide range of data formats, including text, Avro, SequenceFile, ORC( Optimized Row
Columnar), and Parquet.

• Hive provides a highly scalable and fault-tolerant data warehousing solution, making it suitable for
processing large volumes of data.
• Hive provides a metadata repository that stores information about the structure of data, making it
easy to manage and query large datasets.
• Hive supports complex queries, including joins, subqueries.
• Hive provides support for user-defined functions (UDFs), which can be used to extend Hive's
functionality and perform custom transformations on data.
Use Cases of Hive

• Data warehousing: Hive is commonly used for data warehousing and business intelligence
applications, where users need to query and analyze large volumes of data stored in Hadoop. For
example, Hive can be used to analyze sales data in a retail organization.
• Log processing: Hive can be used for log processing and analysis, allowing users to extract insights
from log data stored in Hadoop. For example, Hive can be used to analyze web server logs to identify
patterns and trends in user behavior.
• Machine learning: Hive can be used for machine learning applications, allowing users to train
machine learning models on large datasets stored in Hadoop. For example, Hive can be used to train
a machine learning model to predict customer churn in a telecommunications company.
Use Cases of Hive
(Continued)

• Ad-hoc analysis: Hive can be used for ad-hoc analysis, allowing users to quickly explore and
analyze data stored in Hadoop. For example, Hive can be used to analyze social media data
to identify trending topics and popular hashtags.
• ETL (Extract, Transform, Load): Hive can be used for ETL operations, allowing users to
extract data from various sources, transform it into the desired format, and load it into
Hadoop. For example, Hive can be used to extract data from a relational database,
transform it into a format suitable for Hadoop, and load it into Hadoop for further analysis.
Recap

• Spark: Open-source distributed computing system with features like distributed and
in-memory processing, fault-tolerance, parallel and real-time processing, used for
data analytics, machine learning, graph processing, etc.
• Hive: Data warehousing system built on top of Hadoop for querying and analyzing
large datasets, features include SQL-like queries, data summarization, indexing, etc.,
used for ETL processing, log analysis, machine learning, etc.
• Both Spark and Hive have wide-ranging use cases across various industries and
domains, from processing large amounts of data to real-time analytics, fraud
detection, and more. They offer powerful features and tools for big data analysis and
management, making them essential technologies in the world of data science and
engineering.

Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Unit 5
100% (1)
Unit 5
109 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Spark 101
No ratings yet
Spark 101
25 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Apache Spark Analytics Made Simple
No ratings yet
Apache Spark Analytics Made Simple
76 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Apache Spark Primer 170303
No ratings yet
Apache Spark Primer 170303
8 pages
Apache Spark: Dhineshkumar S K
No ratings yet
Apache Spark: Dhineshkumar S K
31 pages
SAP FICO Course Content - The Interface PVT Ltd.
No ratings yet
SAP FICO Course Content - The Interface PVT Ltd.
8 pages
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
Apache Spark For Beginners
No ratings yet
Apache Spark For Beginners
30 pages
The Big Big Data' Question Hadoop or Spark
No ratings yet
The Big Big Data' Question Hadoop or Spark
3 pages
Lecture 2 - Hadoop 221
No ratings yet
Lecture 2 - Hadoop 221
28 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
FULL PPTs
No ratings yet
FULL PPTs
219 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Module 3
No ratings yet
Module 3
51 pages
Unit 4
No ratings yet
Unit 4
60 pages
06-Apache Spark
No ratings yet
06-Apache Spark
75 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Spark-Rdd
No ratings yet
Spark-Rdd
15 pages
Shark
No ratings yet
Shark
24 pages
Bda Unit 6
No ratings yet
Bda Unit 6
14 pages
Introduction To Big Data Technologies
No ratings yet
Introduction To Big Data Technologies
10 pages
Bda 5
No ratings yet
Bda 5
21 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Spark Final Theory
No ratings yet
Spark Final Theory
19 pages
Power BI Introd 224
No ratings yet
Power BI Introd 224
10 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Bda U4
No ratings yet
Bda U4
49 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
Sspark
No ratings yet
Sspark
7 pages
Unit V
No ratings yet
Unit V
35 pages
Module 2
No ratings yet
Module 2
20 pages
Introduction To Spark 1
No ratings yet
Introduction To Spark 1
21 pages
Introduction-to-Apache-Spark
No ratings yet
Introduction-to-Apache-Spark
22 pages
4a.introduction To Apache Spark
No ratings yet
4a.introduction To Apache Spark
28 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Big Data Anlytics Unit 3 R22 It
No ratings yet
Big Data Anlytics Unit 3 R22 It
57 pages
Big Data Technologies Presentation
No ratings yet
Big Data Technologies Presentation
10 pages
Learning Spark - Chapter 1
No ratings yet
Learning Spark - Chapter 1
18 pages

Lecture 3 PPT 22

Uploaded by

Lecture 3 PPT 22

Uploaded by

Lecture 3: EPBA

Spark & Hive

By: Aviral Apurva

Use cases of Spark

Use cases of Hive

The History It was released as an open-source project in

Since then, it has become one of the most

Spark's in-memory processing capabilities enable faster processing times

Spark consists of a cluster manager and a

The cluster manager allocates resources to

The distributed storage system allows Spark

The processing is done by Spark executors,

Features of In-memory processing: Spark stores data in memory,

Apache Fault-tolerance: Built-in mechanisms ensure reliability in

Spark Lazy evaluation: Spark delays computations until

Parallel processing: Spark can parallelize computations

Hive provides a familiar

You might also like