0% found this document useful (0 votes)
22 views25 pages

Lecture 3 PPT 22

The document discusses Apache Spark and Hive. It provides definitions and features of Spark and Hive, as well as their use cases. Spark is an open-source framework for large-scale data processing. It is optimized for batch and real-time processing. Hive allows querying and managing large datasets stored in various databases.

Uploaded by

Rishabh Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views25 pages

Lecture 3 PPT 22

The document discusses Apache Spark and Hive. It provides definitions and features of Spark and Hive, as well as their use cases. Spark is an open-source framework for large-scale data processing. It is optimized for batch and real-time processing. Hive allows querying and managing large datasets stored in various databases.

Uploaded by

Rishabh Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Lecture 3: EPBA

Spark & Hive

By: Aviral Apurva


Index
Brief Recap

Introduction to Spark

Features of Spark

Use cases of Spark

Introduction to Hive

Features of Hive

Use cases of Hive


1st Lecture Recap -

Structured, semi- Five V's of Big Data: Applications of Big Introduction to Hadoop Architecture:
structured, and Volume, Velocity, Data Analytics: Used Hadoop: An open- Comprises HDFS,
unstructured data: Variety, Veracity, and in various industries source framework YARN, and
Different types of Value. for decision making, used to store and MapReduce, with
data that are customer insights, process large HDFS storing the
analyzed in Big Data product datasets across a data, YARN managing
Analytics. development, and cluster of computers. the resources, and
many more. MapReduce
processing the data.
Definition of
Apache Spark
• Apache Spark is an open-
source distributed
computing system designed
to process large-scale data
sets in parallel across a
cluster of computers.
• It was initially developed at
UC Berkeley's AMPLab in
2009 and later became an
Apache Software
Foundation project.
Spark was created by Matei Zaharia at UC
Berkeley's AMPLab as part of his Ph.D.
research in 2009.

The History It was released as an open-source project in


2010 and later became an Apache Software
of Spark Foundation project in 2013.

Since then, it has become one of the most


popular big data processing frameworks, used
by companies such as Netflix, IBM, and Uber.
• Hadoop is a distributed file system and a batch processing
framework, whereas Spark is a general-purpose data processing
framework that can work with different data sources.
• Hadoop MapReduce is optimized for batch processing, whereas Spark
is optimized for both batch and real-time processing.
Difference • Hadoop uses HDFS for storage, while Spark supports various storage
options such as HDFS, NoSQL databases, and cloud storage services.
b/w Hadoop • Hadoop has a steep learning curve, and programming with
MapReduce is complex, whereas Spark is more accessible and
and Spark provides easy-to-use APIs and libraries for data processing, machine
learning, and graph processing.
• Hadoop's performance is generally slower than Spark, especially for
iterative algorithms that require frequent access to data.
• Hadoop is well-suited for processing large datasets in a batch, while
Spark is better for applications that require processing large amounts
of data in real-time or near real-time, such as streaming, machine
learning, and graph processing.
Why use Spark?

Spark is designed for large-scale data processing, making it ideal for handling big
data workloads.

Spark's in-memory processing capabilities enable faster processing times


compared to disk-based processing.

Spark supports real-time data processing and iterative algorithms, which are not
well-supported by other big data processing frameworks.

Spark provides a wide range of libraries for machine learning, graph processing,
and real-time stream processing.
Spark's architecture

Spark consists of a cluster manager and a


distributed storage system.

The cluster manager allocates resources to


the Spark applications running on the cluster.

The distributed storage system allows Spark


to store and manage data across a cluster of
machines.

The processing is done by Spark executors,


which run on each machine in the cluster.
Architecture of
Spark compared
to Hadoop
RDD (Resilient
Distributed Datasets)
• RDDs are Spark's fundamental
abstraction for distributed data
processing.
• RDDs are immutable (Cannot alter the
state) and distributed across a cluster
of machines.
• RDDs are fault-tolerant, meaning they
can recover from node failures.
• RDDs support two types of operations:
transformations and actions.
(Transformations create RDDs from each other
(Count, Union, but when we want to work with
the actual dataset, at that point action is
performed.(reduce, take, foreach)
Distributed processing: Spark can distribute data across a
cluster of machines for large-scale processing

Features of In-memory processing: Spark stores data in memory,


making it faster than Hadoop

Apache Fault-tolerance: Built-in mechanisms ensure reliability in


case of machine failures or data loss

Spark Lazy evaluation: Spark delays computations until


necessary, optimizing resource usage

Parallel processing: Spark can parallelize computations


across cores and machines, achieving faster performance
• Real-time processing: Spark supports real-
time processing, making it ideal for
applications like fraud detection and
sensor data processing
• Streaming: Spark Streaming is a real-time
processing engine built on top of Spark for
Features of Apache processing data streams
• RDDs: Spark uses Resilient Distributed
Spark (Contd.) Datasets for parallel processing
• Ecosystem: Spark has a rich ecosystem
with libraries for machine learning, graph
processing, and more
• Accessibility: Spark provides easy-to-use
APIs and libraries for data processing,
making it more accessible than Hadoop.
UseCase of Apache Spark
• Data processing and analytics: Spark can handle large-scale data processing and is commonly used for data analysis,
data mining, and data exploration. For example, Spark can be used to analyze social media data to extract insights on
customer sentiment and behavior.
• Machine learning and AI: Spark has a rich library of machine learning algorithms and is widely used for building AI
applications. For example, Spark can be used to build a recommendation engine for a streaming platform like Netflix.
• Real-time data processing: Spark can process data in real-time, making it suitable for applications like fraud detection
and sensor data processing. For example, Spark can be used to detect fraudulent credit card transactions in real-time.
• Graph processing: Spark can handle large-scale graph processing, making it suitable for applications like social network
analysis and recommendation systems. For example, Spark can be used to build a recommendation system for a social
media platform.
UseCase of Apache Spark
(Contd.)
• Log processing and analysis: Spark is commonly used for log processing and analysis to extract insights from server logs, application logs,
and more. For example, Spark can be used to analyze server logs to identify performance issues and improve server reliability.
• Sensor data processing: Spark can handle real-time sensor data processing, making it suitable for applications like IoT and smart cities.
For example, Spark can be used to process sensor data from a fleet of vehicles to optimize routing and reduce fuel consumption. (Tesla
Fleet Learning)
• Fraud detection: Spark is widely used for fraud detection in finance, insurance, and e-commerce industries. For example, Spark can be
used to detect insurance fraud by analyzing large volumes of claims data.
• E-commerce: Spark is used for personalization, recommendation, and customer segmentation in e-commerce. For example, Spark can be
used to recommend products to customers based on their past purchase history.
• Genomics: Spark is used for genomics processing and analysis in bioinformatics. For example, Spark can be used to analyze genomic data
to identify potential causes of disease and develop new treatments.
Practical
Example
Database, Data Mart, Data Warehouse
• A database is a collection of data that is organized and managed to allow for easy access and
manipulation. It can be used for a wide range of purposes, from managing customer data to
tracking inventory.
• A data mart is a subset of a larger data warehouse that is focused on a particular area of the
business. It is designed to support specific business functions, such as sales or marketing, and
provides a simplified view of the data that is relevant to those functions.
• A data warehouse is a centralized repository for all the data that a business collects. It
integrates data from multiple sources, including databases and data marts, and is optimized
for fast querying and reporting. It is designed to support decision-making through analytics,
reporting, and data mining, and requires data governance, ETL processes, and data quality
management.
Examples
1.Database: A company that sells products online may have a database that stores customer
information, including names, addresses, and order history.
2.Data Mart: In the same company, the marketing department may have a data mart that contains
customer data relevant to their function, such as demographics, purchase behaviour, and
campaign response rates.
3.Data Warehouse: The company may then have a data warehouse that integrates data from
multiple sources, including the customer database and the marketing data mart, as well as data
from other departments such as finance and operations. The data warehouse provides a holistic
view of the business, allowing for analysis and reporting across all areas. For example, it may be
used to analyze sales trends by product category, customer segment, and geographic region, or to
track inventory levels and supply chain performance. The data warehouse requires data
governance to ensure data accuracy and consistency, ETL processes to extract, transform and load
data from various sources, and data quality management to identify and correct errors in the data.
ETL :
Extract,
Transform,
Load
Introduction to Hive

Hive provides a familiar


Hive is a data warehousing Hive was developed by
SQL-like syntax for
tool built on top of Facebook and is now an
querying data stored in
Hadoop that provides an open-source project
Hadoop, making it easy for
SQL-like interface to query maintained by the Apache
users who are familiar with
data stored in Hadoop. Software Foundation.
SQL to work with big data.
Features of Hive

• Hive supports a wide range of data formats, including text, Avro, SequenceFile, ORC( Optimized Row
Columnar), and Parquet.

• Hive provides a highly scalable and fault-tolerant data warehousing solution, making it suitable for
processing large volumes of data.
• Hive provides a metadata repository that stores information about the structure of data, making it
easy to manage and query large datasets.
• Hive supports complex queries, including joins, subqueries.
• Hive provides support for user-defined functions (UDFs), which can be used to extend Hive's
functionality and perform custom transformations on data.
Use Cases of Hive

• Data warehousing: Hive is commonly used for data warehousing and business intelligence
applications, where users need to query and analyze large volumes of data stored in Hadoop. For
example, Hive can be used to analyze sales data in a retail organization.
• Log processing: Hive can be used for log processing and analysis, allowing users to extract insights
from log data stored in Hadoop. For example, Hive can be used to analyze web server logs to identify
patterns and trends in user behavior.
• Machine learning: Hive can be used for machine learning applications, allowing users to train
machine learning models on large datasets stored in Hadoop. For example, Hive can be used to train
a machine learning model to predict customer churn in a telecommunications company.
Use Cases of Hive
(Continued)

• Ad-hoc analysis: Hive can be used for ad-hoc analysis, allowing users to quickly explore and
analyze data stored in Hadoop. For example, Hive can be used to analyze social media data
to identify trending topics and popular hashtags.
• ETL (Extract, Transform, Load): Hive can be used for ETL operations, allowing users to
extract data from various sources, transform it into the desired format, and load it into
Hadoop. For example, Hive can be used to extract data from a relational database,
transform it into a format suitable for Hadoop, and load it into Hadoop for further analysis.
Recap

• Spark: Open-source distributed computing system with features like distributed and
in-memory processing, fault-tolerance, parallel and real-time processing, used for
data analytics, machine learning, graph processing, etc.
• Hive: Data warehousing system built on top of Hadoop for querying and analyzing
large datasets, features include SQL-like queries, data summarization, indexing, etc.,
used for ETL processing, log analysis, machine learning, etc.
• Both Spark and Hive have wide-ranging use cases across various industries and
domains, from processing large amounts of data to real-time analytics, fraud
detection, and more. They offer powerful features and tools for big data analysis and
management, making them essential technologies in the world of data science and
engineering.

You might also like