Lecture 3 PPT 22
Lecture 3 PPT 22
Introduction to Spark
Features of Spark
Introduction to Hive
Features of Hive
Structured, semi- Five V's of Big Data: Applications of Big Introduction to Hadoop Architecture:
structured, and Volume, Velocity, Data Analytics: Used Hadoop: An open- Comprises HDFS,
unstructured data: Variety, Veracity, and in various industries source framework YARN, and
Different types of Value. for decision making, used to store and MapReduce, with
data that are customer insights, process large HDFS storing the
analyzed in Big Data product datasets across a data, YARN managing
Analytics. development, and cluster of computers. the resources, and
many more. MapReduce
processing the data.
Definition of
Apache Spark
• Apache Spark is an open-
source distributed
computing system designed
to process large-scale data
sets in parallel across a
cluster of computers.
• It was initially developed at
UC Berkeley's AMPLab in
2009 and later became an
Apache Software
Foundation project.
Spark was created by Matei Zaharia at UC
Berkeley's AMPLab as part of his Ph.D.
research in 2009.
Spark is designed for large-scale data processing, making it ideal for handling big
data workloads.
Spark supports real-time data processing and iterative algorithms, which are not
well-supported by other big data processing frameworks.
Spark provides a wide range of libraries for machine learning, graph processing,
and real-time stream processing.
Spark's architecture
• Hive supports a wide range of data formats, including text, Avro, SequenceFile, ORC( Optimized Row
Columnar), and Parquet.
• Hive provides a highly scalable and fault-tolerant data warehousing solution, making it suitable for
processing large volumes of data.
• Hive provides a metadata repository that stores information about the structure of data, making it
easy to manage and query large datasets.
• Hive supports complex queries, including joins, subqueries.
• Hive provides support for user-defined functions (UDFs), which can be used to extend Hive's
functionality and perform custom transformations on data.
Use Cases of Hive
• Data warehousing: Hive is commonly used for data warehousing and business intelligence
applications, where users need to query and analyze large volumes of data stored in Hadoop. For
example, Hive can be used to analyze sales data in a retail organization.
• Log processing: Hive can be used for log processing and analysis, allowing users to extract insights
from log data stored in Hadoop. For example, Hive can be used to analyze web server logs to identify
patterns and trends in user behavior.
• Machine learning: Hive can be used for machine learning applications, allowing users to train
machine learning models on large datasets stored in Hadoop. For example, Hive can be used to train
a machine learning model to predict customer churn in a telecommunications company.
Use Cases of Hive
(Continued)
• Ad-hoc analysis: Hive can be used for ad-hoc analysis, allowing users to quickly explore and
analyze data stored in Hadoop. For example, Hive can be used to analyze social media data
to identify trending topics and popular hashtags.
• ETL (Extract, Transform, Load): Hive can be used for ETL operations, allowing users to
extract data from various sources, transform it into the desired format, and load it into
Hadoop. For example, Hive can be used to extract data from a relational database,
transform it into a format suitable for Hadoop, and load it into Hadoop for further analysis.
Recap
• Spark: Open-source distributed computing system with features like distributed and
in-memory processing, fault-tolerance, parallel and real-time processing, used for
data analytics, machine learning, graph processing, etc.
• Hive: Data warehousing system built on top of Hadoop for querying and analyzing
large datasets, features include SQL-like queries, data summarization, indexing, etc.,
used for ETL processing, log analysis, machine learning, etc.
• Both Spark and Hive have wide-ranging use cases across various industries and
domains, from processing large amounts of data to real-time analytics, fraud
detection, and more. They offer powerful features and tools for big data analysis and
management, making them essential technologies in the world of data science and
engineering.