0% found this document useful (0 votes)
18 views

Module 2.pptx

Uploaded by

madhavan090603
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Module 2.pptx

Uploaded by

madhavan090603
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Spark Big Data

• Spark has been proposed by Apache Software


Foundation to speed up the software process of
Hadoop computational computing. Spark includes
its cluster management, while Hadoop is only one
of the forms for implementing Spark.
• Spark applies Hadoop in two forms. The first form
is storage and another one is processing. Thus,
Spark includes its computation for cluster
management and applies Hadoop for only storage
purposes.
Apache Spark
• Apache Spark is a distributed and open-source
processing system. It is used for the workloads of 'Big
data'. Spark utilizes optimized query execution and
in-memory caching for rapid queries across any size of
data. It is simply a general and fast engine for much
large-scale processing of data.
• It is much faster as compared to the previous concepts
to implement with Big Data such as
classical MapReduce. Spark is faster die to it executes
on RAM/memory and enables the processing faster as
compared to the disk drivers.
Apache Spark Evolution

• Spark is one of the most important


sub-projects of Hadoop. It was developed in
APMLab of UC Berkeley in 2009 by Matei
Zaharia. In 2010, it was an open-source under
the BSD license. Spark was donated in 2013 to
the Apache Software Foundation. Apache
Spark is now a top-level project of Apache
from 2014 February.
• Apache Spark Core: Apache Spark Core can be defined as an underlying
normal execution engine for the platform of Spark. It facilitates
referencing data sets and in-memory computing within the external
storage structures.
• Spark SQL: This component is a module of Apache Spark for operating
with many kinds of structured data. Various interfaces provided by Spark
SQL facilitates Spark along with a lot of information regarding both the
computation and data being implemented.
• Spark Streaming: Spark streaming permits Spark for processing streaming
data in real-time. The data could be inhaling from several sources such as
Hadoop Distributed File System (HDFS), Flume, and Kafta. After that data
could be processed with complex algorithms and then pushed out towards
the live dashboards, databases, and file systems.
• Machine Learning Library (MLlib): Apache Spark is armed with a
prosperous library called MLlib. The MLlib includes a wide range of
machine learning algorithms collaborative filtering, clustering, regression,
and classifications. Also, it contains other resources for tuning, evaluating,
and constructing ML pipelines. Each of these functionalities supports Spark
scale-out around the cluster.
• GraphX: Apache Spark comes using a library for manipulating graph
databases and implement computations known as GraphX. This
component unifies Extract, Transform, and Load (ETL) process, constant
graph computation, and exploratory analysis in an individual system.
Architecture of Spark
• The architecture of Spark contains three of the
main elements which are listed below:
• API
• Data Storage
• Resource Management
API
• This element facilitates many developers of the
applications for creating Spark-based applications
with a classic API interface. Spark offers API for
Python, Java, and Scala programming languages.
Data Storage
• Spark applies the Hadoop Distributed File System
for various purposes of data storage. It works with
any data source that is compatible with Hadoop
including Cassandra, HBase, HDFS, etc.
Resource Management
• The Spark could be expanded as the stand-alone
server. Also, it can be expanded on any shared
computing framework such as YARN or Mesos.
RDD in Spark
• RDD stands for Resilient Distributed Dataset.
It is a core concept within the Spark
framework. Assume RDD like any table inside
the database.
• Action
• Transformation
Spark Installation

• There are some different things to use and


install Spark. We can install Spark on our
machine as any stand-alone framework or use
the images of Spark VM (Virtual Machine)
available from many vendors such as MapR,
HortonWorks, and Cloudera. Also, we can use
Spark configured and installed inside the cloud
(such as Databricks Clouds).
Features of Spark

• Fast processing: One of the most essential aspects of


Spark is that it has enabled the world of big data to
select the technology on others because of its speed.
On the other hand, big data is featured by veracity,
variety, velocity, and volume which require to be
implemented at a great speed. Spark includes RDD
(Resilient Distributed Dataset) which can save time in
writing and reading operations, permitting it to
execute almost many times faster as compared to
Hadoop.
• Flexibility: Spark supports more than one language
and permits many developers for writing applications
in Python, R, Scala, or Java.
• In-memory computing: Apache Spark can store the
data inside the server's RAM which permits quick
access. Also, it accelerates analytics speed.
• Real-time processing: Apache Spark can process
streaming data in real-time. Unlike MapReduce that
only processes stored data, Apache Spark can process
data in real-time. Therefore, it can also produce
instant results.
• Better analytics: In variations of MapReduce that
provides the ability for mapping and reducing
functions, Spark provides much more as compared to
it. Apache Spark combines a prosperous set of
machine learning, complex analytics, SQL queries, etc.
Using each of these functionalities, analytics could be
implemented in a better way using Spark.
Hadoop Ecosystem
• Apache Hadoop is an open source framework
intended to make interaction with big data easier,
However, for those who are not acquainted with
this technology, one question arises that what is
big data ? Big data is a term given to the data sets
which can’t be processed in an efficient manner
with the help of traditional methodology such as
RDBMS. Hadoop has made its place in the
industries and companies that need to work on
large data sets which are sensitive and needs
efficient handling.
Components
• There are four major elements of
Hadoop i.e. HDFS, MapReduce, YARN, and
Hadoop Common Utilities
• Following are the components that collectively
form a Hadoop ecosystem:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data
Processing
• Spark: In-Memory data processing
Moving data in and out of Hadoop

• Moving data in and out of Hadoop involves


various methods depending on the source and
destination of the data, as well as the specific
Hadoop components involved. Here are some
common techniques and tools used:
• 1. HDFS (Hadoop Distributed File System)
• Uploading Data to HDFS:
• HDFS Command Line Interface (CLI): You can use
hadoop fs -put or hdfs dfs -put commands to
upload data to HDFS. Example:
• 3. Apache Flume
• Flume is used for efficiently collecting, aggregating,
and moving large amounts of log data from various
sources to a centralized data store, such as HDFS.
• 4. Apache Kafka
• Kafka is a distributed messaging system often used for
building real-time data pipelines. Data from various
sources can be sent to Kafka topics, and from there,
Kafka consumers can write the data into Hadoop.
• 5. Apache Nifi
• Nifi is a data integration tool that supports data
ingestion, routing, and transformation. It can be used
to move data between different systems, including
Hadoop.
• 6. Hive and HBase Integration
• Hive: You can load data into Hive tables using the LOAD
DATA command, and data from Hive tables can be exported
using commands like INSERT OVERWRITE.
• HBase: Data can be imported into HBase using tools like
HBase bulkload, and exported using HBase Export.
• 7. Custom Scripts and APIs
• For specific use cases, custom scripts using Hadoop APIs
(Java, Python, etc.) can be written to move data in and out
of Hadoop.
• 8. Cloud Integration
• If you're using Hadoop on the cloud, integration with cloud
storage (like AWS S3, Azure Blob Storage, Google Cloud
Storage) can be done using tools like S3DistCp for Amazon
S3 or similar tools for other cloud providers.

You might also like