0% found this document useful (0 votes)
13 views4 pages

Bda 7

Uploaded by

sdk1972003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views4 pages

Bda 7

Uploaded by

sdk1972003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Vinit Patil(D17B/55)

AIM:
To implement the following programs using pyspark.
1. word count program
2. program to find no of words starting specific letter (e.g. 'h'/’a’ )

Theory:

Spark
Apache Spark is an open-source, distributed data processing framework designed for big data processing
and analytics. It was developed to address limitations in the Hadoop MapReduce model, offering
improved performance, ease of use, and a broader range of data processing capabilities. Spark provides a
unified platform for various data processing tasks, including batch processing, interactive queries, stream
processing, machine learning, and graph processing.

Key features of Spark:


● In-Memory Processing: Spark stores data in memory, which significantly accelerates data
processing compared to disk-based processing in Hadoop MapReduce.
● Distributed Computing: Spark can distribute data and computations across a cluster of machines,
enabling parallel processing for enhanced scalability.
● High-Level APIs: It offers high-level APIs in languages like Scala, Java, Python (PySpark), and
R, making it accessible to a wide range of developers.
● Rich Ecosystem: Spark has a rich ecosystem of libraries and extensions for various data
processing needs, such as Spark SQL for structured data processing, MLlib for machine learning,
GraphX for graph processing, and more.

PySpark
PySpark is the Python library for Apache Spark, allowing developers to write Spark applications using
Python. PySpark provides a high-level API for Spark, making it easier for Python developers to harness
the power of Spark's distributed data processing capabilities. It seamlessly integrates with the Spark
ecosystem, enabling Python users to leverage Spark's features for data analysis, machine learning, and
more.

Key benefits of PySpark:


● Pythonic Syntax: Developers can use familiar Python syntax and libraries while working with
Spark.
● Interactive Data Exploration: PySpark can be used interactively in tools like Jupyter notebooks
for data exploration and analysis.
● Integration with Other Python Libraries: You can easily integrate PySpark with popular Python
libraries like NumPy, pandas, and scikit-learn.
PySpark supports all of Spark’s features such as Spark SQL, DataFrames, Structured Streaming, Machine
Learning (MLlib) and Spark Core.

● Spark SQL: PySpark allows you to work with structured data using SQL queries, making it easy
to perform data analysis and transformations on structured data.
● DataFrames: PySpark provides DataFrames, which are distributed collections of data organized
into named columns. DataFrames offer a high-level API for working with structured data and are
well-suited for data manipulation and exploration.
● Structured Streaming: With PySpark, you can process real-time data using Structured Streaming,
a scalable and fault-tolerant stream processing engine. It enables you to perform continuous data
processing and analytics on live data streams.
● Machine Learning (MLlib): PySpark's MLlib library offers a wide range of machine learning
algorithms and tools for building and deploying machine learning models at scale. It supports
various tasks like classification, regression, clustering, and more.
● Spark Core: PySpark is built on top of the Spark Core, which provides the foundational
components for distributed data processing, including Resilient Distributed Datasets (RDDs) and
the distributed computing engine.

RDD (Resilient Distributed Dataset):


Resilient Distributed Dataset (RDD) is a fundamental data structure in Apache Spark. It serves as the core
abstraction for distributed data processing in Spark. RDDs are immutable, distributed collections of data
that can be processed in parallel across a cluster of machines.

The flow of it in the Spark Architecture


● Spark creates a graph when you enter code in the sparking console.
● When an action is called on Spark, Spark submits a graph to the DAG scheduler.
● Operators are divided into stages of Tasks in DAG scheduler.
● The stages are passed on to the Task scheduler, which launches tasks through Cluster
Manager.
Here are some key characteristics and concepts related to RDDs:
● Resilient: RDDs are resilient because they can recover from node failures. Spark automatically
rebuilds lost data partitions using lineage information (the history of transformations applied to
the data).
● Distributed: RDDs are distributed across multiple nodes in a cluster, enabling parallel processing.
This distribution is transparent to the developer.
● Immutable: RDDs are immutable, meaning once created, their data cannot be modified. Any
transformation applied to an RDD results in the creation of a new RDD.
● Lazily Evaluated: RDD transformations are lazily evaluated, which means they are not executed
immediately. Instead, Spark builds a lineage graph to record the transformations and only
computes them when an action is called. This optimization improves performance.
● Partitioned: RDDs are divided into partitions, which are the basic units of parallelism. Each
partition is processed on a separate node in the cluster.
● Parallel Operations: RDDs support parallel operations like map, reduce, filter, and more. These
operations can be chained together to perform complex data processing tasks.

Conclusion:
In summary, Apache Spark is a powerful open-source framework for distributed data processing. PySpark
extends Spark's capabilities to Python developers, making it accessible and user-friendly. RDDs, as the
core data structure in Spark, provide resilience, distribution, immutability, and parallelism, enabling
efficient and fault-tolerant processing of large-scale data sets across a cluster of machines. Understanding
these concepts is essential for harnessing the full potential of Spark and PySpark in big data analytics and
processing tasks.
Output :

Txt file :

1. word count program

2. program to find no of words starting specific letter (e.g. 'h'/’a’ )

You might also like