Bda 7

Uploaded by

sdk1972003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views4 pages

Bda 7

Uploaded by

sdk1972003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Vinit Patil(D17B/55)

AIM:
To implement the following programs using pyspark.
1. word count program
2. program to find no of words starting specific letter (e.g. 'h'/’a’ )

Theory:

Spark
Apache Spark is an open-source, distributed data processing framework designed for big data processing
and analytics. It was developed to address limitations in the Hadoop MapReduce model, offering
improved performance, ease of use, and a broader range of data processing capabilities. Spark provides a
unified platform for various data processing tasks, including batch processing, interactive queries, stream
processing, machine learning, and graph processing.

Key features of Spark:

● In-Memory Processing: Spark stores data in memory, which significantly accelerates data
processing compared to disk-based processing in Hadoop MapReduce.
● Distributed Computing: Spark can distribute data and computations across a cluster of machines,
enabling parallel processing for enhanced scalability.
● High-Level APIs: It offers high-level APIs in languages like Scala, Java, Python (PySpark), and
R, making it accessible to a wide range of developers.
● Rich Ecosystem: Spark has a rich ecosystem of libraries and extensions for various data
processing needs, such as Spark SQL for structured data processing, MLlib for machine learning,
GraphX for graph processing, and more.

PySpark
PySpark is the Python library for Apache Spark, allowing developers to write Spark applications using
Python. PySpark provides a high-level API for Spark, making it easier for Python developers to harness
the power of Spark's distributed data processing capabilities. It seamlessly integrates with the Spark
ecosystem, enabling Python users to leverage Spark's features for data analysis, machine learning, and
more.

Key benefits of PySpark:

● Pythonic Syntax: Developers can use familiar Python syntax and libraries while working with
Spark.
● Interactive Data Exploration: PySpark can be used interactively in tools like Jupyter notebooks
for data exploration and analysis.
● Integration with Other Python Libraries: You can easily integrate PySpark with popular Python
libraries like NumPy, pandas, and scikit-learn.
PySpark supports all of Spark’s features such as Spark SQL, DataFrames, Structured Streaming, Machine
Learning (MLlib) and Spark Core.

● Spark SQL: PySpark allows you to work with structured data using SQL queries, making it easy
to perform data analysis and transformations on structured data.
● DataFrames: PySpark provides DataFrames, which are distributed collections of data organized
into named columns. DataFrames offer a high-level API for working with structured data and are
well-suited for data manipulation and exploration.
● Structured Streaming: With PySpark, you can process real-time data using Structured Streaming,
a scalable and fault-tolerant stream processing engine. It enables you to perform continuous data
processing and analytics on live data streams.
● Machine Learning (MLlib): PySpark's MLlib library offers a wide range of machine learning
algorithms and tools for building and deploying machine learning models at scale. It supports
various tasks like classification, regression, clustering, and more.
● Spark Core: PySpark is built on top of the Spark Core, which provides the foundational
components for distributed data processing, including Resilient Distributed Datasets (RDDs) and
the distributed computing engine.

RDD (Resilient Distributed Dataset):

Resilient Distributed Dataset (RDD) is a fundamental data structure in Apache Spark. It serves as the core
abstraction for distributed data processing in Spark. RDDs are immutable, distributed collections of data
that can be processed in parallel across a cluster of machines.

The flow of it in the Spark Architecture

● Spark creates a graph when you enter code in the sparking console.
● When an action is called on Spark, Spark submits a graph to the DAG scheduler.
● Operators are divided into stages of Tasks in DAG scheduler.
● The stages are passed on to the Task scheduler, which launches tasks through Cluster
Manager.
Here are some key characteristics and concepts related to RDDs:
● Resilient: RDDs are resilient because they can recover from node failures. Spark automatically
rebuilds lost data partitions using lineage information (the history of transformations applied to
the data).
● Distributed: RDDs are distributed across multiple nodes in a cluster, enabling parallel processing.
This distribution is transparent to the developer.
● Immutable: RDDs are immutable, meaning once created, their data cannot be modified. Any
transformation applied to an RDD results in the creation of a new RDD.
● Lazily Evaluated: RDD transformations are lazily evaluated, which means they are not executed
immediately. Instead, Spark builds a lineage graph to record the transformations and only
computes them when an action is called. This optimization improves performance.
● Partitioned: RDDs are divided into partitions, which are the basic units of parallelism. Each
partition is processed on a separate node in the cluster.
● Parallel Operations: RDDs support parallel operations like map, reduce, filter, and more. These
operations can be chained together to perform complex data processing tasks.

Conclusion:
In summary, Apache Spark is a powerful open-source framework for distributed data processing. PySpark
extends Spark's capabilities to Python developers, making it accessible and user-friendly. RDDs, as the
core data structure in Spark, provide resilience, distribution, immutability, and parallelism, enabling
efficient and fault-tolerant processing of large-scale data sets across a cluster of machines. Understanding
these concepts is essential for harnessing the full potential of Spark and PySpark in big data analytics and
processing tasks.
Output :

Txt file :

1. word count program

2. program to find no of words starting specific letter (e.g. 'h'/’a’ )

PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Untitled2 - Jupyter Notebook
No ratings yet
Untitled2 - Jupyter Notebook
9 pages
Foxboro DCS: System Definition V3.6
100% (1)
Foxboro DCS: System Definition V3.6
20 pages
Pyspark Modules&packages RDD
No ratings yet
Pyspark Modules&packages RDD
9 pages
Bda Notes
No ratings yet
Bda Notes
241 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
Spark Introduction
No ratings yet
Spark Introduction
4 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
SPARK
No ratings yet
SPARK
125 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Unit IV Spark
No ratings yet
Unit IV Spark
23 pages
Spark Theory
No ratings yet
Spark Theory
26 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Unit V
No ratings yet
Unit V
35 pages
PySpark Core Print
No ratings yet
PySpark Core Print
8 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
BDA Exp E1
No ratings yet
BDA Exp E1
5 pages
Pyspark Tutorial
100% (2)
Pyspark Tutorial
27 pages
Pyspark
No ratings yet
Pyspark
31 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Big Data Tools 2 - Apache Spark With PySpark
No ratings yet
Big Data Tools 2 - Apache Spark With PySpark
33 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Unit 5
100% (1)
Unit 5
109 pages
Big Data Technologies Lab
No ratings yet
Big Data Technologies Lab
8 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
3.5 Apache Spark
No ratings yet
3.5 Apache Spark
12 pages
Py Spark
No ratings yet
Py Spark
9 pages
Py Spark
No ratings yet
Py Spark
177 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
LearningSpark EXCERPT
50% (2)
LearningSpark EXCERPT
47 pages
Spark and Dask
No ratings yet
Spark and Dask
55 pages
Module 3
No ratings yet
Module 3
51 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Lecture 10 - Spark
No ratings yet
Lecture 10 - Spark
87 pages
Unit 5 Note
No ratings yet
Unit 5 Note
18 pages
BDA Lec9
No ratings yet
BDA Lec9
25 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Spark 101
No ratings yet
Spark 101
25 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Page 01
No ratings yet
Page 01
2 pages
Pyspark IQ
No ratings yet
Pyspark IQ
13 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Pyspark
No ratings yet
Pyspark
10 pages
PySpark Essentials: A Practical Guide to Distributed Computing
From Everand
PySpark Essentials: A Practical Guide to Distributed Computing
Robert Johnson
No ratings yet
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet
Manual Operation PAVA
No ratings yet
Manual Operation PAVA
24 pages
Cgu Es
No ratings yet
Cgu Es
2 pages
W3 - Lecture 6-RAD and Agile
No ratings yet
W3 - Lecture 6-RAD and Agile
32 pages
JCL Mock Test IV
No ratings yet
JCL Mock Test IV
6 pages
Computer Organization Notes
100% (1)
Computer Organization Notes
115 pages
NPM Vs PNPM Vs YARN
No ratings yet
NPM Vs PNPM Vs YARN
2 pages
2-Stacks and Queues
No ratings yet
2-Stacks and Queues
3 pages
Me8781 Mechatronics Laboratory Manual: 7th Semester
No ratings yet
Me8781 Mechatronics Laboratory Manual: 7th Semester
56 pages
Developing Use Case
No ratings yet
Developing Use Case
21 pages
Image-to-Image Translation With Conditional Adversarial Networks (Review)
No ratings yet
Image-to-Image Translation With Conditional Adversarial Networks (Review)
3 pages
Aprod Manual
No ratings yet
Aprod Manual
39 pages
Power & Energy Signals Questions and Answers - Sanfoundry
No ratings yet
Power & Energy Signals Questions and Answers - Sanfoundry
11 pages
CC Lecture Notes - Unit-I
No ratings yet
CC Lecture Notes - Unit-I
62 pages
Chapter 4. Linked List - Notes
No ratings yet
Chapter 4. Linked List - Notes
25 pages
R20 B.Tech - CSM Siddarth Institute of Engineering & Technology: Puttur (Autonomous) Machine Learning Lab 3 Course Objectives
No ratings yet
R20 B.Tech - CSM Siddarth Institute of Engineering & Technology: Puttur (Autonomous) Machine Learning Lab 3 Course Objectives
2 pages
KG-819 Wouxun Manual
No ratings yet
KG-819 Wouxun Manual
26 pages
10th Comp Preboard 2025
No ratings yet
10th Comp Preboard 2025
2 pages
Internet Cafe Timer Thesis
100% (3)
Internet Cafe Timer Thesis
6 pages
Split Up - AI - X - 2024-25 KVS RO Guwahati
No ratings yet
Split Up - AI - X - 2024-25 KVS RO Guwahati
4 pages
Unmanned Surface Vessel
100% (1)
Unmanned Surface Vessel
16 pages
FM AA CIA 15 Module 5edited
No ratings yet
FM AA CIA 15 Module 5edited
28 pages
Blockchain Cloud Storage
No ratings yet
Blockchain Cloud Storage
32 pages
Aperio Image Analysis: User's Guide
No ratings yet
Aperio Image Analysis: User's Guide
46 pages
Naveen Udaya Kumar - Imp
No ratings yet
Naveen Udaya Kumar - Imp
5 pages
LESSON 1 - The Information Age
No ratings yet
LESSON 1 - The Information Age
14 pages
Oliii Computer Past Papers Qs
No ratings yet
Oliii Computer Past Papers Qs
43 pages
Chord 2go Manual
No ratings yet
Chord 2go Manual
28 pages
Updated Self Balancing Report (F16mte28)
No ratings yet
Updated Self Balancing Report (F16mte28)
25 pages

Bda 7

Uploaded by

Bda 7

Uploaded by

Vinit Patil(D17B/55)

Key features of Spark:

Key benefits of PySpark:

RDD (Resilient Distributed Dataset):

The flow of it in the Spark Architecture

1. word count program

2. program to find no of words starting specific letter (e.g. 'h'/’a’ )

You might also like