0% found this document useful (0 votes)

24 views4 pages

RDD

RDD questions

Uploaded by

053 MANYADHARSHINI S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views4 pages

RDD

RDD questions

Uploaded by

053 MANYADHARSHINI S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

1.

Resilient Distributed Datasets (RDDs) in Apache Spark

**Overview:**
- RDDs are a core data structure in Apache Spark, introduced from its initial release.
- They are distributed across cluster nodes and stored in RAM, providing fast processing compared to
other applications.
.........................
2. **Features of RDDs:**
1. **Immutable:** RDDs are immutable collections; changes result in the creation of new RDDs.
2. **Resilient:** They offer fault tolerance through lineage graphs, allowing recomputation of lost partitions
due to node failures.
3. **Lazy Evaluation:** Transformations are executed only when actions are triggered, enhancing
performance by avoiding unnecessary computations.
4. **Distributed:** Data is spread across multiple nodes, ensuring high availability.
5. **Persistence:** Intermediate results can be stored in-memory or on-disk for reuse and faster
computations.
.........................
3.**Ways to Create RDDs:**
- Using the ‘parallelize()‘ method.
- From existing RDDs.
- From external storage systems (e.g., HDFS, Amazon S3, HBase).
- From existing DataFrames and Datasets.
..............................
4.**Operations on RDDs in Apache Spark**

**1. Transformations:**
- **Purpose:** Used to manipulate RDD data and return new RDDs.
- **Evaluation:** Lazy; they are not executed until an action is performed.
- **Lineage:** Each transformation is linked to its parent RDD through a lineage graph.
- **Types:**
- **Narrow Transformations:** No data movement between RDD partitions. Examples: ‘map()‘,
‘flatMap()‘, ‘filter()‘, ‘union()‘, ‘mapPartitions()‘.
- **Wide Transformations:** Data movement (shuffling) between RDD partitions. Examples:
‘groupByKey()‘, ‘reduceByKey()‘, ‘aggregateByKey()‘, ‘join()‘, ‘repartition()‘.

**2. Actions:**
- **Purpose:** Return a non-RDD value and store the result in the driver program.
- **Examples:** ‘count()‘, ‘collect()‘, ‘first()‘, ‘take()‘.
.........................
5.**Introduction to PySpark**

PySpark is an open-source, distributed computing framework designed for real-time, large-scale data
processing. It serves as the Python API for Apache Spark, enabling Python users to leverage Spark’s
capabilities.

**Use Cases:**
- **Batch Processing:** Handling large volumes of data in batch pipelines.
- **Real-time Processing:** Processing data as it arrives.
- **Machine Learning:** Implementing machine learning algorithms on big data.
- **Graph Processing:** Analyzing and processing graph data structures.

**Key Features:**
- **Real-time Computations:** Supports real-time data processing.
- **Caching and Disk Persistence:** Allows for caching intermediate results and storing data on disk.
- **Fast Processing:** Provides high-speed data processing capabilities.
- **Works Well with RDDs:** Efficiently processes data using Resilient Distributed Datasets (RDDs).
..........................
6.Loading Dataset Into PySpark
Loading libraries
import pyspark
from pyspark.sql import SparkSession
import pandas as pd
Creating a Spark Session
spark = SparkSession.builder.appName("Spark-Introduction").getOrCreate()
spark
Loading dataset to PySpark
To load a dataset into Spark session, we can use the spark.read.csv( ) method and save inside
df_pyspark.

df_pyspark = spark.read.csv("Example.csv")
df_pyspark
.............................
7.**SparkSession Overview**

**Introduction:**
- **SparkSession** is the entry point for Spark, introduced in Spark 2.0.
- It allows the creation of Spark RDDs, DataFrames, and Datasets.
- It replaces older contexts like ‘SQLContext‘, ‘HiveContext‘, and others.

**Key Points:**

1. SparkSession in Spark 2.0:

- Combines functionality of ‘SQLContext‘, ‘HiveContext‘, and other contexts into one unified class.
- Created using ‘SparkSession.builder()‘.

2. **SparkSession in Spark-Shell:**
- The default ‘spark‘ object in the Spark shell is an instance of ‘SparkSession‘.
- Use ‘spark.version‘ to check Spark version.

3. **Creating SparkSession:**
- **From Scala or Python:** Use ‘builder()‘ and ‘getOrCreate()‘ methods.
- **Multiple Sessions:** Create additional sessions using ‘newSession()‘.

4. **Setting Configurations:**
- Use ‘config()‘ method to set Spark configurations.

5. **Hive Support:**
- Enable Hive support with ‘enableHiveSupport()‘.

6. **Other Usages:**
- **Set & Get Configs:** Manage configurations with ‘spark.conf‘.
- **Create DataFrame:** Use ‘createDataFrame()‘ for building DataFrames.
- **Spark SQL:** Use ‘spark.sql()‘ to execute SQL queries on temporary views.
- **Create Hive Table:** Use ‘saveAsTable()‘ to create and query Hive tables.
- **Catalogs:** Access metadata with ‘spark.catalog‘.

Commonly Used Methods:

- ‘version()‘: Returns Spark version.
- ‘catalog()‘: Access catalog metadata.
- ‘conf()‘: Get runtime configuration.
- ‘builder()‘: Create a new ‘SparkSession‘.
- ‘newSession()‘: Create an additional ‘SparkSession‘.
- ‘createDataFrame()‘: Create a DataFrame from a collection or RDD.
- ‘createDataset()‘: Create a Dataset from a collection, DataFrame, or RDD.
- ‘emptyDataFrame()‘: Create an empty DataFrame.
- ‘emptyDataset()‘: Create an empty Dataset.
- ‘sparkContext()‘: Get the ‘SparkContext‘.
- ‘sql(String sql)‘: Execute SQL queries and return a DataFrame.
- ‘sqlContext()‘: Return SQLContext.
- ‘stop()‘: Stop the current ‘SparkContext‘.
- ‘table()‘: Return a DataFrame of a table or view.
- ‘udf()‘: Create a Spark UDF (User-Defined Function).
................................
8.introduction to Shared Variables

**Shared variables** are used to efficiently manage and share data across multiple nodes in a distributed
computing environment. Instead of creating separate copies of variables for each node, shared variables
allow nodes to access and update shared data consistently.

### Types of Shared Variables

1. **Broadcast Variables**:
- **Purpose**: Efficiently share read-only variables across all nodes.
- **Usage**: Useful for distributing large, read-only data (like lookup tables) to all nodes in a cluster.
- **Creation**: Use ‘SparkContext.broadcast(value)‘ to create a broadcast variable.

2. **Accumulators**:
- **Purpose**: Accumulate values across multiple tasks.
- **Usage**: Useful for counters or sums where tasks can update the accumulator.
- **Creation**: Use ‘SparkContext.accumulator(initialValue)‘ to create an accumulator.

implention to see in techm resoruce

.....................................
9.**Actions in Spark RDDs**

**Purpose:**
- Actions perform operations on RDDs that return non-RDD values and store the result in the driver
program.

**Common Actions:**
- **‘collect()‘:** Returns all data from the RDD as a list.
- **‘count()‘:** Returns the total number of elements in the RDD.
- **‘reduce()‘:** Computes a summarized result based on a function applied to the RDD.
- **‘first()‘:** Retrieves the first item from the RDD.
- **‘take(n)‘:** Retrieves the first ‘n‘ items from the RDD.

implementation in tech resource

........................
10.Transformations in Spark RDD

**Transformations Overview**:
- Transformations manipulate RDD data and return a new RDD.
- They are evaluated lazily, meaning they are not executed until an action is performed.
**Key Transformations**:
1. **distinct**: Retrieves unique elements from an RDD.
2. **filter**: Selects elements based on a condition.
3. **sortBy**: Sorts the data in ascending or descending order based on a condition.
4. **map**: Applies an operation to each element in an RDD, returning a new RDD with the same length
as the original.
5. **flatMap**: Similar to ‘map‘, but flattens the result, which may change the length of the RDD.
6. **union**: Combines data from two RDDs into a new RDD.
7. **intersection**: Retrieves common data from two RDDs into a new RDD.
8. **repartition**: Increases the number of partitions in an RDD.
9. **coalesce**: Decreases the number of partitions in an RDD.

**Grouping Operations**:
- **groupByKey()**: Groups each input key to an iterable value in the RDD, resulting in data shuffling over
the network.
- **reduceByKey()**: Similar to ‘groupByKey()‘, but it performs a map-side combine to optimize
processing.

implementation in techm resource

.................................................

Apache Spark With Scala - Cheatsheet
No ratings yet
Apache Spark With Scala - Cheatsheet
7 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
No ratings yet
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
48 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Spark by Sumit
No ratings yet
Spark by Sumit
33 pages
Pyspark Essentials
No ratings yet
Pyspark Essentials
24 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
No ratings yet
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
51 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
SPARK
No ratings yet
SPARK
35 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Important PySpark Operations 1698872557
No ratings yet
Important PySpark Operations 1698872557
4 pages
Pyspark Study Material
No ratings yet
Pyspark Study Material
5 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
BDA Unit III IV
No ratings yet
BDA Unit III IV
33 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
ECS Concepts and Features-Participant Guide
No ratings yet
ECS Concepts and Features-Participant Guide
132 pages
Apache
No ratings yet
Apache
9 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Key Differences in Aache Spark Components and Concepts
No ratings yet
Key Differences in Aache Spark Components and Concepts
7 pages
2.RDDs in Spark
No ratings yet
2.RDDs in Spark
38 pages
Chatgpt Developer Cheatsheet
100% (1)
Chatgpt Developer Cheatsheet
56 pages
DBMS I.P
No ratings yet
DBMS I.P
5 pages
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
No ratings yet
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
6 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Complete Data Engineer Interview Guide
No ratings yet
Complete Data Engineer Interview Guide
3 pages
External Video-En
No ratings yet
External Video-En
2 pages
Unit 5 Note
No ratings yet
Unit 5 Note
18 pages
Fpga Timeline & Applications: Fpgas Past, Present & Future
No ratings yet
Fpga Timeline & Applications: Fpgas Past, Present & Future
39 pages
Compatibility Charts
No ratings yet
Compatibility Charts
117 pages
Spark Optimisation
No ratings yet
Spark Optimisation
7 pages
Philosophy of The VEDANTA Paul Deussen
No ratings yet
Philosophy of The VEDANTA Paul Deussen
83 pages
RDD Actions
No ratings yet
RDD Actions
18 pages
Increase Image Size in KB - Pi7 Photo Size Increaser
No ratings yet
Increase Image Size in KB - Pi7 Photo Size Increaser
3 pages
Day 1 Stlecture Notes
No ratings yet
Day 1 Stlecture Notes
4 pages
Spark RDD
No ratings yet
Spark RDD
4 pages
SAP S4HANA DTS Practice Questions
No ratings yet
SAP S4HANA DTS Practice Questions
11 pages
BDA Exp E1
No ratings yet
BDA Exp E1
5 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
PySpark Core Print
No ratings yet
PySpark Core Print
8 pages
Pyspark
No ratings yet
Pyspark
6 pages
Apache Spark Lecture Notes
No ratings yet
Apache Spark Lecture Notes
4 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
Skyess Spark Syllabus
No ratings yet
Skyess Spark Syllabus
12 pages
Spark Material
No ratings yet
Spark Material
6 pages
Pyspark Theory Questions
No ratings yet
Pyspark Theory Questions
5 pages
Interview Question Spark Day1
No ratings yet
Interview Question Spark Day1
3 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Fiori Space and Pages
No ratings yet
Fiori Space and Pages
26 pages
Interview - Questions
No ratings yet
Interview - Questions
8 pages
Spark RDD Commands - Spark Core
No ratings yet
Spark RDD Commands - Spark Core
7 pages
Apache Spark: The Next Gen Toolset For Big Data Processing
No ratings yet
Apache Spark: The Next Gen Toolset For Big Data Processing
9 pages
Software Characteristics
No ratings yet
Software Characteristics
15 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
EEE484 Note Book
No ratings yet
EEE484 Note Book
104 pages
PCS7 System Recovery With Veritas en
No ratings yet
PCS7 System Recovery With Veritas en
19 pages
Data Science
No ratings yet
Data Science
18 pages
Bhagwan Mahavir University: Diploma Engineering (Electronics and Communication Engineering)
No ratings yet
Bhagwan Mahavir University: Diploma Engineering (Electronics and Communication Engineering)
4 pages
Latestlog
No ratings yet
Latestlog
136 pages
IOT Mod1@AzDOCUMENTS - in
No ratings yet
IOT Mod1@AzDOCUMENTS - in
13 pages
Lantiq GRX388
No ratings yet
Lantiq GRX388
2 pages
PDIM 104 Lecture03 - Variables
No ratings yet
PDIM 104 Lecture03 - Variables
33 pages
2nd Puc Maths Target Centum by Linge Gowda AP
No ratings yet
2nd Puc Maths Target Centum by Linge Gowda AP
43 pages
Data Structures and Algorithms Final Project 2024
No ratings yet
Data Structures and Algorithms Final Project 2024
2 pages
Katalog Evamat Full
No ratings yet
Katalog Evamat Full
9 pages
Customized Image Reference Guide
No ratings yet
Customized Image Reference Guide
72 pages
Operating System MCQ Questions With Answer - OS Question - Computer Fundamental
No ratings yet
Operating System MCQ Questions With Answer - OS Question - Computer Fundamental
4 pages
Eurail Timetable PDF 2012
No ratings yet
Eurail Timetable PDF 2012
2 pages
The MVC Programming Model
No ratings yet
The MVC Programming Model
40 pages
CUET General Test Question Paper 2024 Set A
No ratings yet
CUET General Test Question Paper 2024 Set A
13 pages
VLSI Module 4 & 5 Questions
No ratings yet
VLSI Module 4 & 5 Questions
2 pages
Columstore Index
No ratings yet
Columstore Index
6 pages
Professional Cerfiticate in ICT L1 - Lilongwe - Selection
No ratings yet
Professional Cerfiticate in ICT L1 - Lilongwe - Selection
2 pages
Module 4 - Implementing Network Security New
No ratings yet
Module 4 - Implementing Network Security New
76 pages

RDD

Uploaded by

RDD

Uploaded by

1.

**Resilient Distributed Datasets (RDDs) in Apache Spark**

1. **SparkSession in Spark 2.0:**

**Commonly Used Methods:**

### Types of Shared Variables

implention to see in techm resoruce

implementation in tech resource

implementation in techm resource

You might also like

Resilient Distributed Datasets (RDDs) in Apache Spark

1. SparkSession in Spark 2.0:

Commonly Used Methods: