0% found this document useful (0 votes)
24 views4 pages

RDD

RDD questions
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views4 pages

RDD

RDD questions
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

1.

**Resilient Distributed Datasets (RDDs) in Apache Spark**

**Overview:**
- RDDs are a core data structure in Apache Spark, introduced from its initial release.
- They are distributed across cluster nodes and stored in RAM, providing fast processing compared to
other applications.
.........................
2. **Features of RDDs:**
1. **Immutable:** RDDs are immutable collections; changes result in the creation of new RDDs.
2. **Resilient:** They offer fault tolerance through lineage graphs, allowing recomputation of lost partitions
due to node failures.
3. **Lazy Evaluation:** Transformations are executed only when actions are triggered, enhancing
performance by avoiding unnecessary computations.
4. **Distributed:** Data is spread across multiple nodes, ensuring high availability.
5. **Persistence:** Intermediate results can be stored in-memory or on-disk for reuse and faster
computations.
.........................
3.**Ways to Create RDDs:**
- Using the ‘parallelize()‘ method.
- From existing RDDs.
- From external storage systems (e.g., HDFS, Amazon S3, HBase).
- From existing DataFrames and Datasets.
..............................
4.**Operations on RDDs in Apache Spark**

**1. Transformations:**
- **Purpose:** Used to manipulate RDD data and return new RDDs.
- **Evaluation:** Lazy; they are not executed until an action is performed.
- **Lineage:** Each transformation is linked to its parent RDD through a lineage graph.
- **Types:**
- **Narrow Transformations:** No data movement between RDD partitions. Examples: ‘map()‘,
‘flatMap()‘, ‘filter()‘, ‘union()‘, ‘mapPartitions()‘.
- **Wide Transformations:** Data movement (shuffling) between RDD partitions. Examples:
‘groupByKey()‘, ‘reduceByKey()‘, ‘aggregateByKey()‘, ‘join()‘, ‘repartition()‘.

**2. Actions:**
- **Purpose:** Return a non-RDD value and store the result in the driver program.
- **Examples:** ‘count()‘, ‘collect()‘, ‘first()‘, ‘take()‘.
.........................
5.**Introduction to PySpark**

PySpark is an open-source, distributed computing framework designed for real-time, large-scale data
processing. It serves as the Python API for Apache Spark, enabling Python users to leverage Spark’s
capabilities.

**Use Cases:**
- **Batch Processing:** Handling large volumes of data in batch pipelines.
- **Real-time Processing:** Processing data as it arrives.
- **Machine Learning:** Implementing machine learning algorithms on big data.
- **Graph Processing:** Analyzing and processing graph data structures.

**Key Features:**
- **Real-time Computations:** Supports real-time data processing.
- **Caching and Disk Persistence:** Allows for caching intermediate results and storing data on disk.
- **Fast Processing:** Provides high-speed data processing capabilities.
- **Works Well with RDDs:** Efficiently processes data using Resilient Distributed Datasets (RDDs).
..........................
6.Loading Dataset Into PySpark
Loading libraries
import pyspark
from pyspark.sql import SparkSession
import pandas as pd
Creating a Spark Session
spark = SparkSession.builder.appName("Spark-Introduction").getOrCreate()
spark
Loading dataset to PySpark
To load a dataset into Spark session, we can use the spark.read.csv( ) method and save inside
df_pyspark.

df_pyspark = spark.read.csv("Example.csv")
df_pyspark
.............................
7.**SparkSession Overview**

**Introduction:**
- **SparkSession** is the entry point for Spark, introduced in Spark 2.0.
- It allows the creation of Spark RDDs, DataFrames, and Datasets.
- It replaces older contexts like ‘SQLContext‘, ‘HiveContext‘, and others.

**Key Points:**

1. **SparkSession in Spark 2.0:**


- Combines functionality of ‘SQLContext‘, ‘HiveContext‘, and other contexts into one unified class.
- Created using ‘SparkSession.builder()‘.

2. **SparkSession in Spark-Shell:**
- The default ‘spark‘ object in the Spark shell is an instance of ‘SparkSession‘.
- Use ‘spark.version‘ to check Spark version.

3. **Creating SparkSession:**
- **From Scala or Python:** Use ‘builder()‘ and ‘getOrCreate()‘ methods.
- **Multiple Sessions:** Create additional sessions using ‘newSession()‘.

4. **Setting Configurations:**
- Use ‘config()‘ method to set Spark configurations.

5. **Hive Support:**
- Enable Hive support with ‘enableHiveSupport()‘.

6. **Other Usages:**
- **Set & Get Configs:** Manage configurations with ‘spark.conf‘.
- **Create DataFrame:** Use ‘createDataFrame()‘ for building DataFrames.
- **Spark SQL:** Use ‘spark.sql()‘ to execute SQL queries on temporary views.
- **Create Hive Table:** Use ‘saveAsTable()‘ to create and query Hive tables.
- **Catalogs:** Access metadata with ‘spark.catalog‘.

**Commonly Used Methods:**


- ‘version()‘: Returns Spark version.
- ‘catalog()‘: Access catalog metadata.
- ‘conf()‘: Get runtime configuration.
- ‘builder()‘: Create a new ‘SparkSession‘.
- ‘newSession()‘: Create an additional ‘SparkSession‘.
- ‘createDataFrame()‘: Create a DataFrame from a collection or RDD.
- ‘createDataset()‘: Create a Dataset from a collection, DataFrame, or RDD.
- ‘emptyDataFrame()‘: Create an empty DataFrame.
- ‘emptyDataset()‘: Create an empty Dataset.
- ‘sparkContext()‘: Get the ‘SparkContext‘.
- ‘sql(String sql)‘: Execute SQL queries and return a DataFrame.
- ‘sqlContext()‘: Return SQLContext.
- ‘stop()‘: Stop the current ‘SparkContext‘.
- ‘table()‘: Return a DataFrame of a table or view.
- ‘udf()‘: Create a Spark UDF (User-Defined Function).
................................
8.introduction to Shared Variables

**Shared variables** are used to efficiently manage and share data across multiple nodes in a distributed
computing environment. Instead of creating separate copies of variables for each node, shared variables
allow nodes to access and update shared data consistently.

### Types of Shared Variables

1. **Broadcast Variables**:
- **Purpose**: Efficiently share read-only variables across all nodes.
- **Usage**: Useful for distributing large, read-only data (like lookup tables) to all nodes in a cluster.
- **Creation**: Use ‘SparkContext.broadcast(value)‘ to create a broadcast variable.

2. **Accumulators**:
- **Purpose**: Accumulate values across multiple tasks.
- **Usage**: Useful for counters or sums where tasks can update the accumulator.
- **Creation**: Use ‘SparkContext.accumulator(initialValue)‘ to create an accumulator.

implention to see in techm resoruce


.....................................
9.**Actions in Spark RDDs**

**Purpose:**
- Actions perform operations on RDDs that return non-RDD values and store the result in the driver
program.

**Common Actions:**
- **‘collect()‘:** Returns all data from the RDD as a list.
- **‘count()‘:** Returns the total number of elements in the RDD.
- **‘reduce()‘:** Computes a summarized result based on a function applied to the RDD.
- **‘first()‘:** Retrieves the first item from the RDD.
- **‘take(n)‘:** Retrieves the first ‘n‘ items from the RDD.

implementation in tech resource

........................
10.Transformations in Spark RDD

**Transformations Overview**:
- Transformations manipulate RDD data and return a new RDD.
- They are evaluated lazily, meaning they are not executed until an action is performed.
**Key Transformations**:
1. **distinct**: Retrieves unique elements from an RDD.
2. **filter**: Selects elements based on a condition.
3. **sortBy**: Sorts the data in ascending or descending order based on a condition.
4. **map**: Applies an operation to each element in an RDD, returning a new RDD with the same length
as the original.
5. **flatMap**: Similar to ‘map‘, but flattens the result, which may change the length of the RDD.
6. **union**: Combines data from two RDDs into a new RDD.
7. **intersection**: Retrieves common data from two RDDs into a new RDD.
8. **repartition**: Increases the number of partitions in an RDD.
9. **coalesce**: Decreases the number of partitions in an RDD.

**Grouping Operations**:
- **groupByKey()**: Groups each input key to an iterable value in the RDD, resulting in data shuffling over
the network.
- **reduceByKey()**: Similar to ‘groupByKey()‘, but it performs a map-side combine to optimize
processing.

implementation in techm resource


.................................................

You might also like