0% found this document useful (0 votes)

71 views24 pages

Pyspark Essentials

The document provides information about RDD transformations and actions in PySpark. It lists 14 common transformations like map, filter, flatMap, union, and sample. It also lists 15 actions like collect, count, first, take, reduce, foreach, countByKey, and save operations to external storage systems like text files, sequence files, JSON, Parquet, Avro, CSV, Hive, Cassandra, JDBC and MongoDB. The document further discusses reading and writing various file formats and storage systems using RDD in PySpark.

Uploaded by

Basudev Chhotray

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views24 pages

Pyspark Essentials

Uploaded by

Basudev Chhotray

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Rdd

Transformations:

Certainly, here is a list of PySpark RDD transformations without examples:

1. **`map(func)`**: Applies a function to each element of the RDD and returns a new RDD.

2. **`filter(func)`**: Returns a new RDD containing only the elements that satisfy the given predicate.

3. **`flatMap(func)`**: Similar to `map`, but each input item can be mapped to zero or more output
items.

4. **`union(otherRDD)`**: Returns a new RDD containing the elements of the original RDD and the
other RDD.

5. `distinct(numPartitions=None)`: Returns a new RDD with distinct elements.

6. `groupByKey(numPartitions=None)`: Groups the elements of the RDD by key.

7. **`reduceByKey(func, numPartitions=None)`**: Reduces the elements of the RDD by key using the
specified function.

8. `sortByKey(ascending=True, numPartitions=None)`: Sorts the elements of the RDD by key.

9. **`join(otherRDD, numPartitions=None)`**: Performs an inner join between two RDDs based on their
keys.

10. **`cogroup(otherRDD, numPartitions=None)`**: Groups the elements of the two RDDs by key and
performs a cogroup operation.

11. **`mapValues(func)`**: Applies a function to the values of each key-value pair without changing the
keys.

12. **`flatMapValues(func)`**: Similar to `mapValues`, but each input value can be mapped to zero or
more output values.

13. `keys()`: Returns an RDD of the keys of key-value pairs.

14. `values()`: Returns an RDD of the values of key-value pairs.

15. `sample(withReplacement, fraction, seed=None)`: Returns a random sample of the RDD.

These transformations are fundamental building blocks for constructing more complex data processing
pipelines in PySpark.

Actions
PySpark RDD (Resilient Distributed Dataset) actions are operations that return values to the driver
program or write data to an external storage system. Here is a list of some common PySpark RDD
actions:

1. **`collect()`**: Returns all elements of the RDD as an array to the driver program. It is often used
cautiously as it brings all the data to the driver, which may cause out-of-memory errors for large
datasets.

2. `count()`: Returns the number of elements in the RDD.

3. `first()`: Returns the first element of the RDD.

4. `take(n)`: Returns the first n elements of the RDD.

5. `takeSample(withReplacement, num, seed=None)`: Returns a random sample of num elements

from the RDD with or without replacement.

6. **`reduce(func)`**: Aggregates the elements of the RDD using a specified associative and
commutative binary operator.

7. **`fold(zeroValue, func)`**: Aggregates the elements of the RDD using a specified associative binary
operator and a neutral "zero value."

8. **`aggregate(zeroValue, seqOp, combOp)`**: Aggregate the elements of the RDD using two different
aggregation functions.

9. **`foreach(func)`**: Applies a function to each element of the RDD. This is a way to execute code on
each node of the cluster.

10. **`countByKey()`**: Counts the number of occurrences of each key in a key-value RDD.

11. **`collectAsMap()`**: Returns the key-value pairs of the RDD as a dictionary to the driver program.
12. **`saveAsTextFile(path)`**: Writes the elements of the RDD to a text file or a set of text files in a
specified directory.

13. `saveAsSequenceFile(path)`: Writes the elements of the RDD to a Hadoop SequenceFile.

14. **`saveAsPickleFile(path)`**: Writes the elements of the RDD to a file in pickle format.

15. **`foreachPartition(func)`**: Applies a function to each partition of the RDD. This can be useful for
performing operations that require a per-partition setup.

These actions are used to trigger the execution of the computation defined by transformations on RDDs.
They return values to the driver program or save the data to an external storage system.

Different types of read files and write files from external using rdd

RDD (Resilient Distributed Dataset) in PySpark is a low-level abstraction representing a distributed

collection of objects. RDDs can be used to read data from various external sources and write data to
external storage systems. Below are examples of 20 different types of file formats and storage systems
that you can read from and write to using RDD in PySpark:

**Read Files:**

1. **Text Files (`textFile`):** Read text files from HDFS, local file system, or other supported file systems.

```python

rdd = sc.textFile("hdfs:///path/to/textfile/*.txt")

```

2. Sequence Files (`sequenceFile`): Read Hadoop SequenceFiles.

```python

rdd = sc.sequenceFile("hdfs:///path/to/sequencefile")

```

3. JSON Files (`jsonFile`): Read JSON files.

```python

rdd = sqlContext.read.json("hdfs:///path/to/jsonfile/*.json").rdd

```

4. Parquet Files (`parquetFile`): Read Parquet files.

```python

rdd = sqlContext.read.parquet("hdfs:///path/to/parquetfile").rdd

```

5. Avro Files (`avroFile`): Read Avro files.

```python

rdd = sc.hadoopFile("hdfs:///path/to/avrofile", "org.apache.avro.mapred.AvroInputFormat")

```

6. CSV Files (`csvFile`): Read CSV files.

```python

rdd = sc.textFile("hdfs:///path/to/csvfile/*.csv").map(lambda line: line.split(','))

```

7. Hive Tables (`hiveContext.table`): Read data from Hive tables.

```python

rdd = hiveContext.table("database.table_name").rdd

```

8. Cassandra Tables (`cassandraTable`): Read data from Cassandra tables.

```python

rdd = sc.cassandraTable("keyspace", "table")

```

9. JDBC (`jdbcRDD`): Read data from relational databases using JDBC.

```python

rdd = sc.parallelize([(1,), (2,), (3,)]).jdbc("jdbc:postgresql:dbserver", "table_name", properties={"user":

"username", "password": "password"})

```

10. MongoDB (`mongoRDD`): Read data from MongoDB.

```python

rdd = sc.mongoRDD("mongodb://localhost:27017/db.collection")

```

**Write Files:**

11. Text Files (`saveAsTextFile`): Write RDD content to text files.

```python
rdd.saveAsTextFile("hdfs:///path/to/output")

```

12. Sequence Files (`saveAsSequenceFile`): Write RDD content to Hadoop SequenceFiles.

```python

rdd.saveAsSequenceFile("hdfs:///path/to/output")

```

13. JSON Files (`saveAsJsonFile`): Write RDD content to JSON files.

```python

rdd.toDF().write.json("hdfs:///path/to/output/jsonfile")

```

14. Parquet Files (`saveAsParquetFile`): Write RDD content to Parquet files.

```python

rdd.toDF().write.parquet("hdfs:///path/to/output/parquetfile")

```

15. Avro Files (`saveAsAvroFile`): Write RDD content to Avro files.

```python

rdd.saveAsHadoopFile("hdfs:///path/to/output/avrofile",
"org.apache.avro.mapred.AvroOutputFormat")

```

16. CSV Files (`saveAsCsvFile`): Write RDD content to CSV files.

```python

rdd.map(lambda x: ','.join(map(str, x))).saveAsTextFile("hdfs:///path/to/output/csvfile")

```

17. Hive Tables (`saveAsTable`): Save RDD data to a Hive table.

```python

rdd.toDF().write.saveAsTable("database.table_name")

```

18. Cassandra Tables (`saveToCassandra`): Save RDD data to a Cassandra table.

```python

rdd.saveToCassandra("keyspace", "table")

```

19. **JDBC (`jdbcRDD`):** Write RDD content to a relational database using JDBC.

```python

rdd.saveAsNewAPIHadoopDataset(conf=conf, keyConverter=key_conv, valueConverter=value_conv)

```

20. MongoDB (`saveToMongoDB`): Write RDD content to MongoDB.

```python

rdd.saveToMongoDB("mongodb://localhost:27017/db.collection")

```
Note: The examples provided assume SparkContext (`sc`) and Spark SQLContext (`sqlContext`) are
available. The actual code might vary based on your specific Spark version and configuration.

Dataframe

Transformations

Certainly! Here is a list of 50 PySpark DataFrame transformations:

1. **`select(*cols)`**: Returns a new DataFrame with selected columns.

2. **`filter(condition)`**: Returns a new DataFrame with rows that satisfy the given condition.

3. `withColumn(colName, col)`: Returns a new DataFrame with a new column or replacing an

existing one.

4. `withColumnRenamed(existing, new)`: Returns a new DataFrame with a column renamed.

5. **`drop(*cols)`**: Returns a new DataFrame with specified columns dropped.

6. `distinct()`: Returns a new DataFrame with distinct rows.

7. **`orderBy(*cols, ascending=True)`**: Returns a new DataFrame sorted by the specified columns.

8. **`groupBy(*cols)`**: Groups the DataFrame by the specified columns.

9. **`agg(*exprs)`**: Aggregates the grouped data using specified aggregation expressions.

10. **`pivot(pivot_col, values=None)`**: Pivots a column of the DataFrame and performs the specified
aggregation.

11. **`join(other, on=None, how=None)`**: Joins the DataFrame with another DataFrame.
12. **`union(other)`**: Returns a new DataFrame containing rows from both DataFrames.

13. **`na.fill(value, subset=None)`**: Returns a new DataFrame with missing values filled.

14. **`na.drop(how='any', subset=None)`**: Returns a new DataFrame with rows containing null or NaN
values dropped.

15. **`na.replace(to_replace, value, subset=None)`**: Returns a new DataFrame with specified values
replaced.

16. `withWatermark(eventTime, delayThreshold)`: Specifies the watermark for a streaming

DataFrame.

17. `selectExpr(expr)`: Selects columns using SQL expressions.

18. **`limit(n)`**: Returns a new DataFrame with only the first n rows.

19. **`repartition(numPartitions, *cols)`**: Returns a new DataFrame with a specified number of

partitions.

20. `coalesce(numPartitions)`: Returns a new DataFrame with a reduced number of partitions.

21. **`rollup(*cols)`**: Creates a multi-dimensional rollup for the DataFrame.

22. **`cube(*cols)`**: Creates a multi-dimensional cube for the DataFrame.

23. **`describe(*cols)`**: Computes basic statistics for numeric and string columns.

24. **`summary(*statistics)`**: Generates descriptive statistics for the DataFrame.

25. **`sample(withReplacement, fraction, seed=None)`**: Returns a stratified sample of the DataFrame.

26. **`randomSplit(weights, seed=None)`**: Splits the DataFrame into multiple DataFrames based on
the provided weights.

27. `crossJoin(other)`: Returns a Cartesian product with another DataFrame.

28. **`hint(name, *parameters)`**: Specifies a hint to the query optimizer.

29. `explain(extended=False)`: Displays the physical plan to compute a DataFrame.

30. `cache()`: Persists the DataFrame in memory for faster access.

31. `unpersist()`: Removes the DataFrame from memory.

32. **`withCachedData()`**: Forces the computation of a DataFrame and caches the result.

33. **`fillna(value, subset=None)`**: Returns a new DataFrame with missing values filled.

34. `dropDuplicates(subset=None)`: Returns a new DataFrame with duplicate rows removed.

35. `transform(function)`: Applies a function to the DataFrame.

36. `selectFrom`: Filters rows using the given condition.

37. **`replace`**: Replaces values matching specified conditions with new values.

38. `asDict`: Converts DataFrame to a dictionary.

39. **`printSchema`**: Prints the schema of the DataFrame.

40. `createOrReplaceTempView(name)`: Creates or replaces a temporary view using the DataFrame.

41. **`write`**: Writes the content of the DataFrame to external storage systems (e.g., Parquet, CSV).

42. **`cacheTable`**: Caches the contents of a DataFrame created from a SQL query.

43. **`unpersistTable`**: Removes the DataFrame contents created from a SQL query from memory.

44. **`rollup`**: Creates a rollup (also called a sub-total) for the DataFrame.

45. `corr`: Computes the correlation matrix for the DataFrame.

46. `cov`: Computes the covariance matrix for the DataFrame.

47. `approxQuantile`: Computes approximate quantiles of numerical columns.

48. **`freqItems`**: Finds frequent items for columns with categorical data.

49. **`sampleBy`**: Returns a stratified sample of a DataFrame based on values in a specified column.

50. **`transform`**: Applies a function to the DataFrame and returns a new DataFrame.

Actions

Here is a list of PySpark DataFrame actions, which are operations that return values to the driver
program or write data to an external storage system:
1. **`show(n=20, truncate=True, vertical=False)`**: Prints the first n rows of the DataFrame to the
console.

2. `count()`: Returns the number of rows in the DataFrame.

3. `collect()`: Returns all rows in the DataFrame as a list.

4. `first()`: Returns the first row of the DataFrame.

5. `head(n=1)`: Returns the first n rows of the DataFrame as a list.

6. `take(n)`: Returns the first n rows of the DataFrame as a list.

7. **`describe(*cols)`**: Computes basic statistics for numeric and string columns.

8. **`summary(*statistics)`**: Generates descriptive statistics for the DataFrame.

9. `printSchema()`: Prints the schema of the DataFrame.

10. `explain(extended=False)`: Displays the physical plan to compute a DataFrame.

11. `toPandas()`: Converts the DataFrame to a Pandas DataFrame.

12. **`write`**: Writes the content of the DataFrame to external storage systems (e.g., Parquet, CSV).

- `write.format("parquet").mode("overwrite").save("/path/to/parquet")`

- `write.format("csv").mode("overwrite").save("/path/to/csv")`
13. **`saveAsTable(tableName, format=None, mode=None, partitionBy=None)`**: Saves the DataFrame
as a table in the Hive metastore.

14. `cache()`: Persists the DataFrame in memory for faster access.

15. `unpersist()`: Removes the DataFrame from memory.

16. `createOrReplaceTempView(name)`: Creates or replaces a temporary view using the DataFrame.

17. `createGlobalTempView(name)`: Creates or replaces a global temporary view using the

DataFrame.

18. `explain()`: Displays the physical plan to compute the DataFrame.

19. **`isLocal()`**: Returns True if the DataFrame is executed locally on the driver.

20. **`rdd()`**: Returns the content of the DataFrame as an RDD of Row objects.

21. **`toDF(*cols)`**: Returns a new DataFrame with the specified column names.

22. `join`: Performs a join with another DataFrame.

23. `groupBy`: Groups the DataFrame using the specified columns.

24. `orderBy`: Sorts the DataFrame based on the specified columns.

25. `agg`: Aggregates the DataFrame using specified aggregation expressions.

26. `rollup`: Creates a multi-dimensional rollup for the DataFrame.

27. **`cube`**: Creates a multi-dimensional cube for the DataFrame.

28. `selectExpr(expr)`: Selects columns using SQL expressions.

29. `repartition`: Returns a new DataFrame with a specified number of partitions.

30. `coalesce`: Returns a new DataFrame with a reduced number of partitions.

31. `foreach`: Applies a function to each row of the DataFrame.

32. `foreachPartition`: Applies a function to each partition of the DataFrame.

33. `createOrReplaceTempView`: Creates or replaces a temporary view using the DataFrame.

34. **`createGlobalTempView`**: Creates or replaces a global temporary view using the DataFrame.

35. `toJSON`: Converts the DataFrame to a JSON string.

36. **`toLocalIterator`**: Returns an iterator that contains all rows in the DataFrame.

37. `explain`: Displays the physical plan to compute the DataFrame.

38. `head`: Returns the first n rows of the DataFrame as a list.

39. `take`: Returns the first n rows of the DataFrame as a list.

40. **`limit`**: Returns a new DataFrame with only the first n rows.
41. **`foreach`**: Applies a function to each row of the DataFrame.

42. `foreachPartition`: Applies a function to each partition of the DataFrame.

These actions execute the computation plan defined by transformations on the DataFrame and return
results to the driver program or perform other output-related tasks.

Different types of read and write files using dataframe

Certainly! Here's a list of 20 different types of file formats and storage systems that you can read from
and write to using PySpark DataFrames:

**Read Files:**

1. **Text Files (`text`):** Read text files from HDFS or local file system.

```python

df = spark.read.text("hdfs:///path/to/textfile/*.txt")

```

2. CSV Files (`csv`): Read CSV files.

```python

df = spark.read.csv("hdfs:///path/to/csvfile/*.csv", header=True, inferSchema=True)

```

3. JSON Files (`json`): Read JSON files.

```python

df = spark.read.json("hdfs:///path/to/jsonfile/*.json")
```

4. Parquet Files (`parquet`): Read Parquet files.

```python

df = spark.read.parquet("hdfs:///path/to/parquetfile")

```

5. Avro Files (`avro`): Read Avro files.

```python

df = spark.read.format("avro").load("hdfs:///path/to/avrofile")

```

6. ORC Files (`orc`): Read ORC files.

```python

df = spark.read.orc("hdfs:///path/to/orcfile")

```

7. Delta Lake (`delta`): Read Delta Lake tables.

```python

df = spark.read.format("delta").table("table_name")

```

8. Hive Tables (`hive`): Read data from Hive tables.

```python
df = spark.sql("SELECT * FROM database.table_name")

```

9. Cassandra Tables (`cassandra`): Read data from Cassandra tables.

```python

df = spark.read.format("org.apache.spark.sql.cassandra").options(table="table",
keyspace="keyspace").load()

```

10. **JDBC (`jdbc`):** Read data from relational databases using JDBC.

```python

df = spark.read.jdbc("jdbc:postgresql://localhost:5432/database", "table_name", properties={"user":

"username", "password": "password"})

```

11. MongoDB (`mongo`): Read data from MongoDB.

```python

df = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",
"mongodb://localhost:27017/db.collection").load()

```

12. GraphQL (`graphql`): Read data from GraphQL APIs.

```python

df = spark.read.format("io.github.andykay.ghql.spark.GhqlDataSource").option("url",
"https://api.example.com/graphql").load()

```
13. **Kafka (`kafka`):** Read data from Apache Kafka topics.

```python

df = spark.read.format("kafka").option("kafka.bootstrap.servers",
"localhost:9092").option("subscribe", "topic_name").load()

```

14. Feather Files (`arrow`): Read Feather files.

```python

df = spark.read.format("arrow").load("hdfs:///path/to/featherfile")

```

15. Image Files (`image`): Read image files.

```python

df = spark.read.format("image").option("path", "hdfs:///path/to/imagefolder").load()

```

16. GraphFrames (`graphframes`): Read graph data using GraphFrames.

```python

from graphframes import GraphFrame

g = GraphFrame(vertices, edges)

```

17. Excel Files (`excel`): Read Excel files.

```python

df = spark.read.format("com.crealytics.spark.excel").option("location",
"hdfs:///path/to/excelfile.xlsx").load()

```

18. Arrow Files (`arrow`): Read Arrow files.

```python

df = spark.read.format("arrow").load("hdfs:///path/to/arrowfile")

```

19. Elasticsearch (`org.elasticsearch.spark.sql`): Read data from Elasticsearch.

```python

df = spark.read.format("org.elasticsearch.spark.sql").option("es.nodes",
"localhost").option("es.resource", "index/type").load()

```

20. Bigtable (`bigtable`): Read data from Google Cloud Bigtable.

```python

df = spark.read.format("com.google.cloud.spark.bigtable").option("tableId", "my-table").load()

```

**Write Files:**

1. Text Files (`text`): Write DataFrame content to text files.

```python
df.write.text("hdfs:///path/to/output/textfile")

```

2. CSV Files (`csv`): Write DataFrame content to CSV files.

```python

df.write.csv("hdfs:///path/to/output/csvfile")

```

3. JSON Files (`json`): Write DataFrame content to JSON files.

```python

df.write.json("hdfs:///path/to/output/jsonfile")

```

4. Parquet Files (`parquet`): Write DataFrame content to Parquet files.

```python

df.write.parquet("hdfs:///path/to/output/parquetfile")

```

5. Avro Files (`avro`): Write DataFrame content to Avro files.

```python

df.write.format("avro").save("hdfs:///path/to/output/avrofile")

```

6. ORC Files (`orc`): Write DataFrame content to ORC files.

```python

df.write.orc("hdfs:///path/to/output/orcfile")

```

7. Delta Lake (`delta`): Write DataFrame content to Delta Lake tables.

```python

df.write.format("delta").save("/path/to/delta/table")

```

8. Hive Tables (`hive`): Save DataFrame data to a Hive table.

```python

df.write.saveAsTable("database.table_name")

```

9. Cassandra Tables (`cassandra`): Save DataFrame data to a Cassandra table.

```python

df.write.format("org.apache.spark.sql.cassandra").options(table="table", keyspace="keyspace").save()

```

10. **JDBC (`jdbc`):** Write DataFrame content to a relational database using JDBC.

```python

df.write.jdbc("jdbc:postgresql://localhost:5432/database", "table_name", properties={"user":

"username", "password": "password"})

```
11. **MongoDB (`mongo`):** Write DataFrame content to MongoDB.

```python

df.write.format("com.mongodb.spark.sql.DefaultSource").option("uri",
"mongodb://localhost:27017/db.collection").mode("overwrite").save()

```

12. GraphQL (`graphql`): Write DataFrame content to GraphQL APIs.

```python

df.write.format("io.github.andykay.ghql.spark.GhqlDataSource").option("url",
"https://api.example.com/graphql").save()

```

13. **Kafka

(`kafka`):** Write DataFrame content to Apache Kafka topic.

```python

df.selectExpr("CAST(key AS STRING)", "CAST(value AS

STRING)").write.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("topic",
"topic_name").save()

```

14. Feather Files (`arrow`): Write DataFrame content to Feather files.

```python

df.write.format("arrow").save("hdfs:///path/to/output/featherfile")

```
15. **Image Files (`image`):** Write DataFrame content to image files.

```python

df.write.format("image").option("path", "hdfs:///path/to/output/imagefolder").save()

```

16. GraphFrames (`graphframes`): Write graph data using GraphFrames.

```python

g.vertices.write.format("graphframes").mode("overwrite").save("hdfs:///path/to/output/graph/vertices"
)

g.edges.write.format("graphframes").mode("overwrite").save("hdfs:///path/to/output/graph/edges")

```

17. Excel Files (`excel`): Write DataFrame content to Excel files.

```python

df.write.format("com.crealytics.spark.excel").option("location",
"hdfs:///path/to/output/excelfile.xlsx").save()

```

18. Arrow Files (`arrow`): Write DataFrame content to Arrow files.

```python

df.write.format("arrow").save("hdfs:///path/to/output/arrowfile")

```

19. Elasticsearch (`org.elasticsearch.spark.sql`): Write DataFrame content to Elasticsearch.

```python

df.write.format("org.elasticsearch.spark.sql").option("es.nodes", "localhost").option("es.resource",
"index/type").mode("overwrite").save()

```

20. Bigtable (`bigtable`): Write DataFrame content to Google Cloud Bigtable.

```python

df.write.format("com.google.cloud.spark.bigtable").option("tableId", "my-table").save()

```

Note: The examples provided assume that `spark` is a `SparkSession` object. The actual code might vary
based on your specific Spark version and configuration.

Databricks - Cheatsheet
No ratings yet
Databricks - Cheatsheet
7 pages
Project On ICT Skills
No ratings yet
Project On ICT Skills
23 pages
PySpark Meetup Talk
No ratings yet
PySpark Meetup Talk
35 pages
Hands - On Exercise: Using The Spark Shell..................................
100% (2)
Hands - On Exercise: Using The Spark Shell..................................
13 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Data Entry Standards
50% (2)
Data Entry Standards
119 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
PySpark Notes
No ratings yet
PySpark Notes
190 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
SPARK
No ratings yet
SPARK
35 pages
Kepserverex Manual
No ratings yet
Kepserverex Manual
314 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
CARIS HIPS & SIPS Changes List PDF
No ratings yet
CARIS HIPS & SIPS Changes List PDF
48 pages
Lab Spark
No ratings yet
Lab Spark
3 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
Microsoft Access 2003: Manual - Foundation Level
No ratings yet
Microsoft Access 2003: Manual - Foundation Level
114 pages
Domestic CRM Voice English Class 12
No ratings yet
Domestic CRM Voice English Class 12
201 pages
TM-564 ADMS Operators Handbook
No ratings yet
TM-564 ADMS Operators Handbook
136 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
Security Testing On Diva Application
No ratings yet
Security Testing On Diva Application
11 pages
Lab Chapter 10 Use RDDs
0% (1)
Lab Chapter 10 Use RDDs
4 pages
FP Reader Software User Manual V1.1
No ratings yet
FP Reader Software User Manual V1.1
59 pages
Babel - A Glossary of Computer Oriented
No ratings yet
Babel - A Glossary of Computer Oriented
69 pages
YSS UserManual Rev2.7
No ratings yet
YSS UserManual Rev2.7
50 pages
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
Spark RDD
No ratings yet
Spark RDD
60 pages
Sentinel One SOP
No ratings yet
Sentinel One SOP
4 pages
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
No ratings yet
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
48 pages
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
No ratings yet
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
51 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
Simulex PDF
No ratings yet
Simulex PDF
43 pages
1st Quarter - Monthly - Test-In - ICT 8 - 2023-2024
No ratings yet
1st Quarter - Monthly - Test-In - ICT 8 - 2023-2024
2 pages
Resilient Distributed Datasets
No ratings yet
Resilient Distributed Datasets
40 pages
Spark
No ratings yet
Spark
13 pages
Final DB ToR Land Invontory and File Reorganization
No ratings yet
Final DB ToR Land Invontory and File Reorganization
17 pages
2.RDDs in Spark
No ratings yet
2.RDDs in Spark
38 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
s8674 - IsPF Hidden Treasures
No ratings yet
s8674 - IsPF Hidden Treasures
19 pages
Python Business Intelligence Cookbook - Sample Chapter
No ratings yet
Python Business Intelligence Cookbook - Sample Chapter
22 pages
UMPT Replacing Guide
No ratings yet
UMPT Replacing Guide
21 pages
4220 6 (DataFormat)
No ratings yet
4220 6 (DataFormat)
15 pages
Batch Converting Your Solidworks Files
No ratings yet
Batch Converting Your Solidworks Files
20 pages
RDD Actions
No ratings yet
RDD Actions
18 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
2.services and Componant of OS
No ratings yet
2.services and Componant of OS
18 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
Spark RDD Commands - Spark Core
No ratings yet
Spark RDD Commands - Spark Core
7 pages
Note
No ratings yet
Note
14 pages
USB Based Installation Guide
No ratings yet
USB Based Installation Guide
15 pages
Configuring Italc Management Software For Use With Ncomputing Systems
No ratings yet
Configuring Italc Management Software For Use With Ncomputing Systems
16 pages
PySpark Cheat Sheet Python
No ratings yet
PySpark Cheat Sheet Python
1 page
Spark
No ratings yet
Spark
12 pages
Workshop Title - Workshop Setup Guide A1 - 1
No ratings yet
Workshop Title - Workshop Setup Guide A1 - 1
14 pages
Open Spark Shell
No ratings yet
Open Spark Shell
12 pages
Having Multiple Files That Share The Same Structure and Wanting To Combine Everything in One and Single Data Source?
No ratings yet
Having Multiple Files That Share The Same Structure and Wanting To Combine Everything in One and Single Data Source?
10 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Key Differences in Aache Spark Components and Concepts
No ratings yet
Key Differences in Aache Spark Components and Concepts
7 pages
2335 m8 Demo1 v1 0h2 Cq188do
No ratings yet
2335 m8 Demo1 v1 0h2 Cq188do
9 pages
Cap Body of Knowledge-2023
No ratings yet
Cap Body of Knowledge-2023
7 pages
Information Systems
No ratings yet
Information Systems
20 pages
PySpark Core Print
No ratings yet
PySpark Core Print
8 pages
Spark Material
No ratings yet
Spark Material
6 pages
OptiStruct - 01 - Design Concept For A Structural C-Clip
No ratings yet
OptiStruct - 01 - Design Concept For A Structural C-Clip
12 pages
Pyspark
No ratings yet
Pyspark
6 pages
Databricks
No ratings yet
Databricks
7 pages
Spark RDD
No ratings yet
Spark RDD
4 pages
Pyspark RDD Operations
No ratings yet
Pyspark RDD Operations
5 pages
RDD
No ratings yet
RDD
4 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
Day 1 Stlecture Notes
No ratings yet
Day 1 Stlecture Notes
4 pages
PySpark CheatSheet Edureka
No ratings yet
PySpark CheatSheet Edureka
1 page
Computer Fundamentals & PC Sosftware-Title
No ratings yet
Computer Fundamentals & PC Sosftware-Title
5 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Malloy's Ruller
No ratings yet
Malloy's Ruller
4 pages
ADE Training
No ratings yet
ADE Training
1 page
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
External Video-En
No ratings yet
External Video-En
2 pages
Versatile Tidana Readme v3
No ratings yet
Versatile Tidana Readme v3
2 pages
PySpark RDD Cheat Sheet
No ratings yet
PySpark RDD Cheat Sheet
1 page
New 79
No ratings yet
New 79
1 page
Postgresql Jsonb: Learn This Powerful Tool By Example
From Everand
Postgresql Jsonb: Learn This Powerful Tool By Example
Mohammed N. S. Al Saadi
No ratings yet
Lisp Programming Language
From Everand
Lisp Programming Language
Faiz ul haque Zeya
No ratings yet
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet

Pyspark Essentials

Uploaded by

Pyspark Essentials

Uploaded by

Rdd

Certainly, here is a list of PySpark RDD transformations without examples:

5. **`distinct(numPartitions=None)`**: Returns a new RDD with distinct elements.

6. **`groupByKey(numPartitions=None)`**: Groups the elements of the RDD by key.

8. **`sortByKey(ascending=True, numPartitions=None)`**: Sorts the elements of the RDD by key.

13. **`keys()`**: Returns an RDD of the keys of key-value pairs.

14. **`values()`**: Returns an RDD of the values of key-value pairs.

15. **`sample(withReplacement, fraction, seed=None)`**: Returns a random sample of the RDD.

2. **`count()`**: Returns the number of elements in the RDD.

3. **`first()`**: Returns the first element of the RDD.

4. **`take(n)`**: Returns the first n elements of the RDD.

5. **`takeSample(withReplacement, num, seed=None)`**: Returns a random sample of num elements

13. **`saveAsSequenceFile(path)`**: Writes the elements of the RDD to a Hadoop SequenceFile.

RDD (Resilient Distributed Dataset) in PySpark is a low-level abstraction representing a distributed

2. **Sequence Files (`sequenceFile`):** Read Hadoop SequenceFiles.

3. **JSON Files (`jsonFile`):** Read JSON files.

4. **Parquet Files (`parquetFile`):** Read Parquet files.

5. **Avro Files (`avroFile`):** Read Avro files.

rdd = sc.hadoopFile("hdfs:///path/to/avrofile", "org.apache.avro.mapred.AvroInputFormat")

6. **CSV Files (`csvFile`):** Read CSV files.

rdd = sc.textFile("hdfs:///path/to/csvfile/*.csv").map(lambda line: line.split(','))

7. **Hive Tables (`hiveContext.table`):** Read data from Hive tables.

8. **Cassandra Tables (`cassandraTable`):** Read data from Cassandra tables.

rdd = sc.cassandraTable("keyspace", "table")

9. **JDBC (`jdbcRDD`):** Read data from relational databases using JDBC.

rdd = sc.parallelize([(1,), (2,), (3,)]).jdbc("jdbc:postgresql:dbserver", "table_name", properties={"user":

10. **MongoDB (`mongoRDD`):** Read data from MongoDB.

11. **Text Files (`saveAsTextFile`):** Write RDD content to text files.

12. **Sequence Files (`saveAsSequenceFile`):** Write RDD content to Hadoop SequenceFiles.

13. **JSON Files (`saveAsJsonFile`):** Write RDD content to JSON files.

14. **Parquet Files (`saveAsParquetFile`):** Write RDD content to Parquet files.

15. **Avro Files (`saveAsAvroFile`):** Write RDD content to Avro files.

16. **CSV Files (`saveAsCsvFile`):** Write RDD content to CSV files.

rdd.map(lambda x: ','.join(map(str, x))).saveAsTextFile("hdfs:///path/to/output/csvfile")

17. **Hive Tables (`saveAsTable`):** Save RDD data to a Hive table.

18. **Cassandra Tables (`saveToCassandra`):** Save RDD data to a Cassandra table.

rdd.saveAsNewAPIHadoopDataset(conf=conf, keyConverter=key_conv, valueConverter=value_conv)

20. **MongoDB (`saveToMongoDB`):** Write RDD content to MongoDB.

Certainly! Here is a list of 50 PySpark DataFrame transformations:

1. **`select(*cols)`**: Returns a new DataFrame with selected columns.

3. **`withColumn(colName, col)`**: Returns a new DataFrame with a new column or replacing an

4. **`withColumnRenamed(existing, new)`**: Returns a new DataFrame with a column renamed.

5. **`drop(*cols)`**: Returns a new DataFrame with specified columns dropped.

6. **`distinct()`**: Returns a new DataFrame with distinct rows.

7. **`orderBy(*cols, ascending=True)`**: Returns a new DataFrame sorted by the specified columns.

8. **`groupBy(*cols)`**: Groups the DataFrame by the specified columns.

9. **`agg(*exprs)`**: Aggregates the grouped data using specified aggregation expressions.

16. **`withWatermark(eventTime, delayThreshold)`**: Specifies the watermark for a streaming

17. **`selectExpr(expr)`**: Selects columns using SQL expressions.

19. **`repartition(numPartitions, *cols)`**: Returns a new DataFrame with a specified number of

20. **`coalesce(numPartitions)`**: Returns a new DataFrame with a reduced number of partitions.

21. **`rollup(*cols)`**: Creates a multi-dimensional rollup for the DataFrame.

22. **`cube(*cols)`**: Creates a multi-dimensional cube for the DataFrame.

24. **`summary(*statistics)`**: Generates descriptive statistics for the DataFrame.

27. **`crossJoin(other)`**: Returns a Cartesian product with another DataFrame.

28. **`hint(name, *parameters)`**: Specifies a hint to the query optimizer.

29. **`explain(extended=False)`**: Displays the physical plan to compute a DataFrame.

30. **`cache()`**: Persists the DataFrame in memory for faster access.

31. **`unpersist()`**: Removes the DataFrame from memory.

34. **`dropDuplicates(subset=None)`**: Returns a new DataFrame with duplicate rows removed.

35. **`transform(function)`**: Applies a function to the DataFrame.

36. **`selectFrom`**: Filters rows using the given condition.

38. **`asDict`**: Converts DataFrame to a dictionary.

40. **`createOrReplaceTempView(name)`**: Creates or replaces a temporary view using the DataFrame.

45. **`corr`**: Computes the correlation matrix for the DataFrame.

46. **`cov`**: Computes the covariance matrix for the DataFrame.

47. **`approxQuantile`**: Computes approximate quantiles of numerical columns.

2. **`count()`**: Returns the number of rows in the DataFrame.

3. **`collect()`**: Returns all rows in the DataFrame as a list.

4. **`first()`**: Returns the first row of the DataFrame.

5. **`head(n=1)`**: Returns the first n rows of the DataFrame as a list.

6. **`take(n)`**: Returns the first n rows of the DataFrame as a list.

7. **`describe(*cols)`**: Computes basic statistics for numeric and string columns.

8. **`summary(*statistics)`**: Generates descriptive statistics for the DataFrame.

9. **`printSchema()`**: Prints the schema of the DataFrame.

10. **`explain(extended=False)`**: Displays the physical plan to compute a DataFrame.

5. `distinct(numPartitions=None)`: Returns a new RDD with distinct elements.

6. `groupByKey(numPartitions=None)`: Groups the elements of the RDD by key.

8. `sortByKey(ascending=True, numPartitions=None)`: Sorts the elements of the RDD by key.

13. `keys()`: Returns an RDD of the keys of key-value pairs.

14. `values()`: Returns an RDD of the values of key-value pairs.

15. `sample(withReplacement, fraction, seed=None)`: Returns a random sample of the RDD.

2. `count()`: Returns the number of elements in the RDD.

3. `first()`: Returns the first element of the RDD.

4. `take(n)`: Returns the first n elements of the RDD.

5. `takeSample(withReplacement, num, seed=None)`: Returns a random sample of num elements

13. `saveAsSequenceFile(path)`: Writes the elements of the RDD to a Hadoop SequenceFile.

2. Sequence Files (`sequenceFile`): Read Hadoop SequenceFiles.

3. JSON Files (`jsonFile`): Read JSON files.

4. Parquet Files (`parquetFile`): Read Parquet files.

5. Avro Files (`avroFile`): Read Avro files.

6. CSV Files (`csvFile`): Read CSV files.

7. Hive Tables (`hiveContext.table`): Read data from Hive tables.

8. Cassandra Tables (`cassandraTable`): Read data from Cassandra tables.

9. JDBC (`jdbcRDD`): Read data from relational databases using JDBC.

10. MongoDB (`mongoRDD`): Read data from MongoDB.

11. Text Files (`saveAsTextFile`): Write RDD content to text files.

12. Sequence Files (`saveAsSequenceFile`): Write RDD content to Hadoop SequenceFiles.

13. JSON Files (`saveAsJsonFile`): Write RDD content to JSON files.

14. Parquet Files (`saveAsParquetFile`): Write RDD content to Parquet files.

15. Avro Files (`saveAsAvroFile`): Write RDD content to Avro files.

16. CSV Files (`saveAsCsvFile`): Write RDD content to CSV files.

17. Hive Tables (`saveAsTable`): Save RDD data to a Hive table.

18. Cassandra Tables (`saveToCassandra`): Save RDD data to a Cassandra table.

20. MongoDB (`saveToMongoDB`): Write RDD content to MongoDB.

3. `withColumn(colName, col)`: Returns a new DataFrame with a new column or replacing an

4. `withColumnRenamed(existing, new)`: Returns a new DataFrame with a column renamed.

6. `distinct()`: Returns a new DataFrame with distinct rows.

16. `withWatermark(eventTime, delayThreshold)`: Specifies the watermark for a streaming

17. `selectExpr(expr)`: Selects columns using SQL expressions.

20. `coalesce(numPartitions)`: Returns a new DataFrame with a reduced number of partitions.

27. `crossJoin(other)`: Returns a Cartesian product with another DataFrame.

29. `explain(extended=False)`: Displays the physical plan to compute a DataFrame.

30. `cache()`: Persists the DataFrame in memory for faster access.

31. `unpersist()`: Removes the DataFrame from memory.

34. `dropDuplicates(subset=None)`: Returns a new DataFrame with duplicate rows removed.

35. `transform(function)`: Applies a function to the DataFrame.

36. `selectFrom`: Filters rows using the given condition.

38. `asDict`: Converts DataFrame to a dictionary.

40. `createOrReplaceTempView(name)`: Creates or replaces a temporary view using the DataFrame.

45. `corr`: Computes the correlation matrix for the DataFrame.

46. `cov`: Computes the covariance matrix for the DataFrame.

47. `approxQuantile`: Computes approximate quantiles of numerical columns.

2. `count()`: Returns the number of rows in the DataFrame.

3. `collect()`: Returns all rows in the DataFrame as a list.

4. `first()`: Returns the first row of the DataFrame.

5. `head(n=1)`: Returns the first n rows of the DataFrame as a list.

6. `take(n)`: Returns the first n rows of the DataFrame as a list.

9. `printSchema()`: Prints the schema of the DataFrame.

10. `explain(extended=False)`: Displays the physical plan to compute a DataFrame.

11. `toPandas()`: Converts the DataFrame to a Pandas DataFrame.

14. `cache()`: Persists the DataFrame in memory for faster access.

15. `unpersist()`: Removes the DataFrame from memory.

16. `createOrReplaceTempView(name)`: Creates or replaces a temporary view using the DataFrame.

17. `createGlobalTempView(name)`: Creates or replaces a global temporary view using the

18. `explain()`: Displays the physical plan to compute the DataFrame.

22. `join`: Performs a join with another DataFrame.

23. `groupBy`: Groups the DataFrame using the specified columns.

24. `orderBy`: Sorts the DataFrame based on the specified columns.

25. `agg`: Aggregates the DataFrame using specified aggregation expressions.

26. `rollup`: Creates a multi-dimensional rollup for the DataFrame.

28. `selectExpr(expr)`: Selects columns using SQL expressions.

29. `repartition`: Returns a new DataFrame with a specified number of partitions.

30. `coalesce`: Returns a new DataFrame with a reduced number of partitions.

31. `foreach`: Applies a function to each row of the DataFrame.

32. `foreachPartition`: Applies a function to each partition of the DataFrame.

33. `createOrReplaceTempView`: Creates or replaces a temporary view using the DataFrame.

35. `toJSON`: Converts the DataFrame to a JSON string.

37. `explain`: Displays the physical plan to compute the DataFrame.

38. `head`: Returns the first n rows of the DataFrame as a list.

39. `take`: Returns the first n rows of the DataFrame as a list.

42. `foreachPartition`: Applies a function to each partition of the DataFrame.

2. CSV Files (`csv`): Read CSV files.

3. JSON Files (`json`): Read JSON files.

4. Parquet Files (`parquet`): Read Parquet files.

5. Avro Files (`avro`): Read Avro files.