0% found this document useful (0 votes)
71 views24 pages

Pyspark Essentials

The document provides information about RDD transformations and actions in PySpark. It lists 14 common transformations like map, filter, flatMap, union, and sample. It also lists 15 actions like collect, count, first, take, reduce, foreach, countByKey, and save operations to external storage systems like text files, sequence files, JSON, Parquet, Avro, CSV, Hive, Cassandra, JDBC and MongoDB. The document further discusses reading and writing various file formats and storage systems using RDD in PySpark.

Uploaded by

Basudev Chhotray
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views24 pages

Pyspark Essentials

The document provides information about RDD transformations and actions in PySpark. It lists 14 common transformations like map, filter, flatMap, union, and sample. It also lists 15 actions like collect, count, first, take, reduce, foreach, countByKey, and save operations to external storage systems like text files, sequence files, JSON, Parquet, Avro, CSV, Hive, Cassandra, JDBC and MongoDB. The document further discusses reading and writing various file formats and storage systems using RDD in PySpark.

Uploaded by

Basudev Chhotray
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Rdd

Transformations:

Certainly, here is a list of PySpark RDD transformations without examples:

1. **`map(func)`**: Applies a function to each element of the RDD and returns a new RDD.

2. **`filter(func)`**: Returns a new RDD containing only the elements that satisfy the given predicate.

3. **`flatMap(func)`**: Similar to `map`, but each input item can be mapped to zero or more output
items.

4. **`union(otherRDD)`**: Returns a new RDD containing the elements of the original RDD and the
other RDD.

5. **`distinct(numPartitions=None)`**: Returns a new RDD with distinct elements.

6. **`groupByKey(numPartitions=None)`**: Groups the elements of the RDD by key.

7. **`reduceByKey(func, numPartitions=None)`**: Reduces the elements of the RDD by key using the
specified function.

8. **`sortByKey(ascending=True, numPartitions=None)`**: Sorts the elements of the RDD by key.

9. **`join(otherRDD, numPartitions=None)`**: Performs an inner join between two RDDs based on their
keys.

10. **`cogroup(otherRDD, numPartitions=None)`**: Groups the elements of the two RDDs by key and
performs a cogroup operation.

11. **`mapValues(func)`**: Applies a function to the values of each key-value pair without changing the
keys.

12. **`flatMapValues(func)`**: Similar to `mapValues`, but each input value can be mapped to zero or
more output values.

13. **`keys()`**: Returns an RDD of the keys of key-value pairs.

14. **`values()`**: Returns an RDD of the values of key-value pairs.

15. **`sample(withReplacement, fraction, seed=None)`**: Returns a random sample of the RDD.

These transformations are fundamental building blocks for constructing more complex data processing
pipelines in PySpark.

Actions
PySpark RDD (Resilient Distributed Dataset) actions are operations that return values to the driver
program or write data to an external storage system. Here is a list of some common PySpark RDD
actions:

1. **`collect()`**: Returns all elements of the RDD as an array to the driver program. It is often used
cautiously as it brings all the data to the driver, which may cause out-of-memory errors for large
datasets.

2. **`count()`**: Returns the number of elements in the RDD.

3. **`first()`**: Returns the first element of the RDD.

4. **`take(n)`**: Returns the first n elements of the RDD.

5. **`takeSample(withReplacement, num, seed=None)`**: Returns a random sample of num elements


from the RDD with or without replacement.

6. **`reduce(func)`**: Aggregates the elements of the RDD using a specified associative and
commutative binary operator.

7. **`fold(zeroValue, func)`**: Aggregates the elements of the RDD using a specified associative binary
operator and a neutral "zero value."

8. **`aggregate(zeroValue, seqOp, combOp)`**: Aggregate the elements of the RDD using two different
aggregation functions.

9. **`foreach(func)`**: Applies a function to each element of the RDD. This is a way to execute code on
each node of the cluster.

10. **`countByKey()`**: Counts the number of occurrences of each key in a key-value RDD.

11. **`collectAsMap()`**: Returns the key-value pairs of the RDD as a dictionary to the driver program.
12. **`saveAsTextFile(path)`**: Writes the elements of the RDD to a text file or a set of text files in a
specified directory.

13. **`saveAsSequenceFile(path)`**: Writes the elements of the RDD to a Hadoop SequenceFile.

14. **`saveAsPickleFile(path)`**: Writes the elements of the RDD to a file in pickle format.

15. **`foreachPartition(func)`**: Applies a function to each partition of the RDD. This can be useful for
performing operations that require a per-partition setup.

These actions are used to trigger the execution of the computation defined by transformations on RDDs.
They return values to the driver program or save the data to an external storage system.

Different types of read files and write files from external using rdd

RDD (Resilient Distributed Dataset) in PySpark is a low-level abstraction representing a distributed


collection of objects. RDDs can be used to read data from various external sources and write data to
external storage systems. Below are examples of 20 different types of file formats and storage systems
that you can read from and write to using RDD in PySpark:

**Read Files:**

1. **Text Files (`textFile`):** Read text files from HDFS, local file system, or other supported file systems.

```python

rdd = sc.textFile("hdfs:///path/to/textfile/*.txt")

```

2. **Sequence Files (`sequenceFile`):** Read Hadoop SequenceFiles.


```python

rdd = sc.sequenceFile("hdfs:///path/to/sequencefile")

```

3. **JSON Files (`jsonFile`):** Read JSON files.

```python

rdd = sqlContext.read.json("hdfs:///path/to/jsonfile/*.json").rdd

```

4. **Parquet Files (`parquetFile`):** Read Parquet files.

```python

rdd = sqlContext.read.parquet("hdfs:///path/to/parquetfile").rdd

```

5. **Avro Files (`avroFile`):** Read Avro files.

```python

rdd = sc.hadoopFile("hdfs:///path/to/avrofile", "org.apache.avro.mapred.AvroInputFormat")

```

6. **CSV Files (`csvFile`):** Read CSV files.

```python

rdd = sc.textFile("hdfs:///path/to/csvfile/*.csv").map(lambda line: line.split(','))

```

7. **Hive Tables (`hiveContext.table`):** Read data from Hive tables.


```python

rdd = hiveContext.table("database.table_name").rdd

```

8. **Cassandra Tables (`cassandraTable`):** Read data from Cassandra tables.

```python

rdd = sc.cassandraTable("keyspace", "table")

```

9. **JDBC (`jdbcRDD`):** Read data from relational databases using JDBC.

```python

rdd = sc.parallelize([(1,), (2,), (3,)]).jdbc("jdbc:postgresql:dbserver", "table_name", properties={"user":


"username", "password": "password"})

```

10. **MongoDB (`mongoRDD`):** Read data from MongoDB.

```python

rdd = sc.mongoRDD("mongodb://localhost:27017/db.collection")

```

**Write Files:**

11. **Text Files (`saveAsTextFile`):** Write RDD content to text files.

```python
rdd.saveAsTextFile("hdfs:///path/to/output")

```

12. **Sequence Files (`saveAsSequenceFile`):** Write RDD content to Hadoop SequenceFiles.

```python

rdd.saveAsSequenceFile("hdfs:///path/to/output")

```

13. **JSON Files (`saveAsJsonFile`):** Write RDD content to JSON files.

```python

rdd.toDF().write.json("hdfs:///path/to/output/jsonfile")

```

14. **Parquet Files (`saveAsParquetFile`):** Write RDD content to Parquet files.

```python

rdd.toDF().write.parquet("hdfs:///path/to/output/parquetfile")

```

15. **Avro Files (`saveAsAvroFile`):** Write RDD content to Avro files.

```python

rdd.saveAsHadoopFile("hdfs:///path/to/output/avrofile",
"org.apache.avro.mapred.AvroOutputFormat")

```

16. **CSV Files (`saveAsCsvFile`):** Write RDD content to CSV files.


```python

rdd.map(lambda x: ','.join(map(str, x))).saveAsTextFile("hdfs:///path/to/output/csvfile")

```

17. **Hive Tables (`saveAsTable`):** Save RDD data to a Hive table.

```python

rdd.toDF().write.saveAsTable("database.table_name")

```

18. **Cassandra Tables (`saveToCassandra`):** Save RDD data to a Cassandra table.

```python

rdd.saveToCassandra("keyspace", "table")

```

19. **JDBC (`jdbcRDD`):** Write RDD content to a relational database using JDBC.

```python

rdd.saveAsNewAPIHadoopDataset(conf=conf, keyConverter=key_conv, valueConverter=value_conv)

```

20. **MongoDB (`saveToMongoDB`):** Write RDD content to MongoDB.

```python

rdd.saveToMongoDB("mongodb://localhost:27017/db.collection")

```
Note: The examples provided assume SparkContext (`sc`) and Spark SQLContext (`sqlContext`) are
available. The actual code might vary based on your specific Spark version and configuration.

Dataframe

Transformations

Certainly! Here is a list of 50 PySpark DataFrame transformations:

1. **`select(*cols)`**: Returns a new DataFrame with selected columns.

2. **`filter(condition)`**: Returns a new DataFrame with rows that satisfy the given condition.

3. **`withColumn(colName, col)`**: Returns a new DataFrame with a new column or replacing an


existing one.

4. **`withColumnRenamed(existing, new)`**: Returns a new DataFrame with a column renamed.

5. **`drop(*cols)`**: Returns a new DataFrame with specified columns dropped.

6. **`distinct()`**: Returns a new DataFrame with distinct rows.

7. **`orderBy(*cols, ascending=True)`**: Returns a new DataFrame sorted by the specified columns.

8. **`groupBy(*cols)`**: Groups the DataFrame by the specified columns.

9. **`agg(*exprs)`**: Aggregates the grouped data using specified aggregation expressions.

10. **`pivot(pivot_col, values=None)`**: Pivots a column of the DataFrame and performs the specified
aggregation.

11. **`join(other, on=None, how=None)`**: Joins the DataFrame with another DataFrame.
12. **`union(other)`**: Returns a new DataFrame containing rows from both DataFrames.

13. **`na.fill(value, subset=None)`**: Returns a new DataFrame with missing values filled.

14. **`na.drop(how='any', subset=None)`**: Returns a new DataFrame with rows containing null or NaN
values dropped.

15. **`na.replace(to_replace, value, subset=None)`**: Returns a new DataFrame with specified values
replaced.

16. **`withWatermark(eventTime, delayThreshold)`**: Specifies the watermark for a streaming


DataFrame.

17. **`selectExpr(expr)`**: Selects columns using SQL expressions.

18. **`limit(n)`**: Returns a new DataFrame with only the first n rows.

19. **`repartition(numPartitions, *cols)`**: Returns a new DataFrame with a specified number of


partitions.

20. **`coalesce(numPartitions)`**: Returns a new DataFrame with a reduced number of partitions.

21. **`rollup(*cols)`**: Creates a multi-dimensional rollup for the DataFrame.

22. **`cube(*cols)`**: Creates a multi-dimensional cube for the DataFrame.

23. **`describe(*cols)`**: Computes basic statistics for numeric and string columns.

24. **`summary(*statistics)`**: Generates descriptive statistics for the DataFrame.


25. **`sample(withReplacement, fraction, seed=None)`**: Returns a stratified sample of the DataFrame.

26. **`randomSplit(weights, seed=None)`**: Splits the DataFrame into multiple DataFrames based on
the provided weights.

27. **`crossJoin(other)`**: Returns a Cartesian product with another DataFrame.

28. **`hint(name, *parameters)`**: Specifies a hint to the query optimizer.

29. **`explain(extended=False)`**: Displays the physical plan to compute a DataFrame.

30. **`cache()`**: Persists the DataFrame in memory for faster access.

31. **`unpersist()`**: Removes the DataFrame from memory.

32. **`withCachedData()`**: Forces the computation of a DataFrame and caches the result.

33. **`fillna(value, subset=None)`**: Returns a new DataFrame with missing values filled.

34. **`dropDuplicates(subset=None)`**: Returns a new DataFrame with duplicate rows removed.

35. **`transform(function)`**: Applies a function to the DataFrame.

36. **`selectFrom`**: Filters rows using the given condition.

37. **`replace`**: Replaces values matching specified conditions with new values.

38. **`asDict`**: Converts DataFrame to a dictionary.


39. **`printSchema`**: Prints the schema of the DataFrame.

40. **`createOrReplaceTempView(name)`**: Creates or replaces a temporary view using the DataFrame.

41. **`write`**: Writes the content of the DataFrame to external storage systems (e.g., Parquet, CSV).

42. **`cacheTable`**: Caches the contents of a DataFrame created from a SQL query.

43. **`unpersistTable`**: Removes the DataFrame contents created from a SQL query from memory.

44. **`rollup`**: Creates a rollup (also called a sub-total) for the DataFrame.

45. **`corr`**: Computes the correlation matrix for the DataFrame.

46. **`cov`**: Computes the covariance matrix for the DataFrame.

47. **`approxQuantile`**: Computes approximate quantiles of numerical columns.

48. **`freqItems`**: Finds frequent items for columns with categorical data.

49. **`sampleBy`**: Returns a stratified sample of a DataFrame based on values in a specified column.

50. **`transform`**: Applies a function to the DataFrame and returns a new DataFrame.

Actions

Here is a list of PySpark DataFrame actions, which are operations that return values to the driver
program or write data to an external storage system:
1. **`show(n=20, truncate=True, vertical=False)`**: Prints the first n rows of the DataFrame to the
console.

2. **`count()`**: Returns the number of rows in the DataFrame.

3. **`collect()`**: Returns all rows in the DataFrame as a list.

4. **`first()`**: Returns the first row of the DataFrame.

5. **`head(n=1)`**: Returns the first n rows of the DataFrame as a list.

6. **`take(n)`**: Returns the first n rows of the DataFrame as a list.

7. **`describe(*cols)`**: Computes basic statistics for numeric and string columns.

8. **`summary(*statistics)`**: Generates descriptive statistics for the DataFrame.

9. **`printSchema()`**: Prints the schema of the DataFrame.

10. **`explain(extended=False)`**: Displays the physical plan to compute a DataFrame.

11. **`toPandas()`**: Converts the DataFrame to a Pandas DataFrame.

12. **`write`**: Writes the content of the DataFrame to external storage systems (e.g., Parquet, CSV).

- `write.format("parquet").mode("overwrite").save("/path/to/parquet")`

- `write.format("csv").mode("overwrite").save("/path/to/csv")`
13. **`saveAsTable(tableName, format=None, mode=None, partitionBy=None)`**: Saves the DataFrame
as a table in the Hive metastore.

14. **`cache()`**: Persists the DataFrame in memory for faster access.

15. **`unpersist()`**: Removes the DataFrame from memory.

16. **`createOrReplaceTempView(name)`**: Creates or replaces a temporary view using the DataFrame.

17. **`createGlobalTempView(name)`**: Creates or replaces a global temporary view using the


DataFrame.

18. **`explain()`**: Displays the physical plan to compute the DataFrame.

19. **`isLocal()`**: Returns True if the DataFrame is executed locally on the driver.

20. **`rdd()`**: Returns the content of the DataFrame as an RDD of Row objects.

21. **`toDF(*cols)`**: Returns a new DataFrame with the specified column names.

22. **`join`**: Performs a join with another DataFrame.

23. **`groupBy`**: Groups the DataFrame using the specified columns.

24. **`orderBy`**: Sorts the DataFrame based on the specified columns.

25. **`agg`**: Aggregates the DataFrame using specified aggregation expressions.

26. **`rollup`**: Creates a multi-dimensional rollup for the DataFrame.


27. **`cube`**: Creates a multi-dimensional cube for the DataFrame.

28. **`selectExpr(expr)`**: Selects columns using SQL expressions.

29. **`repartition`**: Returns a new DataFrame with a specified number of partitions.

30. **`coalesce`**: Returns a new DataFrame with a reduced number of partitions.

31. **`foreach`**: Applies a function to each row of the DataFrame.

32. **`foreachPartition`**: Applies a function to each partition of the DataFrame.

33. **`createOrReplaceTempView`**: Creates or replaces a temporary view using the DataFrame.

34. **`createGlobalTempView`**: Creates or replaces a global temporary view using the DataFrame.

35. **`toJSON`**: Converts the DataFrame to a JSON string.

36. **`toLocalIterator`**: Returns an iterator that contains all rows in the DataFrame.

37. **`explain`**: Displays the physical plan to compute the DataFrame.

38. **`head`**: Returns the first n rows of the DataFrame as a list.

39. **`take`**: Returns the first n rows of the DataFrame as a list.

40. **`limit`**: Returns a new DataFrame with only the first n rows.
41. **`foreach`**: Applies a function to each row of the DataFrame.

42. **`foreachPartition`**: Applies a function to each partition of the DataFrame.

These actions execute the computation plan defined by transformations on the DataFrame and return
results to the driver program or perform other output-related tasks.

Different types of read and write files using dataframe

Certainly! Here's a list of 20 different types of file formats and storage systems that you can read from
and write to using PySpark DataFrames:

**Read Files:**

1. **Text Files (`text`):** Read text files from HDFS or local file system.

```python

df = spark.read.text("hdfs:///path/to/textfile/*.txt")

```

2. **CSV Files (`csv`):** Read CSV files.

```python

df = spark.read.csv("hdfs:///path/to/csvfile/*.csv", header=True, inferSchema=True)

```

3. **JSON Files (`json`):** Read JSON files.

```python

df = spark.read.json("hdfs:///path/to/jsonfile/*.json")
```

4. **Parquet Files (`parquet`):** Read Parquet files.

```python

df = spark.read.parquet("hdfs:///path/to/parquetfile")

```

5. **Avro Files (`avro`):** Read Avro files.

```python

df = spark.read.format("avro").load("hdfs:///path/to/avrofile")

```

6. **ORC Files (`orc`):** Read ORC files.

```python

df = spark.read.orc("hdfs:///path/to/orcfile")

```

7. **Delta Lake (`delta`):** Read Delta Lake tables.

```python

df = spark.read.format("delta").table("table_name")

```

8. **Hive Tables (`hive`):** Read data from Hive tables.

```python
df = spark.sql("SELECT * FROM database.table_name")

```

9. **Cassandra Tables (`cassandra`):** Read data from Cassandra tables.

```python

df = spark.read.format("org.apache.spark.sql.cassandra").options(table="table",
keyspace="keyspace").load()

```

10. **JDBC (`jdbc`):** Read data from relational databases using JDBC.

```python

df = spark.read.jdbc("jdbc:postgresql://localhost:5432/database", "table_name", properties={"user":


"username", "password": "password"})

```

11. **MongoDB (`mongo`):** Read data from MongoDB.

```python

df = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",
"mongodb://localhost:27017/db.collection").load()

```

12. **GraphQL (`graphql`):** Read data from GraphQL APIs.

```python

df = spark.read.format("io.github.andykay.ghql.spark.GhqlDataSource").option("url",
"https://api.example.com/graphql").load()

```
13. **Kafka (`kafka`):** Read data from Apache Kafka topics.

```python

df = spark.read.format("kafka").option("kafka.bootstrap.servers",
"localhost:9092").option("subscribe", "topic_name").load()

```

14. **Feather Files (`arrow`):** Read Feather files.

```python

df = spark.read.format("arrow").load("hdfs:///path/to/featherfile")

```

15. **Image Files (`image`):** Read image files.

```python

df = spark.read.format("image").option("path", "hdfs:///path/to/imagefolder").load()

```

16. **GraphFrames (`graphframes`):** Read graph data using GraphFrames.

```python

from graphframes import GraphFrame

g = GraphFrame(vertices, edges)

```

17. **Excel Files (`excel`):** Read Excel files.


```python

df = spark.read.format("com.crealytics.spark.excel").option("location",
"hdfs:///path/to/excelfile.xlsx").load()

```

18. **Arrow Files (`arrow`):** Read Arrow files.

```python

df = spark.read.format("arrow").load("hdfs:///path/to/arrowfile")

```

19. **Elasticsearch (`org.elasticsearch.spark.sql`):** Read data from Elasticsearch.

```python

df = spark.read.format("org.elasticsearch.spark.sql").option("es.nodes",
"localhost").option("es.resource", "index/type").load()

```

20. **Bigtable (`bigtable`):** Read data from Google Cloud Bigtable.

```python

df = spark.read.format("com.google.cloud.spark.bigtable").option("tableId", "my-table").load()

```

**Write Files:**

1. **Text Files (`text`):** Write DataFrame content to text files.

```python
df.write.text("hdfs:///path/to/output/textfile")

```

2. **CSV Files (`csv`):** Write DataFrame content to CSV files.

```python

df.write.csv("hdfs:///path/to/output/csvfile")

```

3. **JSON Files (`json`):** Write DataFrame content to JSON files.

```python

df.write.json("hdfs:///path/to/output/jsonfile")

```

4. **Parquet Files (`parquet`):** Write DataFrame content to Parquet files.

```python

df.write.parquet("hdfs:///path/to/output/parquetfile")

```

5. **Avro Files (`avro`):** Write DataFrame content to Avro files.

```python

df.write.format("avro").save("hdfs:///path/to/output/avrofile")

```

6. **ORC Files (`orc`):** Write DataFrame content to ORC files.


```python

df.write.orc("hdfs:///path/to/output/orcfile")

```

7. **Delta Lake (`delta`):** Write DataFrame content to Delta Lake tables.

```python

df.write.format("delta").save("/path/to/delta/table")

```

8. **Hive Tables (`hive`):** Save DataFrame data to a Hive table.

```python

df.write.saveAsTable("database.table_name")

```

9. **Cassandra Tables (`cassandra`):** Save DataFrame data to a Cassandra table.

```python

df.write.format("org.apache.spark.sql.cassandra").options(table="table", keyspace="keyspace").save()

```

10. **JDBC (`jdbc`):** Write DataFrame content to a relational database using JDBC.

```python

df.write.jdbc("jdbc:postgresql://localhost:5432/database", "table_name", properties={"user":


"username", "password": "password"})

```
11. **MongoDB (`mongo`):** Write DataFrame content to MongoDB.

```python

df.write.format("com.mongodb.spark.sql.DefaultSource").option("uri",
"mongodb://localhost:27017/db.collection").mode("overwrite").save()

```

12. **GraphQL (`graphql`):** Write DataFrame content to GraphQL APIs.

```python

df.write.format("io.github.andykay.ghql.spark.GhqlDataSource").option("url",
"https://api.example.com/graphql").save()

```

13. **Kafka

(`kafka`):** Write DataFrame content to Apache Kafka topic.

```python

df.selectExpr("CAST(key AS STRING)", "CAST(value AS


STRING)").write.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("topic",
"topic_name").save()

```

14. **Feather Files (`arrow`):** Write DataFrame content to Feather files.

```python

df.write.format("arrow").save("hdfs:///path/to/output/featherfile")

```
15. **Image Files (`image`):** Write DataFrame content to image files.

```python

df.write.format("image").option("path", "hdfs:///path/to/output/imagefolder").save()

```

16. **GraphFrames (`graphframes`):** Write graph data using GraphFrames.

```python

g.vertices.write.format("graphframes").mode("overwrite").save("hdfs:///path/to/output/graph/vertices"
)

g.edges.write.format("graphframes").mode("overwrite").save("hdfs:///path/to/output/graph/edges")

```

17. **Excel Files (`excel`):** Write DataFrame content to Excel files.

```python

df.write.format("com.crealytics.spark.excel").option("location",
"hdfs:///path/to/output/excelfile.xlsx").save()

```

18. **Arrow Files (`arrow`):** Write DataFrame content to Arrow files.

```python

df.write.format("arrow").save("hdfs:///path/to/output/arrowfile")

```

19. **Elasticsearch (`org.elasticsearch.spark.sql`):** Write DataFrame content to Elasticsearch.


```python

df.write.format("org.elasticsearch.spark.sql").option("es.nodes", "localhost").option("es.resource",
"index/type").mode("overwrite").save()

```

20. **Bigtable (`bigtable`):** Write DataFrame content to Google Cloud Bigtable.

```python

df.write.format("com.google.cloud.spark.bigtable").option("tableId", "my-table").save()

```

Note: The examples provided assume that `spark` is a `SparkSession` object. The actual code might vary
based on your specific Spark version and configuration.

You might also like