Pyspark Essentials
Pyspark Essentials
Transformations:
1. **`map(func)`**: Applies a function to each element of the RDD and returns a new RDD.
2. **`filter(func)`**: Returns a new RDD containing only the elements that satisfy the given predicate.
3. **`flatMap(func)`**: Similar to `map`, but each input item can be mapped to zero or more output
items.
4. **`union(otherRDD)`**: Returns a new RDD containing the elements of the original RDD and the
other RDD.
7. **`reduceByKey(func, numPartitions=None)`**: Reduces the elements of the RDD by key using the
specified function.
9. **`join(otherRDD, numPartitions=None)`**: Performs an inner join between two RDDs based on their
keys.
10. **`cogroup(otherRDD, numPartitions=None)`**: Groups the elements of the two RDDs by key and
performs a cogroup operation.
11. **`mapValues(func)`**: Applies a function to the values of each key-value pair without changing the
keys.
12. **`flatMapValues(func)`**: Similar to `mapValues`, but each input value can be mapped to zero or
more output values.
These transformations are fundamental building blocks for constructing more complex data processing
pipelines in PySpark.
Actions
PySpark RDD (Resilient Distributed Dataset) actions are operations that return values to the driver
program or write data to an external storage system. Here is a list of some common PySpark RDD
actions:
1. **`collect()`**: Returns all elements of the RDD as an array to the driver program. It is often used
cautiously as it brings all the data to the driver, which may cause out-of-memory errors for large
datasets.
6. **`reduce(func)`**: Aggregates the elements of the RDD using a specified associative and
commutative binary operator.
7. **`fold(zeroValue, func)`**: Aggregates the elements of the RDD using a specified associative binary
operator and a neutral "zero value."
8. **`aggregate(zeroValue, seqOp, combOp)`**: Aggregate the elements of the RDD using two different
aggregation functions.
9. **`foreach(func)`**: Applies a function to each element of the RDD. This is a way to execute code on
each node of the cluster.
10. **`countByKey()`**: Counts the number of occurrences of each key in a key-value RDD.
11. **`collectAsMap()`**: Returns the key-value pairs of the RDD as a dictionary to the driver program.
12. **`saveAsTextFile(path)`**: Writes the elements of the RDD to a text file or a set of text files in a
specified directory.
14. **`saveAsPickleFile(path)`**: Writes the elements of the RDD to a file in pickle format.
15. **`foreachPartition(func)`**: Applies a function to each partition of the RDD. This can be useful for
performing operations that require a per-partition setup.
These actions are used to trigger the execution of the computation defined by transformations on RDDs.
They return values to the driver program or save the data to an external storage system.
Different types of read files and write files from external using rdd
**Read Files:**
1. **Text Files (`textFile`):** Read text files from HDFS, local file system, or other supported file systems.
```python
rdd = sc.textFile("hdfs:///path/to/textfile/*.txt")
```
rdd = sc.sequenceFile("hdfs:///path/to/sequencefile")
```
```python
rdd = sqlContext.read.json("hdfs:///path/to/jsonfile/*.json").rdd
```
```python
rdd = sqlContext.read.parquet("hdfs:///path/to/parquetfile").rdd
```
```python
```
```python
```
rdd = hiveContext.table("database.table_name").rdd
```
```python
```
```python
```
```python
rdd = sc.mongoRDD("mongodb://localhost:27017/db.collection")
```
**Write Files:**
```python
rdd.saveAsTextFile("hdfs:///path/to/output")
```
```python
rdd.saveAsSequenceFile("hdfs:///path/to/output")
```
```python
rdd.toDF().write.json("hdfs:///path/to/output/jsonfile")
```
```python
rdd.toDF().write.parquet("hdfs:///path/to/output/parquetfile")
```
```python
rdd.saveAsHadoopFile("hdfs:///path/to/output/avrofile",
"org.apache.avro.mapred.AvroOutputFormat")
```
```
```python
rdd.toDF().write.saveAsTable("database.table_name")
```
```python
rdd.saveToCassandra("keyspace", "table")
```
19. **JDBC (`jdbcRDD`):** Write RDD content to a relational database using JDBC.
```python
```
```python
rdd.saveToMongoDB("mongodb://localhost:27017/db.collection")
```
Note: The examples provided assume SparkContext (`sc`) and Spark SQLContext (`sqlContext`) are
available. The actual code might vary based on your specific Spark version and configuration.
Dataframe
Transformations
2. **`filter(condition)`**: Returns a new DataFrame with rows that satisfy the given condition.
10. **`pivot(pivot_col, values=None)`**: Pivots a column of the DataFrame and performs the specified
aggregation.
11. **`join(other, on=None, how=None)`**: Joins the DataFrame with another DataFrame.
12. **`union(other)`**: Returns a new DataFrame containing rows from both DataFrames.
13. **`na.fill(value, subset=None)`**: Returns a new DataFrame with missing values filled.
14. **`na.drop(how='any', subset=None)`**: Returns a new DataFrame with rows containing null or NaN
values dropped.
15. **`na.replace(to_replace, value, subset=None)`**: Returns a new DataFrame with specified values
replaced.
18. **`limit(n)`**: Returns a new DataFrame with only the first n rows.
23. **`describe(*cols)`**: Computes basic statistics for numeric and string columns.
26. **`randomSplit(weights, seed=None)`**: Splits the DataFrame into multiple DataFrames based on
the provided weights.
32. **`withCachedData()`**: Forces the computation of a DataFrame and caches the result.
33. **`fillna(value, subset=None)`**: Returns a new DataFrame with missing values filled.
37. **`replace`**: Replaces values matching specified conditions with new values.
41. **`write`**: Writes the content of the DataFrame to external storage systems (e.g., Parquet, CSV).
42. **`cacheTable`**: Caches the contents of a DataFrame created from a SQL query.
43. **`unpersistTable`**: Removes the DataFrame contents created from a SQL query from memory.
44. **`rollup`**: Creates a rollup (also called a sub-total) for the DataFrame.
48. **`freqItems`**: Finds frequent items for columns with categorical data.
49. **`sampleBy`**: Returns a stratified sample of a DataFrame based on values in a specified column.
50. **`transform`**: Applies a function to the DataFrame and returns a new DataFrame.
Actions
Here is a list of PySpark DataFrame actions, which are operations that return values to the driver
program or write data to an external storage system:
1. **`show(n=20, truncate=True, vertical=False)`**: Prints the first n rows of the DataFrame to the
console.
12. **`write`**: Writes the content of the DataFrame to external storage systems (e.g., Parquet, CSV).
- `write.format("parquet").mode("overwrite").save("/path/to/parquet")`
- `write.format("csv").mode("overwrite").save("/path/to/csv")`
13. **`saveAsTable(tableName, format=None, mode=None, partitionBy=None)`**: Saves the DataFrame
as a table in the Hive metastore.
19. **`isLocal()`**: Returns True if the DataFrame is executed locally on the driver.
20. **`rdd()`**: Returns the content of the DataFrame as an RDD of Row objects.
21. **`toDF(*cols)`**: Returns a new DataFrame with the specified column names.
34. **`createGlobalTempView`**: Creates or replaces a global temporary view using the DataFrame.
36. **`toLocalIterator`**: Returns an iterator that contains all rows in the DataFrame.
40. **`limit`**: Returns a new DataFrame with only the first n rows.
41. **`foreach`**: Applies a function to each row of the DataFrame.
These actions execute the computation plan defined by transformations on the DataFrame and return
results to the driver program or perform other output-related tasks.
Certainly! Here's a list of 20 different types of file formats and storage systems that you can read from
and write to using PySpark DataFrames:
**Read Files:**
1. **Text Files (`text`):** Read text files from HDFS or local file system.
```python
df = spark.read.text("hdfs:///path/to/textfile/*.txt")
```
```python
```
```python
df = spark.read.json("hdfs:///path/to/jsonfile/*.json")
```
```python
df = spark.read.parquet("hdfs:///path/to/parquetfile")
```
```python
df = spark.read.format("avro").load("hdfs:///path/to/avrofile")
```
```python
df = spark.read.orc("hdfs:///path/to/orcfile")
```
```python
df = spark.read.format("delta").table("table_name")
```
```python
df = spark.sql("SELECT * FROM database.table_name")
```
```python
df = spark.read.format("org.apache.spark.sql.cassandra").options(table="table",
keyspace="keyspace").load()
```
10. **JDBC (`jdbc`):** Read data from relational databases using JDBC.
```python
```
```python
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",
"mongodb://localhost:27017/db.collection").load()
```
```python
df = spark.read.format("io.github.andykay.ghql.spark.GhqlDataSource").option("url",
"https://api.example.com/graphql").load()
```
13. **Kafka (`kafka`):** Read data from Apache Kafka topics.
```python
df = spark.read.format("kafka").option("kafka.bootstrap.servers",
"localhost:9092").option("subscribe", "topic_name").load()
```
```python
df = spark.read.format("arrow").load("hdfs:///path/to/featherfile")
```
```python
df = spark.read.format("image").option("path", "hdfs:///path/to/imagefolder").load()
```
```python
g = GraphFrame(vertices, edges)
```
df = spark.read.format("com.crealytics.spark.excel").option("location",
"hdfs:///path/to/excelfile.xlsx").load()
```
```python
df = spark.read.format("arrow").load("hdfs:///path/to/arrowfile")
```
```python
df = spark.read.format("org.elasticsearch.spark.sql").option("es.nodes",
"localhost").option("es.resource", "index/type").load()
```
```python
df = spark.read.format("com.google.cloud.spark.bigtable").option("tableId", "my-table").load()
```
**Write Files:**
```python
df.write.text("hdfs:///path/to/output/textfile")
```
```python
df.write.csv("hdfs:///path/to/output/csvfile")
```
```python
df.write.json("hdfs:///path/to/output/jsonfile")
```
```python
df.write.parquet("hdfs:///path/to/output/parquetfile")
```
```python
df.write.format("avro").save("hdfs:///path/to/output/avrofile")
```
df.write.orc("hdfs:///path/to/output/orcfile")
```
```python
df.write.format("delta").save("/path/to/delta/table")
```
```python
df.write.saveAsTable("database.table_name")
```
```python
df.write.format("org.apache.spark.sql.cassandra").options(table="table", keyspace="keyspace").save()
```
10. **JDBC (`jdbc`):** Write DataFrame content to a relational database using JDBC.
```python
```
11. **MongoDB (`mongo`):** Write DataFrame content to MongoDB.
```python
df.write.format("com.mongodb.spark.sql.DefaultSource").option("uri",
"mongodb://localhost:27017/db.collection").mode("overwrite").save()
```
```python
df.write.format("io.github.andykay.ghql.spark.GhqlDataSource").option("url",
"https://api.example.com/graphql").save()
```
13. **Kafka
```python
```
```python
df.write.format("arrow").save("hdfs:///path/to/output/featherfile")
```
15. **Image Files (`image`):** Write DataFrame content to image files.
```python
df.write.format("image").option("path", "hdfs:///path/to/output/imagefolder").save()
```
```python
g.vertices.write.format("graphframes").mode("overwrite").save("hdfs:///path/to/output/graph/vertices"
)
g.edges.write.format("graphframes").mode("overwrite").save("hdfs:///path/to/output/graph/edges")
```
```python
df.write.format("com.crealytics.spark.excel").option("location",
"hdfs:///path/to/output/excelfile.xlsx").save()
```
```python
df.write.format("arrow").save("hdfs:///path/to/output/arrowfile")
```
df.write.format("org.elasticsearch.spark.sql").option("es.nodes", "localhost").option("es.resource",
"index/type").mode("overwrite").save()
```
```python
df.write.format("com.google.cloud.spark.bigtable").option("tableId", "my-table").save()
```
Note: The examples provided assume that `spark` is a `SparkSession` object. The actual code might vary
based on your specific Spark version and configuration.