0% found this document useful (0 votes)
21 views3 pages

Pyspark Intro

This document provides a cheat list for commonly used PySpark functionalities, including initializing a SparkSession, creating and manipulating DataFrames, performing aggregations, enabling SQL queries, and using string and window functions. Key operations such as filtering rows, selecting columns, and concatenating columns are also included. The document serves as a quick reference for users working with PySpark.

Uploaded by

Nagaraj M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views3 pages

Pyspark Intro

This document provides a cheat list for commonly used PySpark functionalities, including initializing a SparkSession, creating and manipulating DataFrames, performing aggregations, enabling SQL queries, and using string and window functions. Key operations such as filtering rows, selecting columns, and concatenating columns are also included. The document serves as a quick reference for users working with PySpark.

Uploaded by

Nagaraj M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Here’s a quick cheat list for PySpark, covering some of the most commonly used functionalities:

Basics

Initialize SparkSession:

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("App Name").getOrCreate()

DataFrame Operations

Create DataFrame:

python

df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)

Show Data:

python

df.show()

Filter Rows:

python

df.filter(df["column_name"] > value).show()

Select Columns:

python

1
df.select("column_name").show()

Aggregations

Group By and Aggregate:

python

df.groupBy("column_name").agg({"another_column": "sum"}).show()

SQL Queries

Enable SQL Queries:

python

df.createOrReplaceTempView("table_name")

spark.sql("SELECT * FROM table_name").show()

String Operations

Concatenate Columns:

python

from pyspark.sql.functions import concat, lit

df.withColumn("new_column", concat(df["col1"], lit("_"), df["col2"])).show()

Window Functions

Add Row Numbers:

python

from pyspark.sql.window import Window

from pyspark.sql.functions import row_number

2
window_spec = Window.partitionBy("column_name").orderBy("another_column")

df.withColumn("row_number", row_number().over(window_spec)).show()

You might also like