0% found this document useful (0 votes)
90 views3 pages

Pandas - PySpark Equivalents-1

The document compares common data wrangling operations between Pandas and PySpark, listing the syntax for performing each operation such as reading/writing CSVs, selecting columns, filtering data, grouping/aggregating, sorting, handling missing values, renaming columns, creating/calculating new columns, joining data, pivoting tables, dropping columns/duplicates, concatenating dataframes, and finding unique values.

Uploaded by

Rufai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views3 pages

Pandas - PySpark Equivalents-1

The document compares common data wrangling operations between Pandas and PySpark, listing the syntax for performing each operation such as reading/writing CSVs, selecting columns, filtering data, grouping/aggregating, sorting, handling missing values, renaming columns, creating/calculating new columns, joining data, pivoting tables, dropping columns/duplicates, concatenating dataframes, and finding unique values.

Uploaded by

Rufai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

OPERATION PANDAS PYSPARK

Reading CSV pd.read_csv('file.csv') spark.read.csv('file.csv', header=True)

Writing CSV df.to_csv("file.csv", index=False) df.write.csv("file.csv", header=True)

Selecting
Columns df[['column1', 'column2']] df.select('column1', 'column2')

Filtering Data
df[df['column'] > value] df.filter(df['column'] > value)

Grouping and df.groupby('group_column') \ df.groupBy('group_column') \


Aggregating .agg({'numeric_column': 'mean'}) .agg({'numeric_column': 'mean'})

Moses David Kalyanapu


OPERATION PANDAS PYSPARK

Sorting Data df.sort_values(by='column', ascending=False) df.orderBy('column', ascending=False)

Handling
df.dropna() df.na.drop()
Missing Values

Renaming
Columns df.rename(columns={'old_name': 'new_name'} df = df.withColumnRenamed('old_name', 'new_name')

Creating New
df[new_column] = values df.withColumn("new_column", values)
Column

Calculated df.withColumn("sum_column", df["column1"] +


df['sum_column'] = df['column1'] + df['column2']
Column df["column2"])

Display DF
df.info() df.printSchema()
Schema Info
OPERATION PANDAS PYSPARK

Data Joining pd.merge(df1, df2, on='key_column', how='join_type') df.join(other_df, on='key_column', how='join_type')

pd.pivot_table(df, values='value', index='index_column', df.groupBy("index_column").pivot("column_name").agg({"


Pivot Tables
columns='column_name', aggfunc='agg_func') value": "agg_func"})

Column
df.drop(columns=['column_name']) df.drop('column_name')
Deletion

Dropping
df.drop_duplicates() df.dropDuplicates()
Duplicates

Dataframe
pd.concat([df1, df2]) df.union(df2)
Concatenation

Find Unique
df['column_name'].unique() df.select('column_name').distinct()
Values

You might also like