0% found this document useful (0 votes)
27 views16 pages

pyspark-vs-pandas

Uploaded by

julianalb.berrio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views16 pages

pyspark-vs-pandas

Uploaded by

julianalb.berrio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Data Engineering Fundamentals

Pandas
vs PySpark

Eren Han
Data Engineering Fundamentals
1
LOAD CSV

Pandas PySpark

df = spark.read \
.options(header=True,
df = pd.read_csv('sample.csv')
inferSchema=True) \
.csv('sample.csv')

Eren Han
Data Engineering Fundamentals
2
VIEW DATAFRAME

Pandas PySpark

df df.show()

df.head(10) df.show(10)

Eren Han
Data Engineering Fundamentals
3
CHECK COLUMNS AND DATA TYPES

Pandas PySpark

df.columns df.columns

df.dtypes df.dtypes

Eren Han
Data Engineering Fundamentals
4
RENAME COLUMNS

Pandas PySpark

df.columns = [x, y, z] df.toDF(x, y, z)

df.rename(columns= {"old":"new"}) df.withColumnRenamed("old","new")

Eren Han
Data Engineering Fundamentals
5
DROP COLUMN

Pandas PySpark

df.drop("column", axis=1) df.drop("column")

Eren Han
Data Engineering Fundamentals
6
FILTERING

Pandas PySpark

df[df.column < 80] df[df.column < 80]

df[(df.column < 80) & (df.column2 == 50)] df[(df.column < 80) & (df.column2 == 50)]

Eren Han
Data Engineering Fundamentals
7
ADD COLUMN

Pandas PySpark

df["new"] = 1 / df.column df.withColumn("new", 1 /


df.column)

Note: Division by zero is Note: Division by zero is NULL.


infinite.

Eren Han
Data Engineering Fundamentals
8
FILL NULLS

Pandas PySpark

df.fillna(0) df.fillna(0)

Eren Han
Data Engineering Fundamentals
9
AGGREGATION

Pandas PySpark

df.groupby([date, product]) \ df.groupby([date, product]) \


.agg({"sales":"mean", .agg({"sales":"mean",
"revenue":"max"}) "revenue":"max"})

Eren Han
Data Engineering Fundamentals
10
STANDARD TRANSFORMATIONS

Pandas PySpark

import numpy as np import pysapark.sql.functions as F


df["logcolumn"] = np.log(df.column) df.withColumn("logcolumn",
F.log(df.column)

Eren Han
Data Engineering Fundamentals
11
CONDITIONAL STATEMENTS

Pandas PySpark

df["cond"]= df.apply(lambda x: 1 if import pysapark.sql.functions as F


df.col1>20 else 2 if df.col2==6 else df.withColumn("cond", \
3, axis=1) F.when(df.col1>20,1) \
.when(df.col2==6,2)
.otherwise(3))

Eren Han
Data Engineering Fundamentals
12
MERGE / JOIN DATAFRAMES

Pandas PySpark

df.merge(df2, on="key") df.join(df2, on="key")

df.merge(df2, left_on="a",right_on="b") df.join(df2, df.a == df2.b)

Eren Han
Data Engineering Fundamentals
13
SUMMARY STATISTICS

Pandas PySpark

df.describe() df.describe().show()

Note: Only
count,mean,stddev,min,max.

Eren Han
Data Engineering Fundamentals
14
CHANGE DATA TYPES

Pandas PySpark

from pyspark.sql.types
df['A'] = df['A'].astype(int)
import IntegerType

df = df.withColumn('A',
col('A').cast(IntegerType()))

Eren Han
Data Engineering Fundamentals

Thank You for


reading. I hope
you enjoyed it.

Eren Han

You might also like