0% found this document useful (0 votes)

424 views9 pages

Python Data Exploratory Commands

This document provides a cheat sheet on exploratory data analysis (EDA) techniques that can be performed with PySpark. It lists over 40 techniques organized into categories like data loading, inspection, cleaning, transformation, SQL queries, statistical analysis, machine learning integration, and more. The techniques are concisely explained and include relevant code snippets using PySpark APIs and functions.

Uploaded by

Jaya Shankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

424 views9 pages

Python Data Exploratory Commands

Uploaded by

Jaya Shankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

[ Exploratory Data Analysis (EDA) with PySpark ] {CheatSheet}

1. Data Loading

● Read CSV File: df = spark.read.csv('filename.csv', header=True,

inferSchema=True)
● Read Parquet File: df = spark.read.parquet('filename.parquet')
● Read from JDBC (Databases): df =
spark.read.format("jdbc").options(url="jdbc_url",
dbtable="table_name").load()

2. Basic Data Inspection

● Display Top Rows: df.show()

● Print Schema: df.printSchema()
● Summary Statistics: df.describe().show()
● Count Rows: df.count()
● Display Columns: df.columns

3. Data Cleaning

● Drop Missing Values: df.na.drop()

● Fill Missing Values: df.na.fill(value)
● Drop Column: df.drop('column_name')
● Rename Column: df.withColumnRenamed('old_name', 'new_name')

4. Data Transformation

● Select Columns: df.select('column1', 'column2')

● Add New or Transform Column: df.withColumn('new_column',
expression)
● Filter Rows: df.filter(df['column'] > value)
● Group By and Aggregate: df.groupby('column').agg({'column': 'sum'})
● Sort Rows: df.sort(df['column'].desc())

5. SQL Queries on DataFrames

● Create Temporary View: df.createOrReplaceTempView('view_name')

By: Waleed Mousa
● SQL Query: spark.sql('SELECT * FROM view_name WHERE condition')

6. Statistical Analysis

● Correlation Matrix: from pyspark.ml.stat import Correlation;

Correlation.corr(df, 'column')
● Covariance: df.stat.cov('column1', 'column2')
● Frequency Items: df.stat.freqItems(['column1', 'column2'])
● Sample By: df.sampleBy('column', fractions={'class1': 0.1,
'class2': 0.2})

7. Handling Missing and Duplicated Data

● Fill Missing Values in Column: df.fillna({'column': value})

● Drop Duplicates: df.dropDuplicates()
● Replace Value: df.na.replace(['old_value'], ['new_value'],
'column')

8. Data Conversion and Export

● Convert to Pandas DataFrame: pandas_df = df.toPandas()

● Write DataFrame to CSV: df.write.csv('path_to_save.csv')
● Write DataFrame to Parquet:
df.write.parquet('path_to_save.parquet')

9. Column Operations

● Change Column Type: df.withColumn('column',

df['column'].cast('new_type'))
● Split Column into Multiple Columns: df.withColumn('new_col1',
split(df['column'], 'delimiter')[0])
● Concatenate Columns: df.withColumn('new_column', concat_ws(' ',
df['col1'], df['col2']))

10. Date and Time Operations

● Current Date and Time: df.withColumn('current_date',

current_date())

By: Waleed Mousa

● Date Formatting: df.withColumn('formatted_date',
date_format('dateColumn', 'yyyyMMdd'))
● Date Arithmetic: df.withColumn('date_plus_days',
date_add(df['date'], 5))

11. Advanced Data Processing

● Window Functions: from pyspark.sql.window import Window;

df.withColumn('rank',
rank().over(Window.partitionBy('column').orderBy('other_column')))
● Pivot Table:
df.groupBy('column').pivot('pivot_column').sum('sum_column')
● UDF (User Defined Functions): from pyspark.sql.functions import udf;
my_udf = udf(my_python_function); df.withColumn('new_col',
my_udf(df['col']))

12. Performance Optimization

● Caching DataFrame: df.cache()

● Repartitioning: df.repartition(10)
● Broadcast Join Hint: df.join(broadcast(df2), 'key', 'inner')

13. Exploratory Data Analysis Specifics

● Column Value Counts: df.groupBy('column').count().show()

● Distinct Values in a Column: df.select('column').distinct().show()
● Aggregations (sum, max, min, avg):
df.groupBy().sum('column').show()

14. Working with Complex Data Types

● Exploding Arrays: df.withColumn('exploded',

explode(df['array_column']))
● Working with Structs: df.select(df['struct_column']['field'])
● Handling Maps: df.select(map_keys(df['map_column']))

15. Joins

● Inner Join: df1.join(df2, df1['id'] == df2['id'])

By: Waleed Mousa

● Left Outer Join: df1.join(df2, df1['id'] == df2['id'],
'left_outer')
● Right Outer Join: df1.join(df2, df1['id'] == df2['id'],
'right_outer')

16. Saving and Loading Models

● Saving ML Model: model.save('model_path')

● Loading ML Model: from pyspark.ml.classification import
LogisticRegressionModel; LogisticRegressionModel.load('model_path')

17. Handling JSON and Complex Files

● Read JSON: df = spark.read.json('path_to_file.json')

● Explode JSON Object: df.selectExpr('json_column.*')

18. Custom Aggregations

● Custom Aggregate Function: from pyspark.sql import functions as F;

df.groupBy('group_column').agg(F.sum('sum_column'))

19. Working with Null Values

● Counting Nulls in Each Column:

df.select([F.count(F.when(F.isnull(c), c)).alias(c) for c in
df.columns])
● Drop Rows with Null Values: df.na.drop()

20. Data Import/Export Tips

● Read Text Files: df = spark.read.text('path_to_file.txt')

● Write Data to JDBC: df.write.format("jdbc").options(url="jdbc_url",
dbtable="table_name").save()

21. Advanced SQL Operations

● Register DataFrame as Table:

df.createOrReplaceTempView('temp_table')

By: Waleed Mousa

● Perform SQL Queries: spark.sql('SELECT * FROM temp_table WHERE
condition')

22. Dealing with Large Datasets

● Sampling Data: sampled_df = df.sample(False, 0.1)

● Approximate Count Distinct:
df.select(approx_count_distinct('column')).show()

23. Data Quality Checks

● Checking Data Integrity: df.checkpoint()

● Asserting Conditions: df.filter(df['column'] > 0).count()

24. Advanced File Handling

● Specify Schema While Reading: schema = StructType([...]); df =

spark.read.csv('file.csv', schema=schema)
● Writing in Overwrite Mode:
df.write.mode('overwrite').csv('path_to_file.csv')

25. Debugging and Error Handling

● Collecting Data Locally for Debugging: local_data = df.take(5)

● Handling Exceptions in UDFs: def safe_udf(my_udf): def
wrapper(*args, **kwargs): try: return my_udf(*args, **kwargs)
except: return None; return wrapper

26. Machine Learning Integration

● Creating Feature Vector: from pyspark.ml.feature import

VectorAssembler; assembler = VectorAssembler(inputCols=['col1',
'col2'], outputCol='features'); feature_df =
assembler.transform(df)

27. Advanced Joins and Set Operations

● Cross Join: df1.crossJoin(df2)

By: Waleed Mousa

● Set Operations (Union, Intersect, Minus): df1.union(df2);
df1.intersect(df2); df1.subtract(df2)

28. Dealing with Network Data

● Reading Data from HTTP Source:

spark.read.format("csv").option("url",
"http://example.com/data.csv").load()

29. Integration with Visualization Libraries

● Convert to Pandas for Visualization: pandas_df = df.toPandas();

pandas_df.plot(kind='bar')

30. Spark Streaming for Real-Time EDA

● Reading from a Stream: df =

spark.readStream.format('source').load()
● Writing to a Stream: df.writeStream.format('console').start()

31. Advanced Window Functions

● Cumulative Sum: from pyspark.sql.window import Window;

df.withColumn('cum_sum',
F.sum('column').over(Window.partitionBy('group_column').orderBy('or
der_column')))
● Row Number: df.withColumn('row_num',
F.row_number().over(Window.orderBy('column')))

32. Handling Complex Analytics

● Rollup: df.rollup('column1', 'column2').agg(F.sum('column3'))

● Cube for Multi-Dimensional Aggregation: df.cube('column1',
'column2').agg(F.sum('column3'))

33. Dealing with Geospatial Data

● Using GeoSpark for Geospatial Data: from geospark.register import

GeoSparkRegistrator; GeoSparkRegistrator.registerAll(spark)

By: Waleed Mousa

34. Advanced File Formats

● Reading ORC Files: df = spark.read.orc('filename.orc')

● Writing Data to ORC: df.write.orc('path_to_file.orc')

35. Dealing with Sparse Data

● Using Sparse Vectors: from pyspark.ml.linalg import SparseVector;

sparse_vec = SparseVector(size, {index: value})

36. Handling Binary Data

● Reading Binary Files: df =

spark.read.format('binaryFile').load('path_to_binary_file')

37. Efficient Data Transformation

● Using mapPartitions for Transformation: rdd =

df.rdd.mapPartitions(lambda partition: [transform(row) for row in
partition])

38. Advanced Machine Learning Operations

● Using ML Pipelines: from pyspark.ml import Pipeline; pipeline =

Pipeline(stages=[stage1, stage2]); model = pipeline.fit(df)
● Model Evaluation: from pyspark.ml.evaluation import
BinaryClassificationEvaluator; evaluator =
BinaryClassificationEvaluator(); evaluator.evaluate(predictions)

39. Optimization Techniques

● Broadcast Variables for Efficiency: from pyspark.sql.functions

import broadcast; df.join(broadcast(df2), 'key')
● Using Accumulators for Global Aggregates: accumulator =
spark.sparkContext.accumulator(0); rdd.foreach(lambda x:
accumulator.add(x))

40. Advanced Data Import/Export

By: Waleed Mousa

● Reading Data from Multiple Sources: df =
spark.read.format('format').option('option',
'value').load(['path1', 'path2'])
● Writing Data to Multiple Formats:
df.write.format('format').save('path', mode='overwrite')

41. Utilizing External Data Sources

● Connecting to External Data Sources (e.g., Kafka, S3): df =

spark.read.format('kafka').option('kafka.bootstrap.servers',
'host1:port1').load()

42. Efficient Use of SQL Functions

● Using Built-in SQL Functions: from pyspark.sql.functions import

col, lit; df.withColumn('new_column', col('existing_column') +
lit(1))

43. Exploring Data with GraphFrames

● Using GraphFrames for Graph Analysis: from graphframes import

GraphFrame; g = GraphFrame(vertices_df, edges_df)

44. Working with Nested Data

● Exploding Nested Arrays: df.selectExpr('id', 'explode(nestedArray)

as element')
● Handling Nested Structs: df.select('struct_column.*')

45. Advanced Statistical Analysis

● Hypothesis Testing: from pyspark.ml.stat import ChiSquareTest; r =

ChiSquareTest.test(df, 'features', 'label')
● Statistical Functions (e.g., mean, stddev): from
pyspark.sql.functions import mean, stddev;
df.select(mean('column'), stddev('column'))

46. Customizing Spark Session

By: Waleed Mousa

● Configuring SparkSession: spark =
SparkSession.builder.appName('app').config('spark.some.config.optio
n', 'value').getOrCreate()

By: Waleed Mousa

Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
024 Price and Everything PDF
100% (1)
024 Price and Everything PDF
12 pages
Assignment 5
No ratings yet
Assignment 5
2 pages
Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Tableau Assignment
100% (1)
Tableau Assignment
7 pages
Software Engineering Micro Project
100% (3)
Software Engineering Micro Project
49 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Pyspark MCQ
No ratings yet
Pyspark MCQ
3 pages
Miro Teardown
100% (1)
Miro Teardown
14 pages
Hadoop Interview Questions New
No ratings yet
Hadoop Interview Questions New
9 pages
Datascience With Answers
100% (1)
Datascience With Answers
36 pages
Concerto de Brandemburgo Nº2 (J. S. Bach)
100% (1)
Concerto de Brandemburgo Nº2 (J. S. Bach)
5 pages
PySpark SQL Cheat Sheet Python
No ratings yet
PySpark SQL Cheat Sheet Python
1 page
On Data Handling Using Pandas-I
100% (2)
On Data Handling Using Pandas-I
64 pages
Pandas Cheat Sheet
100% (2)
Pandas Cheat Sheet
6 pages
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
No ratings yet
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
7 pages
Cleaning Dirty Data With Pandas & Python - DevelopIntelligence Blog PDF
No ratings yet
Cleaning Dirty Data With Pandas & Python - DevelopIntelligence Blog PDF
8 pages
Multidimensional
100% (1)
Multidimensional
42 pages
NX NF TipsUndTricks
100% (1)
NX NF TipsUndTricks
12 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
RELIGION STUDIES P1 GR12 QP SEPT 2023 - English
No ratings yet
RELIGION STUDIES P1 GR12 QP SEPT 2023 - English
16 pages
SQL Notes by Krishna Reddy
No ratings yet
SQL Notes by Krishna Reddy
49 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Error and Solution Ls Retail
No ratings yet
Error and Solution Ls Retail
10 pages
Training Course Datastage (Part 1) : V. Beyet 03/07/2006
100% (1)
Training Course Datastage (Part 1) : V. Beyet 03/07/2006
122 pages
Data Warehousing Interview Questions - by Shobha Bhagwat - Medium
No ratings yet
Data Warehousing Interview Questions - by Shobha Bhagwat - Medium
9 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Datastage Questions
No ratings yet
Datastage Questions
18 pages
Comparative and Superlative 1-Páginas-1
0% (1)
Comparative and Superlative 1-Páginas-1
1 page
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
SQL Notes Full PDF
No ratings yet
SQL Notes Full PDF
72 pages
SQL Interview Questions
No ratings yet
SQL Interview Questions
44 pages
Numerical Integration Methods
No ratings yet
Numerical Integration Methods
6 pages
SQL Cheat Sheet Mysql
No ratings yet
SQL Cheat Sheet Mysql
1 page
Python Technical Interviews Questions
100% (1)
Python Technical Interviews Questions
15 pages
PG - M.B.A - English - Product Management 4.1.1 - CRC - 3871
No ratings yet
PG - M.B.A - English - Product Management 4.1.1 - CRC - 3871
207 pages
About Illustrator Theory
No ratings yet
About Illustrator Theory
3 pages
LongUoo Ew Ue U e Uiv1.0
No ratings yet
LongUoo Ew Ue U e Uiv1.0
66 pages
Wiki Archlinux Org Index PHP Install From Existing Linux 2
No ratings yet
Wiki Archlinux Org Index PHP Install From Existing Linux 2
6 pages
Top 100+ Data Engineer Interview Questions and Answers For 2022
No ratings yet
Top 100+ Data Engineer Interview Questions and Answers For 2022
4 pages
ETL Developer Resume 1660107492
No ratings yet
ETL Developer Resume 1660107492
4 pages
PHYS235 Cheat Sheet 3
No ratings yet
PHYS235 Cheat Sheet 3
5 pages
SQL Scenario Based Interview Questions - ThinkETL
100% (3)
SQL Scenario Based Interview Questions - ThinkETL
23 pages
2.1.2 Structure Chart (MT-L)
No ratings yet
2.1.2 Structure Chart (MT-L)
8 pages
SQL For Everyone (Definitive Guide)
No ratings yet
SQL For Everyone (Definitive Guide)
10 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
New OOPS Assignment 1
No ratings yet
New OOPS Assignment 1
4 pages
Sqoop Cammand
No ratings yet
Sqoop Cammand
8 pages
SQL Cheat Sheet DATAwithBARAA
No ratings yet
SQL Cheat Sheet DATAwithBARAA
5 pages
SQL Notes
No ratings yet
SQL Notes
14 pages
Pandas - Basics - Practice: Consider The Following Python Dictionary Data and Python List Labels
No ratings yet
Pandas - Basics - Practice: Consider The Following Python Dictionary Data and Python List Labels
6 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
Tableau Notes: (Dependent Variables) Role. The Field's Data Type Defines If The Field Is, For Example, A
No ratings yet
Tableau Notes: (Dependent Variables) Role. The Field's Data Type Defines If The Field Is, For Example, A
6 pages
Big Basket - Product Teardown - Sanjeet Sahu
No ratings yet
Big Basket - Product Teardown - Sanjeet Sahu
19 pages
IC Strategic Go To Market Communications Plan Template 11011
No ratings yet
IC Strategic Go To Market Communications Plan Template 11011
4 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
PYTHON Notes by Devaraj
100% (1)
PYTHON Notes by Devaraj
40 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Working With A Dynamic Lookup Cache
No ratings yet
Working With A Dynamic Lookup Cache
15 pages
"Shader.h" "Renderer.h": #Include #Include #Include #Include #Include
No ratings yet
"Shader.h" "Renderer.h": #Include #Include #Include #Include #Include
3 pages
Lab 9 Tasks Subqueries - Solved
No ratings yet
Lab 9 Tasks Subqueries - Solved
9 pages
Python Program
No ratings yet
Python Program
7 pages
Pendo Whitepaper 2020 ModernPMFundamentals
No ratings yet
Pendo Whitepaper 2020 ModernPMFundamentals
19 pages
SQL Interview
100% (1)
SQL Interview
68 pages
Day64 - Pandas Interview Questions
No ratings yet
Day64 - Pandas Interview Questions
5 pages
Jeopardy Game 5-Topic Suffix Prefix
No ratings yet
Jeopardy Game 5-Topic Suffix Prefix
56 pages
Data Visualization PDF
No ratings yet
Data Visualization PDF
3 pages
Mysql Cheat Sheet A4
No ratings yet
Mysql Cheat Sheet A4
2 pages
Pandas - PySpark Equivalents-1
No ratings yet
Pandas - PySpark Equivalents-1
3 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
Python Variables Cheatsheet
No ratings yet
Python Variables Cheatsheet
2 pages
Azure DE Interview Que
100% (1)
Azure DE Interview Que
25 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
EDA With Pandas
No ratings yet
EDA With Pandas
8 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
AgileScrumCheatSheet 1
No ratings yet
AgileScrumCheatSheet 1
1 page
Lim 2014 Manichaeans and Public Disputation in Late Antiquity
No ratings yet
Lim 2014 Manichaeans and Public Disputation in Late Antiquity
40 pages
EDGR - 698 - Literature Review Final Demographic Data
No ratings yet
EDGR - 698 - Literature Review Final Demographic Data
14 pages
Difference Between BART and BERT
No ratings yet
Difference Between BART and BERT
2 pages
Screenshot 2023-07-23 at 8.13.18 PM
No ratings yet
Screenshot 2023-07-23 at 8.13.18 PM
4 pages
SQL Interview Questions For A Data Engineer
No ratings yet
SQL Interview Questions For A Data Engineer
11 pages
Mechatronics Lab Manual Latest - Dummy
No ratings yet
Mechatronics Lab Manual Latest - Dummy
11 pages
My Resume
No ratings yet
My Resume
1 page
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
Gold C1 Advanced NE DF UT02
No ratings yet
Gold C1 Advanced NE DF UT02
2 pages
English Language
No ratings yet
English Language
12 pages
Hand Note of Computers
No ratings yet
Hand Note of Computers
127 pages
Idlar Oct22 23
No ratings yet
Idlar Oct22 23
2 pages
Momen
No ratings yet
Momen
2 pages
PySpark Cheat 23
No ratings yet
PySpark Cheat 23
9 pages
Infinitives - Rule - and - Check - Answer Key
No ratings yet
Infinitives - Rule - and - Check - Answer Key
4 pages
SQL 2
No ratings yet
SQL 2
2 pages
Components of GIS (Praveen) AMREEN
No ratings yet
Components of GIS (Praveen) AMREEN
20 pages
Form 2 School Based Computer Science Syllabus
No ratings yet
Form 2 School Based Computer Science Syllabus
5 pages
Pandasquiz
No ratings yet
Pandasquiz
7 pages