0% found this document useful (0 votes)

65 views4 pages

Important PySpark Operations 1698872557

Uploaded by

Abdellah Amini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views4 pages

Important PySpark Operations 1698872557

Uploaded by

Abdellah Amini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

#_ important PySpark Operations [ +100 ]

RDD (Resilient Distributed Dataset) Operations:

● parallelize(): Create an RDD.

● map(): Transform each element of the RDD.
● filter(): Return a new RDD with only the elements that satisfy a
condition.
● reduce(): Aggregate RDD elements using a function.
● collect(): Return all the elements of the RDD.
● count(): Count the RDD's elements.
● first(): Return the first element of the RDD.
● take(): Return the first 'n' elements of the RDD.
● foreach(): Apply a function to each element of the RDD.
● groupByKey(): Group values with the same key.
● reduceByKey(): Reduce values with the same key using a function.
● sortBy(): Sort the RDD.
● join(): Join two RDDs.
● union(): Return a new RDD that contains the union of the elements
in the source RDD and another RDD.

DataFrame Operations:

● createDataFrame(): Create a DataFrame from an RDD or list.

● select(): Select specific columns from a DataFrame.
● filter() or where(): Filter rows in a DataFrame.
● groupBy(): Group by a column or columns.
● orderBy() or sort(): Sort by one or more columns.
● drop(): Drop a column.
● withColumn(): Add or replace a column.
● withColumnRenamed(): Rename a column.
● join(): Join two DataFrames.
● describe(): Compute summary statistics.
● dropna(): Drop rows with null values.
● fillna(): Fill null values.
● agg(): Aggregate data after grouping.

By: Waleed Mousa

● distinct(): Return distinct rows.
● limit(): Limit the number of rows.

SparkSQL Operations:

● spark.sql(): Execute SQL queries.

● createOrReplaceTempView(): Create a temporary view.
● createGlobalTempView(): Create a global temporary view.

Data Sources and Writing Data:

● read.csv(): Read data from a CSV file.

● write.csv(): Write data to a CSV file.
● read.json(): Read data from a JSON file.
● write.json(): Write data to a JSON file.
● read.parquet(): Read data from a Parquet file.
● write.parquet(): Write data to a Parquet file.
● read.jdbc(): Read data from a JDBC source.
● write.jdbc(): Write data to a JDBC source.

MLlib - Machine Learning Library:

● VectorAssembler(): Assemble feature vectors.

● StringIndexer(): Convert string columns to numeric.
● OneHotEncoder(): One-hot encode categorical features.
● StandardScaler(): Scale features.
● LinearRegression(): Linear regression model.
● DecisionTreeClassifier(): Decision tree classification model.
● KMeans(): K-means clustering.
● CrossValidator(): Cross-validation for model selection.
● TrainValidationSplit(): Train-validation for hyperparameter
tuning.

GraphX Operations:

● Graph(): Create a graph.

● vertices: Access vertices of a graph.
● edges: Access edges of a graph.
By: Waleed Mousa
● triplets: Access triplets of a graph.
● inDegrees: Compute the in-degree of each vertex.
● outDegrees: Compute the out-degree of each vertex.
● subgraph(): Generate a subgraph.
● mapVertices(): Transform the vertices of a graph.
● mapEdges(): Transform the edges of a graph.

Streaming:

● StreamingContext(): Create a streaming context.

● updateStateByKey(): Maintain stateful information.
● window(): Return a new DStream computed based on windowed batches.
● reduceByKeyAndWindow(): Reduce by key over a window.

Performance and Optimization:

● cache() or persist(): Cache an RDD or DataFrame.

● unpersist(): Remove data from memory.
● broadcast(): Broadcast a read-only variable.
● repartition(): Repartition the data.
● coalesce(): Decrease the number of partitions.

Utility Functions:

● udf(): Create a user-defined function.

● lit(): Create a column of literal value.
● when(): Evaluate a condition.

Statistics and Linear Algebra (MLlib):

● Statistics.colStats(): Column statistics.

● Statistics.corr(): Correlation between two series.
● DenseVector(): Create a dense vector.
● SparseVector(): Create a sparse vector.
● RowMatrix(): Create a row matrix.

Advanced Features:

By: Waleed Mousa

● windowSpec(): Define a window specification.
● over(): Apply a window specification.
● lead() and lag(): Lead and lag functions in window operations.
● pivot(): Pivot data.
● explode(): Transform array or map column into multiple rows.

Other Functions and Methods:

● functions.concat(): Concatenate two or more columns.

● functions.substring(): Extract a substring.
● functions.year() and functions.month(): Extract year and month.
● functions.dayofyear() and functions.dayofmonth(): Extract day.
● functions.round(): Round numbers.
● functions.length(): Compute the length of a string.
● functions.size(): Compute the size of a list or map.
● functions.isnan(): Check for NaN values.
● functions.isnull(): Check for NULL values.
● functions.rand(): Generate random values.
● functions.split(): Split a string.
● functions.array(): Create an array.
● functions.array_contains(): Check if an array contains a value.
● functions.map(): Create a map.
● functions.map_keys() and functions.map_values(): Access keys and
values of a map.
● functions.struct(): Create a struct.
● functions.from_json() and functions.to_json(): Work with JSON.
● functions.current_date() and functions.current_timestamp(): Get
current date and time.
● functions.date_add() and functions.date_sub(): Add or subtract days
from a date.
● functions.date_diff(): Compute difference between two dates.
● SparkContext.addFile() and SparkFiles.get(): Distributing auxiliary
files (e.g., Python files, data files) required by tasks.

By: Waleed Mousa

Apache Spark With Scala - Cheatsheet
No ratings yet
Apache Spark With Scala - Cheatsheet
7 pages
Python Programming123uo00es0440
No ratings yet
Python Programming123uo00es0440
405 pages
Py Spark
No ratings yet
Py Spark
427 pages
Pyspark PDF
0% (1)
Pyspark PDF
239 pages
0805 Learning Apache Spark With Python
No ratings yet
0805 Learning Apache Spark With Python
147 pages
Pyspark PDF
100% (1)
Pyspark PDF
397 pages
Bda Notes
No ratings yet
Bda Notes
241 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Pandas: Powerful Python Data Analysis Toolkit: Release 0.7.1
No ratings yet
Pandas: Powerful Python Data Analysis Toolkit: Release 0.7.1
283 pages
Py Spark
83% (6)
Py Spark
195 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
7 Apache Spark
No ratings yet
7 Apache Spark
48 pages
Pandas Roadmap
No ratings yet
Pandas Roadmap
6 pages
Py Spark
No ratings yet
Py Spark
427 pages
Grievance Report by Evangeline Ano
83% (6)
Grievance Report by Evangeline Ano
19 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Python Library Functions
No ratings yet
Python Library Functions
12 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Apache Spark
No ratings yet
Apache Spark
27 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Mastering AutoCAD for Mac
From Everand
Mastering AutoCAD for Mac
George Omura
No ratings yet
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
Advanced Data Science On Spark: Reza Zadeh
No ratings yet
Advanced Data Science On Spark: Reza Zadeh
47 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
PySpark Reference Guide
No ratings yet
PySpark Reference Guide
2 pages
NSX Lab Description
No ratings yet
NSX Lab Description
344 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Spark Using Python
No ratings yet
Spark Using Python
28 pages
Popular Machine Learning Algorithms in Apache Spark
No ratings yet
Popular Machine Learning Algorithms in Apache Spark
6 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
A Z Cheatsheet Python DA
No ratings yet
A Z Cheatsheet Python DA
7 pages
4220 6 (DataFormat)
No ratings yet
4220 6 (DataFormat)
15 pages
BDA Exp E1
No ratings yet
BDA Exp E1
5 pages
Gd Script
From Everand
Gd Script
Marijo Trkulja
No ratings yet
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
What Is Pandas
No ratings yet
What Is Pandas
9 pages
Dissertation On Intellectual Property Rights
100% (2)
Dissertation On Intellectual Property Rights
7 pages
Big Data Tools 2 - Apache Spark With PySpark
No ratings yet
Big Data Tools 2 - Apache Spark With PySpark
33 pages
Key Differences in Aache Spark Components and Concepts
No ratings yet
Key Differences in Aache Spark Components and Concepts
7 pages
Mind Mapping v1.2
No ratings yet
Mind Mapping v1.2
4 pages
PySpark Core Print
No ratings yet
PySpark Core Print
8 pages
Skyess Spark Syllabus
No ratings yet
Skyess Spark Syllabus
12 pages
RDD
No ratings yet
RDD
4 pages
Essential Python Libraries and Functions For Data Science 1706295212
No ratings yet
Essential Python Libraries and Functions For Data Science 1706295212
12 pages
Spark Material
No ratings yet
Spark Material
6 pages
Pandas Notes Design
No ratings yet
Pandas Notes Design
5 pages
Apache Spark Cheatsheet (2014)
No ratings yet
Apache Spark Cheatsheet (2014)
9 pages
Lisp Programming Language
From Everand
Lisp Programming Language
Faiz ul haque Zeya
No ratings yet
Apache Spark
No ratings yet
Apache Spark
5 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Learning Apache Spark With Python: Wenqiang Feng
No ratings yet
Learning Apache Spark With Python: Wenqiang Feng
8 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Woman-Centered Coaching Revolution - Lesson 1 - Handout
No ratings yet
Woman-Centered Coaching Revolution - Lesson 1 - Handout
28 pages
Lab Manual
No ratings yet
Lab Manual
56 pages
pf7200 Int
No ratings yet
pf7200 Int
96 pages
Model Design Process Anaplan
0% (1)
Model Design Process Anaplan
6 pages
Sas#12 Acc150 Quiz
No ratings yet
Sas#12 Acc150 Quiz
3 pages
Maglev Train Market SAMPLE
No ratings yet
Maglev Train Market SAMPLE
158 pages
MAP050-King in Yellow in Carcosa - Compressed
No ratings yet
MAP050-King in Yellow in Carcosa - Compressed
11 pages
Index: Powerpoint
No ratings yet
Index: Powerpoint
24 pages
VTX PDF
No ratings yet
VTX PDF
6 pages
Class X Unit 3 DBMS
No ratings yet
Class X Unit 3 DBMS
78 pages
Gail India Ltd. Report
No ratings yet
Gail India Ltd. Report
8 pages
Applied Entrepreneurship Prototype Lesson Plan Module 2 Q4
No ratings yet
Applied Entrepreneurship Prototype Lesson Plan Module 2 Q4
5 pages
Academic and Support Services: San Carlos Campus Organizational Chart
No ratings yet
Academic and Support Services: San Carlos Campus Organizational Chart
1 page
Organizational Planning, HR Planning & Career Planning
No ratings yet
Organizational Planning, HR Planning & Career Planning
6 pages
CPD Law-Ph
No ratings yet
CPD Law-Ph
6 pages
كاتلوج 2
No ratings yet
كاتلوج 2
44 pages
1 Rakitan Printer 02 Agustus 2021
No ratings yet
1 Rakitan Printer 02 Agustus 2021
1 page
My Resume
No ratings yet
My Resume
2 pages
A Review of Evaporative Cooling Technologies
No ratings yet
A Review of Evaporative Cooling Technologies
8 pages
Final Guidelines For AFRL - Endorsed by ACCSQ
No ratings yet
Final Guidelines For AFRL - Endorsed by ACCSQ
7 pages
UFBU Meeting Notice03072025120953
No ratings yet
UFBU Meeting Notice03072025120953
2 pages
Obj To Report of No Distribution (Original As Filed)
No ratings yet
Obj To Report of No Distribution (Original As Filed)
10 pages
Notif VO BVO 06 2024 23082024
No ratings yet
Notif VO BVO 06 2024 23082024
1 page
Lion Air Eticket (IQVQBS) - Diyarn Putra Maulana
No ratings yet
Lion Air Eticket (IQVQBS) - Diyarn Putra Maulana
4 pages
Group3 CaseStudy3
No ratings yet
Group3 CaseStudy3
7 pages
Interworking Guide For Galaxy Apps: Samsung Electronics
No ratings yet
Interworking Guide For Galaxy Apps: Samsung Electronics
5 pages
Rough Transcriptionasi Se Baila El Tango - Violin 1
No ratings yet
Rough Transcriptionasi Se Baila El Tango - Violin 1
2 pages

Important PySpark Operations 1698872557

Uploaded by

Important PySpark Operations 1698872557

Uploaded by

#_ important PySpark Operations [ +100 ]

RDD (Resilient Distributed Dataset) Operations:

● parallelize(): Create an RDD.

● createDataFrame(): Create a DataFrame from an RDD or list.

By: Waleed Mousa

● spark.sql(): Execute SQL queries.

Data Sources and Writing Data:

● read.csv(): Read data from a CSV file.

MLlib - Machine Learning Library:

● VectorAssembler(): Assemble feature vectors.

● Graph(): Create a graph.

● StreamingContext(): Create a streaming context.

Performance and Optimization:

● cache() or persist(): Cache an RDD or DataFrame.

● udf(): Create a user-defined function.

Statistics and Linear Algebra (MLlib):

● Statistics.colStats(): Column statistics.

By: Waleed Mousa

Other Functions and Methods:

● functions.concat(): Concatenate two or more columns.

By: Waleed Mousa

You might also like