0% found this document useful (0 votes)
65 views4 pages

Important PySpark Operations 1698872557

Uploaded by

Abdellah Amini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views4 pages

Important PySpark Operations 1698872557

Uploaded by

Abdellah Amini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

#_ important PySpark Operations [ +100 ]

RDD (Resilient Distributed Dataset) Operations:

● parallelize(): Create an RDD.


● map(): Transform each element of the RDD.
● filter(): Return a new RDD with only the elements that satisfy a
condition.
● reduce(): Aggregate RDD elements using a function.
● collect(): Return all the elements of the RDD.
● count(): Count the RDD's elements.
● first(): Return the first element of the RDD.
● take(): Return the first 'n' elements of the RDD.
● foreach(): Apply a function to each element of the RDD.
● groupByKey(): Group values with the same key.
● reduceByKey(): Reduce values with the same key using a function.
● sortBy(): Sort the RDD.
● join(): Join two RDDs.
● union(): Return a new RDD that contains the union of the elements
in the source RDD and another RDD.

DataFrame Operations:

● createDataFrame(): Create a DataFrame from an RDD or list.


● select(): Select specific columns from a DataFrame.
● filter() or where(): Filter rows in a DataFrame.
● groupBy(): Group by a column or columns.
● orderBy() or sort(): Sort by one or more columns.
● drop(): Drop a column.
● withColumn(): Add or replace a column.
● withColumnRenamed(): Rename a column.
● join(): Join two DataFrames.
● describe(): Compute summary statistics.
● dropna(): Drop rows with null values.
● fillna(): Fill null values.
● agg(): Aggregate data after grouping.

By: Waleed Mousa


● distinct(): Return distinct rows.
● limit(): Limit the number of rows.

SparkSQL Operations:

● spark.sql(): Execute SQL queries.


● createOrReplaceTempView(): Create a temporary view.
● createGlobalTempView(): Create a global temporary view.

Data Sources and Writing Data:

● read.csv(): Read data from a CSV file.


● write.csv(): Write data to a CSV file.
● read.json(): Read data from a JSON file.
● write.json(): Write data to a JSON file.
● read.parquet(): Read data from a Parquet file.
● write.parquet(): Write data to a Parquet file.
● read.jdbc(): Read data from a JDBC source.
● write.jdbc(): Write data to a JDBC source.

MLlib - Machine Learning Library:

● VectorAssembler(): Assemble feature vectors.


● StringIndexer(): Convert string columns to numeric.
● OneHotEncoder(): One-hot encode categorical features.
● StandardScaler(): Scale features.
● LinearRegression(): Linear regression model.
● DecisionTreeClassifier(): Decision tree classification model.
● KMeans(): K-means clustering.
● CrossValidator(): Cross-validation for model selection.
● TrainValidationSplit(): Train-validation for hyperparameter
tuning.

GraphX Operations:

● Graph(): Create a graph.


● vertices: Access vertices of a graph.
● edges: Access edges of a graph.
By: Waleed Mousa
● triplets: Access triplets of a graph.
● inDegrees: Compute the in-degree of each vertex.
● outDegrees: Compute the out-degree of each vertex.
● subgraph(): Generate a subgraph.
● mapVertices(): Transform the vertices of a graph.
● mapEdges(): Transform the edges of a graph.

Streaming:

● StreamingContext(): Create a streaming context.


● updateStateByKey(): Maintain stateful information.
● window(): Return a new DStream computed based on windowed batches.
● reduceByKeyAndWindow(): Reduce by key over a window.

Performance and Optimization:

● cache() or persist(): Cache an RDD or DataFrame.


● unpersist(): Remove data from memory.
● broadcast(): Broadcast a read-only variable.
● repartition(): Repartition the data.
● coalesce(): Decrease the number of partitions.

Utility Functions:

● udf(): Create a user-defined function.


● lit(): Create a column of literal value.
● when(): Evaluate a condition.

Statistics and Linear Algebra (MLlib):

● Statistics.colStats(): Column statistics.


● Statistics.corr(): Correlation between two series.
● DenseVector(): Create a dense vector.
● SparseVector(): Create a sparse vector.
● RowMatrix(): Create a row matrix.

Advanced Features:

By: Waleed Mousa


● windowSpec(): Define a window specification.
● over(): Apply a window specification.
● lead() and lag(): Lead and lag functions in window operations.
● pivot(): Pivot data.
● explode(): Transform array or map column into multiple rows.

Other Functions and Methods:

● functions.concat(): Concatenate two or more columns.


● functions.substring(): Extract a substring.
● functions.year() and functions.month(): Extract year and month.
● functions.dayofyear() and functions.dayofmonth(): Extract day.
● functions.round(): Round numbers.
● functions.length(): Compute the length of a string.
● functions.size(): Compute the size of a list or map.
● functions.isnan(): Check for NaN values.
● functions.isnull(): Check for NULL values.
● functions.rand(): Generate random values.
● functions.split(): Split a string.
● functions.array(): Create an array.
● functions.array_contains(): Check if an array contains a value.
● functions.map(): Create a map.
● functions.map_keys() and functions.map_values(): Access keys and
values of a map.
● functions.struct(): Create a struct.
● functions.from_json() and functions.to_json(): Work with JSON.
● functions.current_date() and functions.current_timestamp(): Get
current date and time.
● functions.date_add() and functions.date_sub(): Add or subtract days
from a date.
● functions.date_diff(): Compute difference between two dates.
● SparkContext.addFile() and SparkFiles.get(): Distributing auxiliary
files (e.g., Python files, data files) required by tasks.

By: Waleed Mousa

You might also like