The document provides a cheat sheet on ELT using PySpark with over 17 sections covering topics like basic and advanced DataFrame operations, data transformation, data profiling, data visualization, data import/export, machine learning, graph processing, and performance tuning.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
329 views
Etl Commands For Pyspark
The document provides a cheat sheet on ELT using PySpark with over 17 sections covering topics like basic and advanced DataFrame operations, data transformation, data profiling, data visualization, data import/export, machine learning, graph processing, and performance tuning.
● UDF (User Defined Function): from pyspark.sql.functions import udf;
udf_function = udf(lambda z: custom_function(z)) ● String Operations: from pyspark.sql.functions import lower, upper; df.select(upper(df["column"])) ● Date and Time Functions: from pyspark.sql.functions import current_date, current_timestamp; df.select(current_date()) ● Numeric Functions: from pyspark.sql.functions import abs, sqrt; df.select(abs(df["column"])) ● Conditional Expressions: from pyspark.sql.functions import when; df.select(when(df["column"] > value, "true").otherwise("false")) ● Type Casting: df.withColumn("column", df["column"].cast("new_type")) ● Explode Function (Array to Rows): from pyspark.sql.functions import explode; df.withColumn("exploded_column", explode(df["array_column"])) ● Pandas UDF: from pyspark.sql.functions import pandas_udf; @pandas_udf("return_type") def pandas_function(col1, col2): return operation ● Aggregating with Custom Functions: df.groupBy("column").agg(custom_agg_function(df["another_column"])) ● Window Functions (Rank, Lead, Lag): from pyspark.sql.functions import rank, lead, lag; windowSpec = Window.orderBy("column"); df.withColumn("rank", rank().over(windowSpec)) ● Handling JSON Columns: from pyspark.sql.functions import from_json, schema_of_json; df.withColumn("parsed_json", from_json(df["json_column"], schema_of_json))
5. Data Profiling
● Column Value Counts: df.groupBy("column").count()
● Summary Statistics for Numeric Columns: df.describe()
By: Waleed Mousa
● Correlation Between Columns: df.stat.corr("column1", "column2") ● Crosstabulation and Contingency Tables: df.stat.crosstab("column1", "column2") ● Frequent Items in Columns: df.stat.freqItems(["column1", "column2"]) ● Approximate Quantile Calculation: df.approxQuantile("column", [0.25, 0.5, 0.75], relativeError)
6. Data Visualization (Integration with other libraries)
● Convert to Pandas for Visualization: df.toPandas().plot(kind='bar')
● Histograms using Matplotlib: df.toPandas()["column"].hist() ● Box Plots using Seaborn: import seaborn as sns; sns.boxplot(x=df.toPandas()["column"]) ● Scatter Plots using Matplotlib: df.toPandas().plot.scatter(x='col1', y='col2')
7. Data Import/Export
● Reading Data from JDBC Sources:
spark.read.format("jdbc").options(url="jdbc_url", dbtable="table_name").load() ● Writing Data to JDBC Sources: df.write.format("jdbc").options(url="jdbc_url", dbtable="table_name").save() ● Reading Data from HDFS: spark.read.text("hdfs://path/to/file") ● Writing Data to HDFS: df.write.save("hdfs://path/to/output") ● Creating DataFrames from Hive Tables: spark.table("hive_table_name")
● Coalesce Partitions: df.coalesce(numPartitions) ● Reading Data in Chunks: spark.read.option("maxFilesPerTrigger", 1).csv("path/to/file.csv") ● Optimizing Data for Skewed Joins: df.repartition("skewed_column") ● Handling Data Skew in Joins: df1.join(df2.hint("broadcast"), "column")
9. Spark SQL
● Running SQL Queries on DataFrames: df.createOrReplaceTempView("table");
spark.sql("SELECT * FROM table")
By: Waleed Mousa
● Registering UDF for SQL Queries: spark.udf.register("udf_name", lambda x: custom_function(x)) ● Using SQL Functions in DataFrames: from pyspark.sql.functions import expr; df.withColumn("new_column", expr("SQL_expression"))
10. Machine Learning and Advanced Analytics
● VectorAssembler for Feature Vectors: from pyspark.ml.feature import
VectorAssembler; assembler = VectorAssembler(inputCols=["col1", "col2"], outputCol="features") ● StandardScaler for Feature Scaling: from pyspark.ml.feature import StandardScaler; scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures") ● Building a Machine Learning Pipeline: from pyspark.ml import Pipeline; pipeline = Pipeline(stages=[assembler, scaler, ml_model]) ● Train-Test Split: train, test = df.randomSplit([0.7, 0.3]) ● Model Fitting and Predictions: model = pipeline.fit(train); predictions = model.transform(test) ● Cross-Validation for Model Tuning: from pyspark.ml.tuning import CrossValidator; crossval = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid) ● Hyperparameter Tuning: from pyspark.ml.tuning import ParamGridBuilder; paramGrid = ParamGridBuilder().addGrid(model.param, [value1, value2]).build()
11. Graph and Network Analysis
● Creating a GraphFrame: from graphframes import GraphFrame; g =
● Time Series Window Functions: from pyspark.sql.functions import window; df.groupBy(window("timestamp", "1 hour")).mean()
21. Advanced Machine Learning Operations
● Custom Machine Learning Models with MLlib: from pyspark.ml.classification
import LogisticRegression; lr = LogisticRegression() ● Text Analysis with MLlib: from pyspark.ml.feature import Tokenizer; tokenizer = Tokenizer(inputCol="text", outputCol="words") ● Model Evaluation and Metrics: from pyspark.ml.evaluation import BinaryClassificationEvaluator; evaluator = BinaryClassificationEvaluator() ● Model Persistence and Loading: model.save("path"), ModelType.load("path")
22. Graph Analysis with GraphFrames
● Creating GraphFrames for Network Analysis: from graphframes import
custom_udf("column")) ● Vector Operations for ML Features: from pyspark.ml.linalg import Vectors; df.withColumn("vector_col", Vectors.dense("column"))
24. Logging and Monitoring
● Logging Operations in Spark: spark.sparkContext.setLogLevel("WARN")
25. Best Practices and Patterns
● Following Data Partitioning Best Practices: (Optimizing partition
strategy for data size and operations) ● Efficient Data Serialization: (Using Kryo serialization for performance) ● Optimizing Data Locality: (Ensuring data is close to computation resources) ● Error Handling and Recovery Strategies: (Implementing try-catch logic and checkpointing)
By: Waleed Mousa
26. Security and Compliance
● Data Encryption and Security: (Configuring Spark with encryption and
security features) ● GDPR Compliance and Data Anonymization: (Implementing data masking and anonymization)
27. Advanced Data Science Techniques
● Deep Learning Integration (e.g., with TensorFlow): (Using Spark with
TensorFlow for distributed deep learning) ● Complex Event Processing in Streams: (Using structured streaming for event pattern detection)
28. Cloud Integration
● Running Spark on Cloud Platforms (e.g., AWS, Azure, GCP): (Setting up
Spark clusters on cloud services) ● Integrating with Cloud Storage Services: (Reading and writing data to cloud storage like S3, ADLS, GCS)
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)