This document provides a cheat sheet on exploratory data analysis (EDA) techniques that can be performed with PySpark. It lists over 40 techniques organized into categories like data loading, inspection, cleaning, transformation, SQL queries, statistical analysis, machine learning integration, and more. The techniques are concisely explained and include relevant code snippets using PySpark APIs and functions.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
308 views
Python Data Exploratory Commands
This document provides a cheat sheet on exploratory data analysis (EDA) techniques that can be performed with PySpark. It lists over 40 techniques organized into categories like data loading, inspection, cleaning, transformation, SQL queries, statistical analysis, machine learning integration, and more. The techniques are concisely explained and include relevant code snippets using PySpark APIs and functions.
● Fill Missing Values: df.na.fill(value) ● Drop Column: df.drop('column_name') ● Rename Column: df.withColumnRenamed('old_name', 'new_name')
4. Data Transformation
● Select Columns: df.select('column1', 'column2')
● Add New or Transform Column: df.withColumn('new_column', expression) ● Filter Rows: df.filter(df['column'] > value) ● Group By and Aggregate: df.groupby('column').agg({'column': 'sum'}) ● Sort Rows: df.sort(df['column'].desc())
df.rdd.mapPartitions(lambda partition: [transform(row) for row in partition])
38. Advanced Machine Learning Operations
● Using ML Pipelines: from pyspark.ml import Pipeline; pipeline =
Pipeline(stages=[stage1, stage2]); model = pipeline.fit(df) ● Model Evaluation: from pyspark.ml.evaluation import BinaryClassificationEvaluator; evaluator = BinaryClassificationEvaluator(); evaluator.evaluate(predictions)
39. Optimization Techniques
● Broadcast Variables for Efficiency: from pyspark.sql.functions
import broadcast; df.join(broadcast(df2), 'key') ● Using Accumulators for Global Aggregates: accumulator = spark.sparkContext.accumulator(0); rdd.foreach(lambda x: accumulator.add(x))
40. Advanced Data Import/Export
By: Waleed Mousa
● Reading Data from Multiple Sources: df = spark.read.format('format').option('option', 'value').load(['path1', 'path2']) ● Writing Data to Multiple Formats: df.write.format('format').save('path', mode='overwrite')
41. Utilizing External Data Sources
● Connecting to External Data Sources (e.g., Kafka, S3): df =