0% found this document useful (0 votes)

175 views16 pages

Pyspark Basics

PySpark is a Python API for Apache Spark, designed for processing large datasets in distributed environments, suitable for batch processing, real-time streaming, and machine learning. It provides features like DataFrames for efficient big data handling, supports SQL-like operations, and includes methods for data manipulation such as filtering, grouping, and handling missing values. Advanced operations like joins and unions allow for combining and enriching datasets, making PySpark a powerful tool for data analysis across various industries.

Uploaded by

mahima

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

175 views16 pages

Pyspark Basics

Uploaded by

mahima

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

PySpark

PySpark, a powerful tool for processing and analyzing big data. It's tailored for data
engineers, data scientists, and ML enthusiasts, focusing on working with large-scale
datasets in distributed environments. The goal is to transform raw data into insights.
What is PySpark
• Apache Spark: An open-source distributed computing system.
• PySpark: The Python API for Spark, allowing parallel computation on large datasets.
• It’s suitable for:
o Batch processing
o Real-time streaming
o Machine learning (ML)
o SQL-based analytics
• Industries benefiting from PySpark include finance, healthcare, and e-commerce
due to its speed and scalability.
When to Use PySpark
PySpark shines when handling big data that exceeds single-machine capacity:
• Distributed data processing using Spark's in-memory architecture.
• ML on large datasets using Spark’s MLlib.
• ETL/ELT pipelines for structured transformation of raw data.
• Works well with formats like CSV, Parquet, and others.
Spark Cluster
A Spark cluster consists of:
• Master node: Manages task distribution and resources.
• Worker nodes: Execute computing tasks. This setup allows parallel, distributed
processing of large datasets, a core part of PySpark’s power.

SparkSession
• SparkSession is the entry point for using PySpark.
• Enables access to Spark features like SQL, streaming, ML, and data handling.
• To create:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MySparkApp").getOrCreate()
• The .appName() is useful for managing multiple applications.
• getOThis helps avoid creating duplicate sessions, which can waste memory or even
crash the app.rCreate() ensures you don’t accidentally start multiple sessions.
7. PySpark DataFrames
• Similar to Pandas DataFrames, but distributed for large-scale processing.
• Created using:
df = spark.read.csv("file.csv", header=True, inferSchema=True)
• df.show() displays contents.
• Syntax is intuitive for those familiar with Pandas, but underlying processing is
optimized for distributed environments.
Introduction to PySpark DataFrames
PySpark DataFrames vs Pandas:
o Pandas: Operates on a single machine — not ideal for huge datasets.
o PySpark: Designed for distributed computing — data is spread across a
cluster, enabling faster and scalable processing.
• Why PySpark DataFrames matter:
o Efficient for big data handling.
o Support common operations like filtering, grouping, and aggregating.
o Provide SQL-like syntax, but execute operations differently behind the scenes.
• Although syntax might feel familiar to those with Pandas experience, the
performance model is different due to distributed execution.

Creating DataFrames from Filestores

To read a CSV into a DataFrame:
df = spark.read.csv("file.csv", header=True, inferSchema=True)
• header=True: Treats the first row as column names.
• inferSchema=True: Automatically detects data types.
• Efficient for loading structured data into a distributed format.

Printing the DataFrame

To preview data:
df.show()
• Shows the first 5 rows.
• While you can also use createDataFrame(), read.csv() is faster and better at scale.

Printing DataFrame Schema

To view the structure (columns and data types):
df.printSchema()
This helps understand the data you're working with.
Basic Analytics on DataFrames
Common tasks:
• Row count:
df.count()
• Group and aggregate:
df.groupBy("gender").agg({"salary_usd": "avg"})
• Useful functions: sum(), avg(), stddev(), etc.
• Helpful for summarizing big datasets and spotting patterns.

Key Functions for Analytics

Key DataFrame methods:

Equivalent in
Function Use Case
SQL

select() Choose specific columns SELECT

filter() Filter rows by condition WHERE

groupBy() Group rows by one or more columns GROUP BY

agg() Aggregate with a function like sum, avg AGGREGATE

Key Functions Example

Example: Filter and select
df.filter(df.age > 50).select("name", "age").show()
• Filters rows where age > 50.
• Selects only the name and age columns.

Takeaway:
PySpark DataFrames are essential for scalable, high-performance data processing. They:
• Resemble Pandas in syntax but work across clusters.
• Support SQL-like operations.
• Allow flexible, powerful analytics on massive datasets.
# Load the CSV file into a DataFrame
salaries_df = spark.read.csv("salaries.csv", header=True, inferSchema=True)
# Count the total number of rows
row_count = salaries_df.count()
print(f"Total rows: {row_count}")
# Group by company size and calculate the average of salaries
salaries_df.groupBy("company_size").agg({"salary_in_usd": "avg"}).show()
salaries_df.show()
# Average salary for entry level in Canada
CA_jobs = ca_salaries_df.filter(ca_salaries_df['company_location'] == "CA").filter(ca_sa
laries_df['experience_level']
== "EN").groupBy().avg("salary_in_usd")
# Show the result
CA_jobs.show()

Creating DataFrames from Various Data Sources

Schema Inference and Manual Schema Definition

Spark can automatically infer schemas, but it might misinterpret data types, particularly with
complex or ambiguous data. Manually defining a schema can ensure accurate data handling.
DataTypes in PySpark DataFrames
To manually configure a schema, we define the datatype using the StructField function,
calling the appropriate datatype method. PySpark DataFrames support various data types,
similar to SQL and Pandas. Key types include:
DataTypes Syntax for PySpark DataFrames
We import specific classes from pyspark.sql.types. To define the schema for a DataFrame,
we use StructType() and StructField() functions to define its structure and fields, filling in
the columns and their types.

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Fill in the schema with the columns you need from the exercise instructions
schema = StructType([StructField("age",IntegerType()),
StructField("education_num",IntegerType()),
StructField("marital_status",StringType()),
StructField("occupation",StringType()),
StructField("income",StringType()),
])
# Read in the CSV, using the schema you defined above
census_adult = spark.read.csv("adult_reduced_100.csv",sep=',', header=False,
schema=schema)
# Print out the schema
census_adult.printSchema()
DataFrame Operations - Selection and Filtering
Selecting specific columns and filtering rows are fundamental operations in data analysis. In
Spark, these operations are efficient, even on large datasets, using methods like .select(),
.filter(), .sort(), and .where(). .where() and .filter() work similarly to SQL, where we pass a
column or columns and a condition to match.
Sorting and Dropping Missing Values
Sorting and handling missing values are common tasks. Dropping nulls can clean data, but
sometimes we may want to fill or impute values instead, which Spark also supports. We can
use .sort() for simple, flexible sorting and .orderby() for more complex, multi-column
sorting. We can drop nulls using .na.drop().

# Load the dataframe

census_df = spark.read.json("adults.json")
# Filter rows based on age condition
salary_filtered_census = census_df.filter(census_df['age']>40)
# Show the result
salary_filtered_census.show()
1. Handling Missing Data
Handling null values is crucial in data analysis as missing data can lead to skewed results or
errors during processing. PySpark provides two primary methods for managing missing
values:
• Dropping Null Values:
o The .na.drop() method can be used to drop rows that contain null values.
o This can be done across the entire DataFrame or for specific columns.
o While this method simplifies the dataset, it might significantly reduce the
dataset size if null values are common, which could lead to loss of important
data.
Example:
df.na.drop() # Drops rows with nulls in any column
df.na.drop(subset=["column_name"]) # Drops rows with null in specific column
• Replacing Null Values:
o The .na.fill() method is used to replace null values with a specified default
value, ensuring the dataset remains complete and consistent.
o This method is beneficial when null values are sparse, or when removing rows
could result in significant data loss.
Example:
df.na.fill(value=0) # Replace all nulls with 0
df.na.fill({"column1": 0, "column2": "Unknown"}) # Replace nulls in specific columns
• Column-Specific Filtering:
o You can also filter nulls in a specific column using .where() combined with
.isNotNull() to retain rows where the column has non-null values.
Example:
df.filter(df["column_name"].isNotNull()) # Filter out rows with null in a specific column
2. Column Operations
PySpark provides several powerful methods for manipulating columns in a DataFrame:
• Creating New Columns:
o The .withColumn() method allows for the creation of new columns based on
expressions or transformations.
o This is especially useful for derived metrics, such as calculating a new value
from existing columns.
Example:
df = df.withColumn("new_column", df["existing_column"] * 2)
# Create a new column with calculations
• Renaming Columns:
o The .withColumnRenamed() method allows you to rename columns to make
them more descriptive or standardized.
o This improves the clarity of DataFrames, especially in collaborative settings.
Example:
df = df.withColumnRenamed("old_column_name", "new_column_name")
# Renaming columns
• Dropping Columns:
o The .drop() method is used to remove unnecessary columns from a DataFrame.
o This is important for managing large datasets, improving memory usage, and
focusing on relevant data.
Example:
df = df.drop("column_name") # Drop a column from the DataFrame
3. Row Operations
Row operations help in filtering, grouping, and aggregating data to analyze subsets or
patterns within the data.
• Filtering Rows:
o Filtering is used to narrow down the dataset based on specific conditions, such
as selecting data from a particular date, region, or category.
o The .filter() method applies a condition to the rows and returns only the rows
that meet that condition.
Example:
df_filtered = df.filter(df ["age"] > 30) # Filter rows where age is greater than 30
• Grouping Rows:
o Grouping allows you to organize data based on a field or category (e.g., by
customer, product, or date).
o This is crucial for analyzing patterns and trends within each group.
o The .groupBy() method groups rows based on specified columns, and then you
can perform aggregation on those groups.
Example:
df_grouped = df.groupby("category").agg({"value_column": "sum"}) # Group by
category and sum the value column
• Aggregation:
o After grouping, you can apply aggregation functions like sum(), avg(), count(),
etc., to get summarized insights.
o PySpark supports various aggregation functions which can be used after
grouping the data.
Example:
df_grouped = df.groupby("category").agg({"value_column": "avg"}) # Group by
category and calculate the average of a column
Summary:
• Handling Missing Data: PySpark offers .na.drop() and .na.fill() for dealing with null
values. .drop() removes rows with nulls, and .fill() replaces nulls with a default value.
• Column Operations: Use .withColumn() to create new columns,
.withColumnRenamed() to rename columns, and .drop() to remove unnecessary
columns.
• Row Operations: .filter() allows filtering rows based on conditions, and .groupBy()
followed by aggregation functions helps in grouping and analyzing data efficiently.
These operations provide flexibility to clean, manipulate, and transform data in PySpark,
making it easier to analyze and extract valuable insights from large datasets.
Joins in PySpark
Joins in PySpark are used to combine data from multiple DataFrames based on shared
columns, similar to SQL join operations. Joins are essential for enriching datasets by
merging information from different sources.
• Types of Joins:
o Inner Join: Returns only the rows with matching values in both DataFrames.
o Left Join (Left Outer Join): Returns all rows from the left DataFrame, with
matching rows from the right DataFrame; if no match, returns null for the right
DataFrame.
o Right Join (Right Outer Join): Returns all rows from the right DataFrame,
with matching rows from the left DataFrame; if no match, returns null for the
left DataFrame.
o Full Outer Join: Returns all rows when there is a match in one of the
DataFrames, with null values where there is no match.
• Syntax: The .join() method is used to perform the join operation. You specify the
second DataFrame, the join type, and the columns on which to join the DataFrames.
o If the columns have different names, you can explicitly specify the joining
columns.
Example:
df1.join(df2, df1["column1"] == df2["column2"], "inner")

Union Operation
The union operation in PySpark combines or “stacks” two DataFrames with the same
schema (same number of columns, with matching types and order). This operation is crucial
when datasets are split across different sources or time periods and need to be consolidated
into one.
• Conditions for Union:
o The DataFrames must have the same number of columns and matching data
types for the operation to succeed. If they don’t match, PySpark will throw an
error.
o The union operation is typically used to combine data from separate files, such
as monthly sales data into a unified yearly dataset.
df_union = df1.union(df2)
o This stacks df2 below df1, creating a single DataFrame that consolidates rows
from both DataFrames.
• Schema Alignment: It is critical to ensure that the schemas of the two DataFrames
align correctly. Mismatched schemas or column types will prevent the union from
working.

Working with Arrays and Maps

Complex data types like arrays and maps in PySpark provide the flexibility to handle more
intricate data structures within DataFrames.
• Arrays: Arrays allow for the storage of multiple values in a single column, useful for
attributes that have multiple values. You can define an array column using the
appropriate data type.
Example:
from pyspark.sql.functions import lit
df = df.withColumn("array_column", lit([1, 2, 3]))
• Maps: Maps store dynamic key-value pairs, making them ideal for situations where
the attributes of a row vary. Each row can have different keys in a map, offering
flexible data storage.
Example:
from pyspark.sql.functions import map
df = df.withColumn("map_column", map(lit("key1"), lit("value1"), lit("key2"),
lit("value2")))
Note: The map() method allows dynamic assignment of key-value pairs.
Working with Structs
Structs in PySpark group related fields together within a single column. This is useful for
handling hierarchical or nested data, where you want to store multiple fields in one column
but maintain the structure.
• Structs are defined using StructField, where each field has a name and data type.
• This allows you to manage nested relationships within a DataFrame more efficiently.
Example:
from pyspark.sql.types import StructType, StructField, StringType
struct_type = StructType([StructField("name", StringType(), True), StructField("age",
StringType(), True)])
Summary:
• Joins: PySpark supports several join types (inner, left, right, and full outer) to combine
datasets based on shared columns, similar to SQL.
• Union: The union operation allows you to stack two DataFrames with the same
schema, combining their rows into a single dataset.
• Arrays and Maps: PySpark’s complex data types, such as arrays and maps, offer
flexibility in managing hierarchical or dynamic data within a column.
• Structs: Structs allow grouping related fields within a single column, helping to
manage complex data structures in PySpark.
These advanced DataFrame operations are crucial for efficiently handling and manipulating
complex datasets in PySpark, providing the ability to perform sophisticated analyses and
data transformations.

• Examine the airports DataFrame. Note which key column will let you join airports to
the flights table.
• Join the flights with the airports DataFrame on the "dest" column. Save the result
as flights_with_airports.
• Examine flights_with_airports again. Note the new information that has been added.
• # Examine the data
• airports.show()
•
• # .withColumnRenamed() renames the "faa" column to "dest"
• airports = airports.withColumnRenamed("faa", "dest")
•
• # Join the DataFrames
• flights_with_airports = flights.join(airports, on='dest', how='left
outer')
1. Introduction to UDFs
o UDF stands for User-Defined Function in PySpark.
o UDFs are custom functions that you define and use within PySpark DataFrames
to manipulate data in ways that built-in PySpark functions may not support.
o There are two main types of UDFs in PySpark: PySpark UDFs and pandas
UDFs.
2. UDFs for Repeatable Tasks
o UDFs are designed to be reusable and repeatable. This means that once
created and registered with the SparkSession, they can be applied multiple times
across different DataFrames.
o PySpark UDFs are suitable for handling smaller datasets.
o pandas UDFs are designed for larger datasets and typically offer better
performance in those cases.
o Despite differences in performance, both types of UDFs operate similarly, but
the performance optimization comes from how they handle data row by row.
3. Defining and Registering a UDF
o To create a UDF, you first define a standard Python function. For example, a
function like to_upper_case() could be used to convert text data in a column to
uppercase.
o Once you define the function, you need to register it as a UDF using the .udf()
function in PySpark and specify the correct data type (e.g., StringType() for
string operations).
o Registration is important because it makes the UDF accessible across all
worker nodes in the Spark cluster through the SparkSession.
Example code:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def to_upper_case(input_string):
return input_string.upper()

# Register the UDF

upper_case_udf = udf(to_upper_case, StringType())
# Apply the UDF to a DataFrame
df = df.withColumn("upper_case_column", upper_case_udf(df["column_name"]))
df.show()

4. pandas UDF
o pandas UDFs are designed to work with larger datasets and provide better
performance.
o You define a pandas UDF using the @pandas_udf() decorator and specify the
return type (e.g., @pandas_udf("float")).
o Unlike PySpark UDFs, pandas UDFs do not need to be explicitly registered
with the SparkSession as they are more efficient at handling data in bulk.
Example code:
from pyspark.sql.functions import pandas_udf
import pandas as pd
# Define a pandas UDF
@pandas_udf("float")
def calculate_square(s: pd.Series) -> pd.Series:
return s * s
# Apply the pandas UDF to a DataFrame
df = df.withColumn("squared_column", calculate_square(df["numeric_column"]))
df.show()
5. PySpark UDFs vs. pandas UDFs
o PySpark UDFs are ideal for small datasets and simple transformations. They
operate at the column level and require registration with the SparkSession.
o pandas UDFs are better for large datasets because they leverage the power of
pandas, making them more efficient for processing large volumes of data.
o The decision to use one over the other depends on several factors, including:
▪ Data size: Small datasets work better with PySpark UDFs, while large
datasets benefit from pandas UDFs.
▪ Complexity: Simple transformations may be fine with PySpark UDFs,
but for more complex operations on large data, pandas UDFs offer
performance benefits.
▪ Performance considerations: pandas UDFs typically offer better
performance, but they may also require a more complex setup and
handling of data types.

# Define a Pandas UDF that adds 10 to each element in a vectorized way

@pandas_udf(DoubleType())
def add_ten_pandas(column):
return column + 10
# Apply the UDF and show the result
df.withColumn("10_plus", add_ten_pandas(df['value']))
df.show()

Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Chessable Course Creation and Import Guide - Non-Staff
No ratings yet
Chessable Course Creation and Import Guide - Non-Staff
77 pages
PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
My Pyspark Practice Notes
100% (1)
My Pyspark Practice Notes
63 pages
1 - Introduction ToPySpark
No ratings yet
1 - Introduction ToPySpark
26 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Pyspark
No ratings yet
Pyspark
10 pages
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
No ratings yet
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
106 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
Day 11 Notes
No ratings yet
Day 11 Notes
3 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Pyspark Interview: Abhinav Singh
No ratings yet
Pyspark Interview: Abhinav Singh
275 pages
Py Spark
No ratings yet
Py Spark
177 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
Unit 4 (Data Frame and Apache Kafka)
No ratings yet
Unit 4 (Data Frame and Apache Kafka)
28 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Chapter 3
No ratings yet
Chapter 3
33 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
Pyspark Distinct and Filter
No ratings yet
Pyspark Distinct and Filter
3 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
10 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
Page 01
No ratings yet
Page 01
2 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
Pandas PDF
No ratings yet
Pandas PDF
25 pages
Py Spark
No ratings yet
Py Spark
7 pages
Pyspark Coding Interview Questions
No ratings yet
Pyspark Coding Interview Questions
19 pages
Data Engineering 101 PySpark Vs Pandas 1721887961
No ratings yet
Data Engineering 101 PySpark Vs Pandas 1721887961
36 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Py Spark
No ratings yet
Py Spark
9 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
Working With CSV File in Databricks
No ratings yet
Working With CSV File in Databricks
4 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
EDA Cheat Sheet
No ratings yet
EDA Cheat Sheet
7 pages
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
No ratings yet
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
7 pages
Dataframing in CSV
No ratings yet
Dataframing in CSV
14 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
PySpark Slides
No ratings yet
PySpark Slides
30 pages
Apache Spark
No ratings yet
Apache Spark
5 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Railway Tracking System
100% (1)
Railway Tracking System
24 pages
Topic 05
No ratings yet
Topic 05
21 pages
Week 3 Teradata Practice Exercises Guide
No ratings yet
Week 3 Teradata Practice Exercises Guide
5 pages
PartA GRVII EKYA - CMRNPS CS TEE1 SEM1 08OCT2021
No ratings yet
PartA GRVII EKYA - CMRNPS CS TEE1 SEM1 08OCT2021
4 pages
s-22 DWM
100% (2)
s-22 DWM
33 pages
Based On The PaaS Prototype, Which Azure SQL Database Compute Tier Should You Use?
No ratings yet
Based On The PaaS Prototype, Which Azure SQL Database Compute Tier Should You Use?
8 pages
ADMS 2320 Chapter 2 Questions PDF
100% (1)
ADMS 2320 Chapter 2 Questions PDF
4 pages
Sow Isp250 March August 2024 Kedah
No ratings yet
Sow Isp250 March August 2024 Kedah
4 pages
HCSCI 132 Course Outline 2022
No ratings yet
HCSCI 132 Course Outline 2022
4 pages
JDBC Connectivity With Ms-Access
No ratings yet
JDBC Connectivity With Ms-Access
54 pages
Introduction To Vector Embeddings and Vector Databases
No ratings yet
Introduction To Vector Embeddings and Vector Databases
11 pages
Sap C Taw12 731 Exam
No ratings yet
Sap C Taw12 731 Exam
4 pages
Errata For Deveaux, Velleman and Bock, Stats: Data and Models, 3 Ed
No ratings yet
Errata For Deveaux, Velleman and Bock, Stats: Data and Models, 3 Ed
1 page
Data50 2020 02 - Feb 09
No ratings yet
Data50 2020 02 - Feb 09
26 pages
Painting Hire Business Case Study
No ratings yet
Painting Hire Business Case Study
14 pages
Internship Training Report by Vidhita Jain
No ratings yet
Internship Training Report by Vidhita Jain
44 pages
SQL Mcqs
No ratings yet
SQL Mcqs
25 pages
DBMS Lab Manual
No ratings yet
DBMS Lab Manual
57 pages
Microsoft: AZ-303 Exam
No ratings yet
Microsoft: AZ-303 Exam
140 pages
Unit 6 - Cloud Platforms and Applications
100% (1)
Unit 6 - Cloud Platforms and Applications
32 pages
Predictive Analytics and Data Mining: Segmentation Using Clustering
No ratings yet
Predictive Analytics and Data Mining: Segmentation Using Clustering
25 pages
DatabaseDesignDocumentV1 1
No ratings yet
DatabaseDesignDocumentV1 1
15 pages
Transact SQL Reference
No ratings yet
Transact SQL Reference
55 pages
SMK Methodist Sitiawan, Perak
No ratings yet
SMK Methodist Sitiawan, Perak
4 pages
Salesforce Apex Developer Guide 2
No ratings yet
Salesforce Apex Developer Guide 2
349 pages
Scaler Academy New Curriculum
No ratings yet
Scaler Academy New Curriculum
6 pages
SAP ầ
No ratings yet
SAP ầ
2 pages
HSYD300 1 Jan Jun2025 SA2 RM V3 06012025 1
No ratings yet
HSYD300 1 Jan Jun2025 SA2 RM V3 06012025 1
7 pages
Unit 3
No ratings yet
Unit 3
70 pages

Pyspark Basics

Uploaded by

Pyspark Basics

Uploaded by

PySpark

Creating DataFrames from Filestores

Printing the DataFrame

Printing DataFrame Schema

Key Functions for Analytics

select() Choose specific columns SELECT

filter() Filter rows by condition WHERE

groupBy() Group rows by one or more columns GROUP BY

agg() Aggregate with a function like sum, avg AGGREGATE

Key Functions Example

Creating DataFrames from Various Data Sources

Schema Inference and Manual Schema Definition

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Load the dataframe

Working with Arrays and Maps

# Register the UDF

# Define a Pandas UDF that adds 10 to each element in a vectorized way

You might also like