0% found this document useful (0 votes)

3 views11 pages

Basic DataFrame Operation

The document provides an overview of basic DataFrame operations in Databricks using Spark SQL with Python. It covers creating SparkSessions, constructing DataFrames from various data structures (lists, dictionaries, RDDs), reading external files, and performing basic operations like displaying data, filtering, and selecting specific columns. Additionally, it explains the use of methods like show(), display(), and printSchema() for inspecting DataFrames.

Uploaded by

lathakaruna493

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views11 pages

Basic DataFrame Operation

Uploaded by

lathakaruna493

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

Dataframe Basic Operations (Python)

Import notebook

Creating a SparkSession
The SparkSession is the entry point to programming with Spark SQL.

It allows you to create DataFrames, register DataFrames as tables, execute SQL over tables, cache tables, and read
parquet files.

SparkSession.builder: The builder attribute is a class attribute of SparkSession that provides a way to configure
and create a SparkSession instance.

appName("Example App"): The appName method sets the name of the Spark application. This name will appear
in the Spark web UI and can help you identify your application among others running on the same cluster.

config("spark.some.config.option", "some-value"): The config method allows you to set various configuration
options for the Spark session. In this example, " spark.some.config.option " is a placeholder for an actual
configuration key, and "some-value" is the value for that configuration. You can set multiple configuration options
by chaining multiple config calls.

getOrCreate(): The getOrCreate method either retrieves an existing SparkSession if one already exists or creates a
new one if it does not. This ensures that you do not accidentally create multiple SparkSession instances in your
application.

Note:In Databricks, you do not need to create or override the SparkSession as it is automatically created for each
notebook or job executed against the cluster. Databricks manages the SparkSession and SparkContext for you,
ensuring optimal configuration and resource usage.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Spark DataFrames").config("spark.some.config.option", "some-
value").getOrCreate()

Creating DataFrame
1.From Python a List of Tuples

%python
# List of tuples
data = [("John", 25), ("Doe", 30), ("Jane", 22)]

# Creating DataFrame
df_list = spark.createDataFrame(data, ["Name", "Age"])

# Display the DataFrame

df_list.show()

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 1/11

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

;   df_list: pyspark.sql.dataframe.DataFrame = [Name: string, Age: long]

+----+---+
|Name|Age|
+----+---+
|John| 25|
| Doe| 30|
|Jane| 22|
+----+---+

2.From a List of Dictionaries

%python
# List of dictionaries
data = [{"Name": "Alice", "Id": 1}, {"Name": "Bob", "Id": 2}, {"Name": "Cathy", "Id": 3}]

# Creating DataFrame
df_dict = spark.createDataFrame(data)

# Display the DataFrame

df_dict.show()

  df_dict: pyspark.sql.dataframe.DataFrame = [Id: long, Name: string]

+---+-----+
| Id| Name|
+---+-----+
| 1|Alice|
| 2| Bob|
| 3|Cathy|
+---+-----+

3.From a List of Rows

%python
from pyspark.sql import Row

# List of Rows
data = [ Row(Name="Cathy", Id=1),
Row(Name="David", Id=2),
Row(Name="Eva", Id=3),
Row(Name="Frank", Id=4)]

# Creating DataFrame
df_row = spark.createDataFrame(data)

# Display the DataFrame

df_row.show()

  df_row: pyspark.sql.dataframe.DataFrame = [Name: string, Id: long]

+-----+---+
| Name| Id|
+-----+---+

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 2/11

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

|Cathy| 1|
|David| 2|
| Eva| 3|
|Frank| 4|
+-----+---+

4.Creating a DataFrame from an RDD

%python
# Import necessary modules
from pyspark.sql import Row

# Create an RDD
rdd = spark.sparkContext.parallelize([
Row(Name="Alice", Age=25),
Row(Name="Bob", Age=30),
Row(Name="Cathy", Age=22),
Row(Name="David", Age=35),
Row(Name="Eva", Age=28),
Row(Name="Frank", Age=40)
])

# Convert RDD to DataFrame

df_rdd = spark.createDataFrame(rdd)

# Display the DataFrame

df_rdd.show()

  df_rdd: pyspark.sql.dataframe.DataFrame = [Name: string, Age: long]

+-----+---+
| Name|Age|
+-----+---+
|Alice| 25|
| Bob| 30|
|Cathy| 22|
|David| 35|
| Eva| 28|
|Frank| 40|
+-----+---+

5.Reading external file

spark.read: This is the entry point for reading data in Spark. It returns a DataFrameReader object that is used to read
data from various sources.

.format("csv"): Specifies the format of the data source. In this case, it indicates that the data is in CSV (Comma-
Separated Values) format.

.option("header", "true"): This option tells Spark that the first row of the CSV file contains the column names. If this
option is set to false, Spark will treat the first row as data. "true" means that the CSV file has a header row.

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 3/11

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

.option("inferSchema", "true"): This option tells Spark to automatically infer the data types of each column in the
CSV file. If this option is set to false, all columns will be read as strings (default behavior). "true" means that Spark will
try to infer the schema (data types) of the columns based on the data.

.load("/FileStore/tables/retail_db/customers"):

This method specifies the path to the CSV file or directory containing CSV files that you want to read.

customer_df=spark.read.format("csv").option("header","true").option("inferSchema","true").load("dbfs:/FileStor
e/tables/customers_300mb.csv")

  customer_df: pyspark.sql.dataframe.DataFrame = [customer_id: integer, name: string ... 5 more fields]

6. Using StructType & StructField

%python
#employee data and schemas
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, FloatType, DateType
from datetime import date

# Create dummy data as a list of lists

emp_data = [
[1, 101, "John Doe", 30, "M", 60000.0, date(2020, 1, 15)],
[2, 102, "Jane Smith", 25, "F", 65000.0, date(2019, 3, 10)],
[3, 101, "Mike Johnson", 35, "M", 70000.0, date(2018, 5, 20)],
[4, 103, "Emily Davis", 28, "F", 72000.0, date(2021, 7, 30)],
[5, 102, "Robert Brown", 40, "M", 80000.0, date(2017, 9, 25)],
[6, 101, "Linda Wilson", 32, "F", 68000.0, date(2020, 11, 5)],
[7, 103, "David Lee", 29, "M", 75000.0, date(2019, 12, 15)]]

# Define the schema

emp_schema = StructType([
StructField("empid", StringType(), True),
StructField("deptid", IntegerType(), True),
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("gender", StringType(), True),
StructField("salary", FloatType(), True),
StructField("hiredate", DateType(), True)
])
# Create DataFrame
df = spark.createDataFrame(emp_data, emp_schema)
#df = spark1.createDataFrame(data = emp_data, schema = emp_schema)

# Display the DataFrame

df.show()

  df: pyspark.sql.dataframe.DataFrame = [empid: string, deptid: integer ... 5 more fields]

+-----+------+------------+---+------+-------+----------+
|empid|deptid| name|age|gender| salary| hiredate|
+-----+------+------------+---+------+-------+----------+
| 1| 101| John Doe| 30| M|60000.0|2020-01-15|
| 2| 102| Jane Smith| 25| F|65000.0|2019-03-10|
| 3| 101|Mike Johnson| 35| M|70000.0|2018-05-20|

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 4/11

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

| 4| 103| Emily Davis| 28| F|72000.0|2021-07-30|

| 5| 102|Robert Brown| 40| M|80000.0|2017-09-25|
| 6| 101|Linda Wilson| 32| F|68000.0|2020-11-05|
| 7| 103| David Lee| 29| M|75000.0|2019-12-15|
+-----+------+------------+---+------+-------+----------+

Basic DataFrame Operation

1. show() & display()
In Databricks, show() and display() are used to visualize DataFrames, but they have different functionalities:

show(): This is a method available on Spark DataFrames that prints the first n rows to the console. It is useful for
quick inspection of data but does not provide rich formatting or interactivity. You can specify the number of rows to
display, and it defaults to 20 rows if not specified.

display(): This is a Databricks-specific function that provides a rich, interactive view of the DataFrame. It is more
suitable for use within notebooks as it allows for better visualization, including sorting, filtering, and graphical
representation of data.

customer_df.show(5)

customer_df.display()

#display(customer_df)

Table

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 5/11

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

2. Columns & Prinschema()

In Spark, columns and printSchema() are used to inspect the structure of a DataFrame, but they serve different
purposes:

columns: This attribute returns a list of the column names in the DataFrame.

printSchema(): This method prints the schema of the DataFrame, including column names and data types, in a
tree format.

customer_df.columns

['customer_id',
'name',
'city',
'state',
'country',
'registration_date',
'is_active']

customer_df.printSchema()

3. Select specific columns

customer_df.select("name","city").show()

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 6/11

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

4. Filter rows

customer_df.filter(customer_df.city=="Hyderabad").show()

| 21| Customer_21|Hyderabad| Tamil Nadu| India| 2023-09-16| true|

customer_df.where(customer_df.city=="Hyderabad").show()

| 21| Customer_21|Hyderabad| Tamil Nadu| India| 2023 09 16| true|

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 7/11

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

| 134|Customer_134|Hyderabad|West Bengal| India| 2023-06-25| true| 

5. Create or replace new column

The withColumn method is used to create a new column or replace an existing column in a DataFrame.

df.withColumn("name","defination")


%python
from pyspark.sql.functions import col, concat, lit

# col: A function to reference a column in a DataFrame.

# concat: A function to concatenate multiple columns or strings.
# lit: A function to create a column with a literal value.

# Example: Adding a new column

df_with_new_column = customer_df.withColumn("full name", concat(col("name"), lit(" Singh")))

# Display the DataFrame

df_with_new_column.show()

  df_with_new_column: pyspark.sql.dataframe.DataFrame = [customer_id: integer, name: string ... 6 more fields]

withColumnRenamed
The withColumnRenamed method is used to rename a single column in a DataFrame.

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 8/11

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

%python
# Example: Renaming a column
df_renamed_column = df_with_new_column.withColumnRenamed("full name", "Full Name")

# Display the DataFrame

df_renamed_column.show()

  df_renamed_column: pyspark.sql.dataframe.DataFrame = [customer_id: integer, name: string ... 6 more fields]

6. Dropping a Column
The drop method is used to remove one or more columns from a DataFrame.

# Dropping a single column

df_dropped_column = df_renamed_column.drop("Full Name")

# Display the DataFrame

df_dropped_column.show()

  df_dropped_column: pyspark.sql.dataframe.DataFrame = [customer_id: integer, name: string ... 5 more fields]

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 9/11

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

| 17|Customer_17| Pune| Delhi| India| 2023 04 14| false|

Dropping Multiple Columns

%python
# Dropping multiple columns
df_dropped_columns = df_renamed_column.drop("name", "country")

# Display the DataFrame

df_dropped_columns.show()

7. Removing Duplicate Rows

%python
# Removing duplicate rows
df_distinct = df_renamed_column.distinct()

# Display the DataFrame

df_distinct.show()

  df_distinct: pyspark.sql.dataframe.DataFrame = [customer_id: integer, name: string ... 6 more fields]

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 10/11

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

| 3| Customer_3| Mumbai| Telangana| India| 2023-06-04| true| Customer_3 Singh|

Aggregation
Will cover in detail tomorrow

+---------+------+
| city| count|
+---------+------+
|Bangalore|661013|
| Chennai|660249|
| Mumbai|661241|
|Ahmedabad|660218|
| Kolkata|660174|
| Pune|660737|
| Delhi|661025|
|Hyderabad|662281|
+---------+------+

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 11/11

Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Pyspark Interview: Abhinav Singh
No ratings yet
Pyspark Interview: Abhinav Singh
275 pages
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
No ratings yet
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
106 pages
PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Pyspark Practice
No ratings yet
Pyspark Practice
42 pages
Balaji SailPoint Resume
No ratings yet
Balaji SailPoint Resume
3 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
Spark Walmart Data Analysis Project
0% (1)
Spark Walmart Data Analysis Project
17 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
PySpark SQL Cheat Sheet Python
100% (2)
PySpark SQL Cheat Sheet Python
1 page
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
Journal
No ratings yet
Journal
47 pages
Spark Cheat Sheet 1717838924
No ratings yet
Spark Cheat Sheet 1717838924
10 pages
C Programming Language and Data Structure: For DIT Students
No ratings yet
C Programming Language and Data Structure: For DIT Students
34 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
PySpark Questions
No ratings yet
PySpark Questions
5 pages
w12 - Runningnotes 201026 001818
No ratings yet
w12 - Runningnotes 201026 001818
25 pages
Data and AI - Spark Python
No ratings yet
Data and AI - Spark Python
11 pages
Cheat Sheet: From Spark Data Sources SQL Queries
No ratings yet
Cheat Sheet: From Spark Data Sources SQL Queries
1 page
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
Introduction To REXX
No ratings yet
Introduction To REXX
9 pages
1 - Introduction ToPySpark
No ratings yet
1 - Introduction ToPySpark
26 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
PySpark SQL Cheat Sheet Python
No ratings yet
PySpark SQL Cheat Sheet Python
1 page
Unit 4 (Data Frame and Apache Kafka)
No ratings yet
Unit 4 (Data Frame and Apache Kafka)
28 pages
Working With CSV File in Databricks
No ratings yet
Working With CSV File in Databricks
4 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
Pyspark Coding Interview Questions
No ratings yet
Pyspark Coding Interview Questions
19 pages
Pandas Dataframe All Operations 1735471870
No ratings yet
Pandas Dataframe All Operations 1735471870
4 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
Chapter 3
No ratings yet
Chapter 3
33 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Pyspark Distinct and Filter
No ratings yet
Pyspark Distinct and Filter
3 pages
Day 11 Notes
No ratings yet
Day 11 Notes
3 pages
Pyspark SQL and DataFrames
No ratings yet
Pyspark SQL and DataFrames
6 pages
My First ETL Pipeline
No ratings yet
My First ETL Pipeline
10 pages
Spark Test Que
No ratings yet
Spark Test Que
3 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
Pyspark
No ratings yet
Pyspark
10 pages
Data Frame in Panda 01
No ratings yet
Data Frame in Panda 01
9 pages
Apache Spark
No ratings yet
Apache Spark
5 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
Forms - : An Overview of Oracle Form Builder v.6.0
No ratings yet
Forms - : An Overview of Oracle Form Builder v.6.0
35 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
OceanofPDF - Com Essential SQLAlchemy - Rick Copeland
No ratings yet
OceanofPDF - Com Essential SQLAlchemy - Rick Copeland
301 pages
SQL 1
No ratings yet
SQL 1
14 pages
Mca Sylaabus
No ratings yet
Mca Sylaabus
101 pages
Proxmox
No ratings yet
Proxmox
14 pages
Modeling and Predicting Cyber Hacking BR
No ratings yet
Modeling and Predicting Cyber Hacking BR
78 pages
II - I - CSE - Python Programming - Unit-2 - Exceptions - Mate - 240426 - 180429
No ratings yet
II - I - CSE - Python Programming - Unit-2 - Exceptions - Mate - 240426 - 180429
22 pages
Java Unit-5
No ratings yet
Java Unit-5
38 pages
Hybris Administration Console - FlexibleSearch
No ratings yet
Hybris Administration Console - FlexibleSearch
459 pages
Chapter2 Python 100 MCQs With Answers
No ratings yet
Chapter2 Python 100 MCQs With Answers
18 pages
Chapter 05 - Software Effort Estimation IV
No ratings yet
Chapter 05 - Software Effort Estimation IV
44 pages
Qus Bank 2
No ratings yet
Qus Bank 2
47 pages
Trust No File. Trust No Device: Advanced Protection Against Known & Unknown Threats
No ratings yet
Trust No File. Trust No Device: Advanced Protection Against Known & Unknown Threats
27 pages
LAB#7 Q1:-: Simulink
No ratings yet
LAB#7 Q1:-: Simulink
3 pages
(Ebook PDF) Systems Analysis and Design, 12th Edition 2024 Scribd Download
100% (1)
(Ebook PDF) Systems Analysis and Design, 12th Edition 2024 Scribd Download
55 pages
Vedant Aggarwal Ip Project
No ratings yet
Vedant Aggarwal Ip Project
25 pages
Chapter 2 System Development Methodologies
No ratings yet
Chapter 2 System Development Methodologies
18 pages
Scala Cheat Sheet
No ratings yet
Scala Cheat Sheet
3 pages
Chapter 4
No ratings yet
Chapter 4
18 pages
B.M.S. College of Engineering: Department of Machine Learning
No ratings yet
B.M.S. College of Engineering: Department of Machine Learning
27 pages
Akshay Resume
No ratings yet
Akshay Resume
3 pages
Be Computer Engineering Aids Semester 5 2024 May Web Computing Rev 2019 C Scheme
No ratings yet
Be Computer Engineering Aids Semester 5 2024 May Web Computing Rev 2019 C Scheme
1 page
Notes Internet and Web Technology Iwt Unit 1
No ratings yet
Notes Internet and Web Technology Iwt Unit 1
14 pages
《Computer culture》 Homework: 1. Decribe the process of converting Office 97-2010 files to 2013 in detail
No ratings yet
《Computer culture》 Homework: 1. Decribe the process of converting Office 97-2010 files to 2013 in detail
2 pages
NazimShaikh (8 0)
No ratings yet
NazimShaikh (8 0)
3 pages
Operating System - Notes, Videos, QA and Tests - Grade 8 - Computer - Operating System - MS-DOS and MS-WINDOWS - Kullabs
No ratings yet
Operating System - Notes, Videos, QA and Tests - Grade 8 - Computer - Operating System - MS-DOS and MS-WINDOWS - Kullabs
3 pages
Chiranjeevi Lakkakula Data Engineering CV
No ratings yet
Chiranjeevi Lakkakula Data Engineering CV
8 pages
Ana Ventura Resume
No ratings yet
Ana Ventura Resume
3 pages
Gopal Latest Resume
No ratings yet
Gopal Latest Resume
1 page
C# Interview Questions, Answers, and Explanations: C Sharp Certification Review
From Everand
C# Interview Questions, Answers, and Explanations: C Sharp Certification Review
equitypress
4.5/5 (3)
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
From Everand
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
Equity Press
No ratings yet

Basic DataFrame Operation

Uploaded by

Basic DataFrame Operation

Uploaded by

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

Dataframe Basic Operations (Python)

from pyspark.sql import SparkSession

# Display the DataFrame

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 1/11

;   df_list: pyspark.sql.dataframe.DataFrame = [Name: string, Age: long]

2.From a List of Dictionaries

# Display the DataFrame

  df_dict: pyspark.sql.dataframe.DataFrame = [Id: long, Name: string]

3.From a List of Rows

# Display the DataFrame

  df_row: pyspark.sql.dataframe.DataFrame = [Name: string, Id: long]

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 2/11

4.Creating a DataFrame from an RDD

# Convert RDD to DataFrame

# Display the DataFrame

  df_rdd: pyspark.sql.dataframe.DataFrame = [Name: string, Age: long]

5.Reading external file

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 3/11

  customer_df: pyspark.sql.dataframe.DataFrame = [customer_id: integer, name: string ... 5 more fields]

6. Using StructType & StructField

# Create dummy data as a list of lists

# Define the schema

# Display the DataFrame

  df: pyspark.sql.dataframe.DataFrame = [empid: string, deptid: integer ... 5 more fields]

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 4/11

| 4| 103| Emily Davis| 28| F|72000.0|2021-07-30|

Basic DataFrame Operation

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 5/11

2. Columns & Prinschema()

3. Select specific columns

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 6/11

| 21| Customer_21|Hyderabad| Tamil Nadu| India| 2023-09-16| true|

| 21| Customer_21|Hyderabad| Tamil Nadu| India| 2023 09 16| true|

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 7/11

| 134|Customer_134|Hyderabad|West Bengal| India| 2023-06-25| true| 

5. Create or replace new column

# col: A function to reference a column in a DataFrame.

# Example: Adding a new column

# Display the DataFrame

  df_with_new_column: pyspark.sql.dataframe.DataFrame = [customer_id: integer, name: string ... 6 more fields]

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 8/11

# Display the DataFrame

  df_renamed_column: pyspark.sql.dataframe.DataFrame = [customer_id: integer, name: string ... 6 more fields]

# Dropping a single column

# Display the DataFrame

  df_dropped_column: pyspark.sql.dataframe.DataFrame = [customer_id: integer, name: string ... 5 more fields]

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 9/11

| 17|Customer_17| Pune| Delhi| India| 2023 04 14| false|

Dropping Multiple Columns

# Display the DataFrame

7. Removing Duplicate Rows

# Display the DataFrame

  df_distinct: pyspark.sql.dataframe.DataFrame = [customer_id: integer, name: string ... 6 more fields]

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 10/11

| 3| Customer_3| Mumbai| Telangana| India| 2023-06-04| true| Customer_3 Singh|

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 11/11

You might also like