0% found this document useful (0 votes)

12 views4 pages

Working With CSV File in Databricks

This document provides a comprehensive guide on reading CSV files and other formats using PySpark in Databricks, including scenarios for inferring schema, defining a schema, handling different delimiters, and managing malformed data. It also covers reading text files and Excel files, detailing necessary steps and code examples for each case. The guide emphasizes the flexibility and robustness of PySpark for efficient data processing.

Uploaded by

santosh.jammi29

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views4 pages

Working With CSV File in Databricks

Uploaded by

santosh.jammi29

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Scenario Series : Working With CSV files and other format

Here's a detailed guide for reading CSV files and other formats (like text and Excel files)
using PySpark in Databricks, covering all the cases you requested:

1. Read a CSV with Header and inferSchema

Scenario: You have a sales data CSV file, and you want to read it with the header included
and let PySpark infer the schema automatically.

Df=spark.read.option("header","true").option("inferSchema","true").
csv("dbfs:/mnt/sales_data.csv")
df.show()
df.printSchema()

• Explanation: By setting header=true, PySpark considers the first row as the header.
inferSchema=true automatically detects the data types for each column.

2. Read a CSV with Header and Predefined Schema

Scenario: You want more control over the schema, so instead of inferring it automatically,
you define the schema yourself.

from pyspark.sql.types import StructType, StructField, StringType,

IntegerType

# Define the schema manually

schema = StructType([
StructField("CustomerID", IntegerType(), True),
StructField("Name", StringType(), True),
StructField("Age", IntegerType(), True),
StructField("Country", StringType(), True)
])

df = spark.read.option("header",
"true").schema(schema).csv("dbfs:/mnt/customer_data.csv")
df.show()
df.printSchema()

Follow me on LinkedIn – Shivakiran kotur

• Explanation: By defining a StructType schema, you have explicit control over the data
types for each column. This ensures consistency across data loads.

2. Read CSV with a Different Delimiter and Multiple CSV Files Together

Scenario: You have sales data partitioned across multiple CSV files, and they use a
semicolon (;) as the delimiter.

df = spark.read.option("header", "true").option("delimiter",
";").csv("dbfs:/mnt/sales_data_part*.csv")
df.show()

• Explanation: You can specify custom delimiters (like ; or |) using the delimiter option.
The wildcard (*) is used to read multiple files at once.

4. Handling Malformed Data While Reading CSV

Scenario: Some rows in your CSV may have more or fewer columns than expected, and you
need to handle these malformed rows.

df = spark.read.option("header", "true").option("mode",
"PERMISSIVE").csv("dbfs:/mnt/sales_data.csv")
df.show()

• Explanation: The mode option can be set to handle malformed data:

o PERMISSIVE (default): Corrupt records are placed in a separate column called
_corrupt_record.
o DROPMALFORMED: Discards rows that don't match the schema.
o FAILFAST: Throws an error and stops reading if any malformed data is
encountered.

Follow me on LinkedIn – Shivakiran kotur

5. Handling Corrupt Records with corrupt_record

Scenario: You want to capture malformed or corrupt rows in a separate column for further
investigation.

df = spark.read.option("header", "true").option("mode",
"PERMISSIVE").option("columnNameOfCorruptRecord",
"_corrupt_record").csv("dbfs:/mnt/sales_data.csv")
df.select("_corrupt_record").show(truncate=False)

• Explanation: The columnNameOfCorruptRecord option allows you to specify where

corrupt rows should be stored. You can inspect this column to identify data issues.

6. Reading a Text File

Scenario: You have log data or a document saved as a .txt file, and you need to read it into a
DataFrame.

df = spark.read.text("dbfs:/mnt/log_data.txt")
df.show(truncate=False)

• Explanation: Reading a text file treats each line as a row, and the entire line is placed
in a single column called value. This is useful for log processing or analysing
unstructured text data.

7. Reading an Excel (XLSX) File

Scenario: You have customer data saved in an Excel file, and you need to read it into
Databricks.
Step 1: Install the necessary library in Databricks:
1. In the Databricks workspace, go to "Libraries" > "Install New".
2. Search for com.crealytics:spark-excel_2.12:0.13.5 and install it on your cluster.

Step 2: Use the following code to read the Excel file:

Follow me on LinkedIn – Shivakiran kotur

df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("dbfs:/mnt/customer_data.xlsx")
df.show()

• Explanation: The spark-excel library allows you to read .xlsx files in Databricks. You
can specify whether the file has a header and whether to infer the schema.

Conclusion
By understanding and utilizing these different options, you can handle a wide variety of CSV
files and other formats like text and Excel in Databricks. Whether you need to deal with
malformed data, read multiple files at once, or handle complex delimiters, PySpark in
Databricks provides robust tools for efficient data processing.

Follow me on LinkedIn – Shivakiran kotur

Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Android Developer Intrenship (20-445)
No ratings yet
Android Developer Intrenship (20-445)
46 pages
PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
College Management System Database Design Table Database
33% (3)
College Management System Database Design Table Database
20 pages
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
No ratings yet
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
106 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
DataBricks - Reading and Writing Files
No ratings yet
DataBricks - Reading and Writing Files
5 pages
B LSC CD W1 Geiv Yx BAmc EE3 U
No ratings yet
B LSC CD W1 Geiv Yx BAmc EE3 U
166 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Unit 4 (Data Frame and Apache Kafka)
No ratings yet
Unit 4 (Data Frame and Apache Kafka)
28 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
Py Spark
No ratings yet
Py Spark
177 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Fall209 Spark SQL MC
No ratings yet
Fall209 Spark SQL MC
96 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
Spark Cheat Sheet 1717838924
No ratings yet
Spark Cheat Sheet 1717838924
10 pages
1 - Introduction ToPySpark
No ratings yet
1 - Introduction ToPySpark
26 pages
Pyspark
No ratings yet
Pyspark
10 pages
Learn Python Pandas For Data Science Quick TutorialExamples For All Primary Operations of DataFrames
No ratings yet
Learn Python Pandas For Data Science Quick TutorialExamples For All Primary Operations of DataFrames
37 pages
Day 11 Notes
No ratings yet
Day 11 Notes
3 pages
Spark SQL
No ratings yet
Spark SQL
41 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Handling Corrupted Record in Pyspark
No ratings yet
Handling Corrupted Record in Pyspark
5 pages
Python Series Day20
No ratings yet
Python Series Day20
7 pages
Py Spark
No ratings yet
Py Spark
9 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
Bda U5
No ratings yet
Bda U5
42 pages
Pyspark Basics
No ratings yet
Pyspark Basics
74 pages
Introducing Letters
No ratings yet
Introducing Letters
33 pages
Ainotes Dataframe
No ratings yet
Ainotes Dataframe
5 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
9 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Pyspark Dataframe Questions
No ratings yet
Pyspark Dataframe Questions
1 page
Pyspark Interview: Abhinav Singh
No ratings yet
Pyspark Interview: Abhinav Singh
275 pages
Apache Spark
No ratings yet
Apache Spark
5 pages
Chapter 4 - Import-Export Data
No ratings yet
Chapter 4 - Import-Export Data
30 pages
Deloitte & EY Data Engineer Interview Questions
No ratings yet
Deloitte & EY Data Engineer Interview Questions
26 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
Stream Processing & Analytics Lab: WEEK-6
No ratings yet
Stream Processing & Analytics Lab: WEEK-6
5 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
Ainotes
No ratings yet
Ainotes
5 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
Page 01
No ratings yet
Page 01
2 pages
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
Freedium - cfd-PySpark Interview Questions
No ratings yet
Freedium - cfd-PySpark Interview Questions
17 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
10 pages
Learning Spark - Chapter 4
No ratings yet
Learning Spark - Chapter 4
30 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Pandas - PySpark Equivalents-1
No ratings yet
Pandas - PySpark Equivalents-1
3 pages
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
From Everand
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
Anthony Serpico
No ratings yet
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
Learn Cassandra in 24 Hours
From Everand
Learn Cassandra in 24 Hours
Alex Nordeen
No ratings yet
AAAA
No ratings yet
AAAA
4 pages
504 Lecture4
No ratings yet
504 Lecture4
42 pages
Why Do We Need Control-M When I Can Use Cron Jobs or Windows Task Scheduler?
No ratings yet
Why Do We Need Control-M When I Can Use Cron Jobs or Windows Task Scheduler?
8 pages
Xiiprquessolu
No ratings yet
Xiiprquessolu
26 pages
Software Mining Repository Notes Unit 1
No ratings yet
Software Mining Repository Notes Unit 1
10 pages
Shikha - Java
No ratings yet
Shikha - Java
7 pages
Blood Bank Management System: Te-Extc A/A2 2020-21
0% (1)
Blood Bank Management System: Te-Extc A/A2 2020-21
20 pages
Android - Content Providers - Tutorialspoint
No ratings yet
Android - Content Providers - Tutorialspoint
14 pages
Schemas Sailors (Sid, Sname, Age, Rating), Boats (Bid, Bname, Color), Reserves (Sid, Bid, Day)
No ratings yet
Schemas Sailors (Sid, Sname, Age, Rating), Boats (Bid, Bname, Color), Reserves (Sid, Bid, Day)
2 pages
6 SRS On Library Management System
No ratings yet
6 SRS On Library Management System
12 pages
Software Development 5th International Conference Modelsward 2017 7152258
100% (1)
Software Development 5th International Conference Modelsward 2017 7152258
52 pages
NasreenFathima Resume July24-2022
No ratings yet
NasreenFathima Resume July24-2022
2 pages
Database Management System
No ratings yet
Database Management System
9 pages
Evaluation Template - Amdocs NEO Requirement
No ratings yet
Evaluation Template - Amdocs NEO Requirement
25 pages
DBMS Question Bank 2024-25
No ratings yet
DBMS Question Bank 2024-25
10 pages
Banking Management System
No ratings yet
Banking Management System
38 pages
GraphAcademy Question Answer
No ratings yet
GraphAcademy Question Answer
61 pages
Aws Data Engineer Standout Resume Example
No ratings yet
Aws Data Engineer Standout Resume Example
1 page
Data Warehouse and Mining Notes
No ratings yet
Data Warehouse and Mining Notes
12 pages
Oracle Demantra DB Best Practices
No ratings yet
Oracle Demantra DB Best Practices
18 pages
Security in UNIX & Windows
No ratings yet
Security in UNIX & Windows
40 pages
Performance Tuning Hybris
No ratings yet
Performance Tuning Hybris
9 pages
UNIT II Arrays and Strings
100% (1)
UNIT II Arrays and Strings
30 pages
Best Practices Guide For Databases On IBM FlashSystem
No ratings yet
Best Practices Guide For Databases On IBM FlashSystem
20 pages
Ai Notes
No ratings yet
Ai Notes
24 pages
Leçon4 Hadoop Query Languages
No ratings yet
Leçon4 Hadoop Query Languages
21 pages
Dbms Exp3b
No ratings yet
Dbms Exp3b
9 pages
1 The Impact of Electronic Documents
No ratings yet
1 The Impact of Electronic Documents
17 pages

Working With CSV File in Databricks

Uploaded by

Working With CSV File in Databricks

Uploaded by

Scenario Series : Working With CSV files and other format

1. Read a CSV with Header and inferSchema

2. Read a CSV with Header and Predefined Schema

from pyspark.sql.types import StructType, StructField, StringType,

# Define the schema manually

Follow me on LinkedIn – Shivakiran kotur

4. Handling Malformed Data While Reading CSV

• Explanation: The mode option can be set to handle malformed data:

Follow me on LinkedIn – Shivakiran kotur

• Explanation: The columnNameOfCorruptRecord option allows you to specify where

6. Reading a Text File

7. Reading an Excel (XLSX) File

Step 2: Use the following code to read the Excel file:

Follow me on LinkedIn – Shivakiran kotur

Follow me on LinkedIn – Shivakiran kotur

You might also like