0% found this document useful (0 votes)
3 views

Working with csv file in Databricks

This document provides a comprehensive guide on reading CSV files and other formats using PySpark in Databricks, including scenarios for inferring schema, defining a schema, handling different delimiters, and managing malformed data. It also covers reading text files and Excel files, detailing necessary steps and code examples for each case. The guide emphasizes the flexibility and robustness of PySpark for efficient data processing.

Uploaded by

santosh.jammi29
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Working with csv file in Databricks

This document provides a comprehensive guide on reading CSV files and other formats using PySpark in Databricks, including scenarios for inferring schema, defining a schema, handling different delimiters, and managing malformed data. It also covers reading text files and Excel files, detailing necessary steps and code examples for each case. The guide emphasizes the flexibility and robustness of PySpark for efficient data processing.

Uploaded by

santosh.jammi29
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Scenario Series : Working With CSV files and other format

Here's a detailed guide for reading CSV files and other formats (like text and Excel files)
using PySpark in Databricks, covering all the cases you requested:

1. Read a CSV with Header and inferSchema

Scenario: You have a sales data CSV file, and you want to read it with the header included
and let PySpark infer the schema automatically.

Df=spark.read.option("header","true").option("inferSchema","true").
csv("dbfs:/mnt/sales_data.csv")
df.show()
df.printSchema()

• Explanation: By setting header=true, PySpark considers the first row as the header.
inferSchema=true automatically detects the data types for each column.

2. Read a CSV with Header and Predefined Schema


Scenario: You want more control over the schema, so instead of inferring it automatically,
you define the schema yourself.

from pyspark.sql.types import StructType, StructField, StringType,


IntegerType

# Define the schema manually


schema = StructType([
StructField("CustomerID", IntegerType(), True),
StructField("Name", StringType(), True),
StructField("Age", IntegerType(), True),
StructField("Country", StringType(), True)
])

df = spark.read.option("header",
"true").schema(schema).csv("dbfs:/mnt/customer_data.csv")
df.show()
df.printSchema()

Follow me on LinkedIn – Shivakiran kotur


• Explanation: By defining a StructType schema, you have explicit control over the data
types for each column. This ensures consistency across data loads.

2. Read CSV with a Different Delimiter and Multiple CSV Files Together

Scenario: You have sales data partitioned across multiple CSV files, and they use a
semicolon (;) as the delimiter.

df = spark.read.option("header", "true").option("delimiter",
";").csv("dbfs:/mnt/sales_data_part*.csv")
df.show()

• Explanation: You can specify custom delimiters (like ; or |) using the delimiter option.
The wildcard (*) is used to read multiple files at once.

4. Handling Malformed Data While Reading CSV


Scenario: Some rows in your CSV may have more or fewer columns than expected, and you
need to handle these malformed rows.

df = spark.read.option("header", "true").option("mode",
"PERMISSIVE").csv("dbfs:/mnt/sales_data.csv")
df.show()

• Explanation: The mode option can be set to handle malformed data:


o PERMISSIVE (default): Corrupt records are placed in a separate column called
_corrupt_record.
o DROPMALFORMED: Discards rows that don't match the schema.
o FAILFAST: Throws an error and stops reading if any malformed data is
encountered.

Follow me on LinkedIn – Shivakiran kotur


5. Handling Corrupt Records with corrupt_record

Scenario: You want to capture malformed or corrupt rows in a separate column for further
investigation.

df = spark.read.option("header", "true").option("mode",
"PERMISSIVE").option("columnNameOfCorruptRecord",
"_corrupt_record").csv("dbfs:/mnt/sales_data.csv")
df.select("_corrupt_record").show(truncate=False)

• Explanation: The columnNameOfCorruptRecord option allows you to specify where


corrupt rows should be stored. You can inspect this column to identify data issues.

6. Reading a Text File

Scenario: You have log data or a document saved as a .txt file, and you need to read it into a
DataFrame.

df = spark.read.text("dbfs:/mnt/log_data.txt")
df.show(truncate=False)

• Explanation: Reading a text file treats each line as a row, and the entire line is placed
in a single column called value. This is useful for log processing or analysing
unstructured text data.

7. Reading an Excel (XLSX) File

Scenario: You have customer data saved in an Excel file, and you need to read it into
Databricks.
Step 1: Install the necessary library in Databricks:
1. In the Databricks workspace, go to "Libraries" > "Install New".
2. Search for com.crealytics:spark-excel_2.12:0.13.5 and install it on your cluster.

Step 2: Use the following code to read the Excel file:

Follow me on LinkedIn – Shivakiran kotur


df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("dbfs:/mnt/customer_data.xlsx")
df.show()

• Explanation: The spark-excel library allows you to read .xlsx files in Databricks. You
can specify whether the file has a header and whether to infer the schema.

Conclusion
By understanding and utilizing these different options, you can handle a wide variety of CSV
files and other formats like text and Excel in Databricks. Whether you need to deal with
malformed data, read multiple files at once, or handle complex delimiters, PySpark in
Databricks provides robust tools for efficient data processing.

Follow me on LinkedIn – Shivakiran kotur

You might also like