Working with csv file in Databricks
Working with csv file in Databricks
Here's a detailed guide for reading CSV files and other formats (like text and Excel files)
using PySpark in Databricks, covering all the cases you requested:
Scenario: You have a sales data CSV file, and you want to read it with the header included
and let PySpark infer the schema automatically.
Df=spark.read.option("header","true").option("inferSchema","true").
csv("dbfs:/mnt/sales_data.csv")
df.show()
df.printSchema()
• Explanation: By setting header=true, PySpark considers the first row as the header.
inferSchema=true automatically detects the data types for each column.
df = spark.read.option("header",
"true").schema(schema).csv("dbfs:/mnt/customer_data.csv")
df.show()
df.printSchema()
2. Read CSV with a Different Delimiter and Multiple CSV Files Together
Scenario: You have sales data partitioned across multiple CSV files, and they use a
semicolon (;) as the delimiter.
df = spark.read.option("header", "true").option("delimiter",
";").csv("dbfs:/mnt/sales_data_part*.csv")
df.show()
• Explanation: You can specify custom delimiters (like ; or |) using the delimiter option.
The wildcard (*) is used to read multiple files at once.
df = spark.read.option("header", "true").option("mode",
"PERMISSIVE").csv("dbfs:/mnt/sales_data.csv")
df.show()
Scenario: You want to capture malformed or corrupt rows in a separate column for further
investigation.
df = spark.read.option("header", "true").option("mode",
"PERMISSIVE").option("columnNameOfCorruptRecord",
"_corrupt_record").csv("dbfs:/mnt/sales_data.csv")
df.select("_corrupt_record").show(truncate=False)
Scenario: You have log data or a document saved as a .txt file, and you need to read it into a
DataFrame.
df = spark.read.text("dbfs:/mnt/log_data.txt")
df.show(truncate=False)
• Explanation: Reading a text file treats each line as a row, and the entire line is placed
in a single column called value. This is useful for log processing or analysing
unstructured text data.
Scenario: You have customer data saved in an Excel file, and you need to read it into
Databricks.
Step 1: Install the necessary library in Databricks:
1. In the Databricks workspace, go to "Libraries" > "Install New".
2. Search for com.crealytics:spark-excel_2.12:0.13.5 and install it on your cluster.
• Explanation: The spark-excel library allows you to read .xlsx files in Databricks. You
can specify whether the file has a header and whether to infer the schema.
Conclusion
By understanding and utilizing these different options, you can handle a wide variety of CSV
files and other formats like text and Excel in Databricks. Whether you need to deal with
malformed data, read multiple files at once, or handle complex delimiters, PySpark in
Databricks provides robust tools for efficient data processing.