Handling Corrupted Record in Pyspark

The document discusses handling corrupted records in PySpark, specifically when working with CSV data. It outlines types of corrupted records, different read modes (PERMISSIVE, DROPMALFORMED, FAILFAST) to manage them, and provides code examples for storing both uncorrupted and corrupted records in separate locations. The document emphasizes the importance of defining schemas and using options like 'badRecordsPath' to capture corrupted data.

Uploaded by

Kumar Amit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views5 pages

Handling Corrupted Record in Pyspark

Uploaded by

Kumar Amit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Pyspark interview question to handle corrupted records.

List of questions: -
1. What are corrupted records?
2. What are different read modes to handle corrupted records?
3. Store correct and corrupted records in different location.

In PySpark, when you're working with CSV data, a record can be considered corrupted if it
doesn't follow the expected structure or format. Specifically, a CSV file might have the
following types of corrupted records:
1. Field Count Mismatch: When the number of fields (columns) in a row doesn't match
the expected number of columns. This might happen due to missing values or extra
delimiters in the data.
2. Malformed Data: Data that doesn't conform to the expected data type for a column.
For example, a numeric column containing alphabetic characters.
3. Invalid Encoding: Records may be corrupted if the encoding of the data doesn't
match the specified encoding, resulting in unreadable or unexpected characters.
4. Quoted Fields Issues: If the CSV file contains fields enclosed in quotes, there might be
issues with missing or unbalanced quotes, which can cause problems in parsing the
data.
5. Delimiter Issues: If the delimiter used in the CSV file doesn't match the one specified,
or if it is inconsistent across records, it can cause corrupted records.
6. Null or Empty Records: Records that are entirely empty or contain only null values
might also be considered corrupted.

Let create a dummy data to see corrupted records handling in pyspark.

Agent_id, Name, Age, Salary, Address, Insurance_Type

1, Tarun, 33, 45000, MadhyaPradesh, Car&Property
2, Manshi, 35, 55000, Delhi, UttarPradesh, Property
3, Avinash, 45, 150000, Delhi, India, GeneralInsurance
4, Mona, 18, 200000,Kolkata,India,LifeInsurance
5,Vikash,31,300000,, Car&Property

If you see the data, then you get that record number 2, 4 and 5 in corrupted, as it contains
some extra fields.
Read modes to handle corrupted records

To handle corrupted records in PySpark when reading a CSV file, you can specify the mode
parameter in the spark.read.csv() function:
1. PERMISSIVE: Default mode. It allows records to be partially parsed even if they contain
corrupted data. The corrupted records are processed and the data is returned, filling in
with nulls if necessary.

2. DROPMALFORMED: In this mode, any records that are malformed (not conforming to
the schema) are dropped.
3. FAILFAST: If any record is malformed, the entire read operation will fail immediately.

Stores uncorrected and corrupted records in different location.

Store uncorrupted records.

To store the un-corrupted records in a specific location, please refer the below code in which
we use DROPMALFORMED read mode.

For this first, we need to define the schema and use the schema while creating the DF.

Agent_schema = StructType(
[
StructField( "Agent_ID", IntegerType(), True),
StructField( "Name", StringType(), True),
StructField( "Age", IntegerType(), True),
StructField( "Salary", IntegerType(), True),
StructField( "Address", StringType(), True),
StructField( "Insurance_Type", StringType(), True)

]
)

file_path = 'dbfs:/FileStore/tables/AgentRecords.txt'

Agent_df = spark.read.format('csv')\
.option('header','true')\
.option('inferschema','false')\
.option('mode','DROPMALFORMED')\
.schema(‘Agent_schema’)\
.load(file_path)

Agent_df.write.format('csv').save('dbfs:/FileStore/tables/AgentRecord/Record')
Store corrupted records.

For this first, we need to make change in the Agent schema, which we defined earlier. We
need to add StructField('_corrupt_record', StringType(), True). Below is the new
schema

And again we define the agent dataframe with adding option called
.option('badRecordsPath', bad_record_file_path)

file_path = 'dbfs:/FileStore/tables/AgentRecords.txt'
bad_record_file_path = 'dbfs:/FileStore/tables/bad_records'

Agent_df = spark.read.format('csv')\
.option('header','true')\
.option('inferschema','false')\
.schema(Agent_schema)\
.option('badRecordsPath', bad_record_file_path)\
.load(file_path)
Now, let’s checkout the Bad records stored.

Google Scholar Thesis Search
100% (4)
Google Scholar Thesis Search
6 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
No ratings yet
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
106 pages
Up & Running With
No ratings yet
Up & Running With
201 pages
Apex Interview Questions
No ratings yet
Apex Interview Questions
8 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Hackr - Io's Google Dorks Cheat Sheet PDF
No ratings yet
Hackr - Io's Google Dorks Cheat Sheet PDF
35 pages
Pyspark Scenario-Based Interview Questions & Answers: Nitya Cloudtech PVT LTD
No ratings yet
Pyspark Scenario-Based Interview Questions & Answers: Nitya Cloudtech PVT LTD
12 pages
Chapter 4 - Import-Export Data
No ratings yet
Chapter 4 - Import-Export Data
30 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Column Renaming in Pyspark
No ratings yet
Column Renaming in Pyspark
4 pages
Fundamentals of Database Management Systems (Cosc2041) : Chapter Two Database System Architecture
No ratings yet
Fundamentals of Database Management Systems (Cosc2041) : Chapter Two Database System Architecture
38 pages
Gis in Buganda Land Board - New
No ratings yet
Gis in Buganda Land Board - New
14 pages
Employee Data Analysis System (Ip Class Xii)
No ratings yet
Employee Data Analysis System (Ip Class Xii)
26 pages
Unit 2 Part 2 System Analysis and Design
No ratings yet
Unit 2 Part 2 System Analysis and Design
243 pages
B LSC CD W1 Geiv Yx BAmc EE3 U
No ratings yet
B LSC CD W1 Geiv Yx BAmc EE3 U
166 pages
Scenarios Where Bad Records Occur
No ratings yet
Scenarios Where Bad Records Occur
38 pages
Lecture Week2
No ratings yet
Lecture Week2
72 pages
Journal
No ratings yet
Journal
47 pages
Pyspark 500
No ratings yet
Pyspark 500
103 pages
DAP Module3
No ratings yet
DAP Module3
42 pages
12 Pandas
100% (1)
12 Pandas
21 pages
Pandas
No ratings yet
Pandas
26 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
LECTURE 3-Data Resource Management
No ratings yet
LECTURE 3-Data Resource Management
44 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
Updated-Resume Farhana Mannan
No ratings yet
Updated-Resume Farhana Mannan
2 pages
Geekweb
No ratings yet
Geekweb
31 pages
Experiment No 3 Importing and Exporting Data in Python Using Pandas Student
No ratings yet
Experiment No 3 Importing and Exporting Data in Python Using Pandas Student
6 pages
Mysqlpracticaltutorial 231002225649 93643326
No ratings yet
Mysqlpracticaltutorial 231002225649 93643326
103 pages
Employee Data Analysis System (Ip Class 12) (2024-25)
No ratings yet
Employee Data Analysis System (Ip Class 12) (2024-25)
30 pages
Employ Management System
No ratings yet
Employ Management System
29 pages
Access Controls
No ratings yet
Access Controls
12 pages
Fds Unit - III
No ratings yet
Fds Unit - III
58 pages
Pandas
No ratings yet
Pandas
57 pages
Actuators and Drivers
No ratings yet
Actuators and Drivers
23 pages
7.2 - Data Frame Basics - mp4
No ratings yet
7.2 - Data Frame Basics - mp4
3 pages
An Ace Up The Sleeve PDF
No ratings yet
An Ace Up The Sleeve PDF
68 pages
Learn Python Pandas For Data Science Quick TutorialExamples For All Primary Operations of DataFrames
No ratings yet
Learn Python Pandas For Data Science Quick TutorialExamples For All Primary Operations of DataFrames
37 pages
1 - Introduction ToPySpark
No ratings yet
1 - Introduction ToPySpark
26 pages
Course - Introduction To Data Science (SD211105)
No ratings yet
Course - Introduction To Data Science (SD211105)
10 pages
Dam Unit - Iv
No ratings yet
Dam Unit - Iv
17 pages
02-View, Stored Procedure, Function, and Trigger
No ratings yet
02-View, Stored Procedure, Function, and Trigger
29 pages
UNIT I Introduction To Data Mining
No ratings yet
UNIT I Introduction To Data Mining
22 pages
Module 3 Working With The Windows PowerShell Pipeline
No ratings yet
Module 3 Working With The Windows PowerShell Pipeline
55 pages
Pandas Tutorial
No ratings yet
Pandas Tutorial
21 pages
RM - Pandas - Importing Data
No ratings yet
RM - Pandas - Importing Data
15 pages
Python For Data Analysis (1) - 171-192
No ratings yet
Python For Data Analysis (1) - 171-192
24 pages
Management Information System in Construction Project: Semester Genap 2017/2018
No ratings yet
Management Information System in Construction Project: Semester Genap 2017/2018
52 pages
Module - 2 - Reference Course Content
No ratings yet
Module - 2 - Reference Course Content
19 pages
Unit6 - Working With Data
No ratings yet
Unit6 - Working With Data
29 pages
Pandas - Digitalocean
No ratings yet
Pandas - Digitalocean
15 pages
Knowledge Management UNIT-3 Notes
No ratings yet
Knowledge Management UNIT-3 Notes
17 pages
Notes On CSV Filespdf
No ratings yet
Notes On CSV Filespdf
11 pages
Working With CSV File in Databricks
No ratings yet
Working With CSV File in Databricks
4 pages
Pandas - Data Manipulation and Analysis Library - Educative
No ratings yet
Pandas - Data Manipulation and Analysis Library - Educative
7 pages
Data Frame in Panda 01
No ratings yet
Data Frame in Panda 01
9 pages
Unit5 CS
No ratings yet
Unit5 CS
15 pages
Snowflake - Interview Questions
No ratings yet
Snowflake - Interview Questions
15 pages
SAPEP Friday
No ratings yet
SAPEP Friday
31 pages
DMV Lab 7
No ratings yet
DMV Lab 7
9 pages
Ainotes
No ratings yet
Ainotes
5 pages
DWH
No ratings yet
DWH
12 pages
Rajendra Reddy Task-1
No ratings yet
Rajendra Reddy Task-1
9 pages
Spark Questions
No ratings yet
Spark Questions
7 pages
DBMS & Its Application
No ratings yet
DBMS & Its Application
19 pages
Ainotes Dataframe
No ratings yet
Ainotes Dataframe
5 pages
EDA - Session-1 - Basic Dataframe Opertaions-1
No ratings yet
EDA - Session-1 - Basic Dataframe Opertaions-1
7 pages
Spark Cheat Sheet 1717838924
No ratings yet
Spark Cheat Sheet 1717838924
10 pages
Database Lab Exercise
No ratings yet
Database Lab Exercise
4 pages
PySpark Entity Resolution
No ratings yet
PySpark Entity Resolution
5 pages
CSV New
No ratings yet
CSV New
4 pages
CSV Processor
No ratings yet
CSV Processor
3 pages
Kenny-230722-Data Cleaning With Python and Pandas - Detecting Missing Values
No ratings yet
Kenny-230722-Data Cleaning With Python and Pandas - Detecting Missing Values
13 pages
Revision Point - Dataframe
No ratings yet
Revision Point - Dataframe
11 pages
7 Days Analytics Course 3feiz7 4
No ratings yet
7 Days Analytics Course 3feiz7 4
8 pages
How To Enable and Read QueryService Auditing
No ratings yet
How To Enable and Read QueryService Auditing
4 pages
Chapter5 3CSVFile
No ratings yet
Chapter5 3CSVFile
7 pages
All
No ratings yet
All
4 pages
4.1 Semantic Data and Web: Unit 4 Ontology
No ratings yet
4.1 Semantic Data and Web: Unit 4 Ontology
12 pages
Pyspark (Error Handling)
No ratings yet
Pyspark (Error Handling)
2 pages
Jurnal Rancang Bangun Sistem Informasi Pencatatan Transaksi Keuangan Pada Klinik Graha Amani Sidoarjo
No ratings yet
Jurnal Rancang Bangun Sistem Informasi Pencatatan Transaksi Keuangan Pada Klinik Graha Amani Sidoarjo
7 pages
Data Analyst Resume
No ratings yet
Data Analyst Resume
1 page
Pyspark Dataframe Questions
No ratings yet
Pyspark Dataframe Questions
1 page
SQL Server 2008: Failover Clustering
No ratings yet
SQL Server 2008: Failover Clustering
2 pages
Importing Data Cheat Sheet Python For Data Science: Pickled Files Exploring Your Data
No ratings yet
Importing Data Cheat Sheet Python For Data Science: Pickled Files Exploring Your Data
1 page
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
From Everand
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
Matthew Rosch
No ratings yet
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet

Handling Corrupted Record in Pyspark

Uploaded by

Handling Corrupted Record in Pyspark

Uploaded by

Pyspark interview question to handle corrupted records.

Let create a dummy data to see corrupted records handling in pyspark.

Agent_id, Name, Age, Salary, Address, Insurance_Type

Stores uncorrected and corrupted records in different location.

Store uncorrupted records.

You might also like