0% found this document useful (0 votes)
0 views

Handling corrupted record in Pyspark

The document discusses handling corrupted records in PySpark, specifically when working with CSV data. It outlines types of corrupted records, different read modes (PERMISSIVE, DROPMALFORMED, FAILFAST) to manage them, and provides code examples for storing both uncorrupted and corrupted records in separate locations. The document emphasizes the importance of defining schemas and using options like 'badRecordsPath' to capture corrupted data.

Uploaded by

Kumar Amit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Handling corrupted record in Pyspark

The document discusses handling corrupted records in PySpark, specifically when working with CSV data. It outlines types of corrupted records, different read modes (PERMISSIVE, DROPMALFORMED, FAILFAST) to manage them, and provides code examples for storing both uncorrupted and corrupted records in separate locations. The document emphasizes the importance of defining schemas and using options like 'badRecordsPath' to capture corrupted data.

Uploaded by

Kumar Amit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Pyspark interview question to handle corrupted records.

List of questions: -
1. What are corrupted records?
2. What are different read modes to handle corrupted records?
3. Store correct and corrupted records in different location.

In PySpark, when you're working with CSV data, a record can be considered corrupted if it
doesn't follow the expected structure or format. Specifically, a CSV file might have the
following types of corrupted records:
1. Field Count Mismatch: When the number of fields (columns) in a row doesn't match
the expected number of columns. This might happen due to missing values or extra
delimiters in the data.
2. Malformed Data: Data that doesn't conform to the expected data type for a column.
For example, a numeric column containing alphabetic characters.
3. Invalid Encoding: Records may be corrupted if the encoding of the data doesn't
match the specified encoding, resulting in unreadable or unexpected characters.
4. Quoted Fields Issues: If the CSV file contains fields enclosed in quotes, there might be
issues with missing or unbalanced quotes, which can cause problems in parsing the
data.
5. Delimiter Issues: If the delimiter used in the CSV file doesn't match the one specified,
or if it is inconsistent across records, it can cause corrupted records.
6. Null or Empty Records: Records that are entirely empty or contain only null values
might also be considered corrupted.

Let create a dummy data to see corrupted records handling in pyspark.

Agent_id, Name, Age, Salary, Address, Insurance_Type


1, Tarun, 33, 45000, MadhyaPradesh, Car&Property
2, Manshi, 35, 55000, Delhi, UttarPradesh, Property
3, Avinash, 45, 150000, Delhi, India, GeneralInsurance
4, Mona, 18, 200000,Kolkata,India,LifeInsurance
5,Vikash,31,300000,, Car&Property

If you see the data, then you get that record number 2, 4 and 5 in corrupted, as it contains
some extra fields.
Read modes to handle corrupted records

To handle corrupted records in PySpark when reading a CSV file, you can specify the mode
parameter in the spark.read.csv() function:
1. PERMISSIVE: Default mode. It allows records to be partially parsed even if they contain
corrupted data. The corrupted records are processed and the data is returned, filling in
with nulls if necessary.

2. DROPMALFORMED: In this mode, any records that are malformed (not conforming to
the schema) are dropped.
3. FAILFAST: If any record is malformed, the entire read operation will fail immediately.

Stores uncorrected and corrupted records in different location.

Store uncorrupted records.

To store the un-corrupted records in a specific location, please refer the below code in which
we use DROPMALFORMED read mode.

For this first, we need to define the schema and use the schema while creating the DF.

Agent_schema = StructType(
[
StructField( "Agent_ID", IntegerType(), True),
StructField( "Name", StringType(), True),
StructField( "Age", IntegerType(), True),
StructField( "Salary", IntegerType(), True),
StructField( "Address", StringType(), True),
StructField( "Insurance_Type", StringType(), True)

]
)

file_path = 'dbfs:/FileStore/tables/AgentRecords.txt'

Agent_df = spark.read.format('csv')\
.option('header','true')\
.option('inferschema','false')\
.option('mode','DROPMALFORMED')\
.schema(‘Agent_schema’)\
.load(file_path)

Agent_df.write.format('csv').save('dbfs:/FileStore/tables/AgentRecord/Record')
Store corrupted records.

For this first, we need to make change in the Agent schema, which we defined earlier. We
need to add StructField('_corrupt_record', StringType(), True). Below is the new
schema

Agent_schema = StructType(
[
StructField( "Agent_ID", IntegerType(), True),
StructField( "Name", StringType(), True),
StructField( "Age", IntegerType(), True),
StructField( "Salary", IntegerType(), True),
StructField( "Address", StringType(), True),
StructField( "Insurance_Type", StringType(), True),
StructField('_corrupt_record', StringType(), True)
]
)

And again we define the agent dataframe with adding option called
.option('badRecordsPath', bad_record_file_path)

file_path = 'dbfs:/FileStore/tables/AgentRecords.txt'
bad_record_file_path = 'dbfs:/FileStore/tables/bad_records'

Agent_df = spark.read.format('csv')\
.option('header','true')\
.option('inferschema','false')\
.schema(Agent_schema)\
.option('badRecordsPath', bad_record_file_path)\
.load(file_path)
Now, let’s checkout the Bad records stored.

You might also like