Handling corrupted record in Pyspark
Handling corrupted record in Pyspark
List of questions: -
1. What are corrupted records?
2. What are different read modes to handle corrupted records?
3. Store correct and corrupted records in different location.
In PySpark, when you're working with CSV data, a record can be considered corrupted if it
doesn't follow the expected structure or format. Specifically, a CSV file might have the
following types of corrupted records:
1. Field Count Mismatch: When the number of fields (columns) in a row doesn't match
the expected number of columns. This might happen due to missing values or extra
delimiters in the data.
2. Malformed Data: Data that doesn't conform to the expected data type for a column.
For example, a numeric column containing alphabetic characters.
3. Invalid Encoding: Records may be corrupted if the encoding of the data doesn't
match the specified encoding, resulting in unreadable or unexpected characters.
4. Quoted Fields Issues: If the CSV file contains fields enclosed in quotes, there might be
issues with missing or unbalanced quotes, which can cause problems in parsing the
data.
5. Delimiter Issues: If the delimiter used in the CSV file doesn't match the one specified,
or if it is inconsistent across records, it can cause corrupted records.
6. Null or Empty Records: Records that are entirely empty or contain only null values
might also be considered corrupted.
If you see the data, then you get that record number 2, 4 and 5 in corrupted, as it contains
some extra fields.
Read modes to handle corrupted records
To handle corrupted records in PySpark when reading a CSV file, you can specify the mode
parameter in the spark.read.csv() function:
1. PERMISSIVE: Default mode. It allows records to be partially parsed even if they contain
corrupted data. The corrupted records are processed and the data is returned, filling in
with nulls if necessary.
2. DROPMALFORMED: In this mode, any records that are malformed (not conforming to
the schema) are dropped.
3. FAILFAST: If any record is malformed, the entire read operation will fail immediately.
To store the un-corrupted records in a specific location, please refer the below code in which
we use DROPMALFORMED read mode.
For this first, we need to define the schema and use the schema while creating the DF.
Agent_schema = StructType(
[
StructField( "Agent_ID", IntegerType(), True),
StructField( "Name", StringType(), True),
StructField( "Age", IntegerType(), True),
StructField( "Salary", IntegerType(), True),
StructField( "Address", StringType(), True),
StructField( "Insurance_Type", StringType(), True)
]
)
file_path = 'dbfs:/FileStore/tables/AgentRecords.txt'
Agent_df = spark.read.format('csv')\
.option('header','true')\
.option('inferschema','false')\
.option('mode','DROPMALFORMED')\
.schema(‘Agent_schema’)\
.load(file_path)
Agent_df.write.format('csv').save('dbfs:/FileStore/tables/AgentRecord/Record')
Store corrupted records.
For this first, we need to make change in the Agent schema, which we defined earlier. We
need to add StructField('_corrupt_record', StringType(), True). Below is the new
schema
Agent_schema = StructType(
[
StructField( "Agent_ID", IntegerType(), True),
StructField( "Name", StringType(), True),
StructField( "Age", IntegerType(), True),
StructField( "Salary", IntegerType(), True),
StructField( "Address", StringType(), True),
StructField( "Insurance_Type", StringType(), True),
StructField('_corrupt_record', StringType(), True)
]
)
And again we define the agent dataframe with adding option called
.option('badRecordsPath', bad_record_file_path)
file_path = 'dbfs:/FileStore/tables/AgentRecords.txt'
bad_record_file_path = 'dbfs:/FileStore/tables/bad_records'
Agent_df = spark.read.format('csv')\
.option('header','true')\
.option('inferschema','false')\
.schema(Agent_schema)\
.option('badRecordsPath', bad_record_file_path)\
.load(file_path)
Now, let’s checkout the Bad records stored.