0% found this document useful (0 votes)
5 views

Pyspark (Error Handling)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Pyspark (Error Handling)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 2

Permissive Mode-when we are use the permissive mode whatever the values those

currently passed those values will be stored as null values and also the complete
row will
be stored into diferent column that is corrupt record column that is corrupt
record column, so the complete row which has tose unpased values so that record
will be
stored in seperate column that whatever column that will be corrupt column.

Dropmalformed--the whole record which has unpassed value those whole record will be
dropped off, if you dont wamt load the data and if you dot want to even seperate
the
data only just drop off the you go for dropmalformed.

Failfast--whenever such error requires occurs so automatically that execution will


be stopped that pipeline will be stopped.

Bad Records Path--so in case if you dont want to drop or if you dont want to load
null values and if you want just redirect those complete rows and passed values so
we
can go for bad records path, so in the bad records path we can specify the seperae
path and to load those error records into seperate file and we can share that file
with our bussiness to get those records rectified and send it back to us in this we
can go for bad records path.-

Error Handling/Handle with bad records:


--------------------------------------
we are going to read some files from source, while reading file from source
mostly we will follow error handling.

empid,empname
1001, nsr
1002,subbu
10sa,sub73n
1003,subhan

df = spark.read.format("csv").option("header",True).load("file_path")

dbutils.fs.help(): dbutils help

from pyspark.sql.types import *


from pyspark.sql.types import StructType,StructField,IntegerType,StringType
mySchema = StructType([
StructField("Empid",IntegerType(),False),
StructField("EmpName",StringType(),False),
StructField("Empsalary",IntegerType(),False),
StructField("Dept",StringType(),False),
StructField("_corrupt_record",StringType(),True),
])

PERMISSIVE:
-----------
df1 = (spark.read
.option("mode",'PERMISSIVE')
.csv('/FileStore/tables/Errorhandling.csv',header = "True", schema=mySchema))
input:
empid,empname
1001, nsr
1002,sub73n
10sa,subbu
1003,subhan

output:
empid,empname,correpted_records
1001,nsr,null
1002,null,(1002,sub73n)
null,subbu,(10sa,subbu)
1003,subhan,null

DROPMALFORMED:
--------------
df2 = (spark.read
.option("mode",'DROPMALFORMED')
.csv('/FileStore/tables/Errorhandling.csv',header = "True", schema=mySchema))
df2.display()

input:
empid,empname
1001, nsr
1002,sub73n
10sa,subbu
1003,subhan

output:
empid,empname
1001, nsr
1003,subhan

FAILFAST:
---------
df3 = (spark.read
.option("mode",'FAILFAST')
.csv('/FileStore/tables/Errorhandling.csv',header = "True", schema=mySchema))
df3.display()

BadRecordsPath:
----------------
df4 = (spark.read
.option("badRecordsPath", "dbfs:/FileStore/tables/t8adb_badlogs/")
.csv('/FileStore/tables/Errorhandling.csv',header = "True",
schema=mySchema))

df4.display()

You might also like