Pyspark (Error Handling)
Pyspark (Error Handling)
currently passed those values will be stored as null values and also the complete
row will
be stored into diferent column that is corrupt record column that is corrupt
record column, so the complete row which has tose unpased values so that record
will be
stored in seperate column that whatever column that will be corrupt column.
Dropmalformed--the whole record which has unpassed value those whole record will be
dropped off, if you dont wamt load the data and if you dot want to even seperate
the
data only just drop off the you go for dropmalformed.
Bad Records Path--so in case if you dont want to drop or if you dont want to load
null values and if you want just redirect those complete rows and passed values so
we
can go for bad records path, so in the bad records path we can specify the seperae
path and to load those error records into seperate file and we can share that file
with our bussiness to get those records rectified and send it back to us in this we
can go for bad records path.-
empid,empname
1001, nsr
1002,subbu
10sa,sub73n
1003,subhan
df = spark.read.format("csv").option("header",True).load("file_path")
PERMISSIVE:
-----------
df1 = (spark.read
.option("mode",'PERMISSIVE')
.csv('/FileStore/tables/Errorhandling.csv',header = "True", schema=mySchema))
input:
empid,empname
1001, nsr
1002,sub73n
10sa,subbu
1003,subhan
output:
empid,empname,correpted_records
1001,nsr,null
1002,null,(1002,sub73n)
null,subbu,(10sa,subbu)
1003,subhan,null
DROPMALFORMED:
--------------
df2 = (spark.read
.option("mode",'DROPMALFORMED')
.csv('/FileStore/tables/Errorhandling.csv',header = "True", schema=mySchema))
df2.display()
input:
empid,empname
1001, nsr
1002,sub73n
10sa,subbu
1003,subhan
output:
empid,empname
1001, nsr
1003,subhan
FAILFAST:
---------
df3 = (spark.read
.option("mode",'FAILFAST')
.csv('/FileStore/tables/Errorhandling.csv',header = "True", schema=mySchema))
df3.display()
BadRecordsPath:
----------------
df4 = (spark.read
.option("badRecordsPath", "dbfs:/FileStore/tables/t8adb_badlogs/")
.csv('/FileStore/tables/Errorhandling.csv',header = "True",
schema=mySchema))
df4.display()