Column Renaming in Pyspark
Column Renaming in Pyspark
#
from pyspark.sql.functions import *
from pyspark.sql.types import *
df = spark.read.format('CSV').option('Header',True).option('Inferschema',
True).load('dbfs:/FileStore/dbfs/Rename_columns_data.csv')
display(df)
old_columns = df.columns
print(old_columns)
mapping_list = ["id","First_Name","Last_Name","Email_New","gender","Job_Title"]
rename_df = df.toDF(*mapping_list)
rename_df.show()
display(df2)
………………………………………………………………………………………………………………………..
Validate column Order in Pyspark
#Define the correct order
Expected_columns = [(‘Id’,’Name’,’age’,’address’)
# Create Dataframe
Df = spark.createDataFrame([(‘Alice’,1,30,’stree1’),(‘Bob’,2,40,’Street3’),
(‘kumar’,3,35,’street5’)],[‘Name’,’id’,’age’,’address’])
Display(df)
# Get actual columns
Actual_columns = df.columns
# check if column order is correct or not?
If Expected_columns == Actual_columns:
Print(‘column order is correct’)
Else:
Print(f“column order is incorrect expected column order is : {Expected_columns}, but
got column : {Actual_columns})
# To correct the order of the columns in dataset
Df_correct_order = df.select(*Expected_columns)
Display(df_correct_order)
print
…………………………………………………………………………………………………….
How to check Null in all columns in Pyspark
from pyspark.sql.functions import*
null_counts = df.select([(count(when(col(c).isNull(), c))).alias(c) for c in df.columns])
display(null_counts)
…………………………………………………………………………………………………………….
How to read CSV and JSON file
Df = spark.read.option(“header”,True)\
.option(“InferSchema”,True)\
.option(“mode” , “permissive/DROPMAlFormed/Failfast”)\
.option(badRecordPath,”Path”)\
.schema(schema)\
.load(“Path”)
Write CSV:
Df = df.write.mode(SavingMode,Overwrite/Append/Ignore/errorIfexist).CSV(“Path”)
Json☹READ)
Df = Spark.read.option(“multiline”,”True”)\
.option(“mode”,”PERMISSIVE/DROPMALFORMED/FAILFAST”)\
.schema(schema)
.json(“Path”)
Write😊
DF2 = df.write.mode(savingmode, append/overwrite/ignore).json(“PATH”)
……………………………………………………………………………………………………………………………………..
How to check Duplicates value in Pyspark
From pyspark.sql import function as F
From pyspark.sql.types import*
Window_Spec = window.partitionBy(“ col1”,”col2”).orderBy(“col1”)
Df = df.withColumn(“duplicates”,f.row_number().over(window_spec)
Df_filtered = df.filter(f.col(duplicates) = 2
Display(df.filtered)
…………………………………………………………………………………………..
Ntile Function to divide large dataset into smaller batch
Df = spark.createDataFarme(data,schema)
Window_spec = window.PartitionBy(F.monotonically_increasing_id())
Df_batch = df.withColumn(“Batch”,F.ntile(3).over(window_Spec))
Display(df_batch)
………………………………………………………………………………………
Optimize and compact data after Update
deltaTable = DeltaTable.forName(spark,”TableName”)
deltaTable.optimize().executeZorderBy(“id”)
CleanUp Snapshot with Vaccum
deltaTable.vaccum()
……………………………………………………………………………………………………….
Data = 1,10,100,1000,10000 how to get output as :
CB_00001,CB_00010,CB_00100,CB_01000,CB_10000
Ans:
# List of data
data = [1, 10, 100, 1000, 10000]
# Use list comprehension to format the numbers as per the desired output
formatted_data = [f"CB_{x:05d}" for x in data]
# Join the formatted data into a single string, separated by commas
output = ",".join(formatted_data)
# Print the result
print(output)
Explanation:
f"CB_{x:05d}": This is Python’s f-string formatting. It formats the number x to a 5-
digit string, padded with leading zeros (05d).
o 5 means the total width of the string will be 5 characters.
o d is used to format the number as a decimal integer.
o The leading zeros are automatically added to fill the width.
",".join(formatted_data): This joins the formatted data into a string, separated by
commas.