0% found this document useful (0 votes)
8 views4 pages

Column Renaming in Pyspark

The document provides various examples of data manipulation in PySpark, including column renaming, validating column order, checking for null values, reading and writing CSV and JSON files, detecting duplicates, and optimizing data. It also demonstrates how to format numbers into a specific string format. Additionally, it covers the use of the Ntile function for batching data and Delta Lake operations for data optimization and cleanup.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views4 pages

Column Renaming in Pyspark

The document provides various examples of data manipulation in PySpark, including column renaming, validating column order, checking for null values, reading and writing CSV and JSON files, detecting duplicates, and optimizing data. It also demonstrates how to format numbers into a specific string format. Additionally, it covers the use of the Ntile function for batching data and Delta Lake operations for data optimization and cleanup.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Column Renaming in Pyspark:

#
from pyspark.sql.functions import *
from pyspark.sql.types import *
df = spark.read.format('CSV').option('Header',True).option('Inferschema',
True).load('dbfs:/FileStore/dbfs/Rename_columns_data.csv')
display(df)

#Replacing space with _

from pyspark.sql.functions import col


new_columns = [col.replace(" ","_")for col in df.columns]
print(new_columns)

#Getting old columns

old_columns = df.columns
print(old_columns)

# Renaming all columns with new columns with Python list:

mapping_list = ["id","First_Name","Last_Name","Email_New","gender","Job_Title"]
rename_df = df.toDF(*mapping_list)
rename_df.show()

# Renaming all columns with Python dictionary


mapping_list = {
"id": "New_ID",
"First Name": "First_Name",
"Last Name": "Last_Name",
"email": "Email_New",
"gender": "gender",
"Job Title": "Job_Title"
}

# Iterate through the mapping list and apply the renaming


df2 = df
for old_col, new_col in mapping_list.items():
df2 = df2.withColumnRenamed(old_col, new_col)

# Display the DataFrame with renamed columns

display(df2)

………………………………………………………………………………………………………………………..
Validate column Order in Pyspark
#Define the correct order
Expected_columns = [(‘Id’,’Name’,’age’,’address’)
# Create Dataframe
Df = spark.createDataFrame([(‘Alice’,1,30,’stree1’),(‘Bob’,2,40,’Street3’),
(‘kumar’,3,35,’street5’)],[‘Name’,’id’,’age’,’address’])
Display(df)
# Get actual columns
Actual_columns = df.columns
# check if column order is correct or not?
If Expected_columns == Actual_columns:
Print(‘column order is correct’)
Else:
Print(f“column order is incorrect expected column order is : {Expected_columns}, but
got column : {Actual_columns})
# To correct the order of the columns in dataset
Df_correct_order = df.select(*Expected_columns)
Display(df_correct_order)

# Check the number of columns


If len(Expected_columns) != len(Actual_columns):

print
…………………………………………………………………………………………………….
How to check Null in all columns in Pyspark
from pyspark.sql.functions import*
null_counts = df.select([(count(when(col(c).isNull(), c))).alias(c) for c in df.columns])
display(null_counts)
…………………………………………………………………………………………………………….
How to read CSV and JSON file
Df = spark.read.option(“header”,True)\
.option(“InferSchema”,True)\
.option(“mode” , “permissive/DROPMAlFormed/Failfast”)\
.option(badRecordPath,”Path”)\
.schema(schema)\
.load(“Path”)
Write CSV:
Df = df.write.mode(SavingMode,Overwrite/Append/Ignore/errorIfexist).CSV(“Path”)
Json☹READ)
Df = Spark.read.option(“multiline”,”True”)\
.option(“mode”,”PERMISSIVE/DROPMALFORMED/FAILFAST”)\
.schema(schema)
.json(“Path”)
Write😊
DF2 = df.write.mode(savingmode, append/overwrite/ignore).json(“PATH”)
……………………………………………………………………………………………………………………………………..
How to check Duplicates value in Pyspark
From pyspark.sql import function as F
From pyspark.sql.types import*
Window_Spec = window.partitionBy(“ col1”,”col2”).orderBy(“col1”)
Df = df.withColumn(“duplicates”,f.row_number().over(window_spec)
Df_filtered = df.filter(f.col(duplicates) = 2
Display(df.filtered)
…………………………………………………………………………………………..
Ntile Function to divide large dataset into smaller batch
Df = spark.createDataFarme(data,schema)
Window_spec = window.PartitionBy(F.monotonically_increasing_id())
Df_batch = df.withColumn(“Batch”,F.ntile(3).over(window_Spec))
Display(df_batch)
………………………………………………………………………………………
Optimize and compact data after Update
deltaTable = DeltaTable.forName(spark,”TableName”)
deltaTable.optimize().executeZorderBy(“id”)
CleanUp Snapshot with Vaccum
deltaTable.vaccum()
……………………………………………………………………………………………………….
Data = 1,10,100,1000,10000 how to get output as :
CB_00001,CB_00010,CB_00100,CB_01000,CB_10000
Ans:
# List of data
data = [1, 10, 100, 1000, 10000]
# Use list comprehension to format the numbers as per the desired output
formatted_data = [f"CB_{x:05d}" for x in data]
# Join the formatted data into a single string, separated by commas
output = ",".join(formatted_data)
# Print the result
print(output)
Explanation:
 f"CB_{x:05d}": This is Python’s f-string formatting. It formats the number x to a 5-
digit string, padded with leading zeros (05d).
o 5 means the total width of the string will be 5 characters.
o d is used to format the number as a decimal integer.
o The leading zeros are automatically added to fill the width.
 ",".join(formatted_data): This joins the formatted data into a string, separated by
commas.

You might also like