0% found this document useful (0 votes)
14 views4 pages

H AHA2

Uploaded by

PRANJAY ROHILLA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views4 pages

H AHA2

Uploaded by

PRANJAY ROHILLA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

DATA preprocessing TECHNIQUE

Q1. Calculate the total sales generated by each


product.
ts=df1.groupby("Category") ["Sales Amount"].sum()
ts=ts.reset_index() ts.columns=["Product", "Total
sales"] print("The total sales generated by each
product:\n",ts)

Q2. Calculate the average quantity sold in each


region.
avg=df1.groupby ("Region") ["Quantity Sold").mean()
avg=avg.reset_index()
avg.columns=("Region", "Average quantity sold"]
print("The average quantity sold in each region:\n",avg)

Q3. Count the total number of transactions


happened across every month.
df1 "Transaction Date"!=pd.to_datetime(df["Transaction
Date"])
t=df.groupby(df1 "Transaction Date"].dt.month)
["Category"].count()
t=t.reset_index() t.columns=["Month", "Transaction
Count"]
print("The total number of transactions happened
across every month:\n",t)
Q4. Display the regions generating maximum and
minimum sales.
rs=df1.groupby ("Region") "Sales Amount"].sum()
print("The region generating maximum sales is",
rs.idxmax())
print("The region generating minimum sales is",
rs.idxmin())

Q5. Display the total sales of every quarter.


df1 ["Transaction
Date"]=pd.to_datetime(df("Transaction Date"])
sq=df.groupby(df1 ["Transaction Date").dt.quarter)
["Sales Amount").sum()
sq=sq.reset_index()
sq.columns=["Quarter", "Total sales"! print("The total
sales of every quarter:\n",sq)

Q6. Normalize the Sales column.


df=df1.copy()
df ("Normalized Sales")= (df ["Sales Anount"]-df["Sales
Amount"].min())/(df ["Sales Amount").max()-df["Sales
Amount"].min())
print(df[[ "Sales Amount", "Normalized Sales"]])
Q7. Apply log transformation on the Sales column
to minimize its variance.
import numpy as np
df=df1.copy()
df ["Log Transformed Sales"]=np.log(df ["Sales
Amount"])
print(df[["Sales Amount","Log Transformed Sales"]])

Q8. Perform binning on the Sales column in order


to categorize the sales into low, medium and high.
df=df1.copy()
bins=[0,150,350,5001]
l=["Low", "Medium", "High"]
df ["Sales Category"]=pd.cut(df1 ["Sales Amount"],
bins, labels=1)
print(df [["Sales Amount", "Sales Category"]])
Q9. Find out highly correlated columns of the given
data set.
df=df1.copy()
df["Category"]=df
["Category"].astype("category").cat.codes
df["Transaction Date"]=df["Transaction
Date"].astype("category").cat.codes
df["Region"]=df1["Region"]
.astype("category").cat.codes
c=df.corr()
cp=c.unstack().reset_index()
cp.columns = ["Column1", "Column2", "Correlation"]
cp=cp [(cp["Correlation"]>threshold) & (cp["Column1"]!
=cp ["Column2"])]
cp=cp.drop_duplicates(subset=["Correlation"])
print("Highly correlated columns of this dataset are:\
n",cp)

You might also like