We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4
DATA preprocessing TECHNIQUE
Q1. Calculate the total sales generated by each
product. ts=df1.groupby("Category") ["Sales Amount"].sum() ts=ts.reset_index() ts.columns=["Product", "Total sales"] print("The total sales generated by each product:\n",ts)
Q2. Calculate the average quantity sold in each
region. avg=df1.groupby ("Region") ["Quantity Sold").mean() avg=avg.reset_index() avg.columns=("Region", "Average quantity sold"] print("The average quantity sold in each region:\n",avg)
Q3. Count the total number of transactions
happened across every month. df1 "Transaction Date"!=pd.to_datetime(df["Transaction Date"]) t=df.groupby(df1 "Transaction Date"].dt.month) ["Category"].count() t=t.reset_index() t.columns=["Month", "Transaction Count"] print("The total number of transactions happened across every month:\n",t) Q4. Display the regions generating maximum and minimum sales. rs=df1.groupby ("Region") "Sales Amount"].sum() print("The region generating maximum sales is", rs.idxmax()) print("The region generating minimum sales is", rs.idxmin())
Q5. Display the total sales of every quarter.
df1 ["Transaction Date"]=pd.to_datetime(df("Transaction Date"]) sq=df.groupby(df1 ["Transaction Date").dt.quarter) ["Sales Amount").sum() sq=sq.reset_index() sq.columns=["Quarter", "Total sales"! print("The total sales of every quarter:\n",sq)
Q6. Normalize the Sales column.
df=df1.copy() df ("Normalized Sales")= (df ["Sales Anount"]-df["Sales Amount"].min())/(df ["Sales Amount").max()-df["Sales Amount"].min()) print(df[[ "Sales Amount", "Normalized Sales"]]) Q7. Apply log transformation on the Sales column to minimize its variance. import numpy as np df=df1.copy() df ["Log Transformed Sales"]=np.log(df ["Sales Amount"]) print(df[["Sales Amount","Log Transformed Sales"]])
Q8. Perform binning on the Sales column in order
to categorize the sales into low, medium and high. df=df1.copy() bins=[0,150,350,5001] l=["Low", "Medium", "High"] df ["Sales Category"]=pd.cut(df1 ["Sales Amount"], bins, labels=1) print(df [["Sales Amount", "Sales Category"]]) Q9. Find out highly correlated columns of the given data set. df=df1.copy() df["Category"]=df ["Category"].astype("category").cat.codes df["Transaction Date"]=df["Transaction Date"].astype("category").cat.codes df["Region"]=df1["Region"] .astype("category").cat.codes c=df.corr() cp=c.unstack().reset_index() cp.columns = ["Column1", "Column2", "Correlation"] cp=cp [(cp["Correlation"]>threshold) & (cp["Column1"]! =cp ["Column2"])] cp=cp.drop_duplicates(subset=["Correlation"]) print("Highly correlated columns of this dataset are:\ n",cp)