9/7/23, 1:49 PM Assign 4-Samana Tatheer 20U00323 .
ipynb - Colaboratory
Name: Samana Tatheer ID: 20U00323 Assign 4
import pandas as pd
Q1. Import Dataset
df323 = pd.read_csv("/content/train.csv")
df323
User_ID Product_ID Gender Age Occupation City_Category Stay_In_Current_Ci
0-
0 1000001 P00069042 F 10 A
17
0-
1 1000001 P00248942 F 10 A
17
0-
2 1000001 P00087842 F 10 A
17
0-
3 1000001 P00085442 F 10 A
17
4 1000002 P00285442 M 55+ 16 C
... ... ... ... ... ... ...
51-
550063 1006033 P00372445 M 13 B
55
26-
550064 1006035 P00375436 F 1 C
35
26-
550065 1006036 P00375436 F 15 B
35
550066 1006038 P00375436 F 55+ 1 C
46-
550067 1006039 P00371644 F 0 B
50
550068 rows × 12 columns
Q2. Data Profiling:tells everything about the data set-the number of variables in a dataset
pip install ydata_profiling
Collecting ydata_profiling
Downloading ydata_profiling-4.5.1-py2.py3-none-any.whl (357 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 357.3/357.3 kB 5.1 MB/s eta 0:00:00
Requirement already satisfied: scipy<1.12,>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from ydata_profili
Requirement already satisfied: pandas!=1.4.0,<2.1,>1.1 in /usr/local/lib/python3.10/dist-packages (from ydata_pr
Requirement already satisfied: matplotlib<4,>=3.2 in /usr/local/lib/python3.10/dist-packages (from ydata_profili
Collecting pydantic<2,>=1.8.1 (from ydata_profiling)
Downloading pydantic-1.10.12-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 43.3 MB/s eta 0:00:00
Requirement already satisfied: PyYAML<6.1,>=5.0.0 in /usr/local/lib/python3.10/dist-packages (from ydata_profili
Requirement already satisfied: jinja2<3.2,>=2.11.1 in /usr/local/lib/python3.10/dist-packages (from ydata_profil
Collecting visions[type_image_path]==0.7.5 (from ydata_profiling)
Downloading visions-0.7.5-py3-none-any.whl (102 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 102.7/102.7 kB 10.4 MB/s eta 0:00:00
Requirement already satisfied: numpy<1.24,>=1.16.0 in /usr/local/lib/python3.10/dist-packages (from ydata_profil
Collecting htmlmin==0.1.12 (from ydata_profiling)
Downloading htmlmin-0.1.12.tar.gz (19 kB)
Preparing metadata (setup.py) ... done
https://colab.research.google.com/drive/146Saq0pZI_Bj-nEicmWsnkoA83FDynR_#scrollTo=konJxjGp1Al5&printMode=true 1/6
9/7/23, 1:49 PM Assign 4-Samana Tatheer 20U00323 .ipynb - Colaboratory
Collecting phik<0.13,>=0.11.1 (from ydata_profiling)
Downloading phik-0.12.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (679 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 679.5/679.5 kB 50.7 MB/s eta 0:00:00
Requirement already satisfied: requests<3,>=2.24.0 in /usr/local/lib/python3.10/dist-packages (from ydata_profil
Requirement already satisfied: tqdm<5,>=4.48.2 in /usr/local/lib/python3.10/dist-packages (from ydata_profiling)
Requirement already satisfied: seaborn<0.13,>=0.10.1 in /usr/local/lib/python3.10/dist-packages (from ydata_prof
Collecting multimethod<2,>=1.4 (from ydata_profiling)
Downloading multimethod-1.9.1-py3-none-any.whl (10 kB)
Requirement already satisfied: statsmodels<1,>=0.13.2 in /usr/local/lib/python3.10/dist-packages (from ydata_pro
Collecting typeguard<3,>=2.13.2 (from ydata_profiling)
Downloading typeguard-2.13.3-py3-none-any.whl (17 kB)
Collecting imagehash==4.3.1 (from ydata_profiling)
Downloading ImageHash-4.3.1-py2.py3-none-any.whl (296 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 296.5/296.5 kB 25.4 MB/s eta 0:00:00
Requirement already satisfied: wordcloud>=1.9.1 in /usr/local/lib/python3.10/dist-packages (from ydata_profiling
Collecting dacite>=1.8 (from ydata_profiling)
Downloading dacite-1.8.1-py3-none-any.whl (14 kB)
Requirement already satisfied: PyWavelets in /usr/local/lib/python3.10/dist-packages (from imagehash==4.3.1->yda
Requirement already satisfied: pillow in /usr/local/lib/python3.10/dist-packages (from imagehash==4.3.1->ydata_p
Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.10/dist-packages (from visions[type_image
Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.10/dist-packages (from visions[type_image
Collecting tangled-up-in-unicode>=0.0.4 (from visions[type_image_path]==0.7.5->ydata_profiling)
Downloading tangled_up_in_unicode-0.2.0-py3-none-any.whl (4.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.7/4.7 MB 85.2 MB/s eta 0:00:00
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2<3.2,>=2.1
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>=
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>=3.2-
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>=3
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>=
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib<
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas!=1.4.0,<2.1,
Requirement already satisfied: joblib>=0.14.1 in /usr/local/lib/python3.10/dist-packages (from phik<0.13,>=0.11.
Requirement already satisfied: typing-extensions>=4.2.0 in /usr/local/lib/python3.10/dist-packages (from pydanti
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from request
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=
from ydata_profiling import ProfileReport
profile = ProfileReport(df323, title="Data profile")
profile
https://colab.research.google.com/drive/146Saq0pZI_Bj-nEicmWsnkoA83FDynR_#scrollTo=konJxjGp1Al5&printMode=true 2/6
9/7/23, 1:49 PM Assign 4-Samana Tatheer 20U00323 .ipynb - Colaboratory
Summarize dataset: 58/58 [00:34<00:00, 1.70it/s,
100% Completed]
Generate report structure: 1/1 [00:09<00:00,
100% 9.52s/it]
Render HTML: 100% 1/1 [00:01<00:00, 1.81s/it]
Overview
Dataset statistics
Number of variables 12
Number of observations 550068
There are 12Missing
variables in the dataset
cells 556885
Missing cells (%) 8.4%
df323.dtypes
Duplicate rows 0
User_IDDuplicate rows (%) int64 0.0%
Product_ID object
Gender Total size in memory object 50.4 MiB
Age object
Average record size in memory
Occupation int64 96.0 B
City_Category object
Stay_In_Current_City_Years
Variable types object
Marital_Status int64
Product_Category_1 int64
Numeric 6
Product_Category_2 float64
Product_Category_3
Text float64 1
Purchase int64
dtype: Categorical
object 5
Q3. ConvertAlerts
the variables in categorical and numerical data type
Product_Category_2 is highly overall correlated with High correlation
Making an Product_Category_3
array
cols=['User_ID', 'Product_ID', 'Gender', 'Age', 'Occupation', 'City_Category','Stay_In_Current_City_Years','Marital_Stat
converting all the array variables into categorical
df323[cols] = df323[cols].astype("category")
converting the purchase variable into float
df323["Purchase"]=df323["Purchase"].astype(float)
Q4. Identification of outliers
https://colab.research.google.com/drive/146Saq0pZI_Bj-nEicmWsnkoA83FDynR_#scrollTo=konJxjGp1Al5&printMode=true 3/6
9/7/23, 1:49 PM Assign 4-Samana Tatheer 20U00323 .ipynb - Colaboratory
import numpy as np
Q1,Q3=np.percentile(df323["Purchase"],[25,75])
IRQ=Q3-Q1
IRQ
6231.0
upper=np.where(df323["Purchase"]>(Q3+1.5*IRQ))
lower=np.where(df323["Purchase"]>(Q1-1.5*IRQ))
Replace outliers with missing values
df323["Purchase"]=df323["Purchase"].replace(upper[0],np.nan)
df323.isnull().sum
<bound method NDFrame._add_numeric_operations.<locals>.sum of User_ID Product_ID Gender Age
Occupation City_Category \
0 False False False False False False
1 False False False False False False
2 False False False False False False
3 False False False False False False
4 False False False False False False
... ... ... ... ... ... ...
550063 False False False False False False
550064 False False False False False False
550065 False False False False False False
550066 False False False False False False
550067 False False False False False False
Stay_In_Current_City_Years Marital_Status Product_Category_1 \
0 False False False
1 False False False
2 False False False
3 False False False
4 False False False
... ... ... ...
550063 False False False
550064 False False False
550065 False False False
550066 False False False
550067 False False False
Product_Category_2 Product_Category_3 Purchase
0 True True False
1 False False False
2 True True False
3 False True False
4 True True False
... ... ... ...
550063 True True False
550064 True True False
550065 True True False
550066 True True False
550067 True True False
[550068 rows x 12 columns]>
Q5. Dropping Variables
https://colab.research.google.com/drive/146Saq0pZI_Bj-nEicmWsnkoA83FDynR_#scrollTo=konJxjGp1Al5&printMode=true 4/6
9/7/23, 1:49 PM Assign 4-Samana Tatheer 20U00323 .ipynb - Colaboratory
df323b = df323.drop(["User_ID","Product_ID"], axis=1)
df323b.shape
(550068, 10)
Q6 Repalcing missing values with average
df323b["Purchase"]=df323b["Purchase"].fillna(df323b["Purchase"].median())
df323b["Product_Category_2"]=df323b["Product_Category_2"].fillna(df323b["Product_Category_2"].mode()[0])
df323b["Product_Category_3"]=df323b["Product_Category_3"].fillna(df323b["Product_Category_3"].mode()[0])
df323b.isnull().sum()
Gender 0
Age 0
Occupation 0
City_Category 0
Stay_In_Current_City_Years 0
Marital_Status 0
Product_Category_1 0
Product_Category_2 0
Product_Category_3 0
Purchase 0
dtype: int64
Q7. Descriptive Statistics
df323b.describe(include='all')
Gender Age Occupation City_Category Stay_In_Current_City_Years Marital_
count 550068 550068 550068.0 550068 550068 55
unique 2 7 21.0 3 5
top M 26-35 4.0 B 1
freq 414259 219587 72308.0 231173 193821 32
mean NaN NaN NaN NaN NaN
std NaN NaN NaN NaN NaN
min NaN NaN NaN NaN NaN
25% NaN NaN NaN NaN NaN
50% NaN NaN NaN NaN NaN
75% NaN NaN NaN NaN NaN
max NaN NaN NaN NaN NaN
https://colab.research.google.com/drive/146Saq0pZI_Bj-nEicmWsnkoA83FDynR_#scrollTo=konJxjGp1Al5&printMode=true 5/6
9/7/23, 1:49 PM Assign 4-Samana Tatheer 20U00323 .ipynb - Colaboratory
check 0s completed at 13:49
https://colab.research.google.com/drive/146Saq0pZI_Bj-nEicmWsnkoA83FDynR_#scrollTo=konJxjGp1Al5&printMode=true 6/6