DCS CSED
Machine Learning Project Workflow
1. Import Libraries and Load the dataset
import numpy as np
import pandas as pd
import [Link] as plt
import seaborn as sns
from [Link] import SimpleImputer
from [Link] import LabelEncoder
import [Link] as stats
import warnings
[Link]('ignore')
data = pd.read_csv('path/to/your/[Link]')
2. Data Exploration
1. Initial Data Inspection: Examine the dataset's shape and columns.
[Link]()
[Link]()
<.info(): will also give a direct count of number of numeric and categorical
variables>
<variables/attributes are columns, records are rows>
5 point summary:
[Link]()
numeric:
<min, max values>
<50 percentile/median>
<25,75>
<std, mean>
DYSMECH COMPETENCY SERVICES PVT. LTD. 2
D
categorical:
[Link](include='O')
[Link](include=object)
<number of categories present in the variable>
<the top category with highest freq>
<freq of the top category>
3. Identify Missing Values: Check for missing values in each column.
-[Link]().sum()
# will tell you column wise count of missing values.
-[Link]().sum(axis=1)
# will tell you count of missing values in each record.
Missing value treatment:
1. Drop:
[Link](axis=1,how='any'/'all',thresh=num,subset=[col])
2. Impute:
-mean/median for numeric
[Link](tab[col].median/.mean)
-mode for categorical
[Link](tab[col].mode()[0]
4. EDA: Follow EDA Cheat sheet for that
1. Measure of Central Tendency- Mean, Median, Mode
2. Distribution of Data – using Visualization technique
a. Univariate Analysis
b. Bivariate Analysis
c. Multivariate Analysis
DCS CSED
3. Dispersion of Data- min, max, range, variance, standard deviation,
coefficient
of variation
4. Skewness and Kurtosis
5. Covariance and Correlation
5. Identify outliers
using box plot
Treatment for Outliers
q1 = data['column'].quantile(0.25)
q3 = data['column'].quantile(0.75)
iqr = q3 - q1
ul = q3 + 1.5 * iqr
ll = q1 - 1.5 * iqr
1. Drop
data = data[~((data['column'] < ll) | (data['column'] > ul))]
2. Capping
data['column'] = [Link](data['column'] > ul, ul, [Link](data['column']
< ll, ll, data['column']))
6. Data Transformation
Log Transformation:
df['column'] = [Link](df['column'])
Box-Cox Transformation:
pt = PowerTransformer(method='box-cox') df['transformed'] =
pt.fit_transform(df[['column']])
Yeo-Johnson Transformation:
pt = PowerTransformer(method='yeo-johnson') df['transformed'] =
pt.fit_transform(df[['column']])
7. Scaling
Follow EDA Cheat sheet for that
8. Encoding
Follow EDA Cheat sheet for that
DYSMECH COMPETENCY SERVICES PVT. LTD. 4
D
9. Train-Test Split
Follow EDA Cheat sheet for that
10. Feature Scaling Explanation
Follow EDA Cheat sheet for that
11. Apply the Algorithm according to target variable
NOTE: Apply the above steps as relevant to your project. If a step is
not essential, skip it and proceed to the next one.