0% found this document useful (0 votes)
3 views

Machine Learning Project Roadmap

The document outlines a machine learning project workflow, detailing steps such as importing libraries, data exploration, identifying and treating missing values, performing exploratory data analysis (EDA), and handling outliers. It emphasizes the importance of data transformation, scaling, encoding, and splitting the dataset into training and testing sets. The note at the end suggests applying the steps as relevant to the specific project, allowing for flexibility in the workflow.

Uploaded by

Karan Kosare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Machine Learning Project Roadmap

The document outlines a machine learning project workflow, detailing steps such as importing libraries, data exploration, identifying and treating missing values, performing exploratory data analysis (EDA), and handling outliers. It emphasizes the importance of data transformation, scaling, encoding, and splitting the dataset into training and testing sets. The note at the end suggests applying the steps as relevant to the specific project, allowing for flexibility in the workflow.

Uploaded by

Karan Kosare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

DCS CSED

Machine Learning Project Workflow


1. Import Libraries and Load the dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.impute import SimpleImputer


from sklearn.preprocessing import LabelEncoder

import scipy.stats as stats

import warnings
warnings.filterwarnings('ignore')

data = pd.read_csv('path/to/your/data.csv')

2. Data Exploration
1. Initial Data Inspection: Examine the dataset's shape and columns.
data.head()
data.info()
<.info(): will also give a direct count of number of numeric and categorical
variables>
<variables/attributes are columns, records are rows>

5 point summary:
data.describe()
numeric:
<min, max values>
<50 percentile/median>
<25,75>
<std, mean>
DYSMECH COMPETENCY SERVICES PVT. LTD. 2
D

categorical:
data.describe(include='O')
data.describe(include=object)
<number of categories present in the variable>
<the top category with highest freq>
<freq of the top category>

3. Identify Missing Values: Check for missing values in each column.

-data.isnull().sum()
# will tell you column wise count of missing values.
-data.isnull().sum(axis=1)
# will tell you count of missing values in each record.

Missing value treatment:


1. Drop:
data.dropna(axis=1,how='any'/'all',thresh=num,subset=[col])

2. Impute:
-mean/median for numeric
data.fillna(tab[col].median/.mean)
-mode for categorical
data.fillna(tab[col].mode()[0]

4. EDA: Follow EDA Cheat sheet for that


1. Measure of Central Tendency- Mean, Median, Mode
2. Distribution of Data – using Visualization technique
a. Univariate Analysis
b. Bivariate Analysis
c. Multivariate Analysis
DCS CSED

3. Dispersion of Data- min, max, range, variance, standard deviation,


coefficient
of variation
4. Skewness and Kurtosis
5. Covariance and Correlation
5. Identify outliers
using box plot
Treatment for Outliers
q1 = data['column'].quantile(0.25)
q3 = data['column'].quantile(0.75)
iqr = q3 - q1
ul = q3 + 1.5 * iqr
ll = q1 - 1.5 * iqr

1. Drop
data = data[~((data['column'] < ll) | (data['column'] > ul))]
2. Capping
data['column'] = np.where(data['column'] > ul, ul, np.where(data['column']
< ll, ll, data['column']))

6. Data Transformation
Log Transformation:
df['column'] = np.log(df['column'])
Box-Cox Transformation:
pt = PowerTransformer(method='box-cox') df['transformed'] =
pt.fit_transform(df[['column']])
Yeo-Johnson Transformation:
pt = PowerTransformer(method='yeo-johnson') df['transformed'] =
pt.fit_transform(df[['column']])

7. Scaling
Follow EDA Cheat sheet for that

8. Encoding
Follow EDA Cheat sheet for that
DYSMECH COMPETENCY SERVICES PVT. LTD. 4
D

9. Train-Test Split
Follow EDA Cheat sheet for that

10. Feature Scaling Explanation


Follow EDA Cheat sheet for that

11. Apply the Algorithm according to target variable

NOTE: Apply the above steps as relevant to your project. If a step is


not essential, skip it and proceed to the next one.

You might also like