Machine Learning Project Roadmap
Machine Learning Project Roadmap
import warnings
warnings.filterwarnings('ignore')
data = pd.read_csv('path/to/your/data.csv')
2. Data Exploration
1. Initial Data Inspection: Examine the dataset's shape and columns.
data.head()
data.info()
<.info(): will also give a direct count of number of numeric and categorical
variables>
<variables/attributes are columns, records are rows>
5 point summary:
data.describe()
numeric:
<min, max values>
<50 percentile/median>
<25,75>
<std, mean>
DYSMECH COMPETENCY SERVICES PVT. LTD. 2
D
categorical:
data.describe(include='O')
data.describe(include=object)
<number of categories present in the variable>
<the top category with highest freq>
<freq of the top category>
-data.isnull().sum()
# will tell you column wise count of missing values.
-data.isnull().sum(axis=1)
# will tell you count of missing values in each record.
2. Impute:
-mean/median for numeric
data.fillna(tab[col].median/.mean)
-mode for categorical
data.fillna(tab[col].mode()[0]
1. Drop
data = data[~((data['column'] < ll) | (data['column'] > ul))]
2. Capping
data['column'] = np.where(data['column'] > ul, ul, np.where(data['column']
< ll, ll, data['column']))
6. Data Transformation
Log Transformation:
df['column'] = np.log(df['column'])
Box-Cox Transformation:
pt = PowerTransformer(method='box-cox') df['transformed'] =
pt.fit_transform(df[['column']])
Yeo-Johnson Transformation:
pt = PowerTransformer(method='yeo-johnson') df['transformed'] =
pt.fit_transform(df[['column']])
7. Scaling
Follow EDA Cheat sheet for that
8. Encoding
Follow EDA Cheat sheet for that
DYSMECH COMPETENCY SERVICES PVT. LTD. 4
D
9. Train-Test Split
Follow EDA Cheat sheet for that