Ads Exp2 C35
Ads Exp2 C35
TUS3F202135
Mumbai University
TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI
PART A
(PART A: TO BE REFFERED BY STUDENTS)
Experiment No.02
A.1 Aim:
To implement data cleaning techniques (Data Imputation through mean, median and
mode).
A.2 Prerequisite:
Knowledge of Python, Dataset (Kaggle).
A.3 Outcome:
After successful completion of this experiment students will be, able to have clean data
set.
A.4 Theory:
Introduction:
Data cleaning is one of the important parts of machine learning. It plays a significant part in building
a model. It surely isn’t the fanciest part of machine learning and at the same time, there aren’t any
hidden tricks or secrets to uncover. However, the success or failure of a project relies on proper data
cleaning. Professional data scientists usually invest a very large portion of their time in this step
because of the belief that “Better data beats fancier algorithms”. If we have a well-cleaned dataset,
there are chances that we can get achieve good results with simple algorithms also, which can prove
very beneficial at times especially in terms of computation when the dataset size is large.
Obviously, different types of data will require different types of cleaning. However, this
systematic approach can always serve as a good starting point.
Steps involved in Data Cleaning:
Irrelevant observations are any type of data that is of no use to us and can be removed directly.
2. Fixing Structural errors
The errors that arise during measurement, transfer of data, or other similar situations are called
structural errors. Structural errors include typos in the name of features, the same attribute with a
different name, mislabeled classes, i.e. separate classes that should really be the same, or
inconsistent capitalization.
For example, the model will treat America and America as different classes or values, though
they represent the same value or red, yellow, and red-yellow as different classes or attributes,
though one class can be included in the other two classes. So, these are some structural errors
that make our model inefficient and give poor quality results.
3. Managing Unwanted outliers
Outliers can cause problems with certain types of models. For example, linear regression
models are less robust to outliers than decision tree models. Generally, we should not remove
outliers until we have a legitimate reason to remove them. Sometimes, removing them
improves performance, sometimes not. So, one must have a good reason to remove the outlier,
such as suspicious measurements that are unlikely to be part of real data.
4. Handling missing data
Missing data is a deceptively tricky issue in machine learning. We cannot just ignore or remove
the missing observation. They must be handled carefully as they can be an indication of
something important. The two most common ways to deal with missing data are:
Dropping observations with missing values.
The fact that the value was missing may be informative in itself.
Plus, in the real world, you often need to make predictions on new data even if some of
the features are missing!
Imputing the missing values from past observations.
Again, “missingness” is almost always informative in itself, and you should tell your
algorithm if a value was missing.
Even if you build a model to impute your values, you’re not adding any real information.
You’re just reinforcing the patterns already provided by other features. Missing data is like
missing a puzzle piece. If you drop it, that’s like pretending the puzzle slot isn’t there. If you
impute it, that’s like trying to squeeze in a piece from somewhere else in the puzzle.
So, missing data is always an informative and an indication of something important. And we must
be aware of our algorithm of missing data by flagging it. By using this technique of flagging and
filling, you are essentially allowing the algorithm to estimate the optimal constant for missingness,
instead of just filling it in with the mean.
• Trifacta Wrangler
Sarvesh Patil C42
TUS3F202135
PART B
(PART B: TO BE COMPLETED BY STUDENTS)
import pandas as pd
data = pd.read_csv('data.csv')
data['age'].fillna(data['age'].mean(), inplace=True)
# Replace the original numeric features with the scaled features in the DataFrame
data[numeric_columns] = X_scaled
Sarvesh Patil C42
TUS3F202135
import pandas as pd
data = pd.read_csv('data.csv')
data = pd.get_dummies(data, columns=['Department'])
import pandas as pd
data = pd.read_csv('data.csv')
data = data.drop_duplicates()
import pandas as pd
from sklearn.preprocessing import StandardScaler
B.4 Conclusion:
Sarvesh Patil C42
TUS3F202135
Hence, we successfully implemented data cleaning techniques (Data Imputation through mean,
median and mode).