0% found this document useful (0 votes)

17 views9 pages

Ads Exp2 C35

The document discusses data cleaning techniques including removing duplicate, irrelevant or redundant data, fixing structural errors, managing outliers, and handling missing data. Common techniques for handling missing data are dropping observations or imputing missing values from other data. The Python code implements various data cleaning methods like dropping NA values, imputing the mean, scaling data, one-hot encoding categorical variables, removing duplicates and scaling features.

Uploaded by

sarveshpatil2833

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views9 pages

Ads Exp2 C35

Uploaded by

sarveshpatil2833

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

Sarvesh Patil C42

TUS3F202135
Mumbai University
TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI
PART A
(PART A: TO BE REFFERED BY STUDENTS)
Experiment No.02

A.1 Aim:
To implement data cleaning techniques (Data Imputation through mean, median and
mode).
A.2 Prerequisite:
Knowledge of Python, Dataset (Kaggle).
A.3 Outcome:
After successful completion of this experiment students will be, able to have clean data
set.
A.4 Theory:
Introduction:
Data cleaning is one of the important parts of machine learning. It plays a significant part in building
a model. It surely isn’t the fanciest part of machine learning and at the same time, there aren’t any
hidden tricks or secrets to uncover. However, the success or failure of a project relies on proper data
cleaning. Professional data scientists usually invest a very large portion of their time in this step
because of the belief that “Better data beats fancier algorithms”. If we have a well-cleaned dataset,
there are chances that we can get achieve good results with simple algorithms also, which can prove
very beneficial at times especially in terms of computation when the dataset size is large.
Obviously, different types of data will require different types of cleaning. However, this
systematic approach can always serve as a good starting point.
Steps involved in Data Cleaning:

1. Removal of unwanted observations

This includes deleting duplicate/ redundant or irrelevant values from your dataset. Duplicate
observations most frequently arise during data collection and irrelevant observations are
those that don’t actually fit the specific problem that you’re trying to solve.
Redundant observations alter the efficiency by a great extent as the data repeats and may
add towards the correct side or towards the incorrect side, thereby producing unfaithful
results.
Sarvesh Patil C42
TUS3F202135

Irrelevant observations are any type of data that is of no use to us and can be removed directly.
2. Fixing Structural errors
The errors that arise during measurement, transfer of data, or other similar situations are called
structural errors. Structural errors include typos in the name of features, the same attribute with a
different name, mislabeled classes, i.e. separate classes that should really be the same, or
inconsistent capitalization.
For example, the model will treat America and America as different classes or values, though
they represent the same value or red, yellow, and red-yellow as different classes or attributes,
though one class can be included in the other two classes. So, these are some structural errors
that make our model inefficient and give poor quality results.
3. Managing Unwanted outliers
Outliers can cause problems with certain types of models. For example, linear regression
models are less robust to outliers than decision tree models. Generally, we should not remove
outliers until we have a legitimate reason to remove them. Sometimes, removing them
improves performance, sometimes not. So, one must have a good reason to remove the outlier,
such as suspicious measurements that are unlikely to be part of real data.
4. Handling missing data
Missing data is a deceptively tricky issue in machine learning. We cannot just ignore or remove
the missing observation. They must be handled carefully as they can be an indication of
something important. The two most common ways to deal with missing data are:
Dropping observations with missing values.
The fact that the value was missing may be informative in itself.
Plus, in the real world, you often need to make predictions on new data even if some of
the features are missing!
Imputing the missing values from past observations.
Again, “missingness” is almost always informative in itself, and you should tell your
algorithm if a value was missing.
Even if you build a model to impute your values, you’re not adding any real information.
You’re just reinforcing the patterns already provided by other features. Missing data is like
missing a puzzle piece. If you drop it, that’s like pretending the puzzle slot isn’t there. If you
impute it, that’s like trying to squeeze in a piece from somewhere else in the puzzle.
So, missing data is always an informative and an indication of something important. And we must
be aware of our algorithm of missing data by flagging it. By using this technique of flagging and
filling, you are essentially allowing the algorithm to estimate the optimal constant for missingness,
instead of just filling it in with the mean.

Some data cleansing tools

• Openrefine

• Trifacta Wrangler
Sarvesh Patil C42
TUS3F202135

PART B
(PART B: TO BE COMPLETED BY STUDENTS)

Roll No: BE-C42 Name: Sarvesh Sandeep Patil

Class: BE-Comps Batch: C3
Date of Experiment: 16/01/2024 Date of Submission: 16/01/2024
Grade:

B.1 Software Code written by student:

import pandas as pd
data = pd.read_csv('data.csv')
data = data.dropna()

import pandas as pd
data = pd.read_csv('data.csv')
data['age'].fillna(data['age'].mean(), inplace=True)

from sklearn.preprocessing import RobustScaler

import pandas as pd
# Load your data
data = pd.read_csv('data.csv')

# Identify numeric columns for scaling

numeric_columns = data.select_dtypes(include=['number']).columns

# Select the features (X) you want to scale

X = data[numeric_columns]

# Instantiate the RobustScaler

scaler = RobustScaler()

# Fit and transform the selected features

X_scaled = scaler.fit_transform(X)

# Replace the original numeric features with the scaled features in the DataFrame
data[numeric_columns] = X_scaled
Sarvesh Patil C42
TUS3F202135

# Now 'data' contains the scaled values in numeric columns

import pandas as pd
data = pd.read_csv('data.csv')
data = pd.get_dummies(data, columns=['Department'])

import pandas as pd
data = pd.read_csv('data.csv')
data = data.drop_duplicates()

import pandas as pd
from sklearn.preprocessing import StandardScaler

# read the data into a pandas dataframe

df = pd.read_csv("data.csv")

# create a feature matrix and target vector

X = df.drop(["id", "Date of Joining"], axis=1)
y = df["Salary"]

# scale the numerical features

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X[["age", "Salary"]])

# concatenate the scaled features with the categorical features

gender_dummies = pd.get_dummies(X["Gender"], prefix="Gender")
X_processed = pd.concat(
[gender_dummies, pd.DataFrame(X_scaled, columns=["age", "Salary"])],
axis=1,
)
print(X_processed)
Sarvesh Patil C42
TUS3F202135

B.2 Input and Output:

B.3 Observations and learning:

Data cleaning is a crucial step in the data preprocessing pipeline that ensures the quality and
reliability of the data before analysis or model building. Real-world datasets often contain missing
values, outliers, and inconsistencies that can adversely affect the accuracy and effectiveness of any
downstream analysis.

B.4 Conclusion:
Sarvesh Patil C42
TUS3F202135
Hence, we successfully implemented data cleaning techniques (Data Imputation through mean,
median and mode).

B.5 Question of Curiosity (Handwritten any 3)

Q1: What is data cleaning? How can it be done in python?
Q2: When using Python to clean a dataset, what are some of the common issues that arise and how
do you deal with them?
Q3: What are the missing values? How do you handle missing values?
Q4: How can data cleaning contribute to improving the overall data quality and reliability for
decision-making purposes?
Q5: What challenges or ethical considerations might arise when handling missing data, and how
can these be addressed during the data cleaning process?
Q6: Explain the potential impact of data cleaning on the outcomes of statistical analysis and
machine learning models?
Sarvesh Patil C42
TUS3F202135
Sarvesh Patil C42
TUS3F202135
Sarvesh Patil C42
TUS3F202135

Cover
0% (1)
Cover
3 pages
Enabling Android Auto On Rlink1
100% (1)
Enabling Android Auto On Rlink1
13 pages
iOS Demo Content Restore Kit Guide v6.4.1 Mac OS X (US-En)
No ratings yet
iOS Demo Content Restore Kit Guide v6.4.1 Mac OS X (US-En)
14 pages
Guidance To Complete Stage 2
No ratings yet
Guidance To Complete Stage 2
2 pages
Be A 65 Ads Exp 3
No ratings yet
Be A 65 Ads Exp 3
6 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
3b. Data Pre-Processing
No ratings yet
3b. Data Pre-Processing
84 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Data Cleaning (Examples)
No ratings yet
Data Cleaning (Examples)
9 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
03 Data Preprocessing
No ratings yet
03 Data Preprocessing
15 pages
PDS Exp 7 To 9
No ratings yet
PDS Exp 7 To 9
10 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
1 - Data Preprocessing and Cleaning - 55
No ratings yet
1 - Data Preprocessing and Cleaning - 55
8 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Anomalies in Dataset
No ratings yet
Anomalies in Dataset
4 pages
Data Cleaningin ML
No ratings yet
Data Cleaningin ML
15 pages
Ass 3 - Best
No ratings yet
Ass 3 - Best
10 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
III Unit
No ratings yet
III Unit
4 pages
1-Introduction To Data Cleaning
No ratings yet
1-Introduction To Data Cleaning
22 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
6.data Cleaning
No ratings yet
6.data Cleaning
20 pages
What Is Data Cleaning
No ratings yet
What Is Data Cleaning
8 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Okay
No ratings yet
Okay
1 page
4 - Data Pre-Processing I
No ratings yet
4 - Data Pre-Processing I
37 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
2 - Machine Learning - 130824
No ratings yet
2 - Machine Learning - 130824
81 pages
EDA - Zep
No ratings yet
EDA - Zep
33 pages
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
No ratings yet
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
34 pages
Lab 3 DWM
No ratings yet
Lab 3 DWM
5 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
ML Lecture 5 Data Quality
No ratings yet
ML Lecture 5 Data Quality
19 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
06 Data Mining-Data Preprocessing-Cleaning
No ratings yet
06 Data Mining-Data Preprocessing-Cleaning
6 pages
EDA and Cleaning
No ratings yet
EDA and Cleaning
24 pages
Assignment 4 MB511
No ratings yet
Assignment 4 MB511
6 pages
Data Mining Group Assignment4
No ratings yet
Data Mining Group Assignment4
10 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
20 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Unit - II
No ratings yet
Unit - II
56 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
From Everand
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
Elaine Tate
No ratings yet
RTL To GDS 1726328765
No ratings yet
RTL To GDS 1726328765
10 pages
Atm Machine
No ratings yet
Atm Machine
12 pages
DeepL Translate - The World's Most Accurate Translator
No ratings yet
DeepL Translate - The World's Most Accurate Translator
6 pages
Modbus Implementation v364
No ratings yet
Modbus Implementation v364
184 pages
Single Node Deploy
100% (1)
Single Node Deploy
61 pages
Resume Yehuda Friedman
No ratings yet
Resume Yehuda Friedman
1 page
Question Engl-Versão Final
No ratings yet
Question Engl-Versão Final
63 pages
Introduction To Spring Framework and Dependency Injection: Aaron Zeckoski
No ratings yet
Introduction To Spring Framework and Dependency Injection: Aaron Zeckoski
31 pages
Introduction To PostgreSQL
No ratings yet
Introduction To PostgreSQL
8 pages
Httpbin (1) - HTTP Client Testing Service
No ratings yet
Httpbin (1) - HTTP Client Testing Service
2 pages
Firewall and Port Requirements For Zenoss 4.2 Deployments
No ratings yet
Firewall and Port Requirements For Zenoss 4.2 Deployments
9 pages
IJRPR1903
No ratings yet
IJRPR1903
4 pages
Simatic Et 200sp
No ratings yet
Simatic Et 200sp
34 pages
2024 Hysweep Interfacing - Vers 2024.0
No ratings yet
2024 Hysweep Interfacing - Vers 2024.0
136 pages
3.1-1 Introduction To Server Computing SASMA
No ratings yet
3.1-1 Introduction To Server Computing SASMA
11 pages
Agile Flash Card
No ratings yet
Agile Flash Card
3 pages
Computer Science Textbook Solutions - 31
No ratings yet
Computer Science Textbook Solutions - 31
30 pages
Script Efe
No ratings yet
Script Efe
2 pages
Sap Fiori Overview
100% (2)
Sap Fiori Overview
63 pages
NetExec Cheat Sheet
No ratings yet
NetExec Cheat Sheet
14 pages
EcE-21014 Lecture Notes
No ratings yet
EcE-21014 Lecture Notes
40 pages
Cum Bapi Bapi - Alm - Order - Maintain
No ratings yet
Cum Bapi Bapi - Alm - Order - Maintain
2 pages
IncIDFA OOPSLA25
No ratings yet
IncIDFA OOPSLA25
51 pages
TCS NQT Coding Questions, Videos & Solutions Total 134 Sessions
No ratings yet
TCS NQT Coding Questions, Videos & Solutions Total 134 Sessions
65 pages
Overview of SAP S 4HANA Cloud Private
No ratings yet
Overview of SAP S 4HANA Cloud Private
9 pages
Event Log
No ratings yet
Event Log
111 pages

Ads Exp2 C35

Uploaded by

Ads Exp2 C35

Uploaded by

Sarvesh Patil C42

1. Removal of unwanted observations

Some data cleansing tools

Roll No: BE-C42 Name: Sarvesh Sandeep Patil

B.1 Software Code written by student:

from sklearn.preprocessing import RobustScaler

# Identify numeric columns for scaling

# Select the features (X) you want to scale

# Instantiate the RobustScaler

# Fit and transform the selected features

# Now 'data' contains the scaled values in numeric columns

# read the data into a pandas dataframe

# create a feature matrix and target vector

# scale the numerical features

# concatenate the scaled features with the categorical features

B.2 Input and Output:

B.3 Observations and learning:

B.5 Question of Curiosity (Handwritten any 3)

You might also like