0% found this document useful (0 votes)

1 views5 pages

Lab 3 DWM

The document outlines an experiment focused on data cleaning techniques using Python and Anaconda, emphasizing the importance of data validity and accuracy for machine learning models. It details specific tasks for handling empty cells, incorrect data formats, wrong data values, and duplicates in a dataset, along with methods for scaling features. Additionally, it provides final lab tasks for cleaning a 'titanic.csv' dataset and fitting a regression model to compute accuracy.

Uploaded by

hamza.tariqkwl

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views5 pages

Lab 3 DWM

Uploaded by

hamza.tariqkwl

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Experiment 3: Data Cleaning, Data Transformation Normalization,

Data Integration

Objective : To learn and implement the data cleaning techniques

Time Required : 3 hrs

Programming Language: Python

Software Required : Anaconda

Introduction

Cleaning Data

Data cleaning or Data cleansing is very important from the perspective of building intelligent automated
systems. Data cleansing is a preprocessing step that improves the data validity, accuracy, completeness,
consistency, and uniformity. It is essential for building reliable machine learning models that can produce
good results. Otherwise, no matter how good the model is, its results cannot be trusted. In short, data
cleaning means fixing bad data in your data set. Bad data could be:

1. Empty cells
2. Data in wrong format
3. Wrong data
4. Duplicates
The dataset that we are going to use is ‘rawdata.csv’. It has following characteristics:

 The data set contains some empty cells ("Date" in row 22, and "Calories" in row 18 and 28).

 The data set contains wrong format ("Date" in row 26).

 The data set contains wrong data ("Duration" in row 7).

 The data set contains duplicates (row 11 and 12).

Step 1: Load and view dataset

Task: Load and view the dataset provided after importing important libraries.

Step 2: Dealing with empty cells

As empty cells can potentially give a wrong result while analyzing data, so to deal with the empty cells,
we would be performing the following operations:
a. Remove Rows

One way to deal with empty cells is to remove rows that contain empty cells by using the method
dropna(). Since data sets can be very big, and removing a few rows will not have a big impact on the
result.

Task: Remove all the empty cells in dataset provided

By default, the dropna() method returns a new DataFrame, and will not change the original. If you want
to change the original DataFrame, use the inplace = True argument.

b. Replace empty values

Another way of dealing with empty cells is to insert a new value instead by using method fillna(). This
way you do not have to delete entire rows just because of some empty cells.

Task: Replace the empty values with 150

c. Replace only for a specified Columns

In above methods, we are replacing all empty cells in the whole Data Frame. To only replace empty
values for one column, specify the column name for the DataFrame.

Task: Replace the empty values in ‘Calories’ with 130.

d. Replace Using Mean, Median, or Mode

A common way to replace empty cells, is to calculate the mean, median or mode value of the column.
Pandas uses the mean() median() and mode() methods to calculate the respective values for a specified
column:

i. Mean:

Mean = the average value (the sum of all values divided by number of values).

ii. Median:

Median = the value in the middle, after you have sorted all values ascending.

iii. Mode:

Mode = the value that appears most frequently.

Tasks:

1. Calculate the Mean of ‘Calories’ and replace the missing values with it.
2. Calculate the Median of ‘Maxpulse’ and replace the missing values with it.
3. Calculate the mode of ‘Pulse’ and replace the missing values with it.

Step 3: Dealing with data of wrong format

As cells with data of wrong format can make it difficult, or even impossible, to analyze data. To fix it, you
have two options: remove the rows, or convert all cells in the columns into the same format.

a) Convert Into a Correct Format

In our Data Frame, we have two cells with the wrong format. Check out row 22 and 26, the 'Date' column
should be a string that represents a date.

Task: Convert the ‘Date’ column into string.

You will see that the date in row 26 was fixed after converting ‘Date’ column into string, but the empty
date in row 22 got a NaT (Not a Time) value, in other words an empty value. One way to deal with empty
values is simply removing the entire row.

Task: Remove the entire row 22

b) Removing Rows

The result from the converting in the example above gave us a NaT value, which can be handled as a
NULL value, and we can remove the row by using the dropna() method.

Step 4: Dealing with wrong data

"Wrong data" does not have to be "empty cells" or "wrong format", it can just be wrong, like if someone
registered "199" instead of "1.99". Sometimes you can spot wrong data by looking at the data set,
because you have an expectation of what it should be.

In our data set, you can see that in row 7, the duration is 450, but for all the other rows the duration is
between 30 and 60. It doesn't have to be wrong but taking in consideration that this is the data set of
someone's workout sessions, we conclude with the fact that this person did not work out in 450
minutes.

a) Replacing Values

One way to fix wrong values is to replace them with something else. In our test data, it is most likely a
typo, and the value should be "45" instead of "450".

Task: Insert the value "45" in row 7.

For small data sets you might be able to replace the wrong data one by one, but not for big data sets. To
replace wrong data for larger data sets you can create some rules, e.g. set some boundaries for legal
values, and replace any values that are outside of the boundaries.

Task: Loop through all values in the ‘Duration’ column. If the value is higher than 120, set it to 120.

b) Removing Rows

Another way of handling wrong data is to remove the rows that contains wrong data. This way you do
not have to find out what to replace them with, and there is a good chance you do not need them to do
your analyses.

Task: Delete rows where "Duration" is higher than 120.

Step 5: Dealing with duplicates

Duplicate rows are rows that have been registered more than one time. By looking at our data set, we
can assume that rows 11 and 12 are duplicates.

To discover duplicates, we can use the duplicated() method. The duplicated() method returns a Boolean
values for each row:

Task: Remove duplicates using the drop_duplicates() method.

Scale Features

When your data has different values, and even different measurement units, it can be difficult to
compare them. What is kilograms compared to meters? Or altitude compared to time?

The answer to this problem is scaling. We can scale data into new values that are easier to compare.

It can be difficult to compare the volume 1.0 with the weight 790, but if we scale them both into
comparable values, we can easily see how much one value is compared to the other.

There are different methods for scaling data, in this tutorial we will use a method called standardization.

The standardization method uses this formula:

z = (x - u) / s

Where z is the new value, x is the original value, u is the mean and s is the standard deviation.

You do not have to do this manually, the Python sklearn module has a method
called StandardScaler() which returns a Scaler object with methods for transforming data sets.

Final Lab Task:

For given dataset ‘titanic.csv’, perform following data cleaning techniques.

Task1: Impute missing values in the Age column using the median.

Task2: Fill missing values in the Embarked column with the most frequent value.

Task3: Drop rows with missing Cabin data.

Task4: Convert the Pclass column to a categorical variable.

Task5: Ensure the Fare column is a float, not an integer.

Task6: Analyze if Age has outliers and handle them accordingly.

Task7: Remove any duplicate rows if found.

After applying the data cleaning methods, carry out fitting of data with a Regression Model and
compute its accuracy. The code for fitting with Logistic Regression is as follows:

# Split Data into a training set and test set

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3, random state=1)

Note: Here x includes the features or independent variables where y includes the dependent variable
variables or label.

# Fitting with Logistic Regression

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(x_train,y_train)

y_pred = lr.predict(x_test)

# Import scikit-learn metrics module for accuracy calculation

From sklearn import metrics

# Compute Model Accuracy

print(“Accuracy: “, metrics.accuracy_score(y_test, y_pred)*100)

Practice Questions for Tableau Desktop Specialist Certification Case Based
From Everand
Practice Questions for Tableau Desktop Specialist Certification Case Based
Exam OG
5/5 (1)
Evaluating The Customer Preferences of Online Shopping 1939 6104 17-2-185
No ratings yet
Evaluating The Customer Preferences of Online Shopping 1939 6104 17-2-185
13 pages
Lecture 4 Data Pre-Processing
No ratings yet
Lecture 4 Data Pre-Processing
43 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Asfasdas
No ratings yet
Asfasdas
36 pages
Name: Shristi Chapagain Date: 2081/06/04 Lab 1: Data Cleaning
No ratings yet
Name: Shristi Chapagain Date: 2081/06/04 Lab 1: Data Cleaning
6 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
Pandas
No ratings yet
Pandas
30 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Data Cleaningin ML
No ratings yet
Data Cleaningin ML
15 pages
Pandas Module (Part-I)
No ratings yet
Pandas Module (Part-I)
36 pages
3b. Data Pre-Processing
No ratings yet
3b. Data Pre-Processing
84 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
Data Cleanups
No ratings yet
Data Cleanups
16 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
PDS Exp 7 To 9
No ratings yet
PDS Exp 7 To 9
10 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
FBA Module 3
No ratings yet
FBA Module 3
41 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
ch4 Slides PDF
No ratings yet
ch4 Slides PDF
44 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
14 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
6.data Cleaning
No ratings yet
6.data Cleaning
20 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
DWM Exp 7
No ratings yet
DWM Exp 7
4 pages
1-Introduction To Data Cleaning
No ratings yet
1-Introduction To Data Cleaning
22 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
24 pages
Data Science - Sec4
No ratings yet
Data Science - Sec4
16 pages
Exercise 3
No ratings yet
Exercise 3
25 pages
ML - Preprocessing - Introduction
No ratings yet
ML - Preprocessing - Introduction
14 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
DAP Writeups - Merged
No ratings yet
DAP Writeups - Merged
33 pages
Data Science Unit 2 Second Half Notes
No ratings yet
Data Science Unit 2 Second Half Notes
18 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
Practical 3
No ratings yet
Practical 3
2 pages
Dsbda Ass2
No ratings yet
Dsbda Ass2
49 pages
Be A 65 Ads Exp 3
No ratings yet
Be A 65 Ads Exp 3
6 pages
Big Data Analysis
No ratings yet
Big Data Analysis
38 pages
Lab 6
No ratings yet
Lab 6
9 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Arnav MLlab01
No ratings yet
Arnav MLlab01
7 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
III Unit
No ratings yet
III Unit
4 pages
Module 3
No ratings yet
Module 3
20 pages
Day 10 Pandasdatacleaning
No ratings yet
Day 10 Pandasdatacleaning
6 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Handson Data Preprocessing PYTHON
No ratings yet
Handson Data Preprocessing PYTHON
3 pages
String (Pandas) - Removing $ After Int Sales ( Revenue') Sales ( Revenue') .STR - Strip ( $') #Convert String To Int
No ratings yet
String (Pandas) - Removing $ After Int Sales ( Revenue') Sales ( Revenue') .STR - Strip ( $') #Convert String To Int
12 pages
Data Preprocessing 1
No ratings yet
Data Preprocessing 1
6 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
ML Yifan Lu REPORT
No ratings yet
ML Yifan Lu REPORT
22 pages
2020 Deep Learning For Mental Illness Detection Using Brain SPECT Imaging - SpringerLink
No ratings yet
2020 Deep Learning For Mental Illness Detection Using Brain SPECT Imaging - SpringerLink
10 pages
Discrete Choice Model 1734436966
No ratings yet
Discrete Choice Model 1734436966
27 pages
Risk Management Theory: A Comprehensive Empirical Assessment
No ratings yet
Risk Management Theory: A Comprehensive Empirical Assessment
31 pages
Baquero20224269 83
No ratings yet
Baquero20224269 83
16 pages
Homicides and Weapons Examining The Covariates of Weapon Choice
No ratings yet
Homicides and Weapons Examining The Covariates of Weapon Choice
23 pages
2021-GIS-based Landslide Susceptibility Assessment Using Optimized Hybrid Machine Learning Methods
No ratings yet
2021-GIS-based Landslide Susceptibility Assessment Using Optimized Hybrid Machine Learning Methods
16 pages
Logistic Regression
No ratings yet
Logistic Regression
11 pages
Metal Mining and Birth Defect A Case Control Study in Lubumbashi, Democratric Republic of Congo
No ratings yet
Metal Mining and Birth Defect A Case Control Study in Lubumbashi, Democratric Republic of Congo
8 pages
Leo Breiman 2001 Random Forest Algorithm Weka - Google Scholar
No ratings yet
Leo Breiman 2001 Random Forest Algorithm Weka - Google Scholar
6 pages
Assignment 8
No ratings yet
Assignment 8
3 pages
The MDC Procedure: Chapter Contents
100% (1)
The MDC Procedure: Chapter Contents
74 pages
Nikitakis 2019
No ratings yet
Nikitakis 2019
12 pages
Food Quality and Preference: Attila Gere, Géza Székely, Sándor Kovács, Zoltán Kókai, László Sipos
No ratings yet
Food Quality and Preference: Attila Gere, Géza Székely, Sándor Kovács, Zoltán Kókai, László Sipos
6 pages
Heart Disease Prediction Project Documentation
No ratings yet
Heart Disease Prediction Project Documentation
22 pages
Lec 05
No ratings yet
Lec 05
53 pages
Case Questions
75% (4)
Case Questions
2 pages
Data Science Interview Questions and Answer
100% (1)
Data Science Interview Questions and Answer
41 pages
Sumit Tripathi Applied AI Course Schedule
No ratings yet
Sumit Tripathi Applied AI Course Schedule
31 pages
Python Machine Learning - Logistic Regression
No ratings yet
Python Machine Learning - Logistic Regression
1 page
Classification Algorithms 3rd
No ratings yet
Classification Algorithms 3rd
15 pages
Tcs509 Unit 2
No ratings yet
Tcs509 Unit 2
143 pages
An Introduction To Statistical Learning From A Reg PDF
No ratings yet
An Introduction To Statistical Learning From A Reg PDF
25 pages
UNIT 2-3 - Notes - Unit-2-3-Notes
No ratings yet
UNIT 2-3 - Notes - Unit-2-3-Notes
16 pages
Om Sheewale Pune
No ratings yet
Om Sheewale Pune
2 pages
Probabilistic Models in The Study of Language
No ratings yet
Probabilistic Models in The Study of Language
274 pages
Abhay Raj 2019ugcs005r NLP Report
No ratings yet
Abhay Raj 2019ugcs005r NLP Report
21 pages
Customer Classification by Past Purchase Data Analysis
No ratings yet
Customer Classification by Past Purchase Data Analysis
4 pages
A Comparative Analysis of Logistic Regression and Random Forest For Individual Fairness in Machine Learning
No ratings yet
A Comparative Analysis of Logistic Regression and Random Forest For Individual Fairness in Machine Learning
5 pages

Lab 3 DWM

Uploaded by

Lab 3 DWM

Uploaded by

Experiment 3: Data Cleaning, Data Transformation Normalization,

Objective : To learn and implement the data cleaning techniques

Time Required : 3 hrs

Programming Language: Python

Software Required : Anaconda

 The data set contains wrong format ("Date" in row 26).

 The data set contains wrong data ("Duration" in row 7).

 The data set contains duplicates (row 11 and 12).

Step 1: Load and view dataset

Step 2: Dealing with empty cells

Task: Remove all the empty cells in dataset provided

b. Replace empty values

Task: Replace the empty values with 150

c. Replace only for a specified Columns

Task: Replace the empty values in ‘Calories’ with 130.

d. Replace Using Mean, Median, or Mode

Mode = the value that appears most frequently.

Step 3: Dealing with data of wrong format

a) Convert Into a Correct Format

Task: Convert the ‘Date’ column into string.

Task: Remove the entire row 22

Step 4: Dealing with wrong data

Task: Insert the value "45" in row 7.

Task: Delete rows where "Duration" is higher than 120.

Task: Remove duplicates using the drop_duplicates() method.

The standardization method uses this formula:

Final Lab Task:

For given dataset ‘titanic.csv’, perform following data cleaning techniques.

Task3: Drop rows with missing Cabin data.

Task4: Convert the Pclass column to a categorical variable.

Task5: Ensure the Fare column is a float, not an integer.

Task6: Analyze if Age has outliers and handle them accordingly.

Task7: Remove any duplicate rows if found.

# Split Data into a training set and test set

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3, random state=1)

# Fitting with Logistic Regression

from sklearn.linear_model import LogisticRegression

# Import scikit-learn metrics module for accuracy calculation

From sklearn import metrics

# Compute Model Accuracy

print(“Accuracy: “, metrics.accuracy_score(y_test, y_pred)*100)

You might also like