Lab 3 DWM
Lab 3 DWM
Data Integration
Introduction
Cleaning Data
Data cleaning or Data cleansing is very important from the perspective of building intelligent automated
systems. Data cleansing is a preprocessing step that improves the data validity, accuracy, completeness,
consistency, and uniformity. It is essential for building reliable machine learning models that can produce
good results. Otherwise, no matter how good the model is, its results cannot be trusted. In short, data
cleaning means fixing bad data in your data set. Bad data could be:
1. Empty cells
2. Data in wrong format
3. Wrong data
4. Duplicates
The dataset that we are going to use is ‘rawdata.csv’. It has following characteristics:
The data set contains some empty cells ("Date" in row 22, and "Calories" in row 18 and 28).
Task: Load and view the dataset provided after importing important libraries.
As empty cells can potentially give a wrong result while analyzing data, so to deal with the empty cells,
we would be performing the following operations:
a. Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells by using the method
dropna(). Since data sets can be very big, and removing a few rows will not have a big impact on the
result.
By default, the dropna() method returns a new DataFrame, and will not change the original. If you want
to change the original DataFrame, use the inplace = True argument.
Another way of dealing with empty cells is to insert a new value instead by using method fillna(). This
way you do not have to delete entire rows just because of some empty cells.
In above methods, we are replacing all empty cells in the whole Data Frame. To only replace empty
values for one column, specify the column name for the DataFrame.
A common way to replace empty cells, is to calculate the mean, median or mode value of the column.
Pandas uses the mean() median() and mode() methods to calculate the respective values for a specified
column:
i. Mean:
Mean = the average value (the sum of all values divided by number of values).
ii. Median:
Median = the value in the middle, after you have sorted all values ascending.
iii. Mode:
Tasks:
1. Calculate the Mean of ‘Calories’ and replace the missing values with it.
2. Calculate the Median of ‘Maxpulse’ and replace the missing values with it.
3. Calculate the mode of ‘Pulse’ and replace the missing values with it.
In our Data Frame, we have two cells with the wrong format. Check out row 22 and 26, the 'Date' column
should be a string that represents a date.
You will see that the date in row 26 was fixed after converting ‘Date’ column into string, but the empty
date in row 22 got a NaT (Not a Time) value, in other words an empty value. One way to deal with empty
values is simply removing the entire row.
b) Removing Rows
The result from the converting in the example above gave us a NaT value, which can be handled as a
NULL value, and we can remove the row by using the dropna() method.
"Wrong data" does not have to be "empty cells" or "wrong format", it can just be wrong, like if someone
registered "199" instead of "1.99". Sometimes you can spot wrong data by looking at the data set,
because you have an expectation of what it should be.
In our data set, you can see that in row 7, the duration is 450, but for all the other rows the duration is
between 30 and 60. It doesn't have to be wrong but taking in consideration that this is the data set of
someone's workout sessions, we conclude with the fact that this person did not work out in 450
minutes.
a) Replacing Values
One way to fix wrong values is to replace them with something else. In our test data, it is most likely a
typo, and the value should be "45" instead of "450".
For small data sets you might be able to replace the wrong data one by one, but not for big data sets. To
replace wrong data for larger data sets you can create some rules, e.g. set some boundaries for legal
values, and replace any values that are outside of the boundaries.
Task: Loop through all values in the ‘Duration’ column. If the value is higher than 120, set it to 120.
b) Removing Rows
Another way of handling wrong data is to remove the rows that contains wrong data. This way you do
not have to find out what to replace them with, and there is a good chance you do not need them to do
your analyses.
Duplicate rows are rows that have been registered more than one time. By looking at our data set, we
can assume that rows 11 and 12 are duplicates.
To discover duplicates, we can use the duplicated() method. The duplicated() method returns a Boolean
values for each row:
Scale Features
When your data has different values, and even different measurement units, it can be difficult to
compare them. What is kilograms compared to meters? Or altitude compared to time?
The answer to this problem is scaling. We can scale data into new values that are easier to compare.
It can be difficult to compare the volume 1.0 with the weight 790, but if we scale them both into
comparable values, we can easily see how much one value is compared to the other.
There are different methods for scaling data, in this tutorial we will use a method called standardization.
z = (x - u) / s
Where z is the new value, x is the original value, u is the mean and s is the standard deviation.
You do not have to do this manually, the Python sklearn module has a method
called StandardScaler() which returns a Scaler object with methods for transforming data sets.
Task1: Impute missing values in the Age column using the median.
Task2: Fill missing values in the Embarked column with the most frequent value.
Note: Here x includes the features or independent variables where y includes the dependent variable
variables or label.
lr = LogisticRegression()
lr.fit(x_train,y_train)
y_pred = lr.predict(x_test)