Copy of ML_preprocessing_introduction.pptx
Copy of ML_preprocessing_introduction.pptx
Preprocessing techniques
Basics of ML Project Life Cycle :
https://www.analyticsvidhya.com/blog/2020/09/10-things-know-before-first-data-science-project/
What is preprocessing?
The dataset initially provided for training might not be in a ready-to-use state, for e.g. it might not be formatted
properly, or may contain missing or null values.
Solving all these problems using various methods is called Data Preprocessing.
A properly processed dataset increase the efficiency and accuracy of the models.
Steps in Data Preprocessing:
Data Preprocessing
Data Preprocessing is a technique that is used to convert the raw data into a clean data set.
Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.
Dataset has Data objects which are described by a number of features that capture the basic characteristics of an object
• Fill the missing value manually: Entering values manually is time consuming.
• Fill the missing value with the help of global constant: Replace all the values of missing attribute with label unknown
by using same constant.
● To fix it, you have two options: remove the rows, or convert all cells in the columns into the same format.
print(health_data)
health_data.dropna(axis=0,inplace=True)
print(health_data)
use the dropna() function to remove the NaNs. axis=0 means
that we want to remove all rows that have a NaN value:
Data Categories
To analyze data, we also need to know the types of data we are dealing with.
Data Types
We can use the info() function to list the data types within our data set:
print(health_data.info())
We cannot use objects to calculate and perform analysis here. We must convert the type
object to float64 (float64 is a number with a decimal in Python).
We can use the astype() function to convert the data into float64.
convert"Average_Pulse" and "Max_Pulse" into data type float64 (the other variables are
already of data type float64):
health_data["Average_Pulse"] = health_data['Average_Pulse'].astype(float)
health_data["Max_Pulse"] = health_data["Max_Pulse"].astype(float)
print (health_data.info())
Analyze data using
print(health_data.describe())
One Hot Encoding
Categorical column into their respective numeric values conversion
One way to do this is to have a column representing each group in the category.
For each column, the values will be 1 or 0 where 1 represents the inclusion of the group and 0 represents the exclusion. This transformation is called one
hot encoding.
Use the Python Pandas module has a function that called get_dummies() which does one hot encoding.
import pandas as pd
cars = pd.read_csv('data.csv')
ohe_cars = pd.get_dummies(cars[['Car']])
print(ohe_cars.to_string())
Feature Scaling
When your data has different values, and even different measurement units, it can
be difficult to compare them. What is kilograms compared to meters? Or altitude
compared to time?
The answer to this problem is scaling. We can scale data into new values that are
easier to compare.
There are different methods for scaling data, in this tutorial we will use a method
called standardization.
The standardization method uses this formula:
z = (x - u) / s
Where z is the new value, x is the original value, u is the mean and s is
the standard deviation.
import pandas
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
df = pandas.read_csv("data.csv")
X = df[['Weight', 'Volume']]
scaledX = scale.fit_transform(X)
https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encodi
ng-using-scikit-learn/
https://medium.com/@michaeldelsole/what-is-one-hot-encoding-and-how-to-do-it
-f0ae272f1179