0% found this document useful (0 votes)
3 views3 pages

Unit 2 Data Preprocessing

Data preprocessing is essential for preparing raw data for machine learning, addressing issues like noise and missing values to enhance model accuracy. Key steps include importing datasets, handling missing data, encoding categorical data, and splitting datasets into training and test sets. Data transformation further improves data quality and facilitates integration, although it can be time-consuming and complex.

Uploaded by

anveshapsingh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views3 pages

Unit 2 Data Preprocessing

Data preprocessing is essential for preparing raw data for machine learning, addressing issues like noise and missing values to enhance model accuracy. Key steps include importing datasets, handling missing data, encoding categorical data, and splitting datasets into training and test sets. Data transformation further improves data quality and facilitates integration, although it can be time-consuming and complex.

Uploaded by

anveshapsingh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Unit 2 Data Pre-processing

Data Preprocessing • It is a process of preparing the raw data and making it suitable for a
machine learning model.
• It is the first and crucial step while creating a machine learning
model.
• A real-world data generally contains noises, missing values, and
maybe in an unusable format which cannot be directly used for
machine learning models.
• Data preprocessing is required tasks for cleaning the data and
making it suitable for a machine learning model which also increases
the accuracy and efficiency of a machine learning model.

Getting the dataset


• Getting the dataset • Data is a valuable asset to any company today
•Importing libraries • The first thing we required a dataset as a ML model completely works
•Importing datasets on data.
•Finding Missing Data • The collected data for a particular problem in a proper format is
•Encoding Categorical Data known as the dataset.
•Splitting dataset into training and test set • Dataset may be of different formats for different purposes.
• Ex. Excel, CSV, HTML or xlsx file, Image, etc.

Importing datasets Importing datasets


• Need to import some predefined Python libraries. These libraries are used to perform some specific
jobs. • The next key step is to load the data that will be utilized in the machine learning
• import pandas as pd:- Importing and managing the datasets. It is an open-source data manipulation algorithm.
and analysis library. • Many companies start by storing data in warehouses that require data to pass
• import numpy as np:- any type of mathematical operation in the code. It is the fundamental package through an ETL.
for scientific calculation in Python. • The problem with this method is that you never know which data will be useful
• It also supports to add large, multidimensional arrays and matrices. for an ML project.
• import matplotlib.pyplot as plt:- Data visulazation, which is a Python 2D plotting library, and with • As a result, warehouses are commonly used to access data through business
this library, we need to import a sub-library pyplot. This library is used to plot any type of charts in intelligence interfaces in order to observe metrics that we know we need to
Python monitor.
• import seaborn as sns:- Advanace data visulazation
• From sklearn.preprocessing import SimpleImputer, OneHotEncoder, LabelEncoder, StandardScaler
:- pre-processing, cross-validation, and visualization algorithms using a unified interface.
• various machine-learning tasks such as Classification, Regression, Clustering, and many more.
Cont.. Finding Missing Data
• # Load the dataset
• df = pd.read_excel(‘E:/Data/Salary.xlsx') • The next step of data preprocessing is to handle missing data in the
datasets. If our dataset contains some missing data, then it may
• print(df.head()) / df.head()
create a huge problem for our machine learning model.

• # Checking the null values in dataset


• df.isna()
• df.info() • df.isna().sum()
• df.describe()

• # if we want to delete column from dataset To delete the null values


• df = df.drop(["Unnamed: 5","Unnamed: 6","Unnamed: 7","Unnamed: • df.dropna()
8","Unnamed: 9","Unnamed: 10"], axis=1)
• #handling missing data (Replacing missing data with the mean value)
from sklearn.preprocessing import Imputer
imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
• #Fitting imputer object to the independent variables x.
imputerimputer= imputer.fit(x[:, 1:3])
• After check the Missing Value
• #Replacing missing data with the calculated mean value
x[:, 1:3]= imputer.transform(x[:, 1:3])
• Please check the dataset / actual dataset size

Noisy data Encoding Categorical Data


• Noisy data is a meaningless data that can’t be interpreted by machines. It • Categorical data is data which has some categories such as, in our
can be generated due to faulty data collection, data entry errors etc. dataset; there are two categorical variable, University Name,
• Binning Method: This method works on sorted data in order to smooth it. and Program. (Gender, color) (string to int)
The whole data is divided into segments of equal size and then various
methods are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or boundary
values can be used to complete the task.
• Regression: Here data can be made smooth by fitting it to a regression
function. The regression used may be linear (having one independent
variable) or multiple (having multiple independent variables).
• Clustering: This approach groups the similar data in a cluster. The outliers
may be undetected or it will fall outside the clusters.
Splitting dataset into training and test set Data Integration
• We divide our dataset into a training set and test set. • To the process of combining data from multiple sources into a single,
unified view.
• We can enhance the performance of our machine learning model. • This can involve cleaning and transforming the data, as well as
• Training Set: A subset of dataset to train the machine learning model, and resolving any inconsistencies or conflicts that may exist between the
we already know the output. different sources.
• Test set: A subset of dataset to test the machine learning model, and by • The goal of data integration is to make the data more useful and
using the test set, model predicts the output. meaningful for the purposes of analysis and decision making.
• Techniques used in data integration include data warehousing, ETL
(extract, transform, load) processes, and data federation.
• from sklearn.model_selection import train_test_split • Data Integration is a data preprocessing technique that combines
• x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_ data from multiple heterogeneous data sources into a coherent data
state=0) store and provides a unified view of the data.
• G= Global schema, S= Heterogeneous source of Schema, M=Mapping
between the queries of source and global schema.

Data Transformation
• Data transformation in data mining refers to the process of • Data cleaning: Removing or correcting errors, inconsistencies, and missing
values in the data.
converting raw data into a format that is suitable for analysis and
• Data integration: Combining data from multiple sources, such as databases
modeling. and spreadsheets, into a single format.
• The goal of data transformation is to prepare the data for data mining • Data normalization: Scaling the data to a common range of values, such as
so that it can be used to extract useful insights and knowledge. between 0 and 1, to facilitate comparison and analysis.
• Data reduction: Reducing the dimensionality of the data by selecting a
subset of relevant features or attributes.
• Data discretization: Converting continuous data into discrete categories or
bins.
• Data aggregation: Combining data at different levels of granularity, such as
by summing or averaging, to create new features or attributes.

Advantages of Data Transformation Disadvantages of Data Transformation


• Improves Data Quality: Data transformation helps to improve the quality of data • Time-consuming: especially when dealing with large datasets.
by removing errors, inconsistencies, and missing values.
• Facilitates Data Integration: Data transformation enables the integration of data • Complexity: requiring specialized skills and knowledge to implement
from multiple sources, which can improve the accuracy and completeness of the and interpret the results.
data.
• Improves Data Analysis: Data transformation helps to prepare the data for • Data Loss: when discretizing continuous data, or when removing
analysis and modeling by normalizing, reducing dimensionality, and discretizing attributes or features from the data.
the data.
• Increases Data Security: Data transformation can be used to mask sensitive data, • Biased transformation: if the data is not properly understood or
or to remove sensitive information from the data, which can help to increase data used.
security.
• Enhances Data Mining Algorithm Performance: Data transformation can improve • High cost: expensive process, requiring significant investments in
the performance of data mining algorithms by reducing the dimensionality of the hardware, software, and personnel.
data and scaling the data to a common range of values.

You might also like