0% found this document useful (0 votes)

15 views

data preprocessing

Data preprocessing is a crucial step in data mining that involves cleaning, transforming, and integrating data to enhance its quality for analysis. Key steps include data cleaning, integration, transformation, reduction, and discretization, each employing various techniques to ensure data is suitable for mining tasks. Effective data preprocessing improves the accuracy of analysis results and is essential for optimizing machine learning and AI applications.

Uploaded by

kumawatkajal066

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

data preprocessing

Uploaded by

kumawatkajal066

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Data preprocessing is an important step in the data mining process.

It refers to the cleaning,

transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific
data mining task.
Steps of Data Preprocessing
Data preprocessing is an important step in the data mining process that involves cleaning and
transforming raw data to make it suitable for analysis. Some common steps in data
preprocessing include:
1. Data Cleaning: This involves identifying and correcting errors or inconsistencies in the
data, such as missing values, outliers, and duplicates. Various techniques can be used for
data cleaning, such as imputation, removal, and transformation.
2. Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different
formats, structures, and semantics. Techniques such as record linkage and data fusion can be
used for data integration.
3. Data Transformation: This involves converting the data into a suitable format for analysis.
Common techniques used in data transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common range, while
standardization is used to transform the data to have zero mean and unit variance.
Discretization is used to convert continuous data into discrete categories.
4. Data Reduction: This involves reducing the size of the dataset while preserving the
important information. Data reduction can be achieved through techniques such as feature
selection and feature extraction. Feature selection involves selecting a subset of relevant
features from the dataset, while feature extraction involves transforming the data into a
lower-dimensional space while preserving the important information.
5. Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms that
require categorical data. Discretization can be achieved through techniques such as equal
width binning, equal frequency binning, and clustering.
6. Data Normalization: This involves scaling the data to a common range, such as between 0
and 1 or -1 and 1. Normalization is often used to handle data with different units and scales.
Common normalization techniques include min-max normalization, z-score normalization,
and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the
analysis results. The specific steps involved in data preprocessing may vary depending on the
nature of the data and the analysis goals.
By performing these steps, the data mining process becomes more efficient and the results
become more accurate.
Preprocessing in Data Mining
Data preprocessing is a data mining technique which is used to transform the raw data in a
useful and efficient format.
Steps
Involved in Data Preprocessing
1. Data Cleaning: The data can have many irrelevant and missing parts. To handle this part,
data cleaning is done. It involves handling of missing data, noisy data etc.
• Missing Data: This situation arises when some data is missing in the data. It can be handled
in various ways.
Some of them are:
o Ignore the tuples: This approach is suitable only when the dataset we have is
quite large and multiple values are missing within a tuple.
o Fill the Missing values: There are various ways to do this task. You can choose to
fill the missing values manually, by attribute mean or the most probable value.

• Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines.It can be
generated due to faulty data collection, data entry errors etc. It can be handled in following
ways :
o Binning Method: This method works on sorted data in order to smooth it. The
whole data is divided into segments of equal size and then various methods are
performed to complete the task. Each segmented is handled separately. One can
replace all data in a segment by its mean or boundary values can be used to
complete the task.
o Regression:Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).
o Clustering: This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
2. Data Transformation: This step is taken in order to transform the data in appropriate forms
suitable for mining process. This involves following ways:
• Normalization: It is done in order to scale the data values in a specified range (-1.0 to 1.0 or
0.0 to 1.0)
• Attribute Selection: In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
• Discretization: This is done to replace the raw values of numeric attribute by interval levels
or conceptual levels.
• Concept Hierarchy Generation: Here attributes are converted from lower level to higher
level in hierarchy. For Example-The attribute “city” can be converted to “country”.
3. Data Reduction: Data reduction is a crucial step in the data mining process that involves
reducing the size of the dataset while preserving the important information. This is done to
improve the efficiency of data analysis and to avoid overfitting of the model. Some common
steps involved in data reduction are:
• Feature Selection: This involves selecting a subset of relevant features from the dataset.
Feature selection is often performed to remove irrelevant or redundant features from the
dataset. It can be done using various techniques such as correlation analysis, mutual
information, and principal component analysis (PCA).
• Feature Extraction: This involves transforming the data into a lower-dimensional space
while preserving the important information. Feature extraction is often used when the
original features are high-dimensional and complex. It can be done using techniques such
as PCA, linear discriminant analysis (LDA), and non-negative matrix factorization (NMF).
• Sampling: This involves selecting a subset of data points from the dataset. Sampling is
often used to reduce the size of the dataset while preserving the important information. It
can be done using techniques such as random sampling, stratified sampling, and systematic
sampling.
• Clustering: This involves grouping similar data points together into clusters. Clustering is
often used to reduce the size of the dataset by replacing similar data points with a
representative centroid. It can be done using techniques such as k-means, hierarchical
clustering, and density-based clustering.
• Compression: This involves compressing the dataset while preserving the important
information. Compression is often used to reduce the size of the dataset for storage and
transmission purposes. It can be done using techniques such as wavelet compression, JPEG
compression, and gif compression.
How is Data Preprocessing Used?
This we have earlier noted is one of the reasons data preprocessing is important in the earlier
stages of the development of machine learning and AI applications. While in AI context data
preprocessing is applied in order to optimize the methods used to cleanse, transform and
structure the data in a way that will enhance the accuracy of a new model with less computing
power used.
An excellent data preprocessing step will help develop a set of components or tools that can be
utilized to quickly prototype on a set of ideas or even run experiments on improving business
processes or customer satisfaction. For instance, preprocessing can enhance the manner in
which data is arranged for a recommendation engine by enhancing the age ranges of customers
that are used for categorisation.
It can also make the process of developing and enhancing data easier for more enhanced BI
which is beneficial to the business. For instance, small size, category or regions of the
customers may have different behaviors across regions. Backend processing the data into the
correct formats might enable BI teams to integrate such findings into BI dashboard.
In a broad concept, data preprocessing is a sub-process of web mining which is used in
customer relationship management (CRM). There’s usually the possibility of pre-processing of
the Web usage logs in order to arrive at meaningful data sets referred to as user transactions
which are actually a set of groups of URL references. Sessions may be stored to make user
identification possible as well as the websites requested and their sequence and time of use.
Once extracted from raw data, these give out more meaningful information that can be used,
for instance in consumer analysis, product promotion or customization.
Conclusion
Data preprocessing plays a central role in both of the data quality inspection and analysis
examination. In this way, the data mining process is makes effective and the results got are
accurate with these steps. Precisely, the process that is followed during data preprocessing may
vary from one dataset to the other or depending on the analysis that is needed.
Why Should We Use Data Processing?
In the modern era, most of the work relies on data, therefore collecting large amounts of data for
different purposes like academic, scientific research, institutional use, personal and private use,
commercial purposes, and lots more. The processing of this data collected is essential so that the
data goes through all the above steps and gets sorted, stored, filtered, presented in the required
format, and analyzed.

The amount of time consumed and the intricacy of processing will depend on the required
results. In situations where large amounts of data are acquired, the necessity of processing to
obtain authentic results with the help of data processing in data mining and data processing in
data research is inevitable.

Methods of Data Processing

There are three main data processing methods, such as:

1. Manual Data Processing

Data is processed manually in this data processing method. The entire procedure of data
collecting, filtering, sorting, calculation and alternative logical operations is all carried out with
human intervention without using any electronic device or automation software. It is a low-cost
methodology and does not need very many tools. However, it produces high errors and requires
high labor costs and lots of time.

2. Mechanical Data Processing

Data is processed mechanically through the use of devices and machines. These can include
simple devices such as calculators, typewriters, printing press, etc. Simple data processing
operations can be achieved with this method. It has much fewer errors than manual data
processing, but the increase in data has made this method more complex and difficult.

3. Electronic Data Processing

Data is processed with modern technologies using data processing software and programs. The
software gives a set of instructions to process the data and yield output. This method is the most
expensive but provides the fastest processing speeds with the highest reliability and accuracy of
output.

Types of Data Processing

There are different types of data processing based on the source of data and the steps taken by
the processing unit to generate an output. There is no one size fits all method that can be used for
processing raw data.

1. Batch Processing: In this type of data processing, data is collected and processed in
batches. It is used for large amounts of data. For example, the payroll system.
2. Single User Programming Processing: It is usually done by a single person for his
personal use. This technique is suitable even for small offices.
3. Multiple Programming Processing: This technique allows simultaneously storing and
executing more than one program in the Central Processing Unit (CPU). Data is broken
down into frames and processed using two or more CPUs within a single computer
system. It is also known as parallel processing. Further, the multiple programming
techniques increase the respective computer's overall working efficiency. A good
example of multiple programming processing is weather forecasting.
4. Real-time Processing: This technique facilitates the user to have direct contact with the
computer system. This technique eases data processing. This technique is also known as
the direct mode or the interactive mode technique and is developed exclusively to
perform one task. It is a sort of online processing, which always remains under execution.
For example, withdrawing money from ATM.
5. Online Processing: This technique facilitates the entry and execution of data directly; so,
it does not store or accumulate first and then process. The technique is developed to
reduce the data entry errors, as it validates data at various points and ensures that only
corrected data is entered. This technique is widely used for online applications. For
example, barcode scanning.
6. Time-sharing Processing: This is another form of online data processing that facilitates
several users to share the resources of an online computer system. This technique is
adopted when results are needed swiftly. Moreover, as the name suggests, this system is
time-based. Following are some of the major advantages of time-sharing processing, such
as:
o Several users can be served simultaneously.
o All the users have an almost equal amount of processing time.
o There is a possibility of interaction with the running programs.
7. Distributed Processing: This is a specialized data processing technique in which various
computers (located remotely) remain interconnected with a single host computer making
a network of computers. All these computer systems remain interconnected with a high-
speed communication network. However, the central computer system maintains the
master database and monitors accordingly. This facilitates communication between
computers.

Examples of Data Processing

Data processing occurs in our daily lives whether we may be aware of it or not. Here are some
real-life examples of data processing, such as:

o Stock trading software that converts millions of stock data into a simple graph.
o An e-commerce company uses the search history of customers to recommend similar
products.
o A digital marketing company uses demographic data of people to strategize location-
specific campaigns.
o A self-driving car uses real-time data from sensors to detect if there are pedestrians and
other cars on the road.

Importance of Data Processing in Data Mining

In today's world, data has a significant bearing on researchers, institutions, commercial
organizations, and each individual user. Data is often imperfect, noisy, and incompatible, and
then it requires additional processing. After gathering, the question arises of how to store, sort,
filter, analyze and present data. Here data mining comes into play.

The complexity of this process is subject to the scope of data collection and the complexity of
the required results. Whether this process is time-consuming depends on steps, which need to be
made with the collected data and the type of output file desired to be received. This issue
becomes actual when the need for processing a big amount of data arises. Therefore, data mining
is widely used nowadays.

When data is gathered, there is a need to store it. The data can be stored in physical form using
paper-based documents, laptops and desktop computers, or other data storage devices. With the
rise and rapid development of such things as data mining and big data, the process of data
collection becomes more complicated and time-consuming. It is necessary to carry out many
operations to conduct thorough data analysis.

At present, data is stored in a digital form for the most part. It allows processing data faster and
converting it into different formats. The user has the possibility to choose the most suitable
output.

Manual of Percutaneous Coronary Interventions A Step by Step Approach Readable PDF Download
100% (14)
Manual of Percutaneous Coronary Interventions A Step by Step Approach Readable PDF Download
16 pages
Typical Details
No ratings yet
Typical Details
13 pages
ZXA10 C300 Configuration Manual
69% (13)
ZXA10 C300 Configuration Manual
301 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Data Binning
No ratings yet
Data Binning
9 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
DWDM unit 3
No ratings yet
DWDM unit 3
16 pages
Data Mining
No ratings yet
Data Mining
5 pages
Unit 3
No ratings yet
Unit 3
18 pages
Data Mining unit-1 complete
No ratings yet
Data Mining unit-1 complete
45 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
OJCST_Vol13_N2-3_p_78-81
No ratings yet
OJCST_Vol13_N2-3_p_78-81
4 pages
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
Data Mining
No ratings yet
Data Mining
22 pages
BUSINESS INTELLIGENCE NOTES Unit 4
No ratings yet
BUSINESS INTELLIGENCE NOTES Unit 4
10 pages
QB 10 Marker
No ratings yet
QB 10 Marker
19 pages
Dw&bi PR2,3
No ratings yet
Dw&bi PR2,3
6 pages
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
No ratings yet
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
3 pages
What Is Data Preprocessing
No ratings yet
What Is Data Preprocessing
4 pages
Data pre Processing
No ratings yet
Data pre Processing
11 pages
Module 2
No ratings yet
Module 2
42 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
4.1 - Data Preprocessing
No ratings yet
4.1 - Data Preprocessing
28 pages
ml4
No ratings yet
ml4
17 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
IBA - MODULe 4.3
No ratings yet
IBA - MODULe 4.3
10 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
No ratings yet
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
25 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Preprocessing in Data Mining: Edgar Acu Na
No ratings yet
Preprocessing in Data Mining: Edgar Acu Na
5 pages
Data Preprocessing in Python Pandas (With Code)
No ratings yet
Data Preprocessing in Python Pandas (With Code)
11 pages
Unit 2
No ratings yet
Unit 2
9 pages
FDS CH 3
No ratings yet
FDS CH 3
2 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
DATA MINING MODULE 2
No ratings yet
DATA MINING MODULE 2
23 pages
Chương
No ratings yet
Chương
12 pages
preprocessing(review)
No ratings yet
preprocessing(review)
11 pages
Down 2
No ratings yet
Down 2
61 pages
COMPAPPABCA50150rDatrAP Data Preprocessing2 (DataMining)
No ratings yet
COMPAPPABCA50150rDatrAP Data Preprocessing2 (DataMining)
13 pages
DS_UNIT_2
No ratings yet
DS_UNIT_2
23 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
-16-Data Preprocessing
No ratings yet
-16-Data Preprocessing
27 pages
Data Mining
No ratings yet
Data Mining
6 pages
Data_Preprocessing
No ratings yet
Data_Preprocessing
2 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
Data Warehousing - CH3
No ratings yet
Data Warehousing - CH3
15 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
Eligibility Criteria 2024
No ratings yet
Eligibility Criteria 2024
20 pages
Mapeh 10 Arts
No ratings yet
Mapeh 10 Arts
4 pages
Analysis and Effects of TB Ball Vs Ms Ball of Forward Sports PVT LTD
No ratings yet
Analysis and Effects of TB Ball Vs Ms Ball of Forward Sports PVT LTD
11 pages
Installation Guide Flammap6 July 1, 2019
No ratings yet
Installation Guide Flammap6 July 1, 2019
8 pages
Lean Principles in Software Development: A Literature Review
No ratings yet
Lean Principles in Software Development: A Literature Review
6 pages
SR NO Time Specific Objective Content Teacher Activities Student Activities A.V Aids Evaluation
No ratings yet
SR NO Time Specific Objective Content Teacher Activities Student Activities A.V Aids Evaluation
16 pages
Ncert Sol for Class 9 Maths Chapter 2 Ex 1
No ratings yet
Ncert Sol for Class 9 Maths Chapter 2 Ex 1
4 pages
O-God-Beyond-All-Praising
No ratings yet
O-God-Beyond-All-Praising
2 pages
For posting -Bouncing Mass Expt Sheet
No ratings yet
For posting -Bouncing Mass Expt Sheet
2 pages
web portfolio
No ratings yet
web portfolio
32 pages
c99 PHP
No ratings yet
c99 PHP
239 pages
Module 1 - Number Systems
No ratings yet
Module 1 - Number Systems
60 pages
A1 ENS Price List 0515
No ratings yet
A1 ENS Price List 0515
25 pages
The Great Plebeian College: Alaminos City, Pangasinan
No ratings yet
The Great Plebeian College: Alaminos City, Pangasinan
2 pages
Rajesh Planning (1)
No ratings yet
Rajesh Planning (1)
4 pages
Electr - Connect.diagram Elektroschaltplan: Neutraubling Plant Werk
No ratings yet
Electr - Connect.diagram Elektroschaltplan: Neutraubling Plant Werk
62 pages
Experiment No 7 (Inclined - Plane)
No ratings yet
Experiment No 7 (Inclined - Plane)
3 pages
Kumpulan Quiz AKM III
No ratings yet
Kumpulan Quiz AKM III
10 pages
EcoGrow Vertical Farming
No ratings yet
EcoGrow Vertical Farming
2 pages
Daftar Standar Alkes
No ratings yet
Daftar Standar Alkes
128 pages
深入理解JAVA反序列化漏洞
No ratings yet
深入理解JAVA反序列化漏洞
26 pages
Tables To Graph
No ratings yet
Tables To Graph
4 pages
English Syllabus For Esp Course
No ratings yet
English Syllabus For Esp Course
8 pages
P/N 1029711B Problue Adhesive Melter Pump Replacement Kit - P/N 1028303, 1058305, and 1058306
No ratings yet
P/N 1029711B Problue Adhesive Melter Pump Replacement Kit - P/N 1028303, 1058305, and 1058306
2 pages
IC U509 TAB T210 STc3115
No ratings yet
IC U509 TAB T210 STc3115
35 pages
KC Lehman Part V ImmAppRevisitPOST
No ratings yet
KC Lehman Part V ImmAppRevisitPOST
80 pages
Classification of Accidents As Per IS 3786
No ratings yet
Classification of Accidents As Per IS 3786
16 pages

data preprocessing

Uploaded by

data preprocessing

Uploaded by

Data preprocessing is an important step in the data mining process.

It refers to the cleaning,

Methods of Data Processing

1. Manual Data Processing

2. Mechanical Data Processing

3. Electronic Data Processing

Types of Data Processing

Examples of Data Processing

Importance of Data Processing in Data Mining

You might also like