data preprocessing
data preprocessing
• Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines.It can be
generated due to faulty data collection, data entry errors etc. It can be handled in following
ways :
o Binning Method: This method works on sorted data in order to smooth it. The
whole data is divided into segments of equal size and then various methods are
performed to complete the task. Each segmented is handled separately. One can
replace all data in a segment by its mean or boundary values can be used to
complete the task.
o Regression:Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).
o Clustering: This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
2. Data Transformation: This step is taken in order to transform the data in appropriate forms
suitable for mining process. This involves following ways:
• Normalization: It is done in order to scale the data values in a specified range (-1.0 to 1.0 or
0.0 to 1.0)
• Attribute Selection: In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
• Discretization: This is done to replace the raw values of numeric attribute by interval levels
or conceptual levels.
• Concept Hierarchy Generation: Here attributes are converted from lower level to higher
level in hierarchy. For Example-The attribute “city” can be converted to “country”.
3. Data Reduction: Data reduction is a crucial step in the data mining process that involves
reducing the size of the dataset while preserving the important information. This is done to
improve the efficiency of data analysis and to avoid overfitting of the model. Some common
steps involved in data reduction are:
• Feature Selection: This involves selecting a subset of relevant features from the dataset.
Feature selection is often performed to remove irrelevant or redundant features from the
dataset. It can be done using various techniques such as correlation analysis, mutual
information, and principal component analysis (PCA).
• Feature Extraction: This involves transforming the data into a lower-dimensional space
while preserving the important information. Feature extraction is often used when the
original features are high-dimensional and complex. It can be done using techniques such
as PCA, linear discriminant analysis (LDA), and non-negative matrix factorization (NMF).
• Sampling: This involves selecting a subset of data points from the dataset. Sampling is
often used to reduce the size of the dataset while preserving the important information. It
can be done using techniques such as random sampling, stratified sampling, and systematic
sampling.
• Clustering: This involves grouping similar data points together into clusters. Clustering is
often used to reduce the size of the dataset by replacing similar data points with a
representative centroid. It can be done using techniques such as k-means, hierarchical
clustering, and density-based clustering.
• Compression: This involves compressing the dataset while preserving the important
information. Compression is often used to reduce the size of the dataset for storage and
transmission purposes. It can be done using techniques such as wavelet compression, JPEG
compression, and gif compression.
How is Data Preprocessing Used?
This we have earlier noted is one of the reasons data preprocessing is important in the earlier
stages of the development of machine learning and AI applications. While in AI context data
preprocessing is applied in order to optimize the methods used to cleanse, transform and
structure the data in a way that will enhance the accuracy of a new model with less computing
power used.
An excellent data preprocessing step will help develop a set of components or tools that can be
utilized to quickly prototype on a set of ideas or even run experiments on improving business
processes or customer satisfaction. For instance, preprocessing can enhance the manner in
which data is arranged for a recommendation engine by enhancing the age ranges of customers
that are used for categorisation.
It can also make the process of developing and enhancing data easier for more enhanced BI
which is beneficial to the business. For instance, small size, category or regions of the
customers may have different behaviors across regions. Backend processing the data into the
correct formats might enable BI teams to integrate such findings into BI dashboard.
In a broad concept, data preprocessing is a sub-process of web mining which is used in
customer relationship management (CRM). There’s usually the possibility of pre-processing of
the Web usage logs in order to arrive at meaningful data sets referred to as user transactions
which are actually a set of groups of URL references. Sessions may be stored to make user
identification possible as well as the websites requested and their sequence and time of use.
Once extracted from raw data, these give out more meaningful information that can be used,
for instance in consumer analysis, product promotion or customization.
Conclusion
Data preprocessing plays a central role in both of the data quality inspection and analysis
examination. In this way, the data mining process is makes effective and the results got are
accurate with these steps. Precisely, the process that is followed during data preprocessing may
vary from one dataset to the other or depending on the analysis that is needed.
Why Should We Use Data Processing?
In the modern era, most of the work relies on data, therefore collecting large amounts of data for
different purposes like academic, scientific research, institutional use, personal and private use,
commercial purposes, and lots more. The processing of this data collected is essential so that the
data goes through all the above steps and gets sorted, stored, filtered, presented in the required
format, and analyzed.
The amount of time consumed and the intricacy of processing will depend on the required
results. In situations where large amounts of data are acquired, the necessity of processing to
obtain authentic results with the help of data processing in data mining and data processing in
data research is inevitable.
Data is processed manually in this data processing method. The entire procedure of data
collecting, filtering, sorting, calculation and alternative logical operations is all carried out with
human intervention without using any electronic device or automation software. It is a low-cost
methodology and does not need very many tools. However, it produces high errors and requires
high labor costs and lots of time.
Data is processed mechanically through the use of devices and machines. These can include
simple devices such as calculators, typewriters, printing press, etc. Simple data processing
operations can be achieved with this method. It has much fewer errors than manual data
processing, but the increase in data has made this method more complex and difficult.
1. Batch Processing: In this type of data processing, data is collected and processed in
batches. It is used for large amounts of data. For example, the payroll system.
2. Single User Programming Processing: It is usually done by a single person for his
personal use. This technique is suitable even for small offices.
3. Multiple Programming Processing: This technique allows simultaneously storing and
executing more than one program in the Central Processing Unit (CPU). Data is broken
down into frames and processed using two or more CPUs within a single computer
system. It is also known as parallel processing. Further, the multiple programming
techniques increase the respective computer's overall working efficiency. A good
example of multiple programming processing is weather forecasting.
4. Real-time Processing: This technique facilitates the user to have direct contact with the
computer system. This technique eases data processing. This technique is also known as
the direct mode or the interactive mode technique and is developed exclusively to
perform one task. It is a sort of online processing, which always remains under execution.
For example, withdrawing money from ATM.
5. Online Processing: This technique facilitates the entry and execution of data directly; so,
it does not store or accumulate first and then process. The technique is developed to
reduce the data entry errors, as it validates data at various points and ensures that only
corrected data is entered. This technique is widely used for online applications. For
example, barcode scanning.
6. Time-sharing Processing: This is another form of online data processing that facilitates
several users to share the resources of an online computer system. This technique is
adopted when results are needed swiftly. Moreover, as the name suggests, this system is
time-based. Following are some of the major advantages of time-sharing processing, such
as:
o Several users can be served simultaneously.
o All the users have an almost equal amount of processing time.
o There is a possibility of interaction with the running programs.
7. Distributed Processing: This is a specialized data processing technique in which various
computers (located remotely) remain interconnected with a single host computer making
a network of computers. All these computer systems remain interconnected with a high-
speed communication network. However, the central computer system maintains the
master database and monitors accordingly. This facilitates communication between
computers.
o Stock trading software that converts millions of stock data into a simple graph.
o An e-commerce company uses the search history of customers to recommend similar
products.
o A digital marketing company uses demographic data of people to strategize location-
specific campaigns.
o A self-driving car uses real-time data from sensors to detect if there are pedestrians and
other cars on the road.
The complexity of this process is subject to the scope of data collection and the complexity of
the required results. Whether this process is time-consuming depends on steps, which need to be
made with the collected data and the type of output file desired to be received. This issue
becomes actual when the need for processing a big amount of data arises. Therefore, data mining
is widely used nowadays.
When data is gathered, there is a need to store it. The data can be stored in physical form using
paper-based documents, laptops and desktop computers, or other data storage devices. With the
rise and rapid development of such things as data mining and big data, the process of data
collection becomes more complicated and time-consuming. It is necessary to carry out many
operations to conduct thorough data analysis.
At present, data is stored in a digital form for the most part. It allows processing data faster and
converting it into different formats. The user has the possibility to choose the most suitable
output.