0% found this document useful (0 votes)
55 views

Data Mining - Unit - 3

The document discusses data processing and the steps involved. It describes the data processing cycle which consists of collecting raw data, preparing it, inputting it for processing, outputting the results, and storing the data. The key steps are collection, preparation by filtering and sorting data, input, processing using algorithms, output in a readable format, and storage. It also discusses different types of data processing like batch, real-time, online and parallel processing. Data preprocessing tasks like cleaning, integration, reduction and transformation are explained.

Uploaded by

sunil star
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Data Mining - Unit - 3

The document discusses data processing and the steps involved. It describes the data processing cycle which consists of collecting raw data, preparing it, inputting it for processing, outputting the results, and storing the data. The key steps are collection, preparation by filtering and sorting data, input, processing using algorithms, output in a readable format, and storage. It also discusses different types of data processing like batch, real-time, online and parallel processing. Data preprocessing tasks like cleaning, integration, reduction and transformation are explained.

Uploaded by

sunil star
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Data Processing in Data

Mining
Unit-3
Data Processing
 Data processing is the method of collecting raw
data and translating it into usable information. It is
usually performed in a step-by-step process .
The raw data is collected, filtered, sorted, processed,
analyzed, stored, and then presented in a readable
format.
Data processing is more applicable in organisation.
By converting the data into a readable format like
graphs, charts, and documents, employees
throughout the organization can understand and use
the data.
Cycle in Data Processing
Cycles involved in Data Processing
The data processing cycle consists of a series of
steps where raw data (input) is fed into a process
(CPU) to produce actionable insights (output).
Step 1: Collection
Step 2: Preparation
Step 3: Input
Step 4: Data Processing
Step 5: Output
Step 6: Storage
Collection
Step 1:Collection
The collection of raw data is the first step of the
data processing cycle.
Raw data should be gathered from defined and
accurate sources.
 Raw data should be valid and usable.
Raw data can include monetary figures, website
cookies, profit/loss statements of a company,
user behavior, etc.
Preparation in Data Processing
Step 2: Preparation
It is the process of sorting and filtering the raw
data to remove unnecessary and inaccurate data.
Raw data is checked and transformed into a
suitable form for further analysis and processing.
It is done to get the highest quality data is fed into
the processing unit. 
Input and Data Process
Step 3: Input
Raw data is converted into machine readable form and
fed into the processing unit.
It is done through a keyboard, scanner or any other
input source. 
Step 4: Data Processing
Raw data is subjected to various data processing
methods using machine learning and artificial
intelligence algorithms to generate a desirable output.
It may vary from process to process depending on the
source of data being processed (data lakes, online
databases, connected devices, etc.) and the intended
use of the output.
Output and Storage
Step 5: Output
The data is finally transmitted and displayed to the
user in a readable form like graphs, tables, vector
files, audio, video, documents, etc.
This output can be stored and further processed in
the next data processing cycle. 
Step 6: Storage
Data and metadata are stored for further use.
This allows for quick access and retrieval of
information whenever needed, it can be used as an
input in the next data processing cycle directly.
Examples of Data Processing

I. A stock trading software that converts millions of


stock data into a simple graph
II. An e-commerce company uses the search history
of customers to recommend similar products
III. A digital marketing company uses demographic
data of people to strategize location-specific
campaigns
IV. A self-driving car uses real-time data from sensors
to detect if there are pedestrians and other cars
on the road
Data Processing Methods

There are three main data processing methods –


manual,
mechanical
and electronic. 
Data Processing Methods
Manual Processing :
data is processed manually.
The entire data process is done with human intervention
without the use of any other electronic device or automation
software.
Mechanical Data Processing:
Data is processed mechanically through the use of devices
and machines.
These can include simple devices such as calculators,
typewriters, printing press,etc..
Electronic Data Processing:
Data is processed using data processing software and
programs.
A set of instructions is given to the software to process the
data and yield output.
Types of Data Processing

Batch Processing:
Data is collected and processed in batches. Used
for large amounts of data.
Eg: payroll system
Real-time Processing:
Data is processed within seconds when the input
is given.
Used for small amounts of data.
Eg: withdrawing money from ATM
Types of Data Processing

Online Processing:
Data is automatically fed into the CPU as soon as it
becomes available. Used for continuous processing
of data.
Eg: barcode scanning
Multiprocessing:
Data is broken down into frames and processed
using two or more CPUs within a single computer
system. Also known as parallel processing.
Eg: weather forecasting
Time-sharing:
Allocates computer resources and data in time slots
to several users simultaneously. 
Data Preprocessing

Data Preprocessing
It is the process of transforming raw data into an
understandable format.
It helps in checking the quality of the data before
applying machine learning or data mining
algorithms.
Need for Data preprocessing

Why is Data preprocessing important?


Preprocessing of data is mainly to check the data
quality. The quality can be checked by the following
Accuracy: To check whether the data entered is
correct or not.
Completeness: To check whether the data is
available or not recorded.
Consistency: To check whether the same data is
kept in all the places that do or do not match.
Timeliness: The data should be updated correctly.
Believability: The data should be trustable.
Interpretability: The understandability of the data.
Major Tasks in Data Preprocessing:
Major Tasks in Data Preprocessing:
Data cleaning
Data integration
Data reduction
Data transformation
Data cleaning:
Data cleaning:
It is the process to remove incorrect data, incomplete data and
inaccurate data from the datasets, and it also replaces the
missing values.
There are some techniques in data cleaning
Handling missing values ,Noisy Data, Regression and
Clustering.
Handling missing values
Non Available values to replace missing values – Standard
Values.
Mean Values- Distributed Values
Median Values – Non-Normal Distribution.
Most Probable Values – Regression and Decision Tree
Algorithms.
All the above mentioned values are attribute values.
Noisy Values
 Noisy generally means random error or containing
unnecessary data points.
Methods to handle noisy data.
Binning:
It is a method to smooth or handle noisy data.
Data is sorted and then the sorted values are separated and
stored in the form of bins.
There are three methods for smoothing data in the bin.
 Smoothing by bin mean method:
The values in the bin are replaced by the mean value of the bin;
 
Smoothing by bin median: The values in the bin are replaced by
the median value;
 Smoothing by bin boundary: minimum and maximum of bin
values are taken and are replaced by the closest boundary
value.
Data cleaning using noisy values
If you’re working with text data, for example, some
things you should consider when cleaning your data
are:
Remove URLs, symbols, emojis, etc., that aren’t
relevant to your analysis
Translate all text into the language you’ll be working
in
Remove HTML tags
Remove boilerplate email text
Remove unnecessary blank text between words
Remove duplicate data
Data cleaning using noisy values
Regression: This is used to smooth the data and
will help to handle data when unnecessary data is
present.
regression helps to decide the variable which is
suitable for our analysis.
Clustering: This is used for finding the outliers and
also in grouping the data. Clustering is generally
used in unsupervised learning.
Data integration:
The process of combining multiple sources into a
single dataset.
problems to be considered during data integration.
Schema integration: Integrates metadata(a set of
data that describes other data) from different
sources.
Entity identification problem: Identifying entities
from multiple databases.
For example, the user should know student _id of
one database and student_name of another
database belongs to the same entity.
Data integration
Detecting and resolving data value concepts:
The data taken from different databases while
merging  may differ.
Like the attribute values from one database may
differ from another database.
For example, the date format may differ like “MM/
DD/YYYY” or “DD/MM/YYYY”.
Data reduction:
Data reduction:
 It helps in the reduction of the volume of the data
and produces the same or almost the same result.
It helps to reduce storage space.
The techniques in data reduction are:
Dimensionality reduction,
Numerosity reduction,
Data compression.
Data Reduction
Dimensionality reduction: 
It is a real world application.
The reduction of random variables or attributes is done so that
the dimensionality of the data set can be reduced.
Combining and merging the attributes of the data without losing
its original characteristics.
Numerosity Reduction: 
Data is represented in a smaller way by reducing the volume.
There is no loss of data.
Data compression: 
It is a compressed form of data .
It can be lossless or lossy.
When there is no loss of information during compression it is
called lossless compression.
Whereas lossy compression reduces information but it removes
only the unnecessary information.
Data Transformation
Data Transformation:
       The change made in the format or the
structure of the data is called data transformation.
Methods in data transformation.
Smoothing:
Aggregation:
Discretization:
Normalization:
Data Transformation

Smoothing:
Noisy from data set is removed.
Features of dataset can be identified.
Changes helps in prediction can be known easily.

Aggregation:
Data is stored and presented in the form of a summary.
The data set which is from multiple sources is integrated
into with data analysis description.
The accuracy of the data depends on the quantity and
quality.
When the quality and the quantity of the data are good
the results are more relevant.
Data Transformation
Discretization:
The continuous data is split into intervals.
It reduces the data size.
For example, rather than specifying the class time,
we can set an interval like (3 pm-5 pm, 6 pm-8 pm)
.
Normalization: 
It is the method of scaling the data so that it can
be represented in a smaller range. Example
ranging from -1.0 to 1.0.
Tunning in data warehouse
Optimization and tuning in data warehouses
are the processes of selecting adequate
optimization techniques in order to make queries
and updates run faster and to maintain their
performance by maximizing the use of data
warehouse system resources.
Data Preprocessing Examples
In this example, we have three variables: name,
age, and company. In the first example we can tell
that #2 and #3 have been assigned the incorrect
companies.
Name Age Company
Karen 57 CVS Health
Elon 49 Amazon
Jeff 57 Tesla
Tim 60 Apple
Data Preprocessing Examples
We can use data cleaning to simply remove these
rows, as we know the data was improperly entered
or is otherwise corrupted.
Name Age Company
Karen 57 CVS Health
Tim 60 Apple
Data Preprocessing Examples
we can perform data transformation, in this case, manually, in
order to fix the problem:
Name Age Company
Karen Lynch 57 CVS Health
Elon Musk 49 Tesla
Jeff Bezos 57 Amazon
Tim Cook 60 Apple
Once the issue is fixed, we can perform data reduction, in this
case by descending age, to choose which age range we want
to focus on:
Name Age Company
Tim Cook 60 Apple
Karen Lynch 57 CVS Health
Jeff Bezos 57 Amazon
Elon Musk 49 Tesla
Inconsistent Data in Data Mining
Data inconsistency is a situation where there are
multiple tables within a database that deal with
the same data but may receive it from different
inputs.
Inconsistency is generally compounded by data
redundancy. 
It refers to problems with the content of a
database.
Causes of Inconsistent Databases.
Common Causes of Inconsistent Databases.
• Operating system backups.
• Incorrect installation paths.
• Disabling of logging/eecovery system.
• Use of unsupported hardware configuration.
Inconsistencies Due to Operating System Backups.
Example for Inconsistent Data
An organisation is broken up into different
departments, each using their own tools and
systems, each following their own processes and
with their own interpretation of the data points
they are creating and using."
HOW TO MINIMISE DATA
INCONSISTENCY
There are two approaches tackle the problem of
data inconsistency across applications.
 A central semantic store:
 It involves in logging and storing.
Applies on all the rules used by the database
integration process in a single centralised
repository.
So that the data sources become updated or new
ones are added they do not fall outside data
integration rules. 
HOW TO MINIMISE DATA
INCONSISTENCY
A master reference store:
It focus on centralization.
It focus on reference data and rules are made on
syncing all secondary tables when a change is
triggered in the main one.
Eg- student _id_no in student table(Primary Key)
Same student _id_no values and type are used in
Marks table also(Reference Key)
It locks it down into a single central process to keep
greater control over the most important data points,
even if it does come at a greater use of processing
resources.
A Trick for Finding Inconsistent Data
1) Finding and fixing the inconsistencies:
create a filter.
The filter will allows to see all of the unique
values in the column, making it easier to isolate
the incorrect values.

On the Home tab, go to Sort & Filter > Filter. If your


worksheet already has filters, you can skip this
step.
Sort & Filter > Filter
Click the filter drop-down arrow in the
desired column.

2) Click the filter drop-down arrow in the desired column.

3) A drop-down menu will appear, showing a list of all of the unique values in
the column. Deselect all of the correct values, leaving all of the incorrect
values selected. When you're done, click OK.
4) The spreadsheet will now be filtered to only show the incorrect values. In our
example, there were only a few errors, so we'll fix them manually by typing the
correct values for each one.
filter drop-down arrow 
5) Click the column's filter drop-down arrow again
and make sure all of the values listed are correct.
When you're done, click Select All, then
click OK to show all of the rows.
Tasks in data mining

Data Mining deals with what kind of patterns can


be mined.
On the basis of kind of data to be mined there are
two kind of tasks involved in Data Mining, that are
listed below:
Descriptive
Classification and Prediction
Data mining is categorized as:
Descriptive data mining: 
It provides certain knowledge about the data.
Eg-count, average.
It gives information about what is happening inside the data
without any previous idea.
It exhibits the common features in the data.
It Provides the general properties of the data present in the
database.
Predictive data mining: 
This helps in understanding the characteristics that are not
explicitly available.
Eg- the prediction of business analysis in the next quarter with
the performance of the previous quarters.
The predictive analysis predicts the characteristics with the
previously available data.
Continue....
Descriptive
The descriptive function deals with general
properties of data in the database.
Here is the list of descriptive functions:
Class/Concept Description
Mining of Frequent Patterns
Mining of Associations
Mining of Correlations
Mining of Clusters
Class/Concept Description:
Characterization

Data is associated with classes or concepts so they


can be correlated with results.
For example, the new iPhone model is released on
three variants to attend to the targeted customers
based on their requirements like Pro, Pro max, and
Plus.
It produces the characteristic rules for the target
class, like our iPhone buyers.
We can collect the data using simple SQL queries
and perform OLAP functions to generalize the data.  
Continue…
It involves summarizing the generic data features,
It explains the specific rules to define a target
class.
To characterize the data an attribute-oriented
induction technique.
The resultant characterized data can be visualized
in the form of different types of graphs, charts, or
tables.
Mining frequent Patterns
Mining of Frequent Patterns
Frequent patterns are those patterns that occur
frequently in transactional data. Here is the list of
kind of frequent patterns:
Frequent Item Set –
It refers to set of items that frequently appear
together for example milk and bread.
Frequent Subsequence-
A sequence of patterns that occur frequently such
as purchasing a camera is followed by memory card.
Frequent Sub Structure –
Substructure refers to different structural forms,
such as graphs, trees, or lattices, which may be
combined with itemsets or subsequences.
Data Mining Functionalities

Data Mining Functionalities


Classification.
Association Analysis.
Cluster Analysis.
Data Characterization.
Data Discrimination.
Prediction.
Outlier Analysis.
Evolution Analysis.
Data discrimination

It compares the data between the two classes.


It maps the target class with a predefined group or
class.
It compares and contrasts the characteristics of
the class with the predefined class using a set of
rules called discriminant rules.
The methods used in data discrimination is
similar to data characterisation.
Classification
It uses data models to predict the trends in data.
For example, the spending chart our internet
banking or mobile application shows based on our
spend patterns.
This is sometimes used to define our risk of getting
a new loan.
It uses methods like IF-THEN, decision tree,
mathematical formulae, or neural network to predict
or analyse a model.
It uses training data to produce new instances to
compare with the one existing. 
Prediction
Prediction finds the missing numeric values in the data.
It uses regression analysis to find the unavailable data.
If the class label is missing, then the prediction is done
using classification.
Prediction is used more in business intelligence.
There are two ways one can predict data:
Predicting the unavailable or missing data using
prediction analysis
Predicting the class label using the previously built class
model.
It is a forecasting technique that allows us to find value
deep into the future.
We need to have a huge data set of past values to
predict future trends.
Association Analysis
It relates two or more attributes of the data.
It discovers the relationship between the data and
the rules that are binding them.
It finds its application widely in retail sales.
The suggestion that Amazon shows on the bottom,
“Customers who bought this also bought..” is a real-
time example of association analysis.
It associates attributes that are frequently
transacted together.
They find out what are called association rules and
are widely used in market basket analysis.
Example for Association Analysis
There are two items to associate the attributes.
One is the confidence that says the probability of
both associated together.
Second is the support, which tells past occurrence
of associations.
For example, that is if mobile phones are bought
with headphones:
support is 2% and confidence is 40%.
This means that 2% of the time that customers
bought mobile phones with headphones.
40% of confidence is the probability of the same
association happening again.
Cluster Analysis
Unsupervised classification is called cluster analysis.
The data are grouped.
In cluster analysis, the class label is unknown.
Data are grouped based on clustering algorithms. 
The objects that are similarly grouped under one
cluster.
There will be a huge difference between one cluster
and the other.
Grouping is done to maximizing the intraclass
similarity and minimizing the intra class similarity.
Clustering is applied in many fields like machine
learning, image processing, pattern recognition, and
bioinformatics.
Outliers Analysis
When data that cannot be grouped in any of the class
appears, we use outlier analysis.
There will be occurrences of data that will have different
attributes to any of the other classes or general models.
These outstanding data are called outliers.
They are usually considered noise or exceptions, and the
analysis of these outliers is called outlier mining. 
These outliers may be valuable associations in many
applications, although they are usually discarded as noise.
They are also called exceptions or surprises, and it is
significant in identifying them.
The outliers are identified using statistical tests that find
the probability.
Evolution & Deviation Analysis
Deviants 
Abnormalities 
Discordant 
Anomalies
Evolution & Deviation Analysis
With evolution analysis, we get time-related
clustering of data.
We can find trends and changes in behavior over a
period.
We can find features like time-series data,
periodicity, and similarity in trends with such distinct
analysis.
Data mining task primitive

Data Mining Task Primitives


We can specify the data mining task in form of data
mining query.
This query is input to the system.
The data mining query is defined in terms of data
mining task primitives.

Note: Using these primitives allow us to


communicate in interactive manner with the data
mining system. Here is the list of Data Mining Task
Primitives:
List of Data Mining Task Primitives:
1) Set of task relevant data to be mined
2) Kind of knowledge to be mined- Functions of
Data mining
3) Background knowledge to be used in discovery
process
4) Interestingness measures and thresholds for
pattern evaluation
5) Representation for visualizing the discovered
patterns
Continue...
1) Set of task relevant data to be mined
This is the portion of database in which the user
is interested.
This portion includes the following:
Database Attributes
Data Warehouse dimensions of interest
Continue...

3) Background knowledge to be used in discovery


process
It allow data to be mined at multiple level of
abstraction.
For example the Concept hierarchies are one of the
background knowledge that allow data to be mined
at multiple level of abstraction.
4) Interestingness measures and thresholds for
pattern evaluation
This is used to evaluate the patterns that are
discovers by the process of knowledge discovery.
There are different interestingness measures for
different kind of knowledge.
Continue....
5) Representation for visualizing the discovered
patterns
This refers to the form in which discovered patterns
are to be displayed.
These representations may include the following:
Rules
Tables
Charts
Graphs
Decision Trees
Cubes

You might also like