Data Mining - Unit - 3
Data Mining - Unit - 3
Mining
Unit-3
Data Processing
Data processing is the method of collecting raw
data and translating it into usable information. It is
usually performed in a step-by-step process .
The raw data is collected, filtered, sorted, processed,
analyzed, stored, and then presented in a readable
format.
Data processing is more applicable in organisation.
By converting the data into a readable format like
graphs, charts, and documents, employees
throughout the organization can understand and use
the data.
Cycle in Data Processing
Cycles involved in Data Processing
The data processing cycle consists of a series of
steps where raw data (input) is fed into a process
(CPU) to produce actionable insights (output).
Step 1: Collection
Step 2: Preparation
Step 3: Input
Step 4: Data Processing
Step 5: Output
Step 6: Storage
Collection
Step 1:Collection
The collection of raw data is the first step of the
data processing cycle.
Raw data should be gathered from defined and
accurate sources.
Raw data should be valid and usable.
Raw data can include monetary figures, website
cookies, profit/loss statements of a company,
user behavior, etc.
Preparation in Data Processing
Step 2: Preparation
It is the process of sorting and filtering the raw
data to remove unnecessary and inaccurate data.
Raw data is checked and transformed into a
suitable form for further analysis and processing.
It is done to get the highest quality data is fed into
the processing unit.
Input and Data Process
Step 3: Input
Raw data is converted into machine readable form and
fed into the processing unit.
It is done through a keyboard, scanner or any other
input source.
Step 4: Data Processing
Raw data is subjected to various data processing
methods using machine learning and artificial
intelligence algorithms to generate a desirable output.
It may vary from process to process depending on the
source of data being processed (data lakes, online
databases, connected devices, etc.) and the intended
use of the output.
Output and Storage
Step 5: Output
The data is finally transmitted and displayed to the
user in a readable form like graphs, tables, vector
files, audio, video, documents, etc.
This output can be stored and further processed in
the next data processing cycle.
Step 6: Storage
Data and metadata are stored for further use.
This allows for quick access and retrieval of
information whenever needed, it can be used as an
input in the next data processing cycle directly.
Examples of Data Processing
Batch Processing:
Data is collected and processed in batches. Used
for large amounts of data.
Eg: payroll system
Real-time Processing:
Data is processed within seconds when the input
is given.
Used for small amounts of data.
Eg: withdrawing money from ATM
Types of Data Processing
Online Processing:
Data is automatically fed into the CPU as soon as it
becomes available. Used for continuous processing
of data.
Eg: barcode scanning
Multiprocessing:
Data is broken down into frames and processed
using two or more CPUs within a single computer
system. Also known as parallel processing.
Eg: weather forecasting
Time-sharing:
Allocates computer resources and data in time slots
to several users simultaneously.
Data Preprocessing
Data Preprocessing
It is the process of transforming raw data into an
understandable format.
It helps in checking the quality of the data before
applying machine learning or data mining
algorithms.
Need for Data preprocessing
Smoothing:
Noisy from data set is removed.
Features of dataset can be identified.
Changes helps in prediction can be known easily.
Aggregation:
Data is stored and presented in the form of a summary.
The data set which is from multiple sources is integrated
into with data analysis description.
The accuracy of the data depends on the quantity and
quality.
When the quality and the quantity of the data are good
the results are more relevant.
Data Transformation
Discretization:
The continuous data is split into intervals.
It reduces the data size.
For example, rather than specifying the class time,
we can set an interval like (3 pm-5 pm, 6 pm-8 pm)
.
Normalization:
It is the method of scaling the data so that it can
be represented in a smaller range. Example
ranging from -1.0 to 1.0.
Tunning in data warehouse
Optimization and tuning in data warehouses
are the processes of selecting adequate
optimization techniques in order to make queries
and updates run faster and to maintain their
performance by maximizing the use of data
warehouse system resources.
Data Preprocessing Examples
In this example, we have three variables: name,
age, and company. In the first example we can tell
that #2 and #3 have been assigned the incorrect
companies.
Name Age Company
Karen 57 CVS Health
Elon 49 Amazon
Jeff 57 Tesla
Tim 60 Apple
Data Preprocessing Examples
We can use data cleaning to simply remove these
rows, as we know the data was improperly entered
or is otherwise corrupted.
Name Age Company
Karen 57 CVS Health
Tim 60 Apple
Data Preprocessing Examples
we can perform data transformation, in this case, manually, in
order to fix the problem:
Name Age Company
Karen Lynch 57 CVS Health
Elon Musk 49 Tesla
Jeff Bezos 57 Amazon
Tim Cook 60 Apple
Once the issue is fixed, we can perform data reduction, in this
case by descending age, to choose which age range we want
to focus on:
Name Age Company
Tim Cook 60 Apple
Karen Lynch 57 CVS Health
Jeff Bezos 57 Amazon
Elon Musk 49 Tesla
Inconsistent Data in Data Mining
Data inconsistency is a situation where there are
multiple tables within a database that deal with
the same data but may receive it from different
inputs.
Inconsistency is generally compounded by data
redundancy.
It refers to problems with the content of a
database.
Causes of Inconsistent Databases.
Common Causes of Inconsistent Databases.
• Operating system backups.
• Incorrect installation paths.
• Disabling of logging/eecovery system.
• Use of unsupported hardware configuration.
Inconsistencies Due to Operating System Backups.
Example for Inconsistent Data
An organisation is broken up into different
departments, each using their own tools and
systems, each following their own processes and
with their own interpretation of the data points
they are creating and using."
HOW TO MINIMISE DATA
INCONSISTENCY
There are two approaches tackle the problem of
data inconsistency across applications.
A central semantic store:
It involves in logging and storing.
Applies on all the rules used by the database
integration process in a single centralised
repository.
So that the data sources become updated or new
ones are added they do not fall outside data
integration rules.
HOW TO MINIMISE DATA
INCONSISTENCY
A master reference store:
It focus on centralization.
It focus on reference data and rules are made on
syncing all secondary tables when a change is
triggered in the main one.
Eg- student _id_no in student table(Primary Key)
Same student _id_no values and type are used in
Marks table also(Reference Key)
It locks it down into a single central process to keep
greater control over the most important data points,
even if it does come at a greater use of processing
resources.
A Trick for Finding Inconsistent Data
1) Finding and fixing the inconsistencies:
create a filter.
The filter will allows to see all of the unique
values in the column, making it easier to isolate
the incorrect values.
3) A drop-down menu will appear, showing a list of all of the unique values in
the column. Deselect all of the correct values, leaving all of the incorrect
values selected. When you're done, click OK.
4) The spreadsheet will now be filtered to only show the incorrect values. In our
example, there were only a few errors, so we'll fix them manually by typing the
correct values for each one.
filter drop-down arrow
5) Click the column's filter drop-down arrow again
and make sure all of the values listed are correct.
When you're done, click Select All, then
click OK to show all of the rows.
Tasks in data mining