Unit 1 Mining
Unit 1 Mining
1
Unit 3
2
Unit 3
3
Unit 3
4
Unit 3
6
Unit 3
Data Pre-processing
Real-world datasets are raw, incomplete, inconsistent, and unusable. It is an
unusable data, so data pre-processing is the process of converting raw data
into a format that is understandable and usable.
Data pre-processing is a technique that is used to improve the quality of data
before applying mining so that data mining will give high-quality mining
results.
Data Data Data Data
cleaning integration transformation reduction
8
Unit 3
Data: Collection of Data Objects and Their Attributes, or it is how the data
objects and their attributes store.
Datasets, these are made of data objects.
Data objects represent an entity, it is also called sample, example, instance,
tuple, object,
attribute. It is the property or characteristics of an object, example, eye color
of a person, temperature, etc.
Type of Attribute
1) Qualitative (Nominal Ordinal Binary )
2) Quantitative (numeric Discrete Continuous )
Qualitative
1) Nominal Attribute It provides enough attributes which differentiate
between one object and another object. The values of nominal attributes
are name of things or some kind of symbols. If it is a category or state, it
does not follow any order, example, hair colour, black-white-brown,
gender, male-female.
Quantitative attribute:-
(1) Numeric Attribute: it is Quantitative BCZ Quantity can be Measured
and it can have Integer or real value
it has two type.
1) Interval scaled-
• It Can Be measured on se scale of equal size.
• It can have positive, Zero or Negative value.
9
Unit 3
Similarity-
• Numerical measure How Much two object is same
• value is Higher when object are look like, same
• Range is [0, 1]
Dissimilarity
• Numerical Measure of How Different is
• two object Lower when object are same
• Minimum is 0.
• upper limit varier (maybe as)
10
Unit 3
Data visualization
Data visualization is the graphical representation of Information and data
in graphical format Pictorial or
Visualization of Data could be Charts, graphs and Maps.
# Data visualization tool provides an accessible way to see and understand
trends, outliers and patterns in Data.
# Data visualization tools provides and. Technology are essential to
analysing the large amount of data (Information) and making Decisions
Line Graph- it shows the data as a series of Point connected by straight line
Segment. it is used to show changes through time Trends, development or
changes through time
Bar chart:- It show thin Colourful rectangular Bar with their length
which is proportional to value Represented.
Scatter Plot: it a very Basic and useful graphical Form. it help to find the
Relationship bow two variables
Pie chart:- It is circular statistical graph in which a single Number is
Represented by several Categories
Bubble chart:- It in variant of scatter plot where the size and colour of
Bubbles, which Represent the date point, provides the extra information
Heap Map:- it uses colours to Denote value, great for seeing trends in
Huge dataset.
Tree chart:- it is alternative of table for Accurately Numerical data.
11
Unit 3
4) Save time:It is faster to get out information from the data by using Date
visualization.
(5) It is used for competitive analyse.
(6) It help to find Relationship and patterned Quickly.
Data Cleaning
In Real world database is row, in complete, inconsistent and unusable
So Data cleaning is to clean the data by filling Missing value, Smoothing
Noisy Data Identifying or removing outlier and Removing in consistency in
Data.
OR.
Data cleaning is the process of fixing or nee removing Incorrect,
Incomplete, Corrupted, Duplicate data From database.
(B) Noisy Data- It means error in Data occur during data collection, or
data o entering
It is inconsistent Data.
It can be Handle by Fallowing way.
1. Binning- in this 1st data is Sorted then.. Sorted data is stored in Bins.
There are three method to Handle data in Bins
(i) Smoothing By Bin mean
(ii) Smoothing By Bin Median.
(iii) Smoothing By Bin Boundary
3clustering:- in this similar data item are grouped at one place and
Dissimilar items outside the cluster.
12
Unit 3
Data Integration
It is process of combining Date from multiple Sources into a single dataset.
OR.
It is the process of merging Data From different Source Le Database, date
cube and flat Files to avoid inconsistencies and Redundancies So that speed
and Accuracy of Data mining Improves
It has two approaches.
(1) Tight coupling:
Data is Combined together Into a Physical location.
(2) Loose coupling:-
in this Data only Remain in Actual source Database.
# in this method user are provided an Interface to Input their queries and
This interface then transforms this query in way that Source Database Can
understand and then Send to this query to source Database to obtain
results.
13
Unit 3
Data Reduction
Data Reduction is a process which can be applied to obtain or Reduced
representation. of Dataset that is smaller in volume and Maintains the
integrity of original Data.
# in this volume of Data is Reduced to Make analysis easier.
Data Transformation
it is a Method used to transform the data with a small range so that mining
process can be more efficient easy
Method
(1) smoothing:- it is used to remove Noise From the data i.e. Clustering,
Binning.
(2) Attribute Selection:- In this we create New attribute By using older
attributes.
14
Unit 3
Data discretization
Data discretization converts a large number of data values into smaller ones,
so that data isolation and data management becomes very easy.
Discretization is the process of putting value into buckets so that there are a
limited number of possible states. The buckets themselves are treated as
ordered at discrete value.
There are several methods that you can use to discrete data. There are
different methods which are used for performing data discretization.
1. Supervised discretization, if data is discretized using class
information, then it is referred as supervised or organized
discretization
2. Unsupervised discretization, if data values are reduced by
substituting them by limited internal descriptions but without using
class information, then it is referred to as unsupervised discretization.
3. Top-down discretization, if the process starts by first finding one or a
few points to split the entire attribute range and then repeat this
recursively on the resulting interval, then it is called top-down
discretization of splitting.
4. Bottom-up discretization, if the process starts by considering all of the
contiguous values as potential split points, remove some by merging
neighbors, load value to form intervals, then it is called bottom-up
discretization of merging.
Techniques of data discretization,
1. Histogram analysis,
2. Binding,
3. Correlation analysis,
4. Clustering analysis,
5. Decision tree analysis,
6. Equal bits partitioning,
7. Equal depth partitioning, and
8. entropy-based discretization.
15