0% found this document useful (0 votes)
20 views15 pages

Unit 1 Mining

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views15 pages

Unit 1 Mining

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Unit 3

Data Mining : Data mining refers to extracting or mining knowledge from


large amount of data.
Big data Useful data

Other simple definition, it is the process of mining knowledge from large


amount of data.
the information or knowledge that is extracted through mining that can be
used for
1. market analysis,
2. fraud detection,
3. production control,
4. science exploration.
Data mining is also called knowledge discovery in database, KDD.
The knowledge discovery process include data cleaning, data integration,
data selection, data transformation, data mining, and knowledge
representation.
Aim of data mining The primary aim of data mining is discover hidden
patterns and the relationship in the data that can be used to make informed
decision or prediction.

1
Unit 3

Kind of data to be mined


1. Relational Database,
2. Spatial Database,
3. Flat Files,
4. Time Service Database,
5. Transactional Database,
6. WWW,
7. Data Warehouse,
8. Multidimensional Database
1. Flat File These are the data files in text form or binary form with the
structure that can be easily extracted by data mining algorithm.
The data stored in the flat file has no relationship or paths to each
other. These files represent data dictionary such as CSV files.
2. Relational Database It is defined as the collection of data organized in
tables with rows and columns. The physical schema of a relational
database is the schema that defines the structure of tables. The logical
schema is schema that defines the relationship between tables. It is mostly
used in applications like SQL and Oracle Database.
3. Transaction Database It is a database collection organized by timestamp,
dates, and transactions. It follows ACID property of DBMS. Each
transaction has unique transaction ID, seller ID
Application-banking object, database.
4. Data Warehouse It is a cluster of data that is integrated from multiple
sources that have been queried and determined.
5. WWW:-It is a collection of document and resources such as audio, video,
and text,
Application- online shopping, job search.
6. Time Service It contains stock, exchange data, and user logged activities.
It requires real-time analysis.
7. Spatial Database Stores geographical information. Stores data in the
form of coordinates, lines, polygons.
8. Multimedia It consists of audio, video, images, and text media. These can
be stored on an objective-oriented database.

2
Unit 3

Data Mining Functionalities


Data Mining Functionalities are used to specify the kind of patterns to be
found in data mining tasks.
In general, data mining tasks can be divided into two categories,
1. descriptive and
2. predictive.
1. Descriptive. It involves analysing the data in database to understand its
general properties and characteristics. This aims to provide a comprehensive
overview of the data without making any predictions about future outcomes.
2. Predictive Mining. This mining focuses on using the current data to make
predictions about future outcomes.
Mining functionalities
1. Class description.
• It is used to associate data with a class, a concept.
• characterization. It is one of the methods used in class description.
This method helps to connect data with a certain set of customers.
• Discrimination. It is used to compare the characteristics of two
different classes of customers.
• Example. One of the best example is the release of the same model
of mobile phone in different variants. This helps companies to
satisfy the need of different customer segments.
2. Frequent patterns. These are patterns that occur frequently in data.
There are many kinds of frequent patterns including frequent item sets,
frequent subsequences, and frequent substructures.
3. Classification. It is one of the most important data mining functionalities
that use models to predict the trend in the available data.
This method uses the if-then rule, decision tree, mathematical
formula, and neural network to analyze a model.
4. Prediction. Finding missing data in a database is very important for the
accuracy of the analysis.
It is the most popular data mining functionality. Determining any
missing or unknown element in a data set. Linear regression models
based on previous data are used to make numeric predictions, which
help business forecast the result of any given event positively or
negatively.
There are two types of predictions.

3
Unit 3

a. Numeric prediction and


b. class prediction.
a. Numeric prediction predicts any missing or unknown element in a
data set.
b. Class prediction predicts the class label using a previously built
class model.
5. Cluster analysis. The data mining functionality is same as classification
but in this the class label is unknown. In cluster analysis, similar data
types are grouped in a cluster. There are huge differences between one
cluster and another. It is applied in different fields like machine learning,
image processing, and pattern recognition.
6. Outer analysis. It is used to group data that don't appear under any class.
There can be data that has no similarity with attributes of other classes.
These are called outliers. Such occurrence are considered to be noise or
expectations.
7. Association analysis. It is also called market basket analysis. Association
analysis helps to find relation between elements frequently occurring
together. It is widely used in retail sales. It is a way of discovering the
relationship between various items.
For example, suppose we want to know which items are frequently
purchased together. Here x is a variable representing a customer. A
confidence over a certainty of 50% means that if a customer buys a
computer, there is a 50% chance that he will buy software as well. A
1% support means that 1% of all transactions under analysis show
that computer and software are purchased together.
8. Correlation Analysis. It is a mathematical technique that can show
whether and how strongly the pair of attributes are related to each other.
For example, high lead people chance to have more weight.

4
Unit 3

Technology used in data mining,


a) Statistics
b) Machine learning
c) Datawarehouse and database
d) Visualization
e) Information retrieval
f) Pattern recognition
g) algorithm
1. statistics . It is used to collection, analysis, and explanation and
presentation of data. Data mining has an infinite connection with statics.
A statical model is used for data classes and data modeling. It describes
the behavior of an object in class and its probabilities.
Advantages.
It can be used to model noise and missing data values.
It is used for pattern mining.
Disadvantages.
When it is used in large data sets, it increases the complexity cost.
2. Machine Learning. It describes how a computer can learn based on data.
It is a fast-growing discipline which researches how computers
automatically learn based on the given input data and make intelligent
decisions.
It has three types, supervised, unsupervised, and semi-supervised.
1) Supervised. It is a synonym of classification. It uses class labels
to predict information.
2) Unsupervised. It is a synopsis of clusters. It does not use class
labels to predict information. But it discovers new classes within
data.
3) Semi-supervised. It is a class of machine learning techniques
which uses both labeled and unlabeled data.
3. Database System and Data Warehouse. Database systems are used in
query language, query processing, optimization, and data models. Data
warehousing combines data from multiple sources and various time
frames. It provides OLAP facilities in multi-dimensional databases to
promote multi-dimensional data mining. It maintains recent data and
historical data in databases.
4. Information retrieval. This technique searches for the information in
documents which may be taxed, multimedia, or may reside on the web. It
has two features.
1.Searched data in unstructured 2. checking queries are formed by
keywords that don't have complex structures.
5
Unit 3

5. Pattern Recognition. Pattern is everything around in this physical world.


A pattern can either be seen physically or it can be observed
mathematically by applying algorithms. For example, the color of clothes,
speech patterns, etc. Pattern recognition is the process of recognizing
patterns by using a machine learning algorithm.
6. Algorithm is a set of heuristics and calculations that create models from
data.
7. Virtualization. It is the process of converting data and information into a
graphical form.
Application of Data Mining,
1. Healthcare,
2. Market Basket Analysis,
3. Education,
4. Financial Banking,
5. Flood Detection,
6. Customer Segmentation,
7. Risk Management, Lie Detection,
8. Manufacturing,
1. Healthcare Data mining helps in improving of quality of healthcare
system. Data mining is used in healthcare to analyse patient data, identify
risk factor and develop personal treatment plan.
2. Market Basket Analysis, it involves the analysis of customer purchase
data to identify patterns and trends in customer behaviour.
3. Education, data mining is used in education to analyse student
performance data and identify trends and patterns in student behaviour.
4. Finance and Banking, Finance and Banking involves the management of
money and investment. Data mining is important in finance and banking
because it can help banks to identify fraud and behaviour patterns,
analyse customer behaviour.
5. Fraud Detection, Fraud Detection involves identifying fraud and
behaviour in various industries like banking, e-commerce. Data mining
provides meaningful patterns and turn data into information which helps
in flood detection.
6. customer segmentation. Data Mining Identifiers, the common
characteristics of customers who buy the same products from the
company.
7. Lie Detection, Law Enforcement may use data mining techniques to bring
out the truth from criminals.

6
Unit 3

Major issues in data mining.


There are many issues in data mining. Some of these are
1. mining methodology and user interaction issues,
2. performance issues,
3. diverse data type issues,
1. mining methodology and user interaction issues,
a. mining different kind of knowledge in database, different user feed,
different kind of knowledge, so it becomes difficult to cover large
range of data discovery tasks.
b. Interactive mining of knowledge at multiple levels of abstraction.
Data mining process should be interactive because it is difficult to
know what can be discovered within a database.
c. Incorporation of background knowledge. Background knowledge
is used to guide discovery process.
d. Data mining query language. The language of data mining query
language should be perfectly matched with query language of
warehouse.
e. Handle noisy data. The data cleaning methods are required to
handle the noisy data. If data cleaning methods are not there, then
the accuracy of discovery patterns will be poor.
2. Performance issues.
a. Efficiency and scalability. To effectively extract information from
a huge amount of data in database, data mining algorithm must be
efficient.
b. Parallel and distributed mining algorithms the huge size of many
database, wide distribution of data and complex data mining
methods are effective due to that parallel and distributed data
mining algorithm exists. Such algorithms divide the data into
different parts and which extracts the process in parallel.
3. Diverse data type.
a. Handling relational and complex data type.
There are many kind of data stored in database, that is multimedia,
text, document, etc. It is not possible for one system to mine all these
kind of data.
b. Mining information from heterogeneous database and global
information system.
The data is available at different data storage on land or van. This
data source may be semi-structured or unstructured. Therefore,
mining knowledge from dam adds challenge to data mining.
7
Unit 3

Data Pre-processing
Real-world datasets are raw, incomplete, inconsistent, and unusable. It is an
unusable data, so data pre-processing is the process of converting raw data
into a format that is understandable and usable.
Data pre-processing is a technique that is used to improve the quality of data
before applying mining so that data mining will give high-quality mining
results.
Data Data Data Data
cleaning integration transformation reduction

There are four stages in pre-processing of data.


Needs of pre-processing or why pre-processing
A. It improves accuracy and reliability.
B. It makes data consistent.
C. It increases data algorithm readability.
Data pre-processing is needed to check the data quality.
The quality can be checked by following
A. accuracy. To check whether the data entered is correct or not.
B. Completeness To check whether the data is available or not Recorded
C. consistency To check whether the same data is kept in all the places
that do or do not match.
D. Timelines The data should be updated correctly.
E. Believability The data should be trustable.
F. Interpretability The understandability of the data.

8
Unit 3

Data: Collection of Data Objects and Their Attributes, or it is how the data
objects and their attributes store.
Datasets, these are made of data objects.
Data objects represent an entity, it is also called sample, example, instance,
tuple, object,
attribute. It is the property or characteristics of an object, example, eye color
of a person, temperature, etc.
Type of Attribute
1) Qualitative (Nominal Ordinal Binary )
2) Quantitative (numeric Discrete Continuous )
Qualitative
1) Nominal Attribute It provides enough attributes which differentiate
between one object and another object. The values of nominal attributes
are name of things or some kind of symbols. If it is a category or state, it
does not follow any order, example, hair colour, black-white-brown,
gender, male-female.

2) ordinal attribute- This attribute Contains value that have a meaningful


sequence or Ranking or order btw them
e.g. T-shirt size (XL, L, M, 5) grade (O, A+, A, B, C)

3) Binary Attribute: It has two category/value 0 and 1, it is also called


Boolean attribute.
where 0 is absence of any feature.
1 is include of any Features.
It has two sub category.
a. symmetric → gender (M OR F) (Both value imp.)
b. asymmetric:-> Disease-test (0, 1) (Both value are Not Imp.)

Quantitative attribute:-
(1) Numeric Attribute: it is Quantitative BCZ Quantity can be Measured
and it can have Integer or real value
it has two type.
1) Interval scaled-
• It Can Be measured on se scale of equal size.
• It can have positive, Zero or Negative value.

9
Unit 3

• it allows to compare such as temperature in C or F


2) Ratio scale attribute
• it is a Measurable quantity
• it has zero point
• The value are ordered and Mean, Median and Made con be
calculated for these.
• eg age, Length, and weight
2.Discreate & Continuous attributed
Discreate it is set of possible value which may.
or may not be represented as Integer. e.g. Hair colour (Black, white)
Continuous
it has Floating-point or integer values. e.g. year of experience, salary
etc.

Similarity and Dissimilarity.


In clustering techniques Similarity and Dissimilarity is an Imp
Measurement
Similarity : The Similarity btw two object is a Numerical Measure of
the degree to which the two objects. are alike (Same).
# similarity are higher for pair of object that are look like same.
# Similarity are usually Non- Negative and. Often Btw 0 (No same) and I
(Complete some)
Dissimilarity: The Dissimilarity btw two object is Numerical Me
Measure of degree to which the two object. Ls. Not Same (Different)
# The Term Distance is used as synonym for dissimilarity
#proximity is refers to a similarity or Dissimilarity

Similarity-
• Numerical measure How Much two object is same
• value is Higher when object are look like, same
• Range is [0, 1]
Dissimilarity
• Numerical Measure of How Different is
• two object Lower when object are same
• Minimum is 0.
• upper limit varier (maybe as)

10
Unit 3

Data visualization
Data visualization is the graphical representation of Information and data
in graphical format Pictorial or
Visualization of Data could be Charts, graphs and Maps.
# Data visualization tool provides an accessible way to see and understand
trends, outliers and patterns in Data.
# Data visualization tools provides and. Technology are essential to
analysing the large amount of data (Information) and making Decisions

Type of Data visualization..


Chart:- it is graphical Representation. of Information

Line Graph- it shows the data as a series of Point connected by straight line
Segment. it is used to show changes through time Trends, development or
changes through time
Bar chart:- It show thin Colourful rectangular Bar with their length
which is proportional to value Represented.
Scatter Plot: it a very Basic and useful graphical Form. it help to find the
Relationship bow two variables
Pie chart:- It is circular statistical graph in which a single Number is
Represented by several Categories
Bubble chart:- It in variant of scatter plot where the size and colour of
Bubbles, which Represent the date point, provides the extra information
Heap Map:- it uses colours to Denote value, great for seeing trends in
Huge dataset.
Tree chart:- it is alternative of table for Accurately Numerical data.

why use or Adv


(i) Discover the Trends in Data
The Most Imp thing that data visualization does lo Discover the trends in
Data.it is much easier to observe data trends. when all data is visuel form
as Compared to data in table.
(2) Perspective on Data
Visualization of Data Provides a perspective on data by showing Its
meaning.It tell How a particular data stand against overall data picture.
(3) Put Data in correct context-
It is easy to understand the context of Data with help of data visualization.

11
Unit 3

4) Save time:It is faster to get out information from the data by using Date
visualization.
(5) It is used for competitive analyse.
(6) It help to find Relationship and patterned Quickly.

Data Cleaning
In Real world database is row, in complete, inconsistent and unusable
So Data cleaning is to clean the data by filling Missing value, Smoothing
Noisy Data Identifying or removing outlier and Removing in consistency in
Data.
OR.
Data cleaning is the process of fixing or nee removing Incorrect,
Incomplete, Corrupted, Duplicate data From database.

(A) Handling of Missing Data:-


This situation arises when some data is Missing in Dala Set It can be
handled following ways.
(i) Ignore the tuple. It is suitable when data set is large
(ii) Full the missing value By Filling attribute men or median. By Filling
the Most probable value.

(B) Noisy Data- It means error in Data occur during data collection, or
data o entering
It is inconsistent Data.
It can be Handle by Fallowing way.

1. Binning- in this 1st data is Sorted then.. Sorted data is stored in Bins.
There are three method to Handle data in Bins
(i) Smoothing By Bin mean
(ii) Smoothing By Bin Median.
(iii) Smoothing By Bin Boundary

2. Regression:- it is Numerical prediction of data. it can be Linear or


multiple.

3clustering:- in this similar data item are grouped at one place and
Dissimilar items outside the cluster.

12
Unit 3

Data Integration
It is process of combining Date from multiple Sources into a single dataset.
OR.
It is the process of merging Data From different Source Le Database, date
cube and flat Files to avoid inconsistencies and Redundancies So that speed
and Accuracy of Data mining Improves
It has two approaches.
(1) Tight coupling:
Data is Combined together Into a Physical location.
(2) Loose coupling:-
in this Data only Remain in Actual source Database.

# in this method user are provided an Interface to Input their queries and
This interface then transforms this query in way that Source Database Can
understand and then Send to this query to source Database to obtain
results.

Issue in Data Integration-


(1) Entity Identity trouble.
(2) Tuple Duplication
(3) Redundancy and inconsistency
(4) Date Conflict detection and Resolution.
(5) Quality Reduce

13
Unit 3

Data Reduction
Data Reduction is a process which can be applied to obtain or Reduced
representation. of Dataset that is smaller in volume and Maintains the
integrity of original Data.
# in this volume of Data is Reduced to Make analysis easier.

Method of Daty Reduction:-


(1) Dimensionality Reduction:- It is process of Reducing the Number of
variables in data set Bcz Large variables poor performance.
(2) Date cube aggregation:- in this Data is Combined to construct a data
cube. #Redundancy, Noisy data is removed.
(3) Attribute subset Selection:- in this Highly. relevant attributes should be
used. and other 's Discarded.
(4) Numerosity Reduction:- in this we Replace the original Data. volume by
alternative, Smaller form of Data Representation.

Data Transformation
it is a Method used to transform the data with a small range so that mining
process can be more efficient easy
Method
(1) smoothing:- it is used to remove Noise From the data i.e. Clustering,
Binning.
(2) Attribute Selection:- In this we create New attribute By using older
attributes.

(3) Aggregation:- In this summary or Aggregation operation are applied to


the data

(4) Normalization. (-1.0 to 1.0 or 0 to 1):- It is Done in order to Scale the


data value In a specific Range

(5) Hierarchy Generation: - attributes are converted From Low level to


High level. e.g. city-> country.

(6) Data Discretization:- In this Raw value of Numeric attributes are


Replaced by interval table or conceptual tables.
e.g. age (RAW) interval (0-10,18-30)

14
Unit 3

Data discretization
Data discretization converts a large number of data values into smaller ones,
so that data isolation and data management becomes very easy.
Discretization is the process of putting value into buckets so that there are a
limited number of possible states. The buckets themselves are treated as
ordered at discrete value.
There are several methods that you can use to discrete data. There are
different methods which are used for performing data discretization.
1. Supervised discretization, if data is discretized using class
information, then it is referred as supervised or organized
discretization
2. Unsupervised discretization, if data values are reduced by
substituting them by limited internal descriptions but without using
class information, then it is referred to as unsupervised discretization.
3. Top-down discretization, if the process starts by first finding one or a
few points to split the entire attribute range and then repeat this
recursively on the resulting interval, then it is called top-down
discretization of splitting.
4. Bottom-up discretization, if the process starts by considering all of the
contiguous values as potential split points, remove some by merging
neighbors, load value to form intervals, then it is called bottom-up
discretization of merging.
Techniques of data discretization,
1. Histogram analysis,
2. Binding,
3. Correlation analysis,
4. Clustering analysis,
5. Decision tree analysis,
6. Equal bits partitioning,
7. Equal depth partitioning, and
8. entropy-based discretization.

15

You might also like