100% found this document useful (2 votes)
250 views

DWM Unit 4 Introduction To Data Mining

The document discusses data mining including its definition, steps in the knowledge discovery process, types of data that can be mined, major issues in data mining, and data objects and attribute types. Data mining refers to extracting patterns from large amounts of data and is used by companies to analyze markets and customer preferences. The knowledge discovery process includes data cleaning, integration, selection, transformation, mining, evaluation, and presentation.

Uploaded by

Sp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
250 views

DWM Unit 4 Introduction To Data Mining

The document discusses data mining including its definition, steps in the knowledge discovery process, types of data that can be mined, major issues in data mining, and data objects and attribute types. Data mining refers to extracting patterns from large amounts of data and is used by companies to analyze markets and customer preferences. The knowledge discovery process includes data cleaning, integration, selection, transformation, mining, evaluation, and presentation.

Uploaded by

Sp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

PadmashriDr.

VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030


DWM 22621

Unit 4: Introduction to Data Mining (18 Marks)


Course Outcome (CO): Use Data Mining tools for various applications.

*Data Mining:
Data mining means searching for knowledge (interesting patterns or useful data) in data.
Data mining refers to extraction of small information from large amount of data.
The data sources can include databases, datawarehouses, the Web, other information
repositories, or data that are streamed into the system dynamically.
Data mining is the process of discovering interesting patterns and knowledge from large
amounts of data.
Data mining is used by companies in order to get customer preferences, determine price of
their product and services and to analyse market.
Data mining is also known as knowledge discovery in Database (KDD).

Steps in the process of KDD:

Fig: Steps in KDD Process

1. Data cleaning:
In data cleaning it removes the noise and inconsistent data.
2. Data integration:
Multiple data sources may be combined.
3. Data selection:
The data relevant to the analysis task are retrieved from the database.
4. Data transformation:
The data are transformed and consolidated into forms appropriate for mining by performing
summary or aggregation operations.

1
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

i.e. the data from different data sources which is of varied types can be converted into a
single standard format.
5. Data mining:
Data mining is the process in which intelligent methods or algorithms are applied on data to
extract useful data patterns.
6. Pattern evaluation:
This process identifies the truly interesting patterns representing actual knowledge based
on user requirements for analysis.
7. Knowledge presentation:
In this process, visualization and knowledge representation techniques are used to present
mined knowledge to users for analysis.

*What kind of data can be mined?


1. Flat Files:
 Flat files are defined as data files in text form or binary form with a structure that
can be easily extracted by data mining algorithms.
 Data stored in flat files have no relationship or path among themselves, like if a
relational database is stored on flat file, then there will be no relations between
the tables.
 Flat files are represented by data dictionary. Eg: CSV file.
 Application: Used in DataWarehousing to store data, used in carrying data to
and from server, etc.

2. Relational Databases:
 A Relational database is defined as the collection of data organized in tables with
rows and columns.
 Physical schema in Relational databases is a schema which defines the structure
of tables.
 Logical schema in Relational databases is a schema which defines the
relationship among tables.
 Standard API of relational database is SQL.
 Application: Data Mining, ROLAP model, etc.

2
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

3. DataWarehouse:
 A datawarehouse is defined as the collection of data integrated from multiple
sources that will queries and decision making.
 There are three types of datawarehouse: Enterprise datawarehouse, Data
Mart and Virtual Warehouse.
 Application: Business decision making, Data mining, etc.

4. Transactional Databases:
 Transactional databases are a collection of data organized by time stamps, date,
etc to represent transaction in databases.
 This type of database has the capability to roll back or undo its operation when a
transaction is not completed or committed.
 Highly flexible system where users can modify information without changing any
sensitive information.
 Follows ACID property of DBMS.
 Application: Banking, Distributed systems, Object databases, etc.

5. Multimedia Databases:
 Multimedia databases consists audio, video, images and text media.
 They can be stored on Object-Oriented Databases.
 They are used to store complex information in a pre-specified format.
 Application: Digital libraries, video-on demand, news-on demand, musical
database, etc.

6. Spatial Database:
 Store geographical information.
 Stores data in the form of coordinates, topology, lines, polygons, etc.
 Application: Maps, Global positioning, etc.

7. Time-series Databases:
 Time series databases contains stock exchange data and user logged activities.
 It requires real-time analysis.
 Application: eXtremeDB, Graphite, InfluxDB, etc.

8. WWW:
 WWW refers to World wide web is a collection of documents and resources like
audio, video, text, etc which are identified by Uniform Resource Locators (URLs)
through web browsers, linked by HTML pages, and accessible via the Internet.
 It is the most heterogeneous repository as it collects data from multiple
resources.
 It is dynamic in nature as Volume of data is continuously increasing and
changing.
 Application: Online shopping, Job search, Research, studying, etc.

3
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

*Major Issues in Data Mining:


Data mining systems face a lot of challenges and issues like:
A. Mining methodology and user interaction issues
B. Performance issues
C. Issues relating to the diversity of database types

A. Mining methodology and user interaction issues:


i. Mining different kinds of knowledge in databases:
Different user - different knowledge - different way.
That means different client want a different kind of information so it becomes difficult to
cover vast range of data that can meet the client requirement.
ii. Incorporation of background knowledge:
Background knowledge is used to guide discovery process and to express the discovered
patterns. So, in mining process to know the background of data is must for easy process.
iii. Query languages and ad hoc mining:
Relational query languages allow users to use ad-hoc queries for data retrieval.
The language of data mining query language and the query language of data warehouse
should be matched.
iv. Handling noisy or incomplete data:
In a large database, many of the attribute values will be incorrect.
This may be due to human error or because of any instruments fail.

B. Performance issues:
i. Efficiency and scalability of data mining algorithms:
To effectively extract information from a huge amount of data in databases, data mining
algorithms must be efficient and scalable.
ii. Parallel, distributed, and incremental mining algorithms:
There are huge size of databases, the wide distribution of data, and complexity of some
data mining methods.
These factors should be considered during development of parallel and distributed data
mining algorithms.

C. Issues relating to the diversity of database types:


i. Handling of relational and complex types of data:
There are many kinds of data stored in databases and data warehouses.
It is not possible for one system to mine all these kinds of data.So, different data mining
system should be constructed for different kinds data.

4
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

ii. Mining information from heterogeneous databases and global information systems:
Since data is fetched from different data sources on Local Area Network (LAN) and Wide
Area Network (WAN), the discovery of knowledge from different sources of structured is a
great challenge to data mining.

*Data Objects and Attribute Types:


Data Objects:
Data sets are made up of data objects.
A data object represents an entity.
Example: in a sales database, the objects may be customers, store items, and sales; in a
medical database, the objects may be patients.
Data objects are typically described by attributes.
If the data objects are stored in a database, they are data tuples. That is, the rows of a
database correspond to the data objects, and the columns correspond to the attributes.

Attribute:
Attribute is a data field that represents characteristics or features of a data object.
For a customer object, attributes can be customer Id, address etc.
Set of attributes used to describe an object.

Types of attributes:

1. Qualitative Attributes
2. Quantitative Attributes

Numeric

1. Qualitative Attributes:
a. Nominal Attributes (N):
These attributes are related to names.
The values of a Nominal attribute are name of things, some kind of symbols.

5
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

Values of Nominal attributes represents some category or state and that’s why nominal
attribute also referred as categorical attributes and there is no order (rank, position)
among values of nominal attribute.
Example:
Attribute Values
Colors Black, Red, Green
Categorical Data Lecturer, Professor

b. Binary Attributes (B):


Binary data has only 2 values/states.
Example: yes or no, affected or unaffected, true or false.
i.Symmetric: Both values are equally important (Gender).
ii.Asymmetric: Both values are not equally important (Result).

Attribute Values Attribute Values


Gender Male, Female Result Pass, Fail

c.Ordinal Attributes (O):


The Ordinal Attributes contains values that have a meaningful sequence or ranking(order)
between them.

Attribute Values
Grade A, B, C, D, E
Income low, medium, high
Age Teenage, young, old

2. Quantitative Attributes:
a. Numeric:
A numeric attribute is quantitative because, it is a measurable quantity, represented in
integer or real values.

Attribute Values
Salary 2000, 3000
Units sold 10, 20
Age 5,10,20..

6
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

b. Discrete:
Discrete data have finite values, it can be numerical and can also be in categorical form.
These attributes have finite or countably infinite set of values.
Example:

Attribute Values
Profession Teacher, Businessman, Peon
Zip Code 413736, 413713

c. Continuous:
Continuous data have infinite no. of states. Continuous data is of float type. There can be
many values between 2 and 3.

Example:

Attribute Values
Height 2.3, 3, 6.3……
Weight 40, 45.33,…….

7
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

*Data Preprocessing:
Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format.
Real-world data is incomplete, inconsistent and contain many errors.
Data preprocessing is a proven method of resolving such issues.
Data preprocessing prepares raw data for further processing.

Why preprocess the data?


Data Preprocessing is required because real world data are generally:
1. Incomplete:
When a dataset contains missing values, it is referred to as an incomplete dataset.
i.e. Missing attribute values, missing some important attributes, or having only
aggregate data.
2. Noisy:
Containing errors or outliers.
Noisy data is meaningless data.
Include any data that cannot be understood and interpreted correctly by machines,
such as unstructured data.
3. Inconsistent:
Data containing discrepancies in codes or names.
Data inconsistency is a situation where there are multiple tables within a database
that deal with the same data but may receive it from different inputs.)

8
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

*Major Tasks in data preprocessing:


Data goes through a series of tasks during preprocessing:
1. Data Cleaning: Data is cleansed through processes such as filling in missing
values, smoothing the noisy data, or resolving the inconsistencies in the data.
2. Data Integration: Data with different representations (formats) are put together and
conflicts within the data are resolved.
3. Data Transformation: Data is normalized, aggregated and generalized.
4. Data Reduction: This step aims to present a reduced representation of the data in a
data warehouse without loss of data.
5. Data Discretization: Data discretization is a method of converting attributes values of
continuous data into a finite set of intervals with minimum data loss.

Fig: Tasks in Data Preprocessing

9
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

1. Data Cleaning in Data Mining:


Quality of your data is important for final analysis. Any data which is incomplete, noisy and
inconsistent can affect the results.
Data cleaning in data mining is the process of detecting and removing corrupt or inaccurate
records from a record set, table or database.

Some data cleaning methods:


A. Handle Missing Values:
a. Ignore the tuple:
This is done when class label is missing.
Ignore the tuple only if maximum attributes have the missing values.
b. Fill in the missing value manually:
This approach is effective on small data set with some missing values.
c. Replace all missing attribute values with global constant:
We can use any global constants to replace the missing value like “Unknown”.
d. Use the attribute mean to fill in the missing value:
Missing value is replaced by the average value of that column or attribute.
For example, customer average income is 25000 then you can use this value to replace
missing value for income.
e. Use the most probable value to fill in the missing value.
We can replace the missing value by most probable value which is consistent for that
attribute.

B. Cleaning the Noisy Data:


Noise is a random error or variance in a measured variable.
Noisy Data may be due to faulty data collection instruments, data entry problems and
technology limitation.

Binning Method to clean the noisy data:


Binning method sort the data values by consulting its “neighbour- hood,” that is, the values
around it.

Example:
Price = 4, 8, 15, 21, 21, 24, 25, 28, 34

10
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

Partition into (equal-frequency) bins:


Bin a: 4, 8, 15
Bin b: 21, 21, 24
Bin c: 25, 28, 34
In this example, the data for price are first sorted and then partitioned into equal-frequency
bins of size 3.

Smoothing by bin means:


Bin a: 9, 9, 9
Bin b: 22, 22, 22
Bin c: 29, 29, 29
In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.

Smoothing by bin boundaries:


Bin a: 4, 4, 15
Bin b: 21, 21, 24
Bin c: 25, 25, 34
In smoothing by bin boundaries, each bin value is replaced by the closest boundary value.
(find small and largest values)

C. Regression:
Data can be smoothed by fitting the data into a regression function.
Example:
If we measured the height of child per year, if child grows 3 inches approximately, then the
regression function may be: child growing 3 inches per year

D. Clustering:
Outliers may be detected by clustering, where similar values are organized into groups, or
“clusters.
Values that fall outside of the set of clusters may be considered outliers. The outliers may
be ignored while analysis of data.

11
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

2. Data Integration in Data Mining:


Data Integration is a data preprocessing technique that combines data from multiple data
sources and provides a unified view of these data to users.

Data Source 1

Data
Data Source 1
Unified
Warehouse View

Data Source 1

These sources may include multiple databases, data cubes, or flat files. One of the most
well-known implementation of data integration is building an enterprise's data warehouse.
The benefit of a data warehouse enables a business to perform analysis based on the data
in the data warehouse.
There are mainly 2 major approaches for data integration:

1. Tight Coupling
In tight coupling data is combined from different sources into a single physical location
through the process of ETL - Extraction, Transformation and Loading.

2. Loose Coupling
In loose coupling data only remains in the actual source databases. In this approach, an
interface is provided that takes query from user and then sends the query directly to the
source databases to obtain the result.

12
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

3. Data Transformation in Data Mining:


In data transformation process data are transformed from one format to another format,
that is more appropriate for data mining.
Ex: Original data: 1.2, 3.2, 4.6, 123
Transformed data: 120, 320, 460, 123

Some Data Transformation Strategies:


a. Smoothing:
Smoothing is a process of removing noise from the data. (For example, refer binning
method)

b. Aggregation:
Aggregation in data mining is the process of finding, collecting, and presenting the data in a
summarized format to perform statistical analysis of business decisions.
Aggregated data help in finding useful information about a group after they are written as
reports.
Ex: Finding the number of consumers by country. This can increase sales in the country
with more buyers and help the company to enhance its marketing in a country with low
buyers. Here also, instead of an individual buyer, a group of buyers in a country are
considered.

13
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

c. Generalization:
In generalization low-level data are replaced with high-level data by using concept
hierarchies climbing.
Example: Roll-up operation on Data Cube

14
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

4. Data Reduction in Data Mining:


A database or date warehouse may store large amount of data. So, it may take very long to
perform data analysis and mining on such huge amounts of data.
Data reduction techniques can be applied to obtain a reduced representation (without loss
of any data) of the data set that is much smaller in volume but still contain critical
information.

Data Reduction Strategies:


a. Data Cube Aggregation:
Aggregation operations are applied to the data in the construction of a data cube.
(For example, refer page no. 13)

b. Dimensionality Reduction:
In dimensionality reduction redundant attributes are detected and removed, which reduces
the data set size.
Example:
Before reduction
A1 A2 A1 A3
10 11 11 21

After reduction
A1 A2 A3
21 11 21

c. Discretization process:
(for concept refer the below section)

15
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

5. Data Discretization:
Data Discretization techniques can be used to divide the range of continuous attribute into
intervals. (Continuous values can be divided into discrete (finite) values)
i.e. it divides the large dataset into smaller parts.
Numerous continuous attribute values are replaced by small interval labels.
This leads to a brief, easy-to-use, knowledge-level representation of mining results.
Data mining on a reduced data set means fewer input/output operations and is more
efficient than mining on a larger data set.
Because of these benefits, discretization techniques and concept hierarchies are typically
applied before data mining, rather than during mining.
Typical methods for Discretization and Concept Hierarchy Generation for Numerical Data:

a. Binning Method:
(Refer page no. 10)

b. Cluster Analysis:
Cluster analysis is a popular data discretization method.
In clustering, a group of different data objects is classified as similar objects. One group
means a cluster of data. Data sets are divided into different groups in the cluster analysis,
which is based on the similarity of the data.
A clustering algorithm can be applied to discrete a numerical attribute of A by partitioning
the values of A into clusters or groups.
Each initial cluster or partition may be further decomposed into several subcultures, forming
a lower level of the hierarchy.

16
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

Assignment 4

1. Define the term Data cleaning with example. (2)


2. Define the term Data mining. (2)
3. List methods of data preprocessing. (2)
4. Describe any four Challenges of Data mining. (4)
5. Explain Data Cleaning Process. (4)
6. Describe the need of data preprocessing. (4)
7. Explain Data preprocessing techniques in data mining. (6)
8. Explain steps involved in KDD process with diagram. (6)

17

You might also like