0% found this document useful (0 votes)
12 views39 pages

Module-1 C1-C2

Uploaded by

abhishek365ngp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views39 pages

Module-1 C1-C2

Uploaded by

abhishek365ngp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Data Mining

By: S. K. Sahu
M. Tech, Ph.D Scholar VNIT, Nagpur
Module 1: (Book Ch1 & Book Ch2)

1.Introduction
2.Data Mining Roots
3.Data Mining Process
4.Large Data Set
5.Data Warehouse for Data Mining
6.Business Aspects of Data Mining
7.Preparing Data
Data mining is the process of extracting and discovering
patterns in large data sets involving methods at the intersection
of machine learning, statistics, and database systems

Data mining, also known as knowledge discovery in data


(KDD), is the process of uncovering patterns and other
valuable information from large data sets.

Data mining is a process of discovering various models,


summaries, and derived values from a given collection of data.

Data mining is the search for new, valuable, and nontrivial


information in large volumes of data.
Primary goals of data mining
In practice, the two primary goals of data mining tend to be
prediction and description .

Prediction involves using some variables or fields in the data


set to predict unknown or future values of other variables of
interest.
Description , on the other hand, focuses on finding patterns
describing the data that can be interpreted by humans.
Primary goals of data mining
Primary Data Mining Task
DATA - MINING ROOTS

• Means Basic thinks of data mining which is


globally accepted

•Data mining has its origins in various disciplines,


of which the two most important are statistics and
machine learning .

•Also a Control Theory for engineering system and


industrial process. observing input – output data
pairs is generally referred to as system
identification.
DATA MINING PROCESS
The general experimental procedure adapted to data - mining
problems involves the following steps:
LARGE DATA SETS
LARGE DATA SETS
LARGE DATA SETS
LARGE DATA SETS
DATA Warehouse for DATA MINING

What is Data Warehousing?

A Data Warehousing (DW) is process for collecting


and managing data from varied sources to provide
meaningful business insights.

The data warehouse is the core of the BI system


which is built for data analysis and reporting.
DATA Warehouse for DATA MINING
Representation of Data
Business Aspects of Data Mining

For businesses,
data mining is used to discover patterns and
relationships in the data in order to help make
better business decisions.

Data mining can help spot sales trends, develop


smarter marketing campaigns, and accurately
predict customer loyalty.
Business Aspects of Data Mining
REPRESENTATION OF RAW DATA
Two most common types: numeric and categorical . Numeric
values include real - value variables or integer variables such
as age, speed, or length. A feature with numeric values has
two important properties: Its values have an order relation (2
< 5 and 5 < 7) and a distance relation (d [2.3, 4.2] = 1.9).

In contrast, categorical (often called symbolic) variables have


neither of these two relations. The two values of a
categorical variable can be either equal or not equal:

A categorical variable with n values can be converted into


n binary numeric variables, namely, one binary variable for
each categorical value.
REPRESENTATION OF RAW DATA

For example, if the variable eye color has four values


(black, blue, green, and brown), they can be coded with
four binary digits.
REPRESENTATION OF RAW DATA

Another way of classifying a variable, based on its values, is


to look at it as a continuous variable or a discrete variable.

Continuous variables are also known as quantitative or


metric variables . They are measured using either an
interval scale or a ratio scale.

Discrete variables are also called qualitative variables.


Such variables are measured, or their values defined,
using one of two kinds of non metric scales — nominal
or ordinal .
REPRESENTATION OF RAW DATA
CHARACTERISTICS OF RAW DATA

All raw data sets initially prepared for data mining are
often large; many are related to human beings and
have the potential for being messy.

A priori, one should expect to find missing values,


distortions, misrecording, inadequate sampling, and
so on in these initial data sets.

Many experts in data mining will agree that one of


the most critical steps in a data - mining process is
the preparation and transformation of the initial
data set.
CHARACTERISTICS OF RAW DATA

There are two central tasks for the preparation of


data:

1. Organizing data into a standard form that is ready


for processing by data - mining and other computer -
based tools (a standard form is a relational table),
and

2. Preparing data sets that lead to the best data -


mining performances.
TRANSFORMATION OF RAW DATA
TRANSFORMATION OF RAW DATA
TRANSFORMATION OF RAW DATA
Data Smoothing

For many data - mining techniques, minor differences among


these values are not significant and may degrade the
performance of the method and the final results.

They may be considered as random variations of the same


underlying value. Hence, it can be advantageous sometimes to
smooth the values of the variable
MISSING DATA
OUTLIER ANALYSIS
Very often in large data sets, there exist samples
that do not comply with the general behavior of the
data model. Such samples, which are significantly
different or inconsistent with the remaining set of
data, are called outliers.

Outliers can be caused by measurement error, or


they may be the result of inherent data variability. If,
for example, the display of a person ’ s age in the
database is − 1, the value is obviously not correct
OUTLIER ANALYSIS

The main types of outlier detection schemes are


OUTLIER ANALYSIS
OUTLIER ANALYSIS
OUTLIER ANALYSIS
OUTLIER ANALYSIS
Thank you

You might also like