0% found this document useful (0 votes)

150 views5 pages

Assignment 2

Data mining refers to extracting knowledge from large amounts of data through computational methods. The overall goal is to transform data into an understandable structure for analysis and decision making. There are several key steps in data mining: 1) stating the problem and hypotheses, 2) collecting and preprocessing data, 3) estimating models on the data, and 4) interpreting results and drawing conclusions. Data preprocessing is an important step that transforms raw data into a clean and efficient format for modeling. It involves data cleaning, transformation, and reduction techniques like handling missing values, transforming features, and reducing dimensionality.

Uploaded by

Dipankar Gogoi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

150 views5 pages

Assignment 2

Uploaded by

Dipankar Gogoi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

1.

DATA MINING:
Data Mining refers to extracting or mining knowledge from large amounts of data. Thus, data
mining should have been more appropriately named as knowledge mining which emphasis on
mining from large amounts of data. It is computational process of discovering patterns in large
data sets involving methods at intersection of artificial intelligence, machine learning, statistics,
and database systems. The overall goal of data mining process is to extract information from a
data set and transform it into an understandable structure for further use. It is a process of
discovering various models, summaries, and derived values from a given collection of data. Data
mining is a rapidly growing field that is concerned with developing techniques to assist managers
and decision-makers to make intelligent use of a huge amount of repositories.

Few properties of Data Mining are as follows:

Automatic discovery of patterns

Prediction of likely outcomes
Creation of actionable information
Focus on large datasets and databases

Steps involved in Data Mining process:

1.State problem and formulate hypothesis

In this step, a modeler usually specifies a group of variables for unknown dependency and, if
possible, a general sort of this dependency as an initial hypothesis. There could also be several
hypotheses formulated for one problem at this stage. The primary step requires combined
expertise of an application domain and a data-mining model. In practice, it always means an in-
depth interaction between data-mining expert and application expert. In successful data-mining
applications, this cooperation does not stop within initial phase. It continues during whole data-
mining process.

2.Collect data –
This step cares about how information is generated and picked up. Generally, there are two
distinct possibilities. The primary is when data-generation process is under control of an expert.
This approach is understood as a designed experiment. The second possibility is when expert
cannot influence data generation process. This is often referred to as observational approach.
An observational setting, namely, random data generation, is assumed in most data-mining
applications. Typically, sampling distribution is totally unknown after data are collected, or it is
partially and implicitly given within data-collection procedure. It is vital, however, to know how
data collection affects its theoretical distribution since such a piece of prior knowledge is often
useful for modeling and, later, for ultimate interpretation of results. Also, it is important to form
sure that information used for estimating a model and therefore data used later for testing and
applying a model come from an equivalent, unknown, sampling distribution. If this is often not
case, estimated model cannot be successfully utilized in a final application of results.
3.Data Preprocessing
In the observational setting, data is usually “collected” from prevailing databases, data
warehouses, and data marts. Data preprocessing usually includes a minimum of two common
tasks:

a) Outlier Detection and removal :

Outliers are unusual data values that are not according to most observations. Commonly,
outliers result from measurement errors, coding, and recording errors, and, sometimes,
are natural, abnormal values. Such non-representative samples can seriously affect model
produced later. There are two strategies for handling outliers : Detect and eventually
remove outliers as a neighborhood of preprocessing phase. And Develop robust modeling
methods that are insensitive to outliers.

b) Scaling, encoding, and selecting features :

Data preprocessing includes several steps like variable scaling and differing types of
encoding. Application-specific encoding methods usually achieve dimensionality
reduction by providing a smaller number of informative features for subsequent data
modeling. Data-preprocessing steps should not be considered completely independent
from other data-mining phases. In every iteration of data-mining process, all activities,
together, could define new and improved data sets for subsequent iterations. Generally,
an honest preprocessing method provides an optimal representation for a data-mining
technique by incorporating a prior knowledge within sort of application-specific scaling
and encoding.

4.Estimate model
The selection and implementation of acceptable data-mining technique is that main task during
this phase. This process is not straightforward. Usually, in practice, implementation is predicated
on several models, and selecting simplest one is a further task.

5.Interpret model and draw conclusions

In most cases, data-mining models should help in deciding. Hence, such models got to be
interpretable so as to be useful because humans are not likely to base their decisions on complex
“black-box” models. Usually, simple models are more interpretable, but they are also less
accurate. Modern data-mining methods are expected to yield highly accurate results using high
dimensional models. The matter of interpreting these models is taken into account a separate
task, with specific techniques to validate results.
2.DIFFERENCE BETWEEN OLTP AND DATA WAREHOUSE

It is technique that is used for detailed day to day transaction data which keep chaining on every
day whereas it is technique that gathers or collect data from different sources into central
repository.

It is designed for business transaction process whereas it is designed for decision making
process.

It holds current data whereas it stores large amount of data or historical data.

It used for running the business whereas it used for analyzing the business.

In Online transaction processing, the size of data base is around 10MB-100GB whereas In Data
warehousing, the size of database is around 100GB-2TB.

In Online transaction processing, normalized data is present whereas In Data warehousing,

denormalized data is present.

It uses transaction processing whereas it uses Query processing.

It is application-oriented whereas it is subject-oriented.

In Online transaction processing, there is no data redundancy whereas In Data warehousing, data
redundancy is present.

4. PREPROCESS THE DATA:

Data preprocessing is a data mining technique which is used to transform the raw data in a useful
and efficient format. Data Preprocessing is required because:

Real world data are generally incomplete: Missing attribute values, missing certain attributes of
importance, or having only aggregate data
They are noisy: Containing errors or outliers
They are inconsistent: Containing discrepancies in codes or names

Steps Involved in Data Preprocessing:

Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done.
It involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.

i)Ignore the tuples: This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.

ii)Fill the Missing values: There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable value.

(b). Noisy Data:

Noisy data is a meaningless data that can’t be interpreted by machines. It can be generated due to
faulty data collection, data entry errors etc. It can be handled in following ways:

i)Binning Method: This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed to complete the task.
Each segmented is handled separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.

ii)Regression: Here data can be made smooth by fitting it to a regression function. The regression
used may be linear (having one independent variable) or multiple (having multiple independent
variables).

iii)Clustering: This approach groups the similar data in a cluster. The outliers may be undetected
or it will fall outside the clusters.

2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process.
This involves following ways:

i)Normalization: It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0
to 1.0)

ii)Attribute Selection: New attributes are constructed from the given set of attributes to help the
mining process.

iii)Discretization: This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.

iv)Concept Hierarchy Generation: Here attributes are converted from lower level to higher level
in hierarchy.

3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with
huge volume of data, analysis became harder in such cases. In order to get rid of this, we use data
reduction technique. It aims to increase the storage efficiency and reduce data storage and
analysis costs.
The various steps to data reduction are:

i)Data Cube Aggregation: Aggregation operation is applied to data for the construction of the
data cube.

ii)Attribute Subset Selection: The highly relevant attributes should be used, rest all can be
discarded. For performing attribute selection, one can use level of significance and p- value of
the attribute. The attribute having p-value greater than significance level can be discarded.

iii)Numerosity Reduction: This enable to store the model of data instead of whole data, for
example: Regression Models.

iv)Dimensionality Reduction: This reduce the size of data by encoding mechanisms. It can be
lossy or lossless. If after reconstruction from compressed data, original data can be retrieved,
such reduction are called lossless reduction else it is called lossy reduction. The two effective
methods of dimensionality reduction are: Wavelet transforms and PCA (Principal Component
Analysis).

4. Discretization and Concept Hierarchy Generation:

i)Discretization: Reduce the number of values for a given continuous attribute by divide the
range of a continuous attribute into intervals. Interval labels can then be used to replace actual
data values.

ii)Concept Hierarchies: Reduce the data by collecting and replacing low level concepts(such as
numeric values for the attribute age) by higher level concepts(such as young, middle-aged or
senior).

Unit 1 DMW
No ratings yet
Unit 1 DMW
41 pages
slmMA Ancient History and Archaeology
100% (1)
slmMA Ancient History and Archaeology
72 pages
Data Binning
No ratings yet
Data Binning
9 pages
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
No ratings yet
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
3 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
23 pages
Steps in The Data Mining Process
No ratings yet
Steps in The Data Mining Process
5 pages
Business Model of Amazon
No ratings yet
Business Model of Amazon
3 pages
Question Bank DMC
No ratings yet
Question Bank DMC
28 pages
FDS Unit 1
No ratings yet
FDS Unit 1
20 pages
DMDW Lecture Notes
No ratings yet
DMDW Lecture Notes
24 pages
Unit 3
No ratings yet
Unit 3
34 pages
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
RAIN BRADSTREET, Scent Visions. The Nineteenth-Century Olfactory Imagination
No ratings yet
RAIN BRADSTREET, Scent Visions. The Nineteenth-Century Olfactory Imagination
447 pages
Data Mining
100% (1)
Data Mining
18 pages
DWM Notes Class by Proff
No ratings yet
DWM Notes Class by Proff
88 pages
DWM Assigment-Questions Ans
No ratings yet
DWM Assigment-Questions Ans
67 pages
Lecture 1428550844
No ratings yet
Lecture 1428550844
87 pages
1.data Mining Functionalities
No ratings yet
1.data Mining Functionalities
14 pages
Unit2 Notes
No ratings yet
Unit2 Notes
8 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Data Mining Implementation Process
No ratings yet
Data Mining Implementation Process
9 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
DM Sem U-1
No ratings yet
DM Sem U-1
50 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
Data Mining Mod 1 Notes
No ratings yet
Data Mining Mod 1 Notes
25 pages
Data Mining
No ratings yet
Data Mining
44 pages
IV-cse DM Viva Questions
No ratings yet
IV-cse DM Viva Questions
10 pages
David's Battery For Differential Ability
57% (7)
David's Battery For Differential Ability
14 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
Data Mining - KTUweb PDF
No ratings yet
Data Mining - KTUweb PDF
82 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
Data Mining
No ratings yet
Data Mining
22 pages
Data Warehouse and Data Mining - Definition and Concepts
No ratings yet
Data Warehouse and Data Mining - Definition and Concepts
20 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
Intro 2
No ratings yet
Intro 2
3 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Data Pre Processing
No ratings yet
Data Pre Processing
11 pages
BUSINESS INTELLIGENCE NOTES Unit 4
No ratings yet
BUSINESS INTELLIGENCE NOTES Unit 4
10 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
QB 10 Marker
No ratings yet
QB 10 Marker
19 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Data Mining
No ratings yet
Data Mining
15 pages
Data Preprocessing
No ratings yet
Data Preprocessing
8 pages
Unit 3
No ratings yet
Unit 3
18 pages
Unit-1 PPT
No ratings yet
Unit-1 PPT
21 pages
Flexible Class Assessment Activities Mechanism (Fcaam) : Komunikasyon at Pananaliksik Sa Wika at Kulturang Pilipino
50% (2)
Flexible Class Assessment Activities Mechanism (Fcaam) : Komunikasyon at Pananaliksik Sa Wika at Kulturang Pilipino
4 pages
How To Do Data Analysis For Thesis
100% (3)
How To Do Data Analysis For Thesis
6 pages
Ecschsyll 21 Scheme
No ratings yet
Ecschsyll 21 Scheme
174 pages
DATA Mining
No ratings yet
DATA Mining
21 pages
BI - Unit 5
No ratings yet
BI - Unit 5
9 pages
CDEV8130 Career Management Assignment 3 My Interview Preparation Self Assessment
100% (2)
CDEV8130 Career Management Assignment 3 My Interview Preparation Self Assessment
7 pages
Data Mining Notes
No ratings yet
Data Mining Notes
14 pages
MYP Years 4-5 Assessment Criteria 1
100% (1)
MYP Years 4-5 Assessment Criteria 1
33 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
Data
No ratings yet
Data
9 pages
Lecture 3
No ratings yet
Lecture 3
10 pages
Midterm-Coverage NSTP102
No ratings yet
Midterm-Coverage NSTP102
41 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
M.E.-ISE-2023-25-60 PIS E31-RSA-Best Practices in Data Mining
No ratings yet
M.E.-ISE-2023-25-60 PIS E31-RSA-Best Practices in Data Mining
3 pages
Stages in Data Mining
No ratings yet
Stages in Data Mining
11 pages
Business Understanding This Step Involves Understanding The Problem That Needs To Be Solved and Defining The Objectives of The Data Mining Project
No ratings yet
Business Understanding This Step Involves Understanding The Problem That Needs To Be Solved and Defining The Objectives of The Data Mining Project
5 pages
AICTE Internship 2024 Project Report
No ratings yet
AICTE Internship 2024 Project Report
12 pages
Mid Course Work
No ratings yet
Mid Course Work
112 pages
How To Lead Like A Coach HBR
No ratings yet
How To Lead Like A Coach HBR
8 pages
Defining Competencies and Establishing Team Training Requirements
No ratings yet
Defining Competencies and Establishing Team Training Requirements
24 pages
BESC-133 Sample Paper
No ratings yet
BESC-133 Sample Paper
7 pages
Verbal and Non Verbal Communication at Work
No ratings yet
Verbal and Non Verbal Communication at Work
30 pages
AL1 Mid Reviewer
No ratings yet
AL1 Mid Reviewer
5 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
9 pages
The Concept of Constituent Power-Martin Loughlin PDF
No ratings yet
The Concept of Constituent Power-Martin Loughlin PDF
20 pages
01 - Employability and Skill Gap
No ratings yet
01 - Employability and Skill Gap
13 pages
How To Prepare and Make Submission An Article in XYZ Journal
No ratings yet
How To Prepare and Make Submission An Article in XYZ Journal
13 pages
949 Uhv Assignment 5
No ratings yet
949 Uhv Assignment 5
1 page
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
What Is Research?
No ratings yet
What Is Research?
3 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
8 pages
Schwadel 2015 Explaining Cross-National Variation in The Effect of Higher Education On Religiosity
No ratings yet
Schwadel 2015 Explaining Cross-National Variation in The Effect of Higher Education On Religiosity
17 pages
Child and Adolescent Development
No ratings yet
Child and Adolescent Development
18 pages
Chapter 2 Final Out Put
No ratings yet
Chapter 2 Final Out Put
7 pages
Business Uses of Data Mining and Data Warehousing MIS 304 Section 04 CRN-41595
No ratings yet
Business Uses of Data Mining and Data Warehousing MIS 304 Section 04 CRN-41595
23 pages
Group 4 (Maryam Sultana, Hajra, Unza)
No ratings yet
Group 4 (Maryam Sultana, Hajra, Unza)
9 pages
Supervisi Administrasi Pembelajaran Dalam Meningkatkan Mutu Pembelajaran Di Smks 6 Pertiwi Curup
No ratings yet
Supervisi Administrasi Pembelajaran Dalam Meningkatkan Mutu Pembelajaran Di Smks 6 Pertiwi Curup
12 pages
Define Analog Transmission? Define Analog To Analog Conversion
No ratings yet
Define Analog Transmission? Define Analog To Analog Conversion
7 pages
Экзаменационные Билеты По Английскому Языку 9 Класс
No ratings yet
Экзаменационные Билеты По Английскому Языку 9 Класс
8 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
CS 1102 Learning Journal Unit 2
No ratings yet
CS 1102 Learning Journal Unit 2
2 pages
What Do You Understand by A Digital Signature? Explain Its Application and Verification Diagrammatically
No ratings yet
What Do You Understand by A Digital Signature? Explain Its Application and Verification Diagrammatically
4 pages
Acknowledgement p4
No ratings yet
Acknowledgement p4
1 page
Chapter 3: Visual Stdio Code
No ratings yet
Chapter 3: Visual Stdio Code
1 page
Chapter 7: Conclusion
No ratings yet
Chapter 7: Conclusion
1 page
Chapter 1: Introduction To Billing Software
No ratings yet
Chapter 1: Introduction To Billing Software
1 page
Chapter 1: Introduction To Billing Software
No ratings yet
Chapter 1: Introduction To Billing Software
1 page
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet

Assignment 2

Uploaded by

Assignment 2

Uploaded by

1.

Few properties of Data Mining are as follows:

Automatic discovery of patterns

Steps involved in Data Mining process:

1.State problem and formulate hypothesis

a) Outlier Detection and removal :

b) Scaling, encoding, and selecting features :

5.Interpret model and draw conclusions

In Online transaction processing, normalized data is present whereas In Data warehousing,

It uses transaction processing whereas it uses Query processing.

It is application-oriented whereas it is subject-oriented.

4. PREPROCESS THE DATA:

Steps Involved in Data Preprocessing:

(b). Noisy Data:

4. Discretization and Concept Hierarchy Generation:

You might also like