0% found this document useful (0 votes)
138 views5 pages

Assignment 2

Data mining refers to extracting knowledge from large amounts of data through computational methods. The overall goal is to transform data into an understandable structure for analysis and decision making. There are several key steps in data mining: 1) stating the problem and hypotheses, 2) collecting and preprocessing data, 3) estimating models on the data, and 4) interpreting results and drawing conclusions. Data preprocessing is an important step that transforms raw data into a clean and efficient format for modeling. It involves data cleaning, transformation, and reduction techniques like handling missing values, transforming features, and reducing dimensionality.

Uploaded by

Dipankar Gogoi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views5 pages

Assignment 2

Data mining refers to extracting knowledge from large amounts of data through computational methods. The overall goal is to transform data into an understandable structure for analysis and decision making. There are several key steps in data mining: 1) stating the problem and hypotheses, 2) collecting and preprocessing data, 3) estimating models on the data, and 4) interpreting results and drawing conclusions. Data preprocessing is an important step that transforms raw data into a clean and efficient format for modeling. It involves data cleaning, transformation, and reduction techniques like handling missing values, transforming features, and reducing dimensionality.

Uploaded by

Dipankar Gogoi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

1.

DATA MINING:
Data Mining refers to extracting or mining knowledge from large amounts of data. Thus, data
mining should have been more appropriately named as knowledge mining which emphasis on
mining from large amounts of data. It is computational process of discovering patterns in large
data sets involving methods at intersection of artificial intelligence, machine learning, statistics,
and database systems. The overall goal of data mining process is to extract information from a
data set and transform it into an understandable structure for further use. It is a process of
discovering various models, summaries, and derived values from a given collection of data. Data
mining is a rapidly growing field that is concerned with developing techniques to assist managers
and decision-makers to make intelligent use of a huge amount of repositories.

Few properties of Data Mining are as follows:

Automatic discovery of patterns


Prediction of likely outcomes
Creation of actionable information
Focus on large datasets and databases

Steps involved in Data Mining process:

1.State problem and formulate hypothesis


In this step, a modeler usually specifies a group of variables for unknown dependency and, if
possible, a general sort of this dependency as an initial hypothesis. There could also be several
hypotheses formulated for one problem at this stage. The primary step requires combined
expertise of an application domain and a data-mining model. In practice, it always means an in-
depth interaction between data-mining expert and application expert. In successful data-mining
applications, this cooperation does not stop within initial phase. It continues during whole data-
mining process.

2.Collect data –
This step cares about how information is generated and picked up. Generally, there are two
distinct possibilities. The primary is when data-generation process is under control of an expert.
This approach is understood as a designed experiment. The second possibility is when expert
cannot influence data generation process. This is often referred to as observational approach.
An observational setting, namely, random data generation, is assumed in most data-mining
applications. Typically, sampling distribution is totally unknown after data are collected, or it is
partially and implicitly given within data-collection procedure. It is vital, however, to know how
data collection affects its theoretical distribution since such a piece of prior knowledge is often
useful for modeling and, later, for ultimate interpretation of results. Also, it is important to form
sure that information used for estimating a model and therefore data used later for testing and
applying a model come from an equivalent, unknown, sampling distribution. If this is often not
case, estimated model cannot be successfully utilized in a final application of results.
3.Data Preprocessing
In the observational setting, data is usually “collected” from prevailing databases, data
warehouses, and data marts. Data preprocessing usually includes a minimum of two common
tasks:

a) Outlier Detection and removal :


Outliers are unusual data values that are not according to most observations. Commonly,
outliers result from measurement errors, coding, and recording errors, and, sometimes,
are natural, abnormal values. Such non-representative samples can seriously affect model
produced later. There are two strategies for handling outliers : Detect and eventually
remove outliers as a neighborhood of preprocessing phase. And Develop robust modeling
methods that are insensitive to outliers.

b) Scaling, encoding, and selecting features :


Data preprocessing includes several steps like variable scaling and differing types of
encoding. Application-specific encoding methods usually achieve dimensionality
reduction by providing a smaller number of informative features for subsequent data
modeling. Data-preprocessing steps should not be considered completely independent
from other data-mining phases. In every iteration of data-mining process, all activities,
together, could define new and improved data sets for subsequent iterations. Generally,
an honest preprocessing method provides an optimal representation for a data-mining
technique by incorporating a prior knowledge within sort of application-specific scaling
and encoding.

4.Estimate model
The selection and implementation of acceptable data-mining technique is that main task during
this phase. This process is not straightforward. Usually, in practice, implementation is predicated
on several models, and selecting simplest one is a further task.

5.Interpret model and draw conclusions


In most cases, data-mining models should help in deciding. Hence, such models got to be
interpretable so as to be useful because humans are not likely to base their decisions on complex
“black-box” models. Usually, simple models are more interpretable, but they are also less
accurate. Modern data-mining methods are expected to yield highly accurate results using high
dimensional models. The matter of interpreting these models is taken into account a separate
task, with specific techniques to validate results.
2.DIFFERENCE BETWEEN OLTP AND DATA WAREHOUSE

It is technique that is used for detailed day to day transaction data which keep chaining on every
day whereas it is technique that gathers or collect data from different sources into central
repository.

It is designed for business transaction process whereas it is designed for decision making
process.

It holds current data whereas it stores large amount of data or historical data.

It used for running the business whereas it used for analyzing the business.

In Online transaction processing, the size of data base is around 10MB-100GB whereas In Data
warehousing, the size of database is around 100GB-2TB.

In Online transaction processing, normalized data is present whereas In Data warehousing,


denormalized data is present.

It uses transaction processing whereas it uses Query processing.

It is application-oriented whereas it is subject-oriented.

In Online transaction processing, there is no data redundancy whereas In Data warehousing, data
redundancy is present.

4. PREPROCESS THE DATA:


Data preprocessing is a data mining technique which is used to transform the raw data in a useful
and efficient format. Data Preprocessing is required because:

Real world data are generally incomplete: Missing attribute values, missing certain attributes of
importance, or having only aggregate data
They are noisy: Containing errors or outliers
They are inconsistent: Containing discrepancies in codes or names

Steps Involved in Data Preprocessing:

Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done.
It involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.

i)Ignore the tuples: This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.

ii)Fill the Missing values: There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable value.

(b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines. It can be generated due to
faulty data collection, data entry errors etc. It can be handled in following ways:

i)Binning Method: This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed to complete the task.
Each segmented is handled separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.

ii)Regression: Here data can be made smooth by fitting it to a regression function. The regression
used may be linear (having one independent variable) or multiple (having multiple independent
variables).

iii)Clustering: This approach groups the similar data in a cluster. The outliers may be undetected
or it will fall outside the clusters.

2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process.
This involves following ways:

i)Normalization: It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0
to 1.0)

ii)Attribute Selection: New attributes are constructed from the given set of attributes to help the
mining process.

iii)Discretization: This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.

iv)Concept Hierarchy Generation: Here attributes are converted from lower level to higher level
in hierarchy.

3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with
huge volume of data, analysis became harder in such cases. In order to get rid of this, we use data
reduction technique. It aims to increase the storage efficiency and reduce data storage and
analysis costs.
The various steps to data reduction are:

i)Data Cube Aggregation: Aggregation operation is applied to data for the construction of the
data cube.

ii)Attribute Subset Selection: The highly relevant attributes should be used, rest all can be
discarded. For performing attribute selection, one can use level of significance and p- value of
the attribute. The attribute having p-value greater than significance level can be discarded.

iii)Numerosity Reduction: This enable to store the model of data instead of whole data, for
example: Regression Models.

iv)Dimensionality Reduction: This reduce the size of data by encoding mechanisms. It can be
lossy or lossless. If after reconstruction from compressed data, original data can be retrieved,
such reduction are called lossless reduction else it is called lossy reduction. The two effective
methods of dimensionality reduction are: Wavelet transforms and PCA (Principal Component
Analysis).

4. Discretization and Concept Hierarchy Generation:

i)Discretization: Reduce the number of values for a given continuous attribute by divide the
range of a continuous attribute into intervals. Interval labels can then be used to replace actual
data values.

ii)Concept Hierarchies: Reduce the data by collecting and replacing low level concepts(such as
numeric values for the attribute age) by higher level concepts(such as young, middle-aged or
senior).

You might also like