Assignment 2
Assignment 2
DATA MINING:
Data Mining refers to extracting or mining knowledge from large amounts of data. Thus, data
mining should have been more appropriately named as knowledge mining which emphasis on
mining from large amounts of data. It is computational process of discovering patterns in large
data sets involving methods at intersection of artificial intelligence, machine learning, statistics,
and database systems. The overall goal of data mining process is to extract information from a
data set and transform it into an understandable structure for further use. It is a process of
discovering various models, summaries, and derived values from a given collection of data. Data
mining is a rapidly growing field that is concerned with developing techniques to assist managers
and decision-makers to make intelligent use of a huge amount of repositories.
2.Collect data –
This step cares about how information is generated and picked up. Generally, there are two
distinct possibilities. The primary is when data-generation process is under control of an expert.
This approach is understood as a designed experiment. The second possibility is when expert
cannot influence data generation process. This is often referred to as observational approach.
An observational setting, namely, random data generation, is assumed in most data-mining
applications. Typically, sampling distribution is totally unknown after data are collected, or it is
partially and implicitly given within data-collection procedure. It is vital, however, to know how
data collection affects its theoretical distribution since such a piece of prior knowledge is often
useful for modeling and, later, for ultimate interpretation of results. Also, it is important to form
sure that information used for estimating a model and therefore data used later for testing and
applying a model come from an equivalent, unknown, sampling distribution. If this is often not
case, estimated model cannot be successfully utilized in a final application of results.
3.Data Preprocessing
In the observational setting, data is usually “collected” from prevailing databases, data
warehouses, and data marts. Data preprocessing usually includes a minimum of two common
tasks:
4.Estimate model
The selection and implementation of acceptable data-mining technique is that main task during
this phase. This process is not straightforward. Usually, in practice, implementation is predicated
on several models, and selecting simplest one is a further task.
It is technique that is used for detailed day to day transaction data which keep chaining on every
day whereas it is technique that gathers or collect data from different sources into central
repository.
It is designed for business transaction process whereas it is designed for decision making
process.
It holds current data whereas it stores large amount of data or historical data.
It used for running the business whereas it used for analyzing the business.
In Online transaction processing, the size of data base is around 10MB-100GB whereas In Data
warehousing, the size of database is around 100GB-2TB.
In Online transaction processing, there is no data redundancy whereas In Data warehousing, data
redundancy is present.
Real world data are generally incomplete: Missing attribute values, missing certain attributes of
importance, or having only aggregate data
They are noisy: Containing errors or outliers
They are inconsistent: Containing discrepancies in codes or names
Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done.
It involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
i)Ignore the tuples: This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
ii)Fill the Missing values: There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable value.
i)Binning Method: This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed to complete the task.
Each segmented is handled separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.
ii)Regression: Here data can be made smooth by fitting it to a regression function. The regression
used may be linear (having one independent variable) or multiple (having multiple independent
variables).
iii)Clustering: This approach groups the similar data in a cluster. The outliers may be undetected
or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process.
This involves following ways:
i)Normalization: It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0
to 1.0)
ii)Attribute Selection: New attributes are constructed from the given set of attributes to help the
mining process.
iii)Discretization: This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
iv)Concept Hierarchy Generation: Here attributes are converted from lower level to higher level
in hierarchy.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with
huge volume of data, analysis became harder in such cases. In order to get rid of this, we use data
reduction technique. It aims to increase the storage efficiency and reduce data storage and
analysis costs.
The various steps to data reduction are:
i)Data Cube Aggregation: Aggregation operation is applied to data for the construction of the
data cube.
ii)Attribute Subset Selection: The highly relevant attributes should be used, rest all can be
discarded. For performing attribute selection, one can use level of significance and p- value of
the attribute. The attribute having p-value greater than significance level can be discarded.
iii)Numerosity Reduction: This enable to store the model of data instead of whole data, for
example: Regression Models.
iv)Dimensionality Reduction: This reduce the size of data by encoding mechanisms. It can be
lossy or lossless. If after reconstruction from compressed data, original data can be retrieved,
such reduction are called lossless reduction else it is called lossy reduction. The two effective
methods of dimensionality reduction are: Wavelet transforms and PCA (Principal Component
Analysis).
i)Discretization: Reduce the number of values for a given continuous attribute by divide the
range of a continuous attribute into intervals. Interval labels can then be used to replace actual
data values.
ii)Concept Hierarchies: Reduce the data by collecting and replacing low level concepts(such as
numeric values for the attribute age) by higher level concepts(such as young, middle-aged or
senior).