Module_III_data_mining
Module_III_data_mining
Methods ofdatareduction:
These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example, imagine that information
you gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your
company every three months. They involve you in the annual sales, rather than the quarterly
average, So we can summarize the data in such a way that the resulting data summarizes the
total sales per year instead of per quarter. It summarizes the data.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute required
for our analysis. It reduces data size as it eliminates outdated or redundant features.
• Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we decide best of the original
attributes on the set based on their relevance to other attributes. We know it as a p-value in
statistics.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
Binning: Binning is a data smoothing technique and its helps to group a huge number of
continuous values into a smaller number of bins. For example, if we have data about a group of students,
and we want to arrange their marks into a smaller number of marks intervals by making the bins of
grades. One bin for grade A, one for grade B, one for C, one for D, and one for F Grade.
Bottom-up mapping
Bottom-up mapping starts from the Bottom with specialized concepts and moves to the top to the
generalized conct.
To find a numerical output, prediction is used. The training dataset contains the inputs and
numerical output values. According to the training dataset, the algorithm generates a model or
predictor. When fresh data is provided, the model should find a numerical output. This approach,
unlike classification, does not have a class label. A continuous-valued function or ordered value is
predicted by the model.
In most cases, regression is utilized to make predictions. For example: Predicting the worth of a
home based on facts like the number of rooms, total area, and so on.
Consider the following scenario: A marketing manager needs to forecast how much a specific
consumer will spend during a sale. In this scenario, we are bothered to forecast a numerical value.
In this situation, a model or predictor that forecasts a continuous or ordered value function will be
built.
Prediction Issues:
Preparing the data for prediction is the most pressing challenge. The following activities are
involved in data preparation:
• Data Cleaning: Cleaning data include reducing noise and treating missing values. Smoothing
techniques remove noise, and the problem of missing values is solved by replacing a missing
value with the most often occurring value for that characteristic.
• Relevance Analysis: The irrelevant attributes may also be present in the database. The
correlation analysis method is used to determine whether two attributes are connected.
• Data Transformation and Reduction: Any of the methods listed below can be used to transform
the data.
• Normalization: Normalization is used to transform the data. Normalization is the
process of scaling all values for a given attribute so that they lie within a narrow
range. When neural networks or methods requiring measurements are utilized in the
learning process, normalization is performed.
• Generalization: The data can also be modified by applying a higher idea to it. We
can use the concept of hierarchies for this.
What is a Prediction?
The second way to operate data mining is Prediction. It is repeatedly used to detect several data. Same thing as over
in classification, the behaviour of the data set holds the inputs and similar numerical output values. Compatible with
the behaviour of the dataset, the algorithm (division) gets the model or a predictor.
When the new information is given, the model should detect a numerical output. Despite the classification, this
procedure does not have the class label or notes. The model estimates the current valued action or command value.
Regression (Growth) in most cases is used for Prediction. Predicting the price of a house rely on cases such as the
number of apartment, the total region, and so on is an illustration for prediction. An organization has the power to
find the amount of banknotes payout by the person during a negotiation.