0% found this document useful (0 votes)
4 views

Module_III_data_mining

The document discusses various methods of data reduction, including data cube aggregation, dimension reduction, and data discretization, which help simplify and manage large datasets. It also explains the data mining process, outlining steps from understanding the business and data to implementing changes based on analysis results. Additionally, it highlights the differences between classification and prediction in data mining, emphasizing their respective roles in identifying groups and estimating numerical outputs.

Uploaded by

Neha Shaw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Module_III_data_mining

The document discusses various methods of data reduction, including data cube aggregation, dimension reduction, and data discretization, which help simplify and manage large datasets. It also explains the data mining process, outlining steps from understanding the business and data to implementing changes based on analysis results. Additionally, it highlights the differences between classification and prediction in data mining, emphasizing their respective roles in identifying groups and estimating numerical outputs.

Uploaded by

Neha Shaw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Unit III

Methods ofdatareduction:
These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example, imagine that information
you gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your
company every three months. They involve you in the annual sales, rather than the quarterly
average, So we can summarize the data in such a way that the resulting data summarizes the
total sales per year instead of per quarter. It summarizes the data.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute required
for our analysis. It reduces data size as it eliminates outdated or redundant features.
• Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we decide best of the original
attributes on the set based on their relevance to other attributes. We know it as a p-value in
statistics.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }

Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}


• Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at each point, it
eliminates the worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }

Step-1: {X1, X2, X3, X4, X5}


Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}


Data discretization converts a large number of data values into smaller once, so that data
evaluation and data management becomes very easy.
Data discretization example

we have an attribute of age with the following values.


Age 10,11,13,14,17,19,30, 31, 32, 38, 40, 42,70 , 72, 73, 75
Table: Before discretization
Attribute Age Age Age
10,11,13,14,17,19, 30, 31, 32, 38, 40, 42 70 , 72, 73, 75
After Discretization Young Mature Old
Another example is the Website visitor’s data. As seen in the figure below, data is
discretized into the countries.

What are some famous techniques of data discretization?


1. Histogram analysis: Histogram is a plot used to present the underlying frequency
distribution of a set of continuous data. The histogram helps the inspection of the data for the
distribution of the data. For example normal distribution representation, outliers, and skewness
representation, etc.

Binning: Binning is a data smoothing technique and its helps to group a huge number of
continuous values into a smaller number of bins. For example, if we have data about a group of students,
and we want to arrange their marks into a smaller number of marks intervals by making the bins of
grades. One bin for grade A, one for grade B, one for C, one for D, and one for F Grade.

1. Correlation analysis: Cluster analysis is commonly known as clustering. Clustering


is the task of grouping similar objects in one group, commonly called clusters. All
different objects are placed in different clusters.
2. Clustering analysis
3. Decision tree analysis
4. Equal width partitioning
5. Equal depth partitioning

Data discretization and concept hierarchy


generation
A concept hierarchy represents a sequence of mappings with a set of more general concepts
to specialized concepts. Similarly mapping from low-level concepts to higher-level concepts.
In other words, we can say top-down mapping and bottom-up mapping.
Let’s see an example of a concept hierarchy for the dimension location.
Each city can be mapped with the country to which the given city belongs. For example,
Mianwali can be mapped to Pakistan and Pakistan can be mapped to Asia.
Top-down mapping
Top-down mapping starts from the top with general concepts and moves to the bottom to
the specialized concepts.
Bottom-up mapping
Bottom-up mapping starts from the Bottom with specialized concepts and moves to the top
to the generalized concepts.

Bottom-up mapping

Bottom-up mapping starts from the Bottom with specialized concepts and moves to the top to the
generalized conct.

What Is Data Mining?


Data mining is a process used by companies to turn raw data into useful information. By
using software to look for patterns in large batches of data, businesses can learn more
about their customers to develop more effective marketing strategies, increase sales and
decrease costs. Data mining depends on effective data collection, warehousing, and
computer processing.

How Data Mining Works


Data mining involves exploring and analyzing large blocks of information to glean
meaningful patterns and trends. It can be used in a variety of ways, such as database
marketing, credit risk management, fraud detection, spam Email filtering, or even to
discern the sentiment or opinion of users.
The data mining process breaks down into five steps. First, organizations collect data and
load it into their data warehouses. Next, they store and manage the data, either on in-
house servers or the cloud. Business analysts, management teams, and information
technology professionals access the data and determine how they want to organize it.
Then, application software sorts the data based on the user's results, and finally, the end-
user presents the data in an easy-to-share format, such as a graph or table.

Data Mining Techniques


Data mining uses algorithms and various techniques to convert large collections of data
into useful output. The most popular types of data mining techniques include:

• Association rules, also referred to as market basket analysis, searches for


relationships between variables. This relationship in itself creates additional value
within the data set as it strives to link pieces of data. For example, association rules
would search a company's sales history to see which products are most commonly
purchased together; with this information, stores can plan, promote, and forecast
accordingly.
• Classification uses predefined classes to assign to objects. These classes describe
characteristics of items or represent what the data points have in common with
each. This data mining technique allows the underlying data to be more neatly
categorized and summarized across similar features or product lines.
• Clustering is similar to classification. However, clustering identified similarities
between objects, then groups those items based on what makes them different from
other items. While classification may result in groups such as "shampoo",
"conditioner", "soap", and "toothpaste", clustering may identify groups such as "hair
care" and "dental health".
• Decision trees are used to classify or predict an outcome based on a set list of
criteria or decisions. A decision tree is used to ask for input of a series of cascading
questions that sort the dataset based on responses given. Sometimes depicted as a
tree-like visual, a decision tree allows for specific direction and user input when
drilling deeper into the data.
• K-Nearest Neighbor (KNN) is an algorithm that classifies data based on its proximity
to other data. The basis for KNN is rooted in the assumption that data points that
are close to each are more similar to each other than other bits of data. This non-
parametric, supervised technique is used to predict features of a group based on
individual data points.
• Neural networks process data through the use of nodes. These nodes is comprised
of inputs, weights, and an output. Data is mapped through supervised learning
(similar to how the human brain is interconnected). This model can be fit to give
threshold values to determine a model's accuracy.
• Predictive analysis strives to leverage historical information to build graphical or
mathematical models to forecast future outcomes. Overlapping with regression
analysis, this data mining technique aims at supporting an unknown figure in the
future based on current data on hand.

The Data Mining Process
To be most effective, data analysts generally follow a certain flow of tasks along the data
mining process. Without this structure, an analyst may encounter an issue in the middle of
their analysis that could have easily been prevented had they prepared for it earlier. The
data mining process is usually broken into the following steps.

Step 1: Understand the Business


Before any data is touched, extracted, cleaned, or analyzed, it is important to understand
the underlying entity and the project at hand. What are the goals the company is trying to
achieve by mining data? What is their current business situation? What are the findings of
a SWOT analysis? Before looking at any data, the mining process starts by understanding
what will define success at the end of the process.

Step 2: Understand the Data


Once the business problem has been clearly defined, it's time to start thinking about data.
This includes what sources are available, how it will be secured stored, how information
will be gathered, and what the final outcome or analysis may look like. This step also
critically thinks about what limits their are to data, storage, security, and collection and
assesses how these constraints will impact the data mining process.

Step 3: Prepare the Data


It's now time to get our hands on information. Data is gathered, uploaded, extracted, or
calculated. It is then cleaned, standardized, scrubbed for outliers, assessed for mistakes,
and checked for reasonableness. During this stage of data mining, the data may also be
checked for size as an overbearing collection of information may unnecessarily slow
computations and analysis.

Step 4: Build the Model


With our clean data set in hand, it's time to crunch the numbers. Data scientists use the
types of data mining above to search for relationships, trends, associations, or sequential
patterns. The data may also be fed into predictive models to assess how previous bits of
information may translate into future outcomes.

Step 5: Evaluate the Results


The data-centered aspect of data mining concludes by assessing the findings of the data
model(s). The outcomes from the analysis may be aggregated, interpreted, and presented
to decision-makers that have largely be excluded from the data mining process to this
point. In this step, organizations can choose to make decisions based on the findings.

Step 6: Implement Change and Monitor


The data mining process concludes with management taking steps in response to the
findings of the analysis. The company may decide the information was not strong enough
or the findings were not relevant to change course. Alternatively, the company may
strategically pivot based on findings. In either case, management reviews the ultimate
impacts of the business and re-creates future data mining loops by identifying new
business problems or opportunities.
Module 4

To find a numerical output, prediction is used. The training dataset contains the inputs and
numerical output values. According to the training dataset, the algorithm generates a model or
predictor. When fresh data is provided, the model should find a numerical output. This approach,
unlike classification, does not have a class label. A continuous-valued function or ordered value is
predicted by the model.
In most cases, regression is utilized to make predictions. For example: Predicting the worth of a
home based on facts like the number of rooms, total area, and so on.
Consider the following scenario: A marketing manager needs to forecast how much a specific
consumer will spend during a sale. In this scenario, we are bothered to forecast a numerical value.
In this situation, a model or predictor that forecasts a continuous or ordered value function will be
built.

Prediction Issues:

Preparing the data for prediction is the most pressing challenge. The following activities are
involved in data preparation:
• Data Cleaning: Cleaning data include reducing noise and treating missing values. Smoothing
techniques remove noise, and the problem of missing values is solved by replacing a missing
value with the most often occurring value for that characteristic.
• Relevance Analysis: The irrelevant attributes may also be present in the database. The
correlation analysis method is used to determine whether two attributes are connected.
• Data Transformation and Reduction: Any of the methods listed below can be used to transform
the data.
• Normalization: Normalization is used to transform the data. Normalization is the
process of scaling all values for a given attribute so that they lie within a narrow
range. When neural networks or methods requiring measurements are utilized in the
learning process, normalization is performed.
• Generalization: The data can also be modified by applying a higher idea to it. We
can use the concept of hierarchies for this.

What is a Prediction?
The second way to operate data mining is Prediction. It is repeatedly used to detect several data. Same thing as over
in classification, the behaviour of the data set holds the inputs and similar numerical output values. Compatible with
the behaviour of the dataset, the algorithm (division) gets the model or a predictor.

When the new information is given, the model should detect a numerical output. Despite the classification, this
procedure does not have the class label or notes. The model estimates the current valued action or command value.

Regression (Growth) in most cases is used for Prediction. Predicting the price of a house rely on cases such as the
number of apartment, the total region, and so on is an illustration for prediction. An organization has the power to
find the amount of banknotes payout by the person during a negotiation.

• Determines missing or unknown elements in a datasheet.


• The classification model is built to predict the outcome.
• Does not depend on the class label.
• Predictions are made using both regression and classification models.

3) Difference between Classification & Prediction.


• Classification is the method of recognizing to which group; a new process belongs to a background of a training data
set containing a new process of observing whose group membership is familiar.
• Predication is the method of recognizing the missing or not available numerical data for a new process of observing.
• A classifier is built to detect explicit labels.
• A predictor will be built to predict a current job or command value.
• In classification, authenticity depends on detecting the class label correctly.
• In predication, the authenticity depends on how well a given predictor can guess the value of a predicated attribute
for new data.
• In classification, the sample can be called the classifier.
• In prediction, the sample can be called the predictor.

You might also like