0% found this document useful (0 votes)

4 views

Module_III_data_mining

The document discusses various methods of data reduction, including data cube aggregation, dimension reduction, and data discretization, which help simplify and manage large datasets. It also explains the data mining process, outlining steps from understanding the business and data to implementing changes based on analysis results. Additionally, it highlights the differences between classification and prediction in data mining, emphasizing their respective roles in identifying groups and estimating numerical outputs.

Uploaded by

Neha Shaw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Module_III_data_mining

Uploaded by

Neha Shaw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Unit III

Methods ofdatareduction:
These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example, imagine that information
you gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your
company every three months. They involve you in the annual sales, rather than the quarterly
average, So we can summarize the data in such a way that the resulting data summarizes the
total sales per year instead of per quarter. It summarizes the data.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute required
for our analysis. It reduces data size as it eliminates outdated or redundant features.
• Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we decide best of the original
attributes on the set based on their relevance to other attributes. We know it as a p-value in
statistics.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }

Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}

• Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at each point, it
eliminates the worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }

Step-1: {X1, X2, X3, X4, X5}

Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}

Data discretization converts a large number of data values into smaller once, so that data
evaluation and data management becomes very easy.
Data discretization example

we have an attribute of age with the following values.

Age 10,11,13,14,17,19,30, 31, 32, 38, 40, 42,70 , 72, 73, 75
Table: Before discretization
Attribute Age Age Age
10,11,13,14,17,19, 30, 31, 32, 38, 40, 42 70 , 72, 73, 75
After Discretization Young Mature Old
Another example is the Website visitor’s data. As seen in the figure below, data is
discretized into the countries.

What are some famous techniques of data discretization?

1. Histogram analysis: Histogram is a plot used to present the underlying frequency
distribution of a set of continuous data. The histogram helps the inspection of the data for the
distribution of the data. For example normal distribution representation, outliers, and skewness
representation, etc.

Binning: Binning is a data smoothing technique and its helps to group a huge number of
continuous values into a smaller number of bins. For example, if we have data about a group of students,
and we want to arrange their marks into a smaller number of marks intervals by making the bins of
grades. One bin for grade A, one for grade B, one for C, one for D, and one for F Grade.

1. Correlation analysis: Cluster analysis is commonly known as clustering. Clustering

is the task of grouping similar objects in one group, commonly called clusters. All
different objects are placed in different clusters.
2. Clustering analysis
3. Decision tree analysis
4. Equal width partitioning
5. Equal depth partitioning

Data discretization and concept hierarchy

generation
A concept hierarchy represents a sequence of mappings with a set of more general concepts
to specialized concepts. Similarly mapping from low-level concepts to higher-level concepts.
In other words, we can say top-down mapping and bottom-up mapping.
Let’s see an example of a concept hierarchy for the dimension location.
Each city can be mapped with the country to which the given city belongs. For example,
Mianwali can be mapped to Pakistan and Pakistan can be mapped to Asia.
Top-down mapping
Top-down mapping starts from the top with general concepts and moves to the bottom to
the specialized concepts.
Bottom-up mapping
Bottom-up mapping starts from the Bottom with specialized concepts and moves to the top
to the generalized concepts.

Bottom-up mapping

Bottom-up mapping starts from the Bottom with specialized concepts and moves to the top to the
generalized conct.

What Is Data Mining?

Data mining is a process used by companies to turn raw data into useful information. By
using software to look for patterns in large batches of data, businesses can learn more
about their customers to develop more effective marketing strategies, increase sales and
decrease costs. Data mining depends on effective data collection, warehousing, and
computer processing.

How Data Mining Works

Data mining involves exploring and analyzing large blocks of information to glean
meaningful patterns and trends. It can be used in a variety of ways, such as database
marketing, credit risk management, fraud detection, spam Email filtering, or even to
discern the sentiment or opinion of users.
The data mining process breaks down into five steps. First, organizations collect data and
load it into their data warehouses. Next, they store and manage the data, either on in-
house servers or the cloud. Business analysts, management teams, and information
technology professionals access the data and determine how they want to organize it.
Then, application software sorts the data based on the user's results, and finally, the end-
user presents the data in an easy-to-share format, such as a graph or table.

Data Mining Techniques

Data mining uses algorithms and various techniques to convert large collections of data
into useful output. The most popular types of data mining techniques include:

• Association rules, also referred to as market basket analysis, searches for

relationships between variables. This relationship in itself creates additional value
within the data set as it strives to link pieces of data. For example, association rules
would search a company's sales history to see which products are most commonly
purchased together; with this information, stores can plan, promote, and forecast
accordingly.
• Classification uses predefined classes to assign to objects. These classes describe
characteristics of items or represent what the data points have in common with
each. This data mining technique allows the underlying data to be more neatly
categorized and summarized across similar features or product lines.
• Clustering is similar to classification. However, clustering identified similarities
between objects, then groups those items based on what makes them different from
other items. While classification may result in groups such as "shampoo",
"conditioner", "soap", and "toothpaste", clustering may identify groups such as "hair
care" and "dental health".
• Decision trees are used to classify or predict an outcome based on a set list of
criteria or decisions. A decision tree is used to ask for input of a series of cascading
questions that sort the dataset based on responses given. Sometimes depicted as a
tree-like visual, a decision tree allows for specific direction and user input when
drilling deeper into the data.
• K-Nearest Neighbor (KNN) is an algorithm that classifies data based on its proximity
to other data. The basis for KNN is rooted in the assumption that data points that
are close to each are more similar to each other than other bits of data. This non-
parametric, supervised technique is used to predict features of a group based on
individual data points.
• Neural networks process data through the use of nodes. These nodes is comprised
of inputs, weights, and an output. Data is mapped through supervised learning
(similar to how the human brain is interconnected). This model can be fit to give
threshold values to determine a model's accuracy.
• Predictive analysis strives to leverage historical information to build graphical or
mathematical models to forecast future outcomes. Overlapping with regression
analysis, this data mining technique aims at supporting an unknown figure in the
future based on current data on hand.
•
The Data Mining Process
To be most effective, data analysts generally follow a certain flow of tasks along the data
mining process. Without this structure, an analyst may encounter an issue in the middle of
their analysis that could have easily been prevented had they prepared for it earlier. The
data mining process is usually broken into the following steps.

Step 1: Understand the Business

Before any data is touched, extracted, cleaned, or analyzed, it is important to understand
the underlying entity and the project at hand. What are the goals the company is trying to
achieve by mining data? What is their current business situation? What are the findings of
a SWOT analysis? Before looking at any data, the mining process starts by understanding
what will define success at the end of the process.

Step 2: Understand the Data

Once the business problem has been clearly defined, it's time to start thinking about data.
This includes what sources are available, how it will be secured stored, how information
will be gathered, and what the final outcome or analysis may look like. This step also
critically thinks about what limits their are to data, storage, security, and collection and
assesses how these constraints will impact the data mining process.

Step 3: Prepare the Data

It's now time to get our hands on information. Data is gathered, uploaded, extracted, or
calculated. It is then cleaned, standardized, scrubbed for outliers, assessed for mistakes,
and checked for reasonableness. During this stage of data mining, the data may also be
checked for size as an overbearing collection of information may unnecessarily slow
computations and analysis.

Step 4: Build the Model

With our clean data set in hand, it's time to crunch the numbers. Data scientists use the
types of data mining above to search for relationships, trends, associations, or sequential
patterns. The data may also be fed into predictive models to assess how previous bits of
information may translate into future outcomes.

Step 5: Evaluate the Results

The data-centered aspect of data mining concludes by assessing the findings of the data
model(s). The outcomes from the analysis may be aggregated, interpreted, and presented
to decision-makers that have largely be excluded from the data mining process to this
point. In this step, organizations can choose to make decisions based on the findings.

Step 6: Implement Change and Monitor

The data mining process concludes with management taking steps in response to the
findings of the analysis. The company may decide the information was not strong enough
or the findings were not relevant to change course. Alternatively, the company may
strategically pivot based on findings. In either case, management reviews the ultimate
impacts of the business and re-creates future data mining loops by identifying new
business problems or opportunities.
Module 4

To find a numerical output, prediction is used. The training dataset contains the inputs and
numerical output values. According to the training dataset, the algorithm generates a model or
predictor. When fresh data is provided, the model should find a numerical output. This approach,
unlike classification, does not have a class label. A continuous-valued function or ordered value is
predicted by the model.
In most cases, regression is utilized to make predictions. For example: Predicting the worth of a
home based on facts like the number of rooms, total area, and so on.
Consider the following scenario: A marketing manager needs to forecast how much a specific
consumer will spend during a sale. In this scenario, we are bothered to forecast a numerical value.
In this situation, a model or predictor that forecasts a continuous or ordered value function will be
built.

Prediction Issues:

Preparing the data for prediction is the most pressing challenge. The following activities are
involved in data preparation:
• Data Cleaning: Cleaning data include reducing noise and treating missing values. Smoothing
techniques remove noise, and the problem of missing values is solved by replacing a missing
value with the most often occurring value for that characteristic.
• Relevance Analysis: The irrelevant attributes may also be present in the database. The
correlation analysis method is used to determine whether two attributes are connected.
• Data Transformation and Reduction: Any of the methods listed below can be used to transform
the data.
• Normalization: Normalization is used to transform the data. Normalization is the
process of scaling all values for a given attribute so that they lie within a narrow
range. When neural networks or methods requiring measurements are utilized in the
learning process, normalization is performed.
• Generalization: The data can also be modified by applying a higher idea to it. We
can use the concept of hierarchies for this.

What is a Prediction?
The second way to operate data mining is Prediction. It is repeatedly used to detect several data. Same thing as over
in classification, the behaviour of the data set holds the inputs and similar numerical output values. Compatible with
the behaviour of the dataset, the algorithm (division) gets the model or a predictor.

When the new information is given, the model should detect a numerical output. Despite the classification, this
procedure does not have the class label or notes. The model estimates the current valued action or command value.

Regression (Growth) in most cases is used for Prediction. Predicting the price of a house rely on cases such as the
number of apartment, the total region, and so on is an illustration for prediction. An organization has the power to
find the amount of banknotes payout by the person during a negotiation.

• Determines missing or unknown elements in a datasheet.

• The classification model is built to predict the outcome.
• Does not depend on the class label.
• Predictions are made using both regression and classification models.

3) Difference between Classification & Prediction.

• Classification is the method of recognizing to which group; a new process belongs to a background of a training data
set containing a new process of observing whose group membership is familiar.
• Predication is the method of recognizing the missing or not available numerical data for a new process of observing.
• A classifier is built to detect explicit labels.
• A predictor will be built to predict a current job or command value.
• In classification, authenticity depends on detecting the class label correctly.
• In predication, the authenticity depends on how well a given predictor can guess the value of a predicated attribute
for new data.
• In classification, the sample can be called the classifier.
• In prediction, the sample can be called the predictor.

5579295spe 97794 MS P
No ratings yet
5579295spe 97794 MS P
14 pages
DMA Notes
No ratings yet
DMA Notes
40 pages
Pendulum Lab Report
No ratings yet
Pendulum Lab Report
14 pages
A Brief Overview On Data Mining Survey PDF
No ratings yet
A Brief Overview On Data Mining Survey PDF
8 pages
Down 2
No ratings yet
Down 2
61 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Data Mining Implementation
No ratings yet
Data Mining Implementation
9 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
Lec 02
No ratings yet
Lec 02
33 pages
Unit III Dwdm
No ratings yet
Unit III Dwdm
113 pages
Module 4
No ratings yet
Module 4
54 pages
Clustering Full 1
No ratings yet
Clustering Full 1
98 pages
Fundamentals of Data Mining
No ratings yet
Fundamentals of Data Mining
36 pages
DSS chapter 5
No ratings yet
DSS chapter 5
9 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
unit 1 DM
No ratings yet
unit 1 DM
24 pages
3.data Pre-Processing Concepts
No ratings yet
3.data Pre-Processing Concepts
8 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
62 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
Dwdm Unit-II Notes
No ratings yet
Dwdm Unit-II Notes
29 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Seminar on Data Mining Concepts and Its
No ratings yet
Seminar on Data Mining Concepts and Its
8 pages
Unit - 2 Data Minig Notes
No ratings yet
Unit - 2 Data Minig Notes
15 pages
Unit-4 DWM
No ratings yet
Unit-4 DWM
73 pages
Unit 3 PPT (BA)
No ratings yet
Unit 3 PPT (BA)
19 pages
Data Mining
No ratings yet
Data Mining
40 pages
Digital Data Mining Nostos - FP
No ratings yet
Digital Data Mining Nostos - FP
37 pages
Vinee
100% (1)
Vinee
28 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
Data Mining and Data Warehousing Notes ct1
No ratings yet
Data Mining and Data Warehousing Notes ct1
12 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
Data Mining and Data Warehouse BY: Dept. of Computer Science Engineering
No ratings yet
Data Mining and Data Warehouse BY: Dept. of Computer Science Engineering
10 pages
Bana1 Visualization
No ratings yet
Bana1 Visualization
22 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
Data Warehousing&Dat Mining
No ratings yet
Data Warehousing&Dat Mining
12 pages
Data Mining AND Warehousing: Abstract
No ratings yet
Data Mining AND Warehousing: Abstract
12 pages
Chapter 1
No ratings yet
Chapter 1
55 pages
Data Mining - Docx Ghhdocx
No ratings yet
Data Mining - Docx Ghhdocx
6 pages
2 Data Mining Functionalities 14-12-2024
No ratings yet
2 Data Mining Functionalities 14-12-2024
27 pages
Data Mining-Unit-1
No ratings yet
Data Mining-Unit-1
21 pages
DATA MINING
No ratings yet
DATA MINING
44 pages
Data Mining
No ratings yet
Data Mining
7 pages
What Is Data Mining
No ratings yet
What Is Data Mining
8 pages
Synopsis Print
No ratings yet
Synopsis Print
4 pages
Chapter 1&2
No ratings yet
Chapter 1&2
91 pages
DA5.6 Marketing Analytics q&a
No ratings yet
DA5.6 Marketing Analytics q&a
4 pages
1.data Mining Functionalities
No ratings yet
1.data Mining Functionalities
14 pages
Data Mining Notes1
No ratings yet
Data Mining Notes1
56 pages
DATA MINIING Unit 1 Notes
No ratings yet
DATA MINIING Unit 1 Notes
22 pages
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
DM UNIT-1 Question and Answer
No ratings yet
DM UNIT-1 Question and Answer
25 pages
Unit I DM
No ratings yet
Unit I DM
27 pages
5 Data Mining Proccess and Techniques - Week 7
No ratings yet
5 Data Mining Proccess and Techniques - Week 7
61 pages
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
22 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
PPT 2
No ratings yet
PPT 2
51 pages
1. Introduction
No ratings yet
1. Introduction
26 pages
Data Mining Simran
No ratings yet
Data Mining Simran
128 pages
Data Mining Nostos
100% (1)
Data Mining Nostos
39 pages
unit 2
No ratings yet
unit 2
20 pages
Data Analytics
From Everand
Data Analytics
Jeffery Short
1/5 (1)
Final Applied Econometrics Test 290922
No ratings yet
Final Applied Econometrics Test 290922
1 page
Probability Statistics and Data A Fresh Approach Using R 1st Edition Darrin Speegle download pdf
No ratings yet
Probability Statistics and Data A Fresh Approach Using R 1st Edition Darrin Speegle download pdf
41 pages
Backward Elimination and Stepwise Regression
No ratings yet
Backward Elimination and Stepwise Regression
5 pages
Chapter One
No ratings yet
Chapter One
64 pages
The Impact of Brain Drain On Economic Growth Addressing
No ratings yet
The Impact of Brain Drain On Economic Growth Addressing
25 pages
Singh 1982
No ratings yet
Singh 1982
8 pages
Business Statistics Chapter 5
No ratings yet
Business Statistics Chapter 5
43 pages
Dissertation Logistic Regression
100% (2)
Dissertation Logistic Regression
4 pages
A Second Course in Statistics Regression Analysis
No ratings yet
A Second Course in Statistics Regression Analysis
8 pages
JKToolKit Manual
No ratings yet
JKToolKit Manual
39 pages
Syllabus For Computer Science & Engineering B.Tech (Autonomous) Duration: 4 Years (Eight Semesters)
100% (1)
Syllabus For Computer Science & Engineering B.Tech (Autonomous) Duration: 4 Years (Eight Semesters)
64 pages
Galgotias College of Engineering & Technology: Inroduction To Data Analytics and Visualization Lab File (KDS-551)
No ratings yet
Galgotias College of Engineering & Technology: Inroduction To Data Analytics and Visualization Lab File (KDS-551)
47 pages
Stock Market Prediction Using Machine Learning Algorithms A Classification Study
No ratings yet
Stock Market Prediction Using Machine Learning Algorithms A Classification Study
4 pages
SMMD: Practice Problem Set 6 Topic: The Simple Regression Model
No ratings yet
SMMD: Practice Problem Set 6 Topic: The Simple Regression Model
6 pages
AP Stats - Vocab List
No ratings yet
AP Stats - Vocab List
28 pages
Data-Analytics-Automation-with-AI-A-Comparative-Study-of-Traditional-and-Generative-AI-Approaches
No ratings yet
Data-Analytics-Automation-with-AI-A-Comparative-Study-of-Traditional-and-Generative-AI-Approaches
24 pages
Lecture 7 (Adaptive Filters)
No ratings yet
Lecture 7 (Adaptive Filters)
18 pages
SMDS-unit-3
No ratings yet
SMDS-unit-3
45 pages
Remote Sensing of Environment: M. Weiss, F. Jacob, G. Duveiller T
No ratings yet
Remote Sensing of Environment: M. Weiss, F. Jacob, G. Duveiller T
19 pages
Employer Branding - Employer A
No ratings yet
Employer Branding - Employer A
17 pages
Conflict of Interest
No ratings yet
Conflict of Interest
6 pages
Corporate Governance and National Culture: A Multi-Country Study
No ratings yet
Corporate Governance and National Culture: A Multi-Country Study
16 pages
Unit-2 Machine Learning
No ratings yet
Unit-2 Machine Learning
110 pages
4.2.1 STS 3113 202020 Course Project
No ratings yet
4.2.1 STS 3113 202020 Course Project
5 pages
DataMining_DataAnalysisofaBankingdataset
No ratings yet
DataMining_DataAnalysisofaBankingdataset
23 pages
Solved Problems
No ratings yet
Solved Problems
8 pages
Chapter 5 - Classification Problems
100% (1)
Chapter 5 - Classification Problems
25 pages
Multivariate Analysis: Library (Rethinking Data (Hurricanes) ?hurricanes. Deaths Femininity Femininity Deaths
No ratings yet
Multivariate Analysis: Library (Rethinking Data (Hurricanes) ?hurricanes. Deaths Femininity Femininity Deaths
2 pages

Module_III_data_mining

Uploaded by

Module_III_data_mining

Uploaded by

Unit III

Final reduced attribute set: {X1, X2, X5}

Step-1: {X1, X2, X3, X4, X5}

Final reduced attribute set: {X1, X2, X5}

we have an attribute of age with the following values.

What are some famous techniques of data discretization?

1. Correlation analysis: Cluster analysis is commonly known as clustering. Clustering

Data discretization and concept hierarchy

What Is Data Mining?

How Data Mining Works

Data Mining Techniques

• Association rules, also referred to as market basket analysis, searches for

Step 1: Understand the Business

Step 2: Understand the Data

Step 3: Prepare the Data

Step 4: Build the Model

Step 5: Evaluate the Results

Step 6: Implement Change and Monitor

• Determines missing or unknown elements in a datasheet.

3) Difference between Classification & Prediction.

You might also like