0% found this document useful (0 votes)

31 views26 pages

Data Mining Tasks Notes Given

Uploaded by

Simran Shaikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views26 pages

Data Mining Tasks Notes Given

Uploaded by

Simran Shaikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Machine Learning Notes

DATA MINING
Define Data Mining:
• A process used by companies to turn raw data into useful information.
• Data Mining is a technology that blends traditional data analysis methods with
sophisticated algorithms for processing large volumes of data
• Data mining is an automatic or semi-automatic technical process that analyses
large amounts of scattered information to make sense of it and turn it into
knowledge. It looks for anomalies, patterns or correlations among millions of
records to predict results.

Knowledge Discovery In Databases (KDD) KDD Process:

1. Setting up and objective.

2. Data Preparation.
3. Applying the algorithm on the collected data.
4. Evaluation of the result.

-----------------------------------------------------------------------------------------------------------------

1|25
Machine Learning Notes

Motivation Challenges in using data mining:

1. Scalability and Efficiency of the Algorithms :The Data

Mining algorithm should be scalable and efficient to extract individual records in
an efficient manner. There is difficulty in data mining approaches due to the
enormous size of the database.

2. High Dimensionality: High-dimensional data are defined as data in which the

number of features (variables observed), p, are close to or larger than the

number of observations (or data points), n. High-dimensional datasets are

common in the biological sciences. Subjects like genomics and medical sciences

often use both tall (in terms of n) and wide (in terms of p) datasets.

3. Heterogeneous and Complex Data : True data is heterogeneous, and it may be

media data, including natural language text, time series, spatial data, temporal
data, complex data, audio or video, images, etc. It is truly hard to deal with these
various types of data and concentrate on the necessary information.

4. Data Ownership and Distribution: Some times the data needed for an analysis
is not stored in one location or owned by one organization. Instead it is
geographically distributed.

----------------------------------------------------------------------------------------------------------------

2|25
Machine Learning Notes

Data Mining Tasks : Data mining tasks are mainly classified into two types Predictive
data mining tasks and Descriptive data mining tasks

Predictive data mining tasks Descriptive data mining tasks

Predictive data mining tasks come up with Descriptive data mining tasks usually finds
a model from the available data set that is data describing patterns and comes up
helpful in predicting unknown or future with new, significant information from the
values of another data set of interest. available data set. Example : A retailer
Example : A medical practitioner trying to trying to identify products that are
diagnose a disease based on the medical purchased together can be considered as
test results of a patient can be considered a descriptive data mining task.
as a predictive data mining task.

Predictive data mining tasks :

a) Classification : Classification derives a model to determine the class of an object

based on its attributes. A collection of records will be available, each record with a set
of attributes. One of the attributes will be class attribute and the goal of classification
task is assigning a class attribute to new set of records as accurately as possible.

Classification can be used in direct marketing, which is to reduce marketing costs by

targeting a set of customers who are likely to buy a new product. Using the available
data, it is possible to know which customers purchased similar products and who did not
purchase in the past.

b) Prediction : Prediction task predicts the possible values of missing or future data.
Prediction involves developing a model based on the available data and this model is

3|25
Machine Learning Notes

used in predicting future values of a new data set of interest. For example, a model can
predict the income of an employee based on education, experience and other
demographic factors like place of stay, gender etc. Also prediction analysis is used in
different areas including medical diagnosis, fraud detection etc.

c) Time - Series Analysis : Time series is a sequence of events where the next event
is determined by one or more of the preceding events. Time series reflects the process
being measured and there are certain components that affect the behavior of a process.
Time series analysis includes methods to analyze time-series data in order to extract
useful patterns, trends, rules and statistics. Stock market prediction is an important
application of time- series analysis.

Descriptive data mining tasks :

d) Association : Association discovers the association or connection among a set of

items. Association identifies the relationships between objects. Association analysis is
used for commodity management, advertising, catalog design, direct marketing etc. A
retailer can identify the products that normally customers purchase together or even find
the customers who respond to the promotion of same kind of products. If a retailer finds
that beer and nappy are bought together mostly, he can put nappies on sale to promote
the sale of beer.

e) Clustering : Clustering is used to identify data objects that are similar to one
another. The similarity can be decided based on a number of factors like purchase
behavior, responsiveness to certain actions, geographical locations and so on. For
example, an insurance company can cluster its customers based on age, residence,
income etc. This group information will be helpful to understand the customers better
and hence provide better customized services.

f) Summarization : Summarization is the generalization of data. A set of relevant data

is summarized which result in a smaller set that gives aggregated information of the
data. For example, the shopping done by a customer can be summarized into total
products, total spending, offers used, etc. Such high level summarized information can
be useful for sales or customer relationship team for detailed customer and purchase
behavior analysis. Data can be summarized in different abstraction levels and from
different angles.

-----------------------------------------------------------------------------------------------------------------

4|25
Machine Learning Notes

Concept Learning : Concept Learning: Inferring a Boolean-valued function from

training examples of its input and output. (Inferring meaning : conclude (something)
from evidence and reasoning rather than from explicit statements.)

 A concept is an idea of something formed by combining all its features or attributes

which construct the given concept. Every concept has two components:
o Attributes: features that one must look for to decide whether a data instance is a
positive one of the concept.
o A rule: denotes what conjunction of constraints on the attributes will qualify as a
positive instance of the concept.

Concept learning Tasks :

---------------------------------------------------------------------------------------------------------------------

FIND-S Algorithm’s: The find-S algorithm is a basic concept learning algorithm in

machine learning. The find-S algorithm finds the most specific hypothesis that fits all the
positive examples.
Representation :

1. ? indicates that any value is acceptable for the attribute.

2. specify a single required value ( e.g., Cold ) for the attribute.
3. Φ indicates that no value is acceptable.
4. The most general hypothesis is represented by: {?, ?, ?, ?, ?, ?}
5. The most specific hypothesis is represented by: {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}

Algorithm Involved In Find-S :

1. Initialize h to the most specific hypothesis in H

2. For each positive training instance x
For each attribute constraint a, in h
If the constraint a, is satisfied by x
Then do nothing
Else replace a, in h by the next more general constraint that is satisfied by x
3. Output hypothesis h

5|25
Machine Learning Notes

Step 1 :

Step 2 : Iteration 1

- -

------------------------------------------------------------------------------------------------------------------

6|25
Machine Learning Notes

Try It your self : Find S algorithm :

7|25
Machine Learning Notes

Well Posed Learning Problem – A computer program is said to learn from

experience E in context to some task T and some performance measure P, if its
performance on T, as was measured by P, upgrades with experience E.
Any problem can be segregated as well-posed learning problem if it has three traits –
 Task
 Performance Measure
 Experience

Certain examples that efficiently defines the well-posed learning problem are –
1. To better filter emails as spam or not
a. Task – Classifying emails as spam or not
b. Performance Measure – The fraction of emails accurately classified as
spam or not spam
c. Experience – Observing you label emails as spam or not spam
2. Handwriting Recognition Problem
a. Task – Acknowledging handwritten words within portrayal.
b. Performance Measure – percent of words accurately classified
c. Experience – a directory of handwritten words with given classifications

----------------------------------------------------------------------------------------------------------------

Common issues in Machine Learning

Some common issues in Machine Learning that professionals face to inculcate ML

skills and create an application from scratch.

1. Inadequate and poor Quality of Training Data :The major issue that comes while
using machine learning algorithms is the lack of quality as well as quantity of data.
Although data plays a vital role in the processing of machine learning algorithms,
many data scientists claim that inadequate data, noisy data, and unclean data are
extremely exhausting the machine learning algorithms. Data quality can be affected
by some factors as follows:

o Noisy Data- It is responsible for an inaccurate prediction that affects the

decision as well as accuracy in classification tasks.
o Incorrect data- It is also responsible for faulty programming and results
obtained in machine learning models. Hence, incorrect data may affect the
accuracy of the results also.

Noisy data, incomplete data, inaccurate data, and unclean data lead to less
accuracy in classification and low-quality results. Hence, data quality can also be
considered as a major common problem while processing machine learning
algorithms.

8|25
Machine Learning Notes

2. Non-representative training data :To make sure our training model is generalized
well or not, we have to ensure that sample training data must be representative of
new cases that we need to generalize. The training data must cover all cases that
already occurred as well as occurring.

A machine learning model is said to be ideal if it predicts well for generalized cases
and provides accurate decisions. If there is less training data, then there will be a
sampling noise in the model, called the non-representative training set. It won't be
accurate in predictions. To overcome this, it will be biased against one class or a
group. Hence, we should use representative data in training to protect against being
biased and make accurate predictions without any drift.

3. Overfitting and Underfitting

Overfitting: Overfitting is one of the most common issues faced by Machine Learning
engineers and data scientists. Whenever a machine learning model is trained with a
huge amount of data, it starts capturing noise and inaccurate data into the training
data set. It negatively affects the performance of the model. We can overcome
overfitting by using linear and parametric algorithms in the machine learning models.

Methods to reduce over fitting:

o Increase training data in a dataset.

o Reduce model complexity by simplifying the model by selecting one with fewer
parameters
o Ridge Regularization and Lasso Regularization
o Early stopping during the training phase
o Reduce the noise
o Reduce the number of attributes in training data.
o Constraining the model.

Underfitting: Underfitting is just the opposite of overfitting. Whenever a machine

learning model is trained with fewer amounts of data, and as a result, it provides
incomplete and inaccurate data and destroys the accuracy of the machine learning
model.

Underfitting occurs when our model is too simple to understand the base structure of
the data, just like an undersized pant. This generally happens when we have limited
data into the data set, and we try to build a linear model with non-linear data. In such
scenarios, the complexity of the model destroys, and rules of the machine learning
model become too easy to be applied on this data set, and the model starts doing
wrong predictions as well.

Methods to reduce Underfitting:

o Increase model complexity

9|25
Machine Learning Notes

o Remove noise from the data

o Trained on increased and better features
o Reduce the constraints
o Increase the number of epochs to get better results.

4. Monitoring and maintenance :As we know that generalized output data is

mandatory for any machine learning model; hence, regular monitoring and
maintenance become compulsory for the same. Different results for different
actions require data change; hence editing of codes as well as resources for
monitoring them also become necessary.

5. Getting bad recommendations: A machine learning model operates under a

specific context which results in bad recommendations and concept drift in the
model. Let's understand with an example where at a specific time customer is
looking for some gadgets, but now customer requirement changed over time but
still machine learning model showing same recommendations to the customer
while customer expectation has been changed. This incident is called a Data Drift.
It generally occurs when new data is introduced or interpretation of data changes.
However, we can overcome this by regularly updating and monitoring data
according to the expectations.
6. Lack of skilled resources :Although Machine Learning and Artificial Intelligence
are continuously growing in the market, still these industries are fresher in
comparison to others. The absence of skilled resources in the form of manpower
is also an issue. Hence, we need manpower having in-depth knowledge of
mathematics, science, and technologies for developing and managing scientific
substances for machine learning.
7. Process Complexity of Machine Learning :The machine learning process is
very complex, which is also another major issue faced by machine learning
engineers and data scientists. However, Machine Learning and Artificial
Intelligence are very new technologies but are still in an experimental phase and
continuously being changing over time.
8. Data Bias :Data Biasing is also found a big challenge in Machine Learning. These
errors exist when certain elements of the dataset are heavily weighted or need
more importance than others. Biased data leads to inaccurate results, skewed
outcomes, and other analytical errors. However, we can resolve this error by
determining where data is actually biased in the dataset. Further, take necessary
steps to reduce it.

Methods to remove Data Bias:

o Research more for customer segmentation.
o Be aware of your general use cases and potential outliers.
o Combine inputs from multiple sources to ensure data diversity.
o Include bias testing in the development process.
o Analyze data regularly and keep tracking errors to resolve them easily.
o Review the collected and annotated data.

10 | 2 5
Machine Learning Notes

9. Lack of Explainability : This basically means the outputs cannot be easily

comprehended as it is programmed in specific ways to deliver for certain conditions.
Hence, a lack of explainability is also found in machine learning algorithms which
reduce the credibility of the algorithms.

10. Slow implementations and results : This issue is also very commonly seen in
machine learning models. However, machine learning models are highly efficient in
producing accurate results but are time-consuming. Slow programming, excessive
requirements' and overloaded data take more time to provide accurate results than
expected. This needs continuous maintenance and monitoring of the model for
delivering accurate results.

===========================================================

Data Preprocessing: Consists of different techniques and strategies that are

interrelated in complex ways. The different data preprocessing techniques are :

1. Aggregation 5. Feature creation

2. Sampling 6. Discretization and Binarization
3. Dimensionality reduction. 7. Variable Transformation
4. Feature subset selection

1. Aggregation : Combination of two or more objects into a single object is

called aggregation. Aggregation in data mining is the process of finding,
collecting, and presenting the data in a summarized format to perform
statistical analysis of business schemes or analysis of human patterns. When
numerous data is collected from various datasets, it’s crucial to gather
accurate data to provide significant results. Example of aggregate data:
Finding the average age of customer buying a particular product which can
help in finding out the targeted age group for that particular product. Instead of
dealing with an individual customer, the average age of the customer is
calculated.

2. Sampling : It is the practice of selecting an individual group from a population

to study the whole population. Every sampling type comes under two broad
categories:
 Probability sampling - Random selection techniques are used to select
the sample.
 Non-probability sampling - Non-random selection techniques based on
certain criteria are used to select the sample .
3. Dimensionality reduction: Data sets can have a large number of feature.
Dimensionality reduction is the process of reducing the number of random
variables under consideration, by obtaining a set of principal variables. It can
be divided into feature selection and feature extraction. A key benefit of

11 | 2 5
Machine Learning Notes

Dimensionality Reduction in Machine Learning and Predictive Modeling is

that the data mining algorithms work better if the dimensionality and number
of attributes are less.
Curse of Dimensionality: Curse of Dimensionality describes the explosive
nature of increasing data dimensions and its resulting exponential increase in
computational efforts required for its processing and/or analysis. This term was
first introduced by Richard E. Bellman. In machine learning, a feature of an
object can be an attribute or a characteristic that defines it. Each feature
represents a dimension and group of dimensions creates a data point. This
represents a feature vector that defines the data point to be used by a machine
learning algorithm(s). When we say increase in dimensionality it implies an
increase in the number of features used to describe the data. For example, in
the field of breast cancer research, age, number of cancerous nodes can be
used as features to define the prognosis of the breast cancer patient. These
features constitute the dimensions of a feature vector. But other factors like past
surgeries, patient history, type of tumor and other such features help a doctor to
better determine the prognosis. In this case by adding features, we are
theoretically increasing the dimensions of our data.
As the dimensionality increases, the number of data points required for good
performance of any machine learning algorithm increases exponentially.

4. Feature Subset Selection: Another way to reduce the dimensionality is to

use only a subset of the features.. Redundant features duplicate much or all of
the information contained in one or more other attributes. For example, the
purchase price of a product and the amount of sales tax paid contain much of
the same information. So we can use either sales tax paid or the purchase
price as a data set.
5. Feature creation: Feature engineering is a process of using domain
knowledge to create/extract new features from a given dataset by using data
mining techniques. It helps machine learning algorithms to understand data
and determine patterns that can improve the performance of machine learning
algorithms. Three related methodologies for creating new attributes are
described next: feature extraction, mapping the data to a new space, and
feature construction.
6. Discretization and Binarization : Data discretization is a method of
converting attributes values of continuous data into a finite set of intervals with
minimum data loss. In contrast, data binarization is used to transform the
continuous and discrete attributes into binary attributes.
7. Variable Transformation : A variable transformation defines a transformation
that is used to some values of a variable. In other terms, for every object, the
revolution is used to the value of the variable for that object. For instance, if
only the significance of a variable is essential, then the values of the variable
can be changed by creating the absolute value.
------------------------------------------------------------------------------------------------------------

12 | 2 5
Machine Learning Notes

Measure of similarity and Dissimilarity:

Proximity: The term proximity is used to refer to either similarity or dissimilarity.

Since the proximity between two objects is a function of the proximity between the
corresponding attributes of the two objects.

Similarity : Similarity between two objects is a numerical measure of the degree to

which the two objects are alike. Consequently, similarities are higher for pairs of
objects that are more alike. Similarities are usually non-negative and are often
between 0 (no similarity) and 1 (complete similarity).

Dissimilarity: The dissimilarity between two objects is a numerical measure of the

degree to which the two objects are different. Dissimilarities are lower for more similar
pairs of objects. Dissimilarities sometimes fall in the interval [0,1], but it is also
common for them to range from 0 to oo.

---------------------------------------------------------------------------------------------------------------

Data Similarity measures between objects that contain only binary attributes are
called similarity coefficients, and typically have values between 0 and 1. A value of 1
indicates that the two objects are completely similar, while a value of 0 indicates that
the objects are not at all similar. There are many rationales for why one co-efficient is
better than another in specific instances.

Let x and y be two objects that consist of n binary attributes. The comparison of two
such objects, i.e., two binary vectors, Ieads to the following four quantities
(frequencies) :

foo : the number of attributes where x is 0 and y is 0

fo1 : the number of attributes where x is 0 and y is 1

f1o : the number of attributes where x is 1 and y is 0

f11 : the number of attributes where x is 1 and y is 1 .

Simple Matching Coefficient One commonly used similarity coefficient is the simple
matching coefficient (SMC), which is defined as

number of matching attribute values

SMC = -------------------------------------------------------------

number of attributes

= (f11 + f00) / (f01 + f10 + f11 + f00)

13 | 2 5
Machine Learning Notes

Jaccard measures an asymmetric setting, i.e., only consider matching of 1's, ignoring
the 0's also in the denominator

J = number of 11 matches / number of non-zero attributes = (f11) / (f01 + f10 + f11)

Example SMC v Jaccard

Example 2:

x=1011010001
y=1101000011

f01= 2, f10= 2, f00=3, f11= 3

SMC = 7/10 and J = 3/7

14 | 2 5
Machine Learning Notes

Cosine Similarity : When you have a set of quantified attributes for each instance--
an alternative to Minkowski distances.

If d1 and d2 are two document vectors, then cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot product d1'd2 of
vectors d1 and d2, and || d || is the length of vector d.

Example:

d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
|| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)1/2 = (42) 1/2 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 1/2 = (6)1/2 = 2.449

cos(d1, d2 ) = 0.3150

Euclidean distance Formula :

Distance between the two points

(–3, 2) and (3, 5).

15 | 2 5
Machine Learning Notes

• Classification is the task of learning a target function that maps each attribute
set x to one of the predefined class labels of y.

• Descriptive Modeling : An explanatory tool to distinguish between objects of

different classes. Example : Biologist.

• Predictive Modeling : A classification model can be used to predict the class

label of unknown records.

General Approach for classification Problem :

• The systematic approach for learning a classification model given a training

set is known as a learning algorithm. The process of using a learning
algorithm to build a classification model from the training data is known as
induction. This process is also often described as “learning a model” or
“building a model.”

• This process of applying a classification model on unseen test instances to

predict their class labels is known as deduction.

The process of classification involves two steps:

 Applying a learning algorithm to training data to learn a model,

 Applying that model to assign labels to unlabeled instances.

16 | 2 5
Machine Learning Notes

50 + 60
Error Rate = ---------------------
560+60+50+330

17 | 2 5
Machine Learning Notes

How a Decision Tree Works :

Consider a simpler version of the vertebrate classification problem described in the
previous section. Instead of classifying the vertebrates into five distinct groups of
species, we assign them to two categories: mammals and non-mammals.

We can solve a classification problem by asking a series of carefully crafted

questions about the attributes of the test record. Each time we receive an answer? a
follow-up question is asked until we reach a conclusion about the class label of the
record. The series of questions and their possible answers can be organized in the
form of a decision tree, which is a hierarchical structure consisting of nodes and
directed edges. Figure shows the decision tree for the mammal classification
problem. The tree has three types of nodes:

 A root node that has no incoming edges and zero or more outgoing edges.
 Internal nodes, each of which has exactly one incoming edge and two or more
outgoing edges.
 Leaf or terminal nodes, each of which has exactly one incoming edge and no
outgoing edges.

In a decision tree, each leaf node is assigned a class label. The nonterminal nodes,
which include the root and other internal nodes, contain attribute test conditions to
separate records that have different characteristics.

18 | 2 5
Machine Learning Notes

19 | 2 5
Machine Learning Notes

-----------------------------------------------------------------------------------------------------------------

20 | 2 5
Machine Learning Notes

21 | 2 5
Machine Learning Notes

Methods for Expressing Attribute Test Conditions

Decision tree induction algorithms must provide a method for expressing an attribute
test condition and its corresponding outcomes for different attribute types.

1. Binary Attributes. The test condition for a binary attribute generates two
potential outcome

2. Nominal Attributes : Nominal attribute can have many values, its test
condition can be expressed in two ways. For a multiway split the number of
outcomes depends on the number of distinct values for the corresponding
attribute.

For example, if an attribute such as marital status has three distinct values-single,
married, or divorced-its test condition will produce a three-way split. On the other
hand, some decision tree algorithms, such as CART, produce only binary splits by
considering all 2k-1 - 1 ways of creating a binary partition of k attribute values. Figure
4.9(b) illustrates three different ways of grouping the attribute values for marital status
into two subsets.

22 | 2 5
Machine Learning Notes

3. Ordinal Attributes : Ordinal attributes can also produce binary or multiway

splits. Ordinal attribute values can be grouped as long as the grouping does
not violate the order property of the attribute values.

4. Continuous Attributes :

23 | 2 5
Machine Learning Notes

24 | 2 5
Machine Learning Notes

--------------------------------------------------------------------------------------------------------------------------------

25 | 2 5
Machine Learning Notes

26 | 2 5

Discovery Service Manual Rev 04
100% (4)
Discovery Service Manual Rev 04
198 pages
Paper - Xvii Data Mining and Warehousing
No ratings yet
Paper - Xvii Data Mining and Warehousing
140 pages
DM - Unit I-Updated
No ratings yet
DM - Unit I-Updated
65 pages
Data Mining
No ratings yet
Data Mining
7 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
Lecture 1
No ratings yet
Lecture 1
55 pages
CS822 DataMining Week1
No ratings yet
CS822 DataMining Week1
97 pages
Data Mining Group Project .
No ratings yet
Data Mining Group Project .
26 pages
Combinepdf 1
No ratings yet
Combinepdf 1
74 pages
Unit 1
No ratings yet
Unit 1
28 pages
Ware House Server
No ratings yet
Ware House Server
89 pages
Wk. 1. Introduction (08.10.2020)
No ratings yet
Wk. 1. Introduction (08.10.2020)
30 pages
Data Mining Notes Jntuh Compress
No ratings yet
Data Mining Notes Jntuh Compress
62 pages
Pioneer+Deh p5780mp
No ratings yet
Pioneer+Deh p5780mp
82 pages
Data Mining
No ratings yet
Data Mining
35 pages
TT02 Data, Methods, and Scenarios
No ratings yet
TT02 Data, Methods, and Scenarios
44 pages
Unit 1
No ratings yet
Unit 1
59 pages
DM Notes Pra
No ratings yet
DM Notes Pra
63 pages
DM Unit 1
No ratings yet
DM Unit 1
10 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
Unit 2
No ratings yet
Unit 2
20 pages
Data Mining Tasks
No ratings yet
Data Mining Tasks
24 pages
3 Data Mining
No ratings yet
3 Data Mining
58 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
2-Tasks and Techniques
No ratings yet
2-Tasks and Techniques
17 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
DMlecture 1
No ratings yet
DMlecture 1
39 pages
Unit 2
No ratings yet
Unit 2
37 pages
Week 4 - Introduction To Data Mining and Data Mining Techniques
No ratings yet
Week 4 - Introduction To Data Mining and Data Mining Techniques
44 pages
Chapter 1 Introduction To Data Mining
No ratings yet
Chapter 1 Introduction To Data Mining
46 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
DRDO Books PDF
No ratings yet
DRDO Books PDF
2 pages
Unit 1 Mining
No ratings yet
Unit 1 Mining
15 pages
CSC 425 Data Mining and Warehousing 2024
No ratings yet
CSC 425 Data Mining and Warehousing 2024
54 pages
Chapter 3: Data Mining
No ratings yet
Chapter 3: Data Mining
20 pages
Data Mining Models - GeeksforGeeks
No ratings yet
Data Mining Models - GeeksforGeeks
4 pages
Ch2 DTasks
No ratings yet
Ch2 DTasks
44 pages
Review of Related Literature and Studies
No ratings yet
Review of Related Literature and Studies
13 pages
Datamining ch1
No ratings yet
Datamining ch1
24 pages
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
No ratings yet
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
33 pages
Unit 1
No ratings yet
Unit 1
46 pages
Unit 1
No ratings yet
Unit 1
59 pages
4 - Data Analytics Using DM and ML Algorithms - 1
No ratings yet
4 - Data Analytics Using DM and ML Algorithms - 1
71 pages
02 - Data Mining
No ratings yet
02 - Data Mining
27 pages
DataMining Chapter1
No ratings yet
DataMining Chapter1
13 pages
Data Mining
No ratings yet
Data Mining
87 pages
Unit-1 Notes
No ratings yet
Unit-1 Notes
24 pages
Data Mining
No ratings yet
Data Mining
33 pages
Data Mining
No ratings yet
Data Mining
25 pages
Unit I
No ratings yet
Unit I
19 pages
UNIT - II - Data Mining Essentials
No ratings yet
UNIT - II - Data Mining Essentials
20 pages
BI - Unit 5
No ratings yet
BI - Unit 5
9 pages
Unit 1 Datamining
No ratings yet
Unit 1 Datamining
16 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
महाराष्ट्र परिचर्या परिषद
No ratings yet
महाराष्ट्र परिचर्या परिषद
2 pages
St. Paul'S P.U.College of Science & Commerce
No ratings yet
St. Paul'S P.U.College of Science & Commerce
36 pages
Week 1 Homework ITS 632 UC
No ratings yet
Week 1 Homework ITS 632 UC
7 pages
Data Mining Course Overview
No ratings yet
Data Mining Course Overview
38 pages
Notes Module 2
No ratings yet
Notes Module 2
28 pages
John - Fields - HW1 Data Mining
No ratings yet
John - Fields - HW1 Data Mining
10 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
Data Mining: Prof Jyotiranjan Hota
No ratings yet
Data Mining: Prof Jyotiranjan Hota
17 pages
Saiyuki ps1
No ratings yet
Saiyuki ps1
78 pages
03 - Introduction To Trixbox
No ratings yet
03 - Introduction To Trixbox
25 pages
CHS - TG
No ratings yet
CHS - TG
8 pages
Unit 1 - Chapter1 Introduction Part 1
No ratings yet
Unit 1 - Chapter1 Introduction Part 1
106 pages
Data Mining Technique Using Weka Tool
No ratings yet
Data Mining Technique Using Weka Tool
21 pages
7 Task2
No ratings yet
7 Task2
17 pages
CMA Unit 4 Canvas (Complete)
No ratings yet
CMA Unit 4 Canvas (Complete)
43 pages
1.1 What Is Data Mining?
No ratings yet
1.1 What Is Data Mining?
6 pages
p144 Data Mining
100% (3)
p144 Data Mining
11 pages
Summative Test in Math 7
100% (1)
Summative Test in Math 7
4 pages
Unit 1 - Introduction To Python
No ratings yet
Unit 1 - Introduction To Python
77 pages
Company Introduction
No ratings yet
Company Introduction
17 pages
Chapter 1
No ratings yet
Chapter 1
12 pages
Unit 5 The Transport Layer and Application Layer
No ratings yet
Unit 5 The Transport Layer and Application Layer
15 pages
CHAPTER 8 Strings
No ratings yet
CHAPTER 8 Strings
16 pages
Ventis MX4 Product Manual
No ratings yet
Ventis MX4 Product Manual
51 pages
Requirements Specifications Template: Epri Software Development Process
No ratings yet
Requirements Specifications Template: Epri Software Development Process
8 pages
Chapter 13 Database Concepts
No ratings yet
Chapter 13 Database Concepts
15 pages
Staffordshire University Dissertation Proposal
100% (2)
Staffordshire University Dissertation Proposal
5 pages
Session3 Overheads
No ratings yet
Session3 Overheads
50 pages
1.1 SD Module - Exercises
No ratings yet
1.1 SD Module - Exercises
13 pages
Assignment 2
No ratings yet
Assignment 2
6 pages
Prompt Engr - Module 13
No ratings yet
Prompt Engr - Module 13
4 pages
KML Quick Guide
No ratings yet
KML Quick Guide
2 pages
Engine Guard - YAMAHA TENERE 700 Product Instructions
No ratings yet
Engine Guard - YAMAHA TENERE 700 Product Instructions
5 pages
Are Smart Sensors Needed in Your Industrial Machines - Sensors Content From Machine Design
No ratings yet
Are Smart Sensors Needed in Your Industrial Machines - Sensors Content From Machine Design
4 pages
Web Hacking and Security - Vulnerability Assessment
No ratings yet
Web Hacking and Security - Vulnerability Assessment
6 pages
Taylor Polynomials: Ucsc Supplementary Notes Ams/econ 11a
No ratings yet
Taylor Polynomials: Ucsc Supplementary Notes Ams/econ 11a
7 pages
Mysql prg1 Doc
No ratings yet
Mysql prg1 Doc
3 pages
Python Programs To Practice in Lab
No ratings yet
Python Programs To Practice in Lab
2 pages
Cisco CCENT - CCNA ICND1 100-101 - Welcome To ICND1 - Cisco Certification and Getting The Most From This Course - CBT Nuggets
No ratings yet
Cisco CCENT - CCNA ICND1 100-101 - Welcome To ICND1 - Cisco Certification and Getting The Most From This Course - CBT Nuggets
4 pages
Fashion Show Report
No ratings yet
Fashion Show Report
2 pages
Payload Paint Hack
No ratings yet
Payload Paint Hack
2 pages
01 WT 00
No ratings yet
01 WT 00
3 pages
Ayesha
No ratings yet
Ayesha
1 page
LS2208 Spec Sheet
No ratings yet
LS2208 Spec Sheet
2 pages
Ford MP&L JD - B Tech
No ratings yet
Ford MP&L JD - B Tech
1 page
Tutorial 2 PDF
No ratings yet
Tutorial 2 PDF
1 page
Stephanus Indra Resume
No ratings yet
Stephanus Indra Resume
1 page
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet

Data Mining Tasks Notes Given

Uploaded by

Data Mining Tasks Notes Given

Uploaded by

Machine Learning Notes

Knowledge Discovery In Databases (KDD) KDD Process:

1. Setting up and objective.

Motivation Challenges in using data mining:

1. Scalability and Efficiency of the Algorithms :The Data

2. High Dimensionality: High-dimensional data are defined as data in which the

number of features (variables observed), p, are close to or larger than the

number of observations (or data points), n. High-dimensional datasets are

3. Heterogeneous and Complex Data : True data is heterogeneous, and it may be

Predictive data mining tasks Descriptive data mining tasks

Predictive data mining tasks :

a) Classification : Classification derives a model to determine the class of an object

Classification can be used in direct marketing, which is to reduce marketing costs by

Descriptive data mining tasks :

d) Association : Association discovers the association or connection among a set of

f) Summarization : Summarization is the generalization of data. A set of relevant data

Concept Learning : Concept Learning: Inferring a Boolean-valued function from

 A concept is an idea of something formed by combining all its features or attributes

Concept learning Tasks :

FIND-S Algorithm’s: The find-S algorithm is a basic concept learning algorithm in

1. ? indicates that any value is acceptable for the attribute.

Algorithm Involved In Find-S :

1. Initialize h to the most specific hypothesis in H

Try It your self : Find S algorithm :

Well Posed Learning Problem – A computer program is said to learn from

Common issues in Machine Learning

Some common issues in Machine Learning that professionals face to inculcate ML

o Noisy Data- It is responsible for an inaccurate prediction that affects the

3. Overfitting and Underfitting

Methods to reduce over fitting:

o Increase training data in a dataset.

Underfitting: Underfitting is just the opposite of overfitting. Whenever a machine

Methods to reduce Underfitting:

o Increase model complexity

o Remove noise from the data

4. Monitoring and maintenance :As we know that generalized output data is

5. Getting bad recommendations: A machine learning model operates under a

Methods to remove Data Bias:

9. Lack of Explainability : This basically means the outputs cannot be easily

Data Preprocessing: Consists of different techniques and strategies that are

1. Aggregation 5. Feature creation

1. Aggregation : Combination of two or more objects into a single object is

2. Sampling : It is the practice of selecting an individual group from a population

Dimensionality Reduction in Machine Learning and Predictive Modeling is

4. Feature Subset Selection: Another way to reduce the dimensionality is to

Measure of similarity and Dissimilarity:

Proximity: The term proximity is used to refer to either similarity or dissimilarity.

Similarity : Similarity between two objects is a numerical measure of the degree to

Dissimilarity: The dissimilarity between two objects is a numerical measure of the

foo : the number of attributes where x is 0 and y is 0

fo1 : the number of attributes where x is 0 and y is 1

f1o : the number of attributes where x is 1 and y is 0

f11 : the number of attributes where x is 1 and y is 1 .

number of matching attribute values

= (f11 + f00) / (f01 + f10 + f11 + f00)

J = number of 11 matches / number of non-zero attributes = (f11) / (f01 + f10 + f11)

Example SMC v Jaccard

f01= 2, f10= 2, f00=3, f11= 3

SMC = 7/10 and J = 3/7

Euclidean distance Formula :

Distance between the two points

• Descriptive Modeling : An explanatory tool to distinguish between objects of

• Predictive Modeling : A classification model can be used to predict the class

General Approach for classification Problem :

• The systematic approach for learning a classification model given a training

• This process of applying a classification model on unseen test instances to

The process of classification involves two steps:

 Applying a learning algorithm to training data to learn a model,

How a Decision Tree Works :

We can solve a classification problem by asking a series of carefully crafted

Methods for Expressing Attribute Test Conditions

3. Ordinal Attributes : Ordinal attributes can also produce binary or multiway

You might also like