0% found this document useful (0 votes)
31 views26 pages

Data Mining Tasks Notes Given

Uploaded by

Simran Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views26 pages

Data Mining Tasks Notes Given

Uploaded by

Simran Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Machine Learning Notes

DATA MINING
Define Data Mining:
• A process used by companies to turn raw data into useful information.
• Data Mining is a technology that blends traditional data analysis methods with
sophisticated algorithms for processing large volumes of data
• Data mining is an automatic or semi-automatic technical process that analyses
large amounts of scattered information to make sense of it and turn it into
knowledge. It looks for anomalies, patterns or correlations among millions of
records to predict results.

Knowledge Discovery In Databases (KDD) KDD Process:

1. Setting up and objective.


2. Data Preparation.
3. Applying the algorithm on the collected data.
4. Evaluation of the result.

-----------------------------------------------------------------------------------------------------------------

1|25
Machine Learning Notes

Motivation Challenges in using data mining:

1. Scalability and Efficiency of the Algorithms :The Data


Mining algorithm should be scalable and efficient to extract individual records in
an efficient manner. There is difficulty in data mining approaches due to the
enormous size of the database.

2. High Dimensionality: High-dimensional data are defined as data in which the

number of features (variables observed), p, are close to or larger than the

number of observations (or data points), n. High-dimensional datasets are

common in the biological sciences. Subjects like genomics and medical sciences

often use both tall (in terms of n) and wide (in terms of p) datasets.

3. Heterogeneous and Complex Data : True data is heterogeneous, and it may be


media data, including natural language text, time series, spatial data, temporal
data, complex data, audio or video, images, etc. It is truly hard to deal with these
various types of data and concentrate on the necessary information.

4. Data Ownership and Distribution: Some times the data needed for an analysis
is not stored in one location or owned by one organization. Instead it is
geographically distributed.

----------------------------------------------------------------------------------------------------------------

2|25
Machine Learning Notes

Data Mining Tasks : Data mining tasks are mainly classified into two types Predictive
data mining tasks and Descriptive data mining tasks

Predictive data mining tasks Descriptive data mining tasks


Predictive data mining tasks come up with Descriptive data mining tasks usually finds
a model from the available data set that is data describing patterns and comes up
helpful in predicting unknown or future with new, significant information from the
values of another data set of interest. available data set. Example : A retailer
Example : A medical practitioner trying to trying to identify products that are
diagnose a disease based on the medical purchased together can be considered as
test results of a patient can be considered a descriptive data mining task.
as a predictive data mining task.

Predictive data mining tasks :

a) Classification : Classification derives a model to determine the class of an object


based on its attributes. A collection of records will be available, each record with a set
of attributes. One of the attributes will be class attribute and the goal of classification
task is assigning a class attribute to new set of records as accurately as possible.

Classification can be used in direct marketing, which is to reduce marketing costs by


targeting a set of customers who are likely to buy a new product. Using the available
data, it is possible to know which customers purchased similar products and who did not
purchase in the past.

b) Prediction : Prediction task predicts the possible values of missing or future data.
Prediction involves developing a model based on the available data and this model is

3|25
Machine Learning Notes

used in predicting future values of a new data set of interest. For example, a model can
predict the income of an employee based on education, experience and other
demographic factors like place of stay, gender etc. Also prediction analysis is used in
different areas including medical diagnosis, fraud detection etc.

c) Time - Series Analysis : Time series is a sequence of events where the next event
is determined by one or more of the preceding events. Time series reflects the process
being measured and there are certain components that affect the behavior of a process.
Time series analysis includes methods to analyze time-series data in order to extract
useful patterns, trends, rules and statistics. Stock market prediction is an important
application of time- series analysis.

Descriptive data mining tasks :

d) Association : Association discovers the association or connection among a set of


items. Association identifies the relationships between objects. Association analysis is
used for commodity management, advertising, catalog design, direct marketing etc. A
retailer can identify the products that normally customers purchase together or even find
the customers who respond to the promotion of same kind of products. If a retailer finds
that beer and nappy are bought together mostly, he can put nappies on sale to promote
the sale of beer.

e) Clustering : Clustering is used to identify data objects that are similar to one
another. The similarity can be decided based on a number of factors like purchase
behavior, responsiveness to certain actions, geographical locations and so on. For
example, an insurance company can cluster its customers based on age, residence,
income etc. This group information will be helpful to understand the customers better
and hence provide better customized services.

f) Summarization : Summarization is the generalization of data. A set of relevant data


is summarized which result in a smaller set that gives aggregated information of the
data. For example, the shopping done by a customer can be summarized into total
products, total spending, offers used, etc. Such high level summarized information can
be useful for sales or customer relationship team for detailed customer and purchase
behavior analysis. Data can be summarized in different abstraction levels and from
different angles.

-----------------------------------------------------------------------------------------------------------------

4|25
Machine Learning Notes

Concept Learning : Concept Learning: Inferring a Boolean-valued function from


training examples of its input and output. (Inferring meaning : conclude (something)
from evidence and reasoning rather than from explicit statements.)

 A concept is an idea of something formed by combining all its features or attributes


which construct the given concept. Every concept has two components:
o Attributes: features that one must look for to decide whether a data instance is a
positive one of the concept.
o A rule: denotes what conjunction of constraints on the attributes will qualify as a
positive instance of the concept.

Concept learning Tasks :

---------------------------------------------------------------------------------------------------------------------

FIND-S Algorithm’s: The find-S algorithm is a basic concept learning algorithm in


machine learning. The find-S algorithm finds the most specific hypothesis that fits all the
positive examples.
Representation :

1. ? indicates that any value is acceptable for the attribute.


2. specify a single required value ( e.g., Cold ) for the attribute.
3. Φ indicates that no value is acceptable.
4. The most general hypothesis is represented by: {?, ?, ?, ?, ?, ?}
5. The most specific hypothesis is represented by: {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}

Algorithm Involved In Find-S :

1. Initialize h to the most specific hypothesis in H


2. For each positive training instance x
For each attribute constraint a, in h
If the constraint a, is satisfied by x
Then do nothing
Else replace a, in h by the next more general constraint that is satisfied by x
3. Output hypothesis h

5|25
Machine Learning Notes

Step 1 :

Step 2 : Iteration 1

- -

------------------------------------------------------------------------------------------------------------------

6|25
Machine Learning Notes

Try It your self : Find S algorithm :

7|25
Machine Learning Notes

Well Posed Learning Problem – A computer program is said to learn from


experience E in context to some task T and some performance measure P, if its
performance on T, as was measured by P, upgrades with experience E.
Any problem can be segregated as well-posed learning problem if it has three traits –
 Task
 Performance Measure
 Experience

Certain examples that efficiently defines the well-posed learning problem are –
1. To better filter emails as spam or not
a. Task – Classifying emails as spam or not
b. Performance Measure – The fraction of emails accurately classified as
spam or not spam
c. Experience – Observing you label emails as spam or not spam
2. Handwriting Recognition Problem
a. Task – Acknowledging handwritten words within portrayal.
b. Performance Measure – percent of words accurately classified
c. Experience – a directory of handwritten words with given classifications

----------------------------------------------------------------------------------------------------------------

Common issues in Machine Learning

Some common issues in Machine Learning that professionals face to inculcate ML


skills and create an application from scratch.

1. Inadequate and poor Quality of Training Data :The major issue that comes while
using machine learning algorithms is the lack of quality as well as quantity of data.
Although data plays a vital role in the processing of machine learning algorithms,
many data scientists claim that inadequate data, noisy data, and unclean data are
extremely exhausting the machine learning algorithms. Data quality can be affected
by some factors as follows:

o Noisy Data- It is responsible for an inaccurate prediction that affects the


decision as well as accuracy in classification tasks.
o Incorrect data- It is also responsible for faulty programming and results
obtained in machine learning models. Hence, incorrect data may affect the
accuracy of the results also.

Noisy data, incomplete data, inaccurate data, and unclean data lead to less
accuracy in classification and low-quality results. Hence, data quality can also be
considered as a major common problem while processing machine learning
algorithms.

8|25
Machine Learning Notes

2. Non-representative training data :To make sure our training model is generalized
well or not, we have to ensure that sample training data must be representative of
new cases that we need to generalize. The training data must cover all cases that
already occurred as well as occurring.

A machine learning model is said to be ideal if it predicts well for generalized cases
and provides accurate decisions. If there is less training data, then there will be a
sampling noise in the model, called the non-representative training set. It won't be
accurate in predictions. To overcome this, it will be biased against one class or a
group. Hence, we should use representative data in training to protect against being
biased and make accurate predictions without any drift.

3. Overfitting and Underfitting

Overfitting: Overfitting is one of the most common issues faced by Machine Learning
engineers and data scientists. Whenever a machine learning model is trained with a
huge amount of data, it starts capturing noise and inaccurate data into the training
data set. It negatively affects the performance of the model. We can overcome
overfitting by using linear and parametric algorithms in the machine learning models.

Methods to reduce over fitting:

o Increase training data in a dataset.


o Reduce model complexity by simplifying the model by selecting one with fewer
parameters
o Ridge Regularization and Lasso Regularization
o Early stopping during the training phase
o Reduce the noise
o Reduce the number of attributes in training data.
o Constraining the model.

Underfitting: Underfitting is just the opposite of overfitting. Whenever a machine


learning model is trained with fewer amounts of data, and as a result, it provides
incomplete and inaccurate data and destroys the accuracy of the machine learning
model.

Underfitting occurs when our model is too simple to understand the base structure of
the data, just like an undersized pant. This generally happens when we have limited
data into the data set, and we try to build a linear model with non-linear data. In such
scenarios, the complexity of the model destroys, and rules of the machine learning
model become too easy to be applied on this data set, and the model starts doing
wrong predictions as well.

Methods to reduce Underfitting:

o Increase model complexity

9|25
Machine Learning Notes

o Remove noise from the data


o Trained on increased and better features
o Reduce the constraints
o Increase the number of epochs to get better results.

4. Monitoring and maintenance :As we know that generalized output data is


mandatory for any machine learning model; hence, regular monitoring and
maintenance become compulsory for the same. Different results for different
actions require data change; hence editing of codes as well as resources for
monitoring them also become necessary.

5. Getting bad recommendations: A machine learning model operates under a


specific context which results in bad recommendations and concept drift in the
model. Let's understand with an example where at a specific time customer is
looking for some gadgets, but now customer requirement changed over time but
still machine learning model showing same recommendations to the customer
while customer expectation has been changed. This incident is called a Data Drift.
It generally occurs when new data is introduced or interpretation of data changes.
However, we can overcome this by regularly updating and monitoring data
according to the expectations.
6. Lack of skilled resources :Although Machine Learning and Artificial Intelligence
are continuously growing in the market, still these industries are fresher in
comparison to others. The absence of skilled resources in the form of manpower
is also an issue. Hence, we need manpower having in-depth knowledge of
mathematics, science, and technologies for developing and managing scientific
substances for machine learning.
7. Process Complexity of Machine Learning :The machine learning process is
very complex, which is also another major issue faced by machine learning
engineers and data scientists. However, Machine Learning and Artificial
Intelligence are very new technologies but are still in an experimental phase and
continuously being changing over time.
8. Data Bias :Data Biasing is also found a big challenge in Machine Learning. These
errors exist when certain elements of the dataset are heavily weighted or need
more importance than others. Biased data leads to inaccurate results, skewed
outcomes, and other analytical errors. However, we can resolve this error by
determining where data is actually biased in the dataset. Further, take necessary
steps to reduce it.

Methods to remove Data Bias:


o Research more for customer segmentation.
o Be aware of your general use cases and potential outliers.
o Combine inputs from multiple sources to ensure data diversity.
o Include bias testing in the development process.
o Analyze data regularly and keep tracking errors to resolve them easily.
o Review the collected and annotated data.

10 | 2 5
Machine Learning Notes

9. Lack of Explainability : This basically means the outputs cannot be easily


comprehended as it is programmed in specific ways to deliver for certain conditions.
Hence, a lack of explainability is also found in machine learning algorithms which
reduce the credibility of the algorithms.

10. Slow implementations and results : This issue is also very commonly seen in
machine learning models. However, machine learning models are highly efficient in
producing accurate results but are time-consuming. Slow programming, excessive
requirements' and overloaded data take more time to provide accurate results than
expected. This needs continuous maintenance and monitoring of the model for
delivering accurate results.

===========================================================

Data Preprocessing: Consists of different techniques and strategies that are


interrelated in complex ways. The different data preprocessing techniques are :

1. Aggregation 5. Feature creation


2. Sampling 6. Discretization and Binarization
3. Dimensionality reduction. 7. Variable Transformation
4. Feature subset selection

1. Aggregation : Combination of two or more objects into a single object is


called aggregation. Aggregation in data mining is the process of finding,
collecting, and presenting the data in a summarized format to perform
statistical analysis of business schemes or analysis of human patterns. When
numerous data is collected from various datasets, it’s crucial to gather
accurate data to provide significant results. Example of aggregate data:
Finding the average age of customer buying a particular product which can
help in finding out the targeted age group for that particular product. Instead of
dealing with an individual customer, the average age of the customer is
calculated.

2. Sampling : It is the practice of selecting an individual group from a population


to study the whole population. Every sampling type comes under two broad
categories:
 Probability sampling - Random selection techniques are used to select
the sample.
 Non-probability sampling - Non-random selection techniques based on
certain criteria are used to select the sample .
3. Dimensionality reduction: Data sets can have a large number of feature.
Dimensionality reduction is the process of reducing the number of random
variables under consideration, by obtaining a set of principal variables. It can
be divided into feature selection and feature extraction. A key benefit of

11 | 2 5
Machine Learning Notes

Dimensionality Reduction in Machine Learning and Predictive Modeling is


that the data mining algorithms work better if the dimensionality and number
of attributes are less.
Curse of Dimensionality: Curse of Dimensionality describes the explosive
nature of increasing data dimensions and its resulting exponential increase in
computational efforts required for its processing and/or analysis. This term was
first introduced by Richard E. Bellman. In machine learning, a feature of an
object can be an attribute or a characteristic that defines it. Each feature
represents a dimension and group of dimensions creates a data point. This
represents a feature vector that defines the data point to be used by a machine
learning algorithm(s). When we say increase in dimensionality it implies an
increase in the number of features used to describe the data. For example, in
the field of breast cancer research, age, number of cancerous nodes can be
used as features to define the prognosis of the breast cancer patient. These
features constitute the dimensions of a feature vector. But other factors like past
surgeries, patient history, type of tumor and other such features help a doctor to
better determine the prognosis. In this case by adding features, we are
theoretically increasing the dimensions of our data.
As the dimensionality increases, the number of data points required for good
performance of any machine learning algorithm increases exponentially.

4. Feature Subset Selection: Another way to reduce the dimensionality is to


use only a subset of the features.. Redundant features duplicate much or all of
the information contained in one or more other attributes. For example, the
purchase price of a product and the amount of sales tax paid contain much of
the same information. So we can use either sales tax paid or the purchase
price as a data set.
5. Feature creation: Feature engineering is a process of using domain
knowledge to create/extract new features from a given dataset by using data
mining techniques. It helps machine learning algorithms to understand data
and determine patterns that can improve the performance of machine learning
algorithms. Three related methodologies for creating new attributes are
described next: feature extraction, mapping the data to a new space, and
feature construction.
6. Discretization and Binarization : Data discretization is a method of
converting attributes values of continuous data into a finite set of intervals with
minimum data loss. In contrast, data binarization is used to transform the
continuous and discrete attributes into binary attributes.
7. Variable Transformation : A variable transformation defines a transformation
that is used to some values of a variable. In other terms, for every object, the
revolution is used to the value of the variable for that object. For instance, if
only the significance of a variable is essential, then the values of the variable
can be changed by creating the absolute value.
------------------------------------------------------------------------------------------------------------

12 | 2 5
Machine Learning Notes

Measure of similarity and Dissimilarity:

Proximity: The term proximity is used to refer to either similarity or dissimilarity.


Since the proximity between two objects is a function of the proximity between the
corresponding attributes of the two objects.

Similarity : Similarity between two objects is a numerical measure of the degree to


which the two objects are alike. Consequently, similarities are higher for pairs of
objects that are more alike. Similarities are usually non-negative and are often
between 0 (no similarity) and 1 (complete similarity).

Dissimilarity: The dissimilarity between two objects is a numerical measure of the


degree to which the two objects are different. Dissimilarities are lower for more similar
pairs of objects. Dissimilarities sometimes fall in the interval [0,1], but it is also
common for them to range from 0 to oo.

---------------------------------------------------------------------------------------------------------------

Data Similarity measures between objects that contain only binary attributes are
called similarity coefficients, and typically have values between 0 and 1. A value of 1
indicates that the two objects are completely similar, while a value of 0 indicates that
the objects are not at all similar. There are many rationales for why one co-efficient is
better than another in specific instances.

Let x and y be two objects that consist of n binary attributes. The comparison of two
such objects, i.e., two binary vectors, Ieads to the following four quantities
(frequencies) :

foo : the number of attributes where x is 0 and y is 0

fo1 : the number of attributes where x is 0 and y is 1

f1o : the number of attributes where x is 1 and y is 0

f11 : the number of attributes where x is 1 and y is 1 .

Simple Matching Coefficient One commonly used similarity coefficient is the simple
matching coefficient (SMC), which is defined as

number of matching attribute values

SMC = -------------------------------------------------------------

number of attributes

= (f11 + f00) / (f01 + f10 + f11 + f00)

13 | 2 5
Machine Learning Notes

Jaccard measures an asymmetric setting, i.e., only consider matching of 1's, ignoring
the 0's also in the denominator

J = number of 11 matches / number of non-zero attributes = (f11) / (f01 + f10 + f11)

Example SMC v Jaccard

Example 2:

x=1011010001
y=1101000011

f01= 2, f10= 2, f00=3, f11= 3

SMC = 7/10 and J = 3/7

14 | 2 5
Machine Learning Notes

Cosine Similarity : When you have a set of quantified attributes for each instance--
an alternative to Minkowski distances.

If d1 and d2 are two document vectors, then cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot product d1'd2 of
vectors d1 and d2, and || d || is the length of vector d.

Example:

d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
|| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)1/2 = (42) 1/2 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 1/2 = (6)1/2 = 2.449

cos(d1, d2 ) = 0.3150

Euclidean distance Formula :

Distance between the two points


(–3, 2) and (3, 5).

15 | 2 5
Machine Learning Notes

• Classification is the task of learning a target function that maps each attribute
set x to one of the predefined class labels of y.

• Descriptive Modeling : An explanatory tool to distinguish between objects of


different classes. Example : Biologist.

• Predictive Modeling : A classification model can be used to predict the class


label of unknown records.

General Approach for classification Problem :

• The systematic approach for learning a classification model given a training


set is known as a learning algorithm. The process of using a learning
algorithm to build a classification model from the training data is known as
induction. This process is also often described as “learning a model” or
“building a model.”

• This process of applying a classification model on unseen test instances to


predict their class labels is known as deduction.

The process of classification involves two steps:

 Applying a learning algorithm to training data to learn a model,


 Applying that model to assign labels to unlabeled instances.

16 | 2 5
Machine Learning Notes

50 + 60
Error Rate = ---------------------
560+60+50+330

17 | 2 5
Machine Learning Notes

How a Decision Tree Works :


Consider a simpler version of the vertebrate classification problem described in the
previous section. Instead of classifying the vertebrates into five distinct groups of
species, we assign them to two categories: mammals and non-mammals.

We can solve a classification problem by asking a series of carefully crafted


questions about the attributes of the test record. Each time we receive an answer? a
follow-up question is asked until we reach a conclusion about the class label of the
record. The series of questions and their possible answers can be organized in the
form of a decision tree, which is a hierarchical structure consisting of nodes and
directed edges. Figure shows the decision tree for the mammal classification
problem. The tree has three types of nodes:

 A root node that has no incoming edges and zero or more outgoing edges.
 Internal nodes, each of which has exactly one incoming edge and two or more
outgoing edges.
 Leaf or terminal nodes, each of which has exactly one incoming edge and no
outgoing edges.

In a decision tree, each leaf node is assigned a class label. The nonterminal nodes,
which include the root and other internal nodes, contain attribute test conditions to
separate records that have different characteristics.

18 | 2 5
Machine Learning Notes

19 | 2 5
Machine Learning Notes

-----------------------------------------------------------------------------------------------------------------

20 | 2 5
Machine Learning Notes

21 | 2 5
Machine Learning Notes

Methods for Expressing Attribute Test Conditions

Decision tree induction algorithms must provide a method for expressing an attribute
test condition and its corresponding outcomes for different attribute types.

1. Binary Attributes. The test condition for a binary attribute generates two
potential outcome

2. Nominal Attributes : Nominal attribute can have many values, its test
condition can be expressed in two ways. For a multiway split the number of
outcomes depends on the number of distinct values for the corresponding
attribute.

For example, if an attribute such as marital status has three distinct values-single,
married, or divorced-its test condition will produce a three-way split. On the other
hand, some decision tree algorithms, such as CART, produce only binary splits by
considering all 2k-1 - 1 ways of creating a binary partition of k attribute values. Figure
4.9(b) illustrates three different ways of grouping the attribute values for marital status
into two subsets.

22 | 2 5
Machine Learning Notes

3. Ordinal Attributes : Ordinal attributes can also produce binary or multiway


splits. Ordinal attribute values can be grouped as long as the grouping does
not violate the order property of the attribute values.

4. Continuous Attributes :

23 | 2 5
Machine Learning Notes

24 | 2 5
Machine Learning Notes

--------------------------------------------------------------------------------------------------------------------------------

25 | 2 5
Machine Learning Notes

26 | 2 5

You might also like