Data Mining Tasks Notes Given
Data Mining Tasks Notes Given
DATA MINING
Define Data Mining:
• A process used by companies to turn raw data into useful information.
• Data Mining is a technology that blends traditional data analysis methods with
sophisticated algorithms for processing large volumes of data
• Data mining is an automatic or semi-automatic technical process that analyses
large amounts of scattered information to make sense of it and turn it into
knowledge. It looks for anomalies, patterns or correlations among millions of
records to predict results.
-----------------------------------------------------------------------------------------------------------------
1|25
Machine Learning Notes
common in the biological sciences. Subjects like genomics and medical sciences
often use both tall (in terms of n) and wide (in terms of p) datasets.
4. Data Ownership and Distribution: Some times the data needed for an analysis
is not stored in one location or owned by one organization. Instead it is
geographically distributed.
----------------------------------------------------------------------------------------------------------------
2|25
Machine Learning Notes
Data Mining Tasks : Data mining tasks are mainly classified into two types Predictive
data mining tasks and Descriptive data mining tasks
b) Prediction : Prediction task predicts the possible values of missing or future data.
Prediction involves developing a model based on the available data and this model is
3|25
Machine Learning Notes
used in predicting future values of a new data set of interest. For example, a model can
predict the income of an employee based on education, experience and other
demographic factors like place of stay, gender etc. Also prediction analysis is used in
different areas including medical diagnosis, fraud detection etc.
c) Time - Series Analysis : Time series is a sequence of events where the next event
is determined by one or more of the preceding events. Time series reflects the process
being measured and there are certain components that affect the behavior of a process.
Time series analysis includes methods to analyze time-series data in order to extract
useful patterns, trends, rules and statistics. Stock market prediction is an important
application of time- series analysis.
e) Clustering : Clustering is used to identify data objects that are similar to one
another. The similarity can be decided based on a number of factors like purchase
behavior, responsiveness to certain actions, geographical locations and so on. For
example, an insurance company can cluster its customers based on age, residence,
income etc. This group information will be helpful to understand the customers better
and hence provide better customized services.
-----------------------------------------------------------------------------------------------------------------
4|25
Machine Learning Notes
---------------------------------------------------------------------------------------------------------------------
5|25
Machine Learning Notes
Step 1 :
Step 2 : Iteration 1
- -
------------------------------------------------------------------------------------------------------------------
6|25
Machine Learning Notes
7|25
Machine Learning Notes
Certain examples that efficiently defines the well-posed learning problem are –
1. To better filter emails as spam or not
a. Task – Classifying emails as spam or not
b. Performance Measure – The fraction of emails accurately classified as
spam or not spam
c. Experience – Observing you label emails as spam or not spam
2. Handwriting Recognition Problem
a. Task – Acknowledging handwritten words within portrayal.
b. Performance Measure – percent of words accurately classified
c. Experience – a directory of handwritten words with given classifications
----------------------------------------------------------------------------------------------------------------
1. Inadequate and poor Quality of Training Data :The major issue that comes while
using machine learning algorithms is the lack of quality as well as quantity of data.
Although data plays a vital role in the processing of machine learning algorithms,
many data scientists claim that inadequate data, noisy data, and unclean data are
extremely exhausting the machine learning algorithms. Data quality can be affected
by some factors as follows:
Noisy data, incomplete data, inaccurate data, and unclean data lead to less
accuracy in classification and low-quality results. Hence, data quality can also be
considered as a major common problem while processing machine learning
algorithms.
8|25
Machine Learning Notes
2. Non-representative training data :To make sure our training model is generalized
well or not, we have to ensure that sample training data must be representative of
new cases that we need to generalize. The training data must cover all cases that
already occurred as well as occurring.
A machine learning model is said to be ideal if it predicts well for generalized cases
and provides accurate decisions. If there is less training data, then there will be a
sampling noise in the model, called the non-representative training set. It won't be
accurate in predictions. To overcome this, it will be biased against one class or a
group. Hence, we should use representative data in training to protect against being
biased and make accurate predictions without any drift.
Overfitting: Overfitting is one of the most common issues faced by Machine Learning
engineers and data scientists. Whenever a machine learning model is trained with a
huge amount of data, it starts capturing noise and inaccurate data into the training
data set. It negatively affects the performance of the model. We can overcome
overfitting by using linear and parametric algorithms in the machine learning models.
Underfitting occurs when our model is too simple to understand the base structure of
the data, just like an undersized pant. This generally happens when we have limited
data into the data set, and we try to build a linear model with non-linear data. In such
scenarios, the complexity of the model destroys, and rules of the machine learning
model become too easy to be applied on this data set, and the model starts doing
wrong predictions as well.
9|25
Machine Learning Notes
10 | 2 5
Machine Learning Notes
10. Slow implementations and results : This issue is also very commonly seen in
machine learning models. However, machine learning models are highly efficient in
producing accurate results but are time-consuming. Slow programming, excessive
requirements' and overloaded data take more time to provide accurate results than
expected. This needs continuous maintenance and monitoring of the model for
delivering accurate results.
===========================================================
11 | 2 5
Machine Learning Notes
12 | 2 5
Machine Learning Notes
---------------------------------------------------------------------------------------------------------------
Data Similarity measures between objects that contain only binary attributes are
called similarity coefficients, and typically have values between 0 and 1. A value of 1
indicates that the two objects are completely similar, while a value of 0 indicates that
the objects are not at all similar. There are many rationales for why one co-efficient is
better than another in specific instances.
Let x and y be two objects that consist of n binary attributes. The comparison of two
such objects, i.e., two binary vectors, Ieads to the following four quantities
(frequencies) :
Simple Matching Coefficient One commonly used similarity coefficient is the simple
matching coefficient (SMC), which is defined as
SMC = -------------------------------------------------------------
number of attributes
13 | 2 5
Machine Learning Notes
Jaccard measures an asymmetric setting, i.e., only consider matching of 1's, ignoring
the 0's also in the denominator
Example 2:
x=1011010001
y=1101000011
14 | 2 5
Machine Learning Notes
Cosine Similarity : When you have a set of quantified attributes for each instance--
an alternative to Minkowski distances.
If d1 and d2 are two document vectors, then cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot product d1'd2 of
vectors d1 and d2, and || d || is the length of vector d.
Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
|| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)1/2 = (42) 1/2 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 1/2 = (6)1/2 = 2.449
cos(d1, d2 ) = 0.3150
15 | 2 5
Machine Learning Notes
• Classification is the task of learning a target function that maps each attribute
set x to one of the predefined class labels of y.
16 | 2 5
Machine Learning Notes
50 + 60
Error Rate = ---------------------
560+60+50+330
17 | 2 5
Machine Learning Notes
A root node that has no incoming edges and zero or more outgoing edges.
Internal nodes, each of which has exactly one incoming edge and two or more
outgoing edges.
Leaf or terminal nodes, each of which has exactly one incoming edge and no
outgoing edges.
In a decision tree, each leaf node is assigned a class label. The nonterminal nodes,
which include the root and other internal nodes, contain attribute test conditions to
separate records that have different characteristics.
18 | 2 5
Machine Learning Notes
19 | 2 5
Machine Learning Notes
-----------------------------------------------------------------------------------------------------------------
20 | 2 5
Machine Learning Notes
21 | 2 5
Machine Learning Notes
Decision tree induction algorithms must provide a method for expressing an attribute
test condition and its corresponding outcomes for different attribute types.
1. Binary Attributes. The test condition for a binary attribute generates two
potential outcome
2. Nominal Attributes : Nominal attribute can have many values, its test
condition can be expressed in two ways. For a multiway split the number of
outcomes depends on the number of distinct values for the corresponding
attribute.
For example, if an attribute such as marital status has three distinct values-single,
married, or divorced-its test condition will produce a three-way split. On the other
hand, some decision tree algorithms, such as CART, produce only binary splits by
considering all 2k-1 - 1 ways of creating a binary partition of k attribute values. Figure
4.9(b) illustrates three different ways of grouping the attribute values for marital status
into two subsets.
22 | 2 5
Machine Learning Notes
4. Continuous Attributes :
23 | 2 5
Machine Learning Notes
24 | 2 5
Machine Learning Notes
--------------------------------------------------------------------------------------------------------------------------------
25 | 2 5
Machine Learning Notes
26 | 2 5