0% found this document useful (0 votes)
28 views37 pages

Digital Data Mining Nostos - FP

Data mining is the process of discovering patterns in large data sets. It involves techniques like regression, classification, clustering, and association rule mining. Regression analysis examines relationships between variables to make predictions. It fits a linear regression line to observed data points to minimize errors between estimated and actual values. Data mining is used in many applications like recommendations, fraud detection, and market basket analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views37 pages

Digital Data Mining Nostos - FP

Data mining is the process of discovering patterns in large data sets. It involves techniques like regression, classification, clustering, and association rule mining. Regression analysis examines relationships between variables to make predictions. It fits a linear regression line to observed data points to minimize errors between estimated and actual values. Data mining is used in many applications like recommendations, fraud detection, and market basket analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 37

Data Mining

Data is ubiquitous. Instead of consuming data unnecessarily, we must use data to optimize the
information to turn the world into a realm.

The world is data-rich, but poor in information.

Data Mining is a magic wand that does wonders with data.

Data Mining

Try understanding it by breaking down the term into chunks.

 Data is present everywhere. They are the facts and statistics collected together for reference
and analysis.
 Mining is a vivid term portraying the process of gleaning small chunks of nuggets from a
sufficient volume of raw material.

Definition

In the book Data Mining Concepts and Techniques, Data Mining is defined as

The process of discovering interesting patterns (non-trivial, potentially useful, previously


unknown) and knowledge from large amounts of data.

Why Data Mining?

 The explosive growth of data from terabytes to petabytes


 Eminent sources of voluminous data
o Business: Web, Transactions, Stock
o Science: Satellites, Sensors, Bio-informatics, Medical Diagnosis
o Social Media: News, Youtube, Facebook, Twitter, Instagram

We are drowning in data, but starving for knowledge.

As the saying Necessity is the mother of Invention, it gave rise to Automated analysis of
massive data to extract knowledge. Thus, Data Mining came into existence.

Trivia
Data Mining can also be termed as Knowledge Discovery in Databases (KDD).

Data Mining - An Interdisciplinary Term

Data mining is an interdisciplinary term that covers the domains portrayed in the above image.

Why not Traditional Data Analysis?

 Tremendous volume of data generation


 High dimensionality of data
 High complexity of data
 Emergence of new and sophisticated applications

Data Mining in a Nutshell

Here is an expert talk on demystifying the concept of data mining and why it would continue
to grow in popularity.

This is an auto-generated transcript

KDD Process
Many people look upon data mining as a synonym of Knowledge Discovery in Databases or
KDD. Few others consider Data Mining as a vital step in the KDD process.

The above image unveils the steps of KDD Process.

Steps in KDD Process

1. Data Cleaning - Removal of noisy and inconsistent data.


2. Data Integration - Data from diverse sources are unified.
3. Data Selection - Relevant data is retrieved.
4. Data Transformation - Data is transformed into appropriate forms.
5. Data Mining - Intelligent methods are applied to extract knowledge and patterns.
6. Pattern Evaluation - It identifies valuable patterns.
7. Knowledge Presentation- Visualization and presentation of the extracted
knowledge and the identified patterns.
Types of Mined Data

Thomas A. Edison said,

The value of an idea lies in the using of it.

The efficiency of the process relies on usage at the right place at the right time.

Basic forms of data that can be mined are listed as follows:

1. Database-oriented data
o Database data

o Data warehouse data


o Transactional data
2. Advanced datasets
o Spatial data
o Multimedia data
o Data streams
o Sensor data
o World Wide Web (WWW)

Patterns that can be Mined

Having known the types of data on which the process can be applied, let us now examine the
types of patterns that can be mined:

 Characterization and Discrimination


 Mining Frequent Patterns, Association, and correlations
 Classification
 Clustering
 Regression Analysis
 Outlier Detection
 Trend and Evolution Analysis
Data Mining in Daily Life

Amazon Recommendations

Have you ever used Amazon.com? If so, you would have come across the pop-up with a
phrase.

People who bought 'this' also bought.

It is the magic of Data Mining. It runs algorithms in the back-end to give us valuable
recommendations.

Facebook Suggestions

Facebook always surprises us with the suggestions of friends whom we know. It is all part of
the Data Mining Process.

Any common characteristic between the users, such as the school they studied in, is
processed by algorithms, which in turn suggest friends.

Data Mining in Daily Life

Market Basket Analysis

Market Basket Analysis helps in identifying the purchase patterns of potential customers and
analyzing the data gathered from them. The gathered data is then used to predict future
buying trends, and forecast supply demands.

Data Mining Techniques

Just as how the charm of a rainbow lies in its seven colors, the charm of data mining lies in
its seven techniques.

Data Mining Techniques

Data mining encompass the following techniques:

1. Regression
2. Classification
3. Clustering
4. Association Rules
5. Anomaly Detection
6. Data Visualization
7. Statistics

Let's start this colorful journey by appreciating each technique of data mining!

Regression

Prelude

Our curiosity to find associations between things dates back to the nineteenth century.

The term Regression was coined to illustrate a biological phenomenon that resulted in a
fascinating insight. The observation was that the heights of offspring of tall ancestors tend to
regress down towards a normal average.

Let us now delve into the first technique of Data Mining!

Regression

Regression Analysis

Regression Analysis is defined as

The process of examining the functional relationships among the variables.

 It is a predictive modeling technique.


 The functional relationship is represented in the form of an equation containing:
o Response or dependent variable
o Explanatory or predictor variable
 Regression Analysis is used in prediction and forecasting applications.

Regression Equation

Let us assume:

 Response variable is represented by YYY.


 Predictor variables are denoted by X1,X2,X3...XpX_1,X_2,X_3...X_pX1,X2,X3
...Xp, where ppp denotes the number of independent or predictor variables.

A simple regression model equation is represented as

Y=β0+β1X1+....+βpXp+εY=\beta_0+\beta_1X_1+....+\beta_pX_p+\varepsilonY=β0+β1X1
+....+βpXp+ε

where

β0,β1,...βp\beta_0,\beta_1,...\beta_pβ0,β1,...βp are regression parameters to be estimated from


data.

ε\varepsilonε is the measure of a discrepancy.

The sign of Regression Coefficient signifies the direction of the relationship (positive or negative)
between a predictor and response variable.

An Example

Let's see a sample relationship between the linearly related variables.

More on Regression!

This video explains Regression using a diagrammatic representation.

This is an auto-generated transcript


If you have trouble playing this video, please click here for help.

this tutorial is an introduction to regression there is an X variable and a Y variable in this case the
independent variables on the x-axis and the dependent variable is on the y-axis and we try to form a
relationship between these two variables and draw a line in this case a straight line and over the
next series of videos I'll explain what all this means what we try to understand is as the independent
variable is moving or changing what happens to the dependent variable does it go up or does it go
down how does it change if they move in the same direction if the independent variable increases
and the dependent variable increases as well like this we say there's a positive relationship if on the
other hand as the independent variable increases and the dependent variable decreases like this we
say there's a negative relationship the line would look like this go downward in the linear regression
we try to make a line a line to make a linear regression the key is on line right there a straight line
you can also do curved lines but for the this topic is all straight lines to actually conduct regression I
take observations and always plot some more observations in your random play I'll stick them in
here like that and I try to find a line that will fit a straight line that fits through all these different
points and this is called my regression line and it's based upon the least squares method and in the
end I want to minimize the difference between the estimated value and the actual value I want to
minimize my error errors this line will have a lot of errors if I compare the actual to the estimated
value and again the point is to minimize these errors or make them as small as possible now let's
imagine I put study time on the x-axis or make that my independent variable and the dependent
variable becomes grades or GPA as study time increases grades should go up there is a positive
relationship in regression we develop these equations like this in this case y hat is estimated grades
and it's based upon or it's equal to B naught plus B 1 times X where X is study time be not we derive
mathematically and it is the y-intercept b1 we also derive mathematically and I'll do in a later video
and it's the slope of the line in this case the slope is positive in the next video I'll discuss how you
develop these equations now if I change the x-axis to time on face book we see a negative
relationship more time on face book grades will suffer and go down a negative relationship what
we're estimating is still grades estimated grades is equal to B naught minus B 1 times X where X is
time on Facebook B naught is still the y intercept the y-intercept and it is a calculated value the slope
of the line is negative B 1 because it's downward sloping negative relationship and as I said before all
show you how to calculate this equation in the next video the X is the independent variable the Y is
the dependent variable the X is what we control what we manipulate what we change and the
dependent variable is the outcome so study time is the independent variable is what we control and
manipulate and your grades are dependent upon how much you study now this looks really ugly and
it's what I'll talk about in the next video but I'll step you step-by-step through it and hopefully make
it simple for you you

To Know More!

Explore these courses to know more about Regression.

 Regression Analysis

 Advanced Regression Analysis


Backdrop of Classification

Our intelligence of classification goes back to the era of ancient civilizations where the
perception Taxonomy emerged.

The word Taxonomy has its origin from an Ancient Greek word

 taxis- arrangement
 nomia- method

Taxonomy is the science of naming and classifying groups of biological species based on
shared characteristics. The taxonomic categories are domain, kingdom, phylum, class, order,
family, genus, and species.

Let us check the next technique of data mining.

Classification

Definition

Classification is the problem of identifying a category to which a new observation belongs to,
based on a training set of data containing observations whose categories are already known.

 It is a data analysis task.


 It follows a two-step process, namely:
o Learning Step - Training phase where a model is constructed.
o Classification Step - Predicting the class labels and testing the same for accuracy.
 Predicts the value of Categorical Variable.

Learn with Super-Hero Characters

Delve More!
This video unveils the concept of classification with an example.

This is an auto-generated transcript

If you have trouble playing this video, please click here for help.

so what kind of predictors are we going to look at or what kind of tasks what kind of prediction tasks
will we cover in this course will basically cover three we'll cover classification will cover regression
and we'll cover clustering tasks so I'm gonna show them kind of pictorially just just to get everyone
warmed up so um in classification what do you have you have a collection of individuals so these are
maybe the people that we've seen before and they are represented in some way and in a
classification task in addition to this collection of individuals somebody comes along and gives us
some labels for some of these individuals right so he says that maybe this one is an F and this one is
an M and this one is an action okay now as a learning algorithm you actually have no idea what
those labels mean right and you really have no idea what these what these things are but somebody
comes along and attaches some labels to some data points so what does a classification algorithm
try to do in a situation like that what it tries to do is it builds that predictor and in classification the
predictor takes on a particularly simple or reduced form so the whole predictor takes the form of
something that we call a decision boundary right so a decision boundary is an imaginary line that
goes through our space or or or or surface and it cuts the space into two parts so one part is going to
be the part where our algorithm thinks the MS live and the other part is where the F's live so what it
tries to do is it tries to draw a boundary in the space of data points such that all the FS are on one
side and all the ends are on another side so that's what it tries to know that's what the decision
boundaries and that's a it's a fundamental concept in in classification and we're actually going to
dwell quite a bit on what decision boundaries are and what form what do they look like
geometrically for the different learning algorithms but that will come in a few lectures the futures on
the line so for now you just have some sort of a decision boundary ends on one side FS on the other
side so this red line is the function that's it there's nothing more to it that is your predictor and how
does it predict stuff well if you fall on one side of the boundary it will produce an M for you and if
you fall on another side it'll predict an F right so maybe M is the market going up and F is the market
going down or maybe M is the individual ismail and F is the individualist female so you've just built a
predictor to detect gender based on whatever they however you represent that individual okay so
that's classification now the important thing to keep in mind is this classification this decision
boundary only looks that way because of the labels we came in the data so we could take exactly the
same data set exactly the same set of individuals put some different labels maybe a couple of yeses
and an O and what you hope for is that your learning algorithm will produce a different predictor a
different decision boundary so maybe this decision boundary reflects whether you're going to loan
money to that particular individual or not so you had some example of some examples of people
who paid up you have an example of a person who didn't pay up so that's how you decide to draw
the boundary and again the function the predictor is just to the decision boundary nothing else
that's all so the prediction is which side of the boundary while you're falling on so that's classification

Notable Algorithms
Eminent classification algorithms encompasses the following.

Base Classifiers

1. Neural Networks
2. Deep Learning
3. Decision Tree based Methods
4. Naive Bayes and Bayesian Belief Networks
5. Support Vector Machines
6. Rule-based Methods
7. Nearest-neighbor

Ensemble Classifiers

1. Boosting
2. Bagging
3. Random Forests

Consider an example of an apartment: The number of bedrooms, bathrooms, and the floor of an
apartment determines its price. Which is the dependent variable in this example?

All of the

step of classification contributes to the construction of learning model.

Learning Step

Consider an example of an apartment: The number of bedrooms, bathrooms, and the floor of
an apartment determines its price.

Which is/are the predictor variable(s) in this example?

All

Backdrop of Classification

Our intelligence of classification goes back to the era of ancient civilizations where the
perception Taxonomy emerged.
The word Taxonomy has its origin from an Ancient Greek word

 taxis- arrangement
 nomia- method

Taxonomy is the science of naming and classifying groups of biological species based on
shared characteristics. The taxonomic categories are domain, kingdom, phylum, class, order,
family, genus, and species.

Let us check the next technique of data mining.

Classification

Definition

Classification is the problem of identifying a category to which a new observation belongs to,
based on a training set of data containing observations whose categories are already known.

 It is a data analysis task.


 It follows a two-step process, namely:
o Learning Step - Training phase where a model is constructed.
o Classification Step - Predicting the class labels and testing the same for accuracy.
 Predicts the value of Categorical Variable.

**********************************************

term portrays the process of discovering small pieces from a large volume of raw material.

Mining

____Classification______ is the problem of identifying a category to which a new observation


belongs to, based on a training set of data containing observations whose categories are already
known

Response variable is Dependent Variable


Preface

Clusters are everywhere! From atoms to the galaxy.

Atoms cluster together to form compounds.

Galaxies love to cluster together!

Marketing companies love to cluster customers!

It is time to admire the beauty of the third color of the Data Mining rainbow!

Clustering

Clustering

Clustering is the task of grouping a set of objects, such that objects in the same cluster are
more similar to each other than to the objects in other clusters.

 Distance measure plays a significant role in clustering.


 Unsupervised learning method.
 Common distance measures in

Numeric Dataset

- Manhattan distance
- Minkowski distance
- Hamming distance

Non-Numeric Datasets

- Jaccard index
- Cosine Similarity
- Dice Coefficient

An Ozone Example
Oxygen atoms cluster together to form Ozone Molecule.

  

3 of 7

A Peek into Applications

Recommendation Engines

Recommendation Engines cluster the potential customers on the basis of their behavior (past
search patterns), which is followed by filtering the data and recommending relevant results to
the users.

Example: Bing, Google

Anamoly Detection

Anamoly Detection is the identification of observations that do not conform to the expected
pattern of clusters or groups. It is also known as Outlier Detection.

Example: Finding fraudulent credit card transactions

A Peek into Applications

Marketing

 Marketers face a plethora of analysis techniques to appeal their new customers and existing
potential customers.
 Customers are clustered into segments based on their demographical information, age,
purchase behavior and so on.
 Based on the clustered segments, marketers tailor their efforts in marketing to various
customer subsets in terms of advertising strategies.
Delve More into Algorithms

This video explores the types of clustering algorithms.

This is an auto-generated transcript

hello i'm luis serrano and this video is about flustering we're gonna learn two very important
algorithms k-means clustering and hierarchical clustering clustering is a type of unsupervised
learning and it basically consists of grouping data so if your data looks like it's all over the
place the algorithm will say okay you got a group here you've got a group here a group here
etc so let's take a look so let's start with an application the application is gonna be in
marketing in particular and customer segmentation and the situation is the following we have
an app and we want to market this app we've looked at our budget and we can actually make
three marketing strategies so that's our goal to make three marketing strategies so the idea is
to look at their potential customer base and to split it into three well-defined groups when we
look at the customer base we realize that we have two types of information we have their age
in years and we have their engagement with a certain page in number of days per week so one
of the columns is demographic age and the other one is behavioral which is engagement with
the page and the engagement on the page can be a number from 0 to 7 since it's in days per
week so we look at the potential customer base and this is it there's 8 people with their age
and their engagement so by looking at this list of people what groups can you come up with
let's take a look feel free to pause the video and think about it for a minute so just by
eyeballing I can think that for example this two people are similar they have similar ages and
similar engagements maybe I could put those in the same group I don't know maybe these
two are similar as well we can take awhile and we can actually write them down and maybe
come up with groups but there's gotta be something easier or at least something mechanical
the computer can do automatically so one of the first things to do with data is to plot it so let's
let's plot it in some way let's plot it like this so in the horizontal axis we put the age and in the
vertical axis we put the engagement and now it looks more clear right there are three groups
here is one here is another one and here is the other one so that's our three marketing
strategies the first one is for people around the age of 20 who are have a low engagement
with the page two three and four days a week then strategy two for people that are around
their late 30s and early 40s and high engagement with the page and then the last one is for
people that are around their 50s and very low engagement with the page and that is pretty
much for clustering is basically if our data looks like it's a bunch of points like this then a
clustering algorithm will say hey you know what I don't know much about your data but I can
tell you that it's kind of split into these groups so what we learn in this video is how to do
these clustering how does the computer identify these groups because for a human in this
small case it's easy but for computers not and in particular if you have many many many
points and and many columns or many dimensions it's not easy so in this video I'm gonna
show you two important methods the first one is called k-means clustering and the second
one is called hierarchical clustering so let's start with k-means clustering and the question is
how does the computer look at this points all over the place and figure out that they are
forming these groups so when I try to imagine points in the plane I just imagine places in a
city and trying to put pizza parlors so let's say that we are the owners of this pizza place and
we want to put three pizza parlors in this city and what we want to do is we want to put them
in such a way that we serve our clientele in the best possible way so we look at our clientele
and it looks like this this is where they live so what we want to do is locate three pizza parlors
in the best possible places that will serve a clientele so if you take a look at it you can come
up with three places right it seems like we should have a red one that serves the red point
some blue one that serves the blue points and a yellow one that serves the yellow points
however for humans is easy but a computer has a harder time so what the computer is gonna
do is like in most things in machine learning start a random spot and start getting better and
better so how to start well first it locates three random points and puts three pizza parlors
there and now what we're gonna do is a series of slightly obvious logical statements that
when put together will get us to a better place so the first logical statement is it seems like if
we have the pizza parlors in these places everyone should go to the closest one to them that
makes sense right so we're gonna plot all the people that go to the red to the blue and to the
yellow pizza parlor basically you go to the one that is the closest so here's another logical
statement if all the red people go to the red pizza parlor it wouldn't make sense to put it in the
center of all those houses right and the same thing with the blue and with the yellow basically
you move the pizza parlor to the center of the houses that it's serving so the yellow one will
serve these houses over here the blue one is serving these houses over here and the red one is
serving these houses over here so we move each one of them to the center of the houses that
they're serving and now let's apply the first logical statement again we have three pizza
parlors and everyone's gonna go to the one that is closest to them so some things change right
because let's take a look at these three blue points well now they're closer to the yellow pizza
parlor so these people move and now they're gonna go to the other two the yellow pizza
parlor what about these two red points over here well now they're closer to the blue pizza
parlor so they're gonna start going to the blue pizza parlor now so let's go back to the
ideological statement which is that the best location for a pizza parlor is the center of the
houses that it serves so we move every pizza parlor to the center of the houses that it serves
and again let's go back to the first logical statement which is every person goes to the closest
pizza parlor so if you look at these points over here they are red but now they're much closer
to the blue pizza parlor so they move to the blue pizza parlor now and you can see that we're
getting better and better right because now when we apply the other statement which is every
pizza parlor should be at the center of the houses that it serves then now we move everything
to the center or the houses where it serves and we're done so that's pretty simple right and a
computer can do it because a computer can find the center of a bunch of points by just
averaging the coordinates and can also determine if a point is closer to one center than to the
other one because it simply just applies the Pythagorean theorem or the distance formula and
can compares numbers these are these are decisions that a computer can make very easily so
we managed to think like a computer and not like a human which is basically the main idea
and machine learning so this is a k-means clustering algorithm now you may be noticing that
we took one decision that seemed to be taken by a human and not by a computer right we
decided that there were three clusters but as we said that's hard for a computer to decide as a
human can see it back empiric and so here's a question how do we know how many clusters
to pick and for this we have a few methods but I'm gonna show you what's called the elbow
method so the elbow method basically says try a bunch of numbers and then be a little smart
on how to pick the best one so let's try with one cluster we can do this algorithm with only
one cluster and we're probably gonna get something like this every house go to the same
pizza parlor let me can run it with two clusters and you can start seeing that this algorithm
actually depends on where the original point starts sometimes it works sometimes it doesn't
sometimes give you different answers so let's say we try to clusters and we got this then we
try three clusters and we got the solution that we got then we try with four clusters and let's
say we got this with five clusters and we got this and with six clusters and we got this so by
eyeballing this we can see that the best solution is with three clusters but again we need to
teach the computer how to find the three clusters we need to think like a computer so we can't
rationalize things we have to do things like measuring distances comparing numbers
averaging coordinates etc so with those tools how do we find that 3 is the best well what we
need is a measure of how good is one clustering and maybe the following measure will make
sense basically we're gonna do is we're gonna think of the diameter of a clustering and the
diameter is simply gonna be the largest possible distance between two points of the same
color that basically tells us how big each group is in a rough way so let's look at the first one
cluster solution the longest possible distance between two points of the same color is this one
those two red points are the farthest apart so that distances sign is a win away telling us how
good is that clustering let's do it with two clusters so the longest distance let's say it's this
distance over here that tells us how good the clustering is with two clusters now let's do it
with three clusters let's say that the longest distance is this one over here again with four
clusters along its distance is this one with five clusters long a distance is this one and with six
clusters is this one now I just eyeball these distances so if you think there's another one you
may be correct but conceptually what we're trying to do is to do it define the next method
which is the elbow method so what we're gonna do is we take all these distances and we
graph them in the following way on the horizontal axis we're gonna put the number of
clusters so one two three four five and six and on the vertical axis we're gonna graph the
diameter so we get the following points and now what we do is we join these points and now
this is somewhere where a human can intervene a human can look at this graph and say okay
you know what I want the elbow to be here there are also some automatic methods to do this
but at some point in in the machine learning algorithm is good to actually have a consistency
check because you may have an idea of how many clusters you want or you may have an idea
of how many Coster you you would like to have or a maximum or a minimum so anyway in
some way or another we figure out that the numbers to be another thing that's important is
that this elbo math is very easy for a human if our if our data has many many columns we're
looking at points in very high dimensions however the elbow method the graph is always
gonna be two-dimensional so that's it that's how we decided that three clusters are the best
and that is the k-means clustering algorithm in a nutshell okay so now let's go to our second
method which is hierarchical clustering and we're gonna do a similar problem except now
with this data set we're gonna find it a clustering and two let's see how many groups we can
find so another way to do it is the following let's think about this let's think of the two closest
points it wouldn't make sense to say that these two points that are the closest would belong to
the same group maybe yes maybe no but it's a sensical thing to ask right so let's go on that
statement let's say these are the two closest points so these two are gonna be part of the same
group now what are the next two closest points let's say it is two so these two belong in a
group and we're gonna keep going in this direction the two closest points are these ones so
these two belong to the same group the two closest points are this ones so now what do we do
well we just join the two groups so now it becomes a group of three the two closest points are
these two so they join like this the two closest points after that are these two so we join the
two groups the group of two and the group of three into a group of five and then the next pair
of points are the closest are these

 so we're gonna join them but let's say that's just too big so we have maybe a measure
of how much is too far so we stop here and that's it that's hierarchical clustering it's
pretty simple right now again there seem to be a human decision here right why did
we decide on that being the distance or for example why did we decide on two being
the number of clusters so we can make this decision but let's actually look at an
educated way to make this decision so let's answer this question how do we decide the
distance or the number of clusters so a way to do it is by building something called
add and drop so what we're gonna do is the following we're gonna pour points in a
row over here one up to eight and then in the vertical axis we're gonna graph the
distance I'm going to show you how let's pick the closest two points which are four
and five so we join four and five and we join them over here and this is not up to scale
but the height of that little curved line between four and five let's say is the distance
between four and five so we join this two and then we go to the next two which is 1/2
and so we're going to join one two here and we're gonna join them in the dendrogram
they're right and again assume that that height of that little curved line is the distance
between one and two now we join the next pair which is six and seven so we join six
and seven and again the height is the distance we keep going six and eight so now
we're gonna join six and eight how do we join them well we join them like this the
group of six seven and the group of eight and the next group is three and four five so
they get joined like this and now the next group is gonna be two and three so we join
the group corresponding to two one the group's one two three in the dendrogram and
notice that the dendogram goes up because these distances increase so every time we
make a new joint it's higher than the previous one the next one that we joined are
three and six so we end up joining these two trees like that and so that's it we have a
lot of information about this set in this dendrogram and now how do we decide where
to cut well let's say we cut over here at a certain distance and that gives us two
clusters which are this one one two three four and five and this one which is six seven
and eight so notice that we made the decision on cutting based on how much a
distance is too far away or how many clusters do we want to obtain let's say the one
obtained four clusters so we cut out this distance over here which gives us four
clusters the cluster formed by one and two the one formed by three by itself the more
informed by four and five and the one formed by six seven and eight so again these
decisions are taken by a human but think about it again let's say we have billions of
points and let's say that they live in a thousand dimensional space it doesn't matter the
dendogram is still a two dimensional figure and we can easily make decisions on it so
again a combination of a computer algorithm and some smart human decisions it what
gives us the best clustering and that's it that's hierarchical clustering in a nutshell
clustering has some very interesting applications and admissions some of them things
like genetics or evolutionary biology the genome carries a lot of information about a
species and if you manage to cluster them you get to understand a lot about species
and how they evolved into what they are right now other things I recommend are
systems use a lot of clustering for example the way you may have got this video
recommended was using several methods that include clustering users grouping them
into into similar users so maybe somebody very similar then you watch this video and
that's why you gotta recommend it and that brings us to social networks which is
another place where a clustering is used a lot in a very similar example than the one
we did social networks use these methods to group users into certain similar groups
based on demographics based on behavior and then be able to target information to
them that they want to see or suggest your friends that are similar to you at cetera so
that's all for now thank you very much for your attention as usual if you would like to
see some more of this content please subscribe or hit like feel free share with your
friends and feel free to throw in a comment and tell me what you think of the video or
if you have any suggestions for other videos you'd like to see and my twitter handle is
Luis likes math if you'd like to tweet at me so thanks again and see you in the next
video
To Know More!

Explore this course to know more on Clustering!

1. Clustering - Ensemble

Prelude

Probably, we all felt at a point of time that IF statements are the easiest in Programming!

Imagine those IF statements doing magic in increasing the sales of markets. Yes!
Association Rule Mining does that.

Let us stop beating around the bush with programming paradigms. It is time to check the next
technique of data mining.

Association Rule Mining

The next technique of data mining is Association Rule Mining.

Association Rule Mining

Association Rule Mining aids in identifying the associations, correlations, and frequent
patterns in data.

The derived relationships are represented in the form of Association Rules.

Types and Formats

Rule Format

IF{Set of Items} ⇒\Rightarrow⇒ THEN{Set of Items}

IF part is termed as Antecedent, while the THEN part is termed as Consequent.


Antecedent and Consequent are disjoint sets.

Types of Association Rules

 Multilevel Association Rule


 Multidimensional Association Rule

 Quantitative Association Rule

An Example

The above image portrays the process of Market Basket Analysis.

Association Measures

The important measures that aid in forming Association Rules are as follows.

Support

It indicates how frequently the item appears in the dataset.

Confidence

It measures the number of times the extracted IF-THEN rule is found to be valid.

Lift

It compares the confidence with expected confidence.

Walmart customers who purchase Barbie dolls have a 60% likelihood of buying one of three
types of candy bars.

Source: Forbes

Notable Algorithms

Some key algorithms that generate Association Rules are:

 AIS
 SETM
 Apriori
ARM in Action with Apriori

Check out the video to see how Apriori works.

This is an auto-generated transcript

hello everyone and welcome to this interesting session on a prairie and quartum now many of us
have visited reading shops such as Walmart or Target for our household needs well let's say that we
are planning to buy a new iPhone from Target what we would typically do is search for the model by
visiting the mobile section of the store and then select the product and head towards the billing
counter but in today's world the goal of the organization is to increase the revenue can this be done
by just pitching one product at a time to the customer now the answer to us is clearly no hence
organization began mining data relating to frequently bought items so market basket analysis is one
of the key techniques used by large retailers to uncover associations between items now examples
could be the customers who purchase bread have a 60% likelihood to also purchase Jam customers
who purchase laptops are more likely to purchase laptop bags as well they try to find out
associations between different items and products that can be sold together which gives assisting in
the right product placement typically it figures out what products are being bought together and
organizations can place products in a similar manner for example people who buy bread also tend to
buy butter right and the marketing team at reiden stores should target customers who buy bread
and butter I provide and offer to them so that they buy a third item suppose eggs so if a customer
buys bread and butter and sees a discount offer on eggs he will be encouraged to spend more and
buy the eggs and this is what market basket analysis is all about this is what we are going to talk
about in this session which is Association rule mining and the a priori El Corte Association rule can be
thought of as an if-then relationship just to elaborate on that we have come up with a rule suppose
if an item a is being bought by the customer then the chances of item being picked by the customer -
under the same transaction ID is found out you need to understand here that it's not a casualty
rather it's a co-occurrence pattern that comes to the force now there are two elements to this rule
first is if and second is that then now if is also known as antecedent this is an item or a group of
items that are typically found in the item set and the later one is called the consequent this comes
along as an item with an antecedent group or the group of antecedents approaches now if we look
at the image here a arrow B it means that if a person buys an item a then he will also buy an item B
or he will most probably buy an item B the simple example that I gave you about the bread and
butter and the X is just a small example but what if you have thousands and thousands of items if
you go to any professional data scientist with that data you can just imagine how much of profit you
can make if the data scientist provides you with the right examples and the right placement of the
items which you can do and you can get a lot of insights that is what associate rule mining is a very
good algorithm which helps the business make profit so let's see how this algorithm works so
Association rule mining is all about building the rules and we have just seen one rule that if you buy
a then there's a slight possibility or there's a chance that you might buy B also this step of a
relationship in which we can find the relationship between these two items is known as single
cardinality but what if the customer who bought a and B also wants to buy C or if a customer who
bought a B and C also wants to buy D then in these cases the cardinality usually increases and we
can have a lot of combination around these data and if you have around 10,000 or more than 10,000
data items just imagine how many rules you're going to create for each product that is my
Association rule mining has such measures so that we do not end up creating tens of thousands of
rules no that is really the ebrary algorithm comes in but before we get into the ebrary algorithm let's
understand what's the mattes behind it now there are three types of matrices which help to
measure the Association we have support confidence and lift so support is the frequency of item a or
the combination of item a or B it's basically the frequency of the items which we have bought and
what are the combination of the frequency of the item we have bought so with this what we can do
is filter out the items which have been bought less frequently this is one of the measures which is
support now what confidence tells us so confidence gives us how often the items a and B occur
together given the number of times a occur now this also helps us solve a lot of other problems
because if somebody is buying a and B together and not buying see we can just rule out C at that
point of time so this solves another problem is that we obviously do not need to analyze the
problems which people just buy barely so what we can do is according this is we can define our
minimum support and confidence and when you have set this values we can put this values in the
algorithm and we can filter out the data and we can create different rules and suppose even after
filtering you have like five thousand rules and for every item we create these five thousand rules so
that's practically impossible so for that we need the third calculation which is the lift so lift is
basically the strength of any rule now let's have a look at the denominator of the formula given here
and if you see here we have the independent support values of a and B so this gives us the
independent occurrence probability of a and B and obviously there's a lot of difference between this
random occurrence at Association and if the denominator of the lift is more what it means is that the
occurrence of randomness is more rather than the occurrence because of any Association so left is
the final verdict where we know whether we have to spend time on this but rule what we have got
here or not now let's have a look at a simple example of Association rule mining so suppose we have
a set of items a B C D and E and a set of transactions t1 t2 t3 t4 and t5 and as you can see here we
have the transactions t1 in which we have ABC t2 a CD t3 BCD t4 a de and t5 BCE now what we
generally do is create some rules or Association rules such as a gives T or C gives a a give C B and C
gives a what this basically means is that if a person buys a then he's most likely to buy D and if a
person by C then he's most likely to buy a and if you have a look at the last one if a person by his B
and C he's most likely to buy the item here as well now if we calculate the support confidence and
left using these rules as you can see here in the table we have the rule and the support converse and
the lift values now let's discuss about a priori so a priori algorithm uses the frequent itemsets to
generate the Association rule and it is based on the concept that a subset of a frequent item set
must also be a frequent item set itself now this raises the question what exactly is a frequent Adams
set so our frequent Adams set is an item set whose support value is greater than the threshold value
now just now we discussed that the marketing team are going to the sales have a minimum
threshold value for the confidence as well as the support so frequent Adams have is that item set
who support value is greater than the threshold value already specified an example if a and B is a
frequent item set then a and B should also be frequent itemsets individually now let's consider the
find transaction to make the things for them easier suppose we have transactions 1 2 3 4 5 and these
items are there so T 1 has 1 3 and 4 T 2 has 2 3 & 5 t 3 has 1 2 3 5 t 4 2 5 and T 5 1 3 & 5 now the
fourth step is to build a list of item sets of size 1 by using this transactional data and one thing to
note here is that the minimum support count which is given here is 2 let's suppose it's 2 so the first
step is to create item sets of size 1 and calculate their support values so as you can see here we have
the table c1 in which we have the item sets 1 2 3 4 5 and the support values if you remember the
formula of support it was frequency divided by the total number of occurrence so as you can see
here for the item set 1 the support is 3 as you can see here that I don't set one appears in T 1 T 3 and
T 5 so as you can see it's frequency is 1 2 & 3 now as you can see here the items at 4 has a support of
1 as it occurs only once in transaction 1 but the minimum support value is 2 that's why it's going to
be eliminated so we have that final table which is the table F 1 in which we have the item sets 1 2 3
& 5 and we have the support values 3 3 4 & 4 now the next step is to create item sets of size 2 and
calculate their support values now all the combination of the item sets in the F 1 which is the final
table in which you discarded the 4 are going to be used for this iteration so we get the table C 2 so as
you can see here we have 1 2 1 3 1 5 2 3 2 5 & 3 5 now if you calculate this support here again we
can see that the items at 1 comma 2 has a support of 1 which is again less than the specified
threshold so we're going to discard that so if we have a look at the table F 2 we have 1 comma 3 1 5
2 3 2 5 & 3 5 again we're going to move forward and create the Adams set of size 3 and calculate the
support values now all the combinations are going to be used from the item set F 2 for this particular
iterations now before calculating support values let's perform pruning on the data set now what is
pruning now after the combinations are being made we devise C 3 item sets to check if there is
another subset whose support is less than the minimum support value that is what frequent item set
means so we have a look here the item sets we have is 1 2 3 1 2 1 3 2 3 4 the first one because as
you can see here if we have a look at the subsets of 1 2 3 we have 1 comma 2 as well so we are going
to discard this whole item set same goes for the second one we have 1 2 5 we have 1 2 in that which
was discarded in the previous set or the previous step that's why we're gonna discard that also
which leaves us with only 2 factors which is 1 3 5 item set and the 2 3 5 and the support for this is 2
and 2 as well now if we create the table see for using four elements we're gonna have only one item
set which is 1 2 3 & 5 and if we have a look at the table here the transaction table 1 2 3 & 5 appears
only 1 so the support is 1 and since see for the support of the whole table C 4 is less than 2 so we're
gonna stop here and return to the previous Adam set that is 3 3 so the frequent itemsets have 1 3 5
& 2 3 5 now let's assume our minimum confidence value is 60% for that we're gonna generate all the
non-empty subsets for each frequent itemsets now for I equals 1 comma 3 comma 5 which is the
item set we get the subset 1 3 1 5 3 5 1 3 & 5 similarly for 2 3 5 we get 2 3 2 5 3 5 2 3 & 5 now this
rules taste that for every subset s of I the output of the rule gives something like s gifts I to s that
implies s recommends I of s and this is only possible if the support of I divided by the support of s is
greater than equal to the minimum confidence value now applying these rules to the item set of F 3
we get Rule 1 which is 1 3 gives 1 comma 3 comma 5 and 1/3 it means 1 & 3 gives 5 so the
confidence is equal to the support of 1 comma 3 comma fired wherever the support of 1 comma 3
daddy pulse 2 by 3 which is 66% and which is greater than the 60 person so the rule 1 is selected
now if we come to rule 2 which is 1 comma 5 it gives 1 comma 3 comma 5 and 1/5 it means if we
have 1 and 5 it implies we also gonna have 3 now if we calculate the confidence of this one we're
going to have support 1 3 5 whereby support 1 5 which gives us 100 person which means rule 2 is
selected as well but again if you have a look at rule 506 over here similarly if it select 3 gives 1 3 5 &
3 it means if we have 3 we also get 1 & 5 so the confidence for this comes at 50% which is less than
the given 60% target so we're gonna reject this rule and same goes for the rule number 6 now one
thing to keep in mind here is that although the rule 1 and rule 5 look a lot similar they are not so it
really depends what's on the left-hand side of the arrow and what's on the right-hand side of the
arrow it's the if-then possibility I'm sure you guys can understand what exactly these rules are and
how to proceed with this rules so let's see how we can implement the same in Python right so for
that what I'm going to do is create a new Python file and I'm going to use the Jupiter notebook you
are free to use any sort of IDE I'm going to name it as a priority so the first thing what we're gonna
do is we'll be using the online transactional data or ETS store for generating Association rules so
firstly what we need to do is get the pandas and mlx tel libraries imported and read the file so as you
can see here we are using the online retail dot xlsx format file and from ml extend we are going to
import a priori and Association rules it all comes under ml extend so as you can see here we have the
invoice the stock code the description the quantity the invoice date a unit price customer ID and the
country now next in this step what we're going to do is do data cleanup which includes removing the
species from some of the descriptions and drop the rules that do not have invoice numbers and
remove the great grad transactions because that is of no use to us so as you can see here at the
output in which we have like five hundred and thirty two thousand rows with eight columns so after
the cleaned up we need to consolidate the items into one transaction per row with each product for
the sake of keeping the data set small we are only looking at the sales for France so as you can see
here we have excluded all the other seeds we are just looking at the seeds for France now there are
a lot of zeros in the data but we also need to make sure any positive values are converted to one and
anything less than zero is set to zero so as you can see here we are still 392 rows you're gonna
encode it and see check again now that you have structured the data properly in this step what
we're going to do is generate frequent itemsets that have support at least 7% now this number is
chosen so that you can get close enough and generate the rules with the corresponding support
confidence and left so as you can see here the minimum support is 0.7 and then what if we add
another constraint on the rules such as the lift is greater than six and the conference is greater than
0.8 so as you can see here we have the left-hand side on the right-hand side of the Association rule
which is the ant ascendant and the consequence we have the support we have the confidence to lift
the leverage and the conviction so is that's it for this session that is how you create Association rules
using the ebrary algorithm which helps a lot in the marketing business it runs on the principle of
market basket analysis which is exactly what big companies like Walmart you have reliance and
Target to given IKEA does it and I hope you got to know what exactly is Association rule mining what
is lift confidence and support and how to create Association rules that we have any queries feel free
to mention it in the comment section below till then thank you and happy learning I hope you have
enjoyed listening to this video please be kind enough to like it and you can comment any of your
doubts and queries and we will reply them at the earliest do look out for more videos in our playlist
and subscribe to any rekha channel to learn more happy learning

Applications
Delve More

Explore more in the course Association Rule Mining.

Prelude

There is an evolving danger to the employees, customers, and organizations in the form of
intrusions, cyberattacks, and fraudulent transactions.

It is not so long since Facebook-Cambridge Analytica data scandal happened.

Outlier Detection technique can support in minimizing the attacks. Let us delve into the next
technique of data mining.

Outlier Detection

Outlier

Jiawei Han defines Outlier as

A data object that deviates significantly from the normal objects as if it were generated by a
different mechanism.

Types of Outlier

 Global Outlier
It significantly deviates from the entire dataset.

 Contextual Outlier

It significantly deviates based on the context selected.

 Collective Outlier

A subset of data objects collectively deviates from the entire dataset.

Diverse Methods

Outlier Detection Methods:

 Statistical approach
 Clustering-based approach
 Classification approach
 Proximity-based approach

Applications

Prologue

According to the American mathematician John W. Turkey,

"The greatest value of a picture is when it forces us to notice what we never expected to see."

Images speak louder than words, which presents data mining in the form of Data
Visualization.

Let us check the next technique of data mining.

Data Visualization

Data Visualization

It is a technique that aids in representing information in the form of visual context,


helping people understand the significance of data.
Patterns and trends that would be unveiled in text-based information can be easily spotted
with data visualization.

Why Data Visualization?

Read the following statement.

It was predicted that the price of pizza will increase by 25% next year, while the price of
burger will decrease by 20% next year.

Confusing right?

Now see the image.

The price charts state the fact clearly.

A Glance

Check out this video to have an idea of data visualization.

This is an auto-generated transcript


welcome to the Jensi and business travel academy what is data visualization data is a hot topic
people buy and wear t-shirts sang data nerd or data is the new bacon true story you can check it out
online with the digitalization era data went from scarce expensive difficult to find and collect to
abundant and cheap very difficult to process and understand that's when the concept of big data
emerged credible amounts of information so vast that they were challenging to capture store
understand and analyze with traditional software however all of this material is only as good as what
we can make out of it as individuals and businesses terabytes of data sitting in a data center unused
is a burden if correctly processed it can become digital gold big data is often combined with machine
learning to create predictive analytics for other analytics processes that bring the value of the
information to light still if you do not own a PhD in data science the raw details can remain obscure
that's where data visualization data visualization is the process of taking raw data transforming it
into graphs charts images and even videos that explain the numbers and allow us to gain insights
from it it changes the way we make sense of information to create value out of it discover new
patterns and spot trends think about a simple example how do you create a story to tell your boss
out of thousands of rows of data in an Excel spreadsheet an easy way is to create a chart like a pie or
a bar chart of that same data now you have a visual representation and can start analyzing and
integrating it into your business giving meaning and purpose to the original raw data in the business
travel industry data visualization truly empowers travel managers and reporting users by providing
clear and actionable insights into their programs a data-driven program brings value to all
stakeholders from the finance controller to the security manager and the HR manager travelers
themselves it allows for better control and prediction of travel spent and increases travellers security
and satisfaction imagine if you could get a visual representation of your past current and future
travel spent to show to your CFO or if you could visualize the impacts of each change you make to
your travel policies or negotiated rates what about if you can see where your travelers are
frequently disrupted and what would be better alternatives the possibilities are endless and data
visualization can keep you ahead this was a crash course of what is data visualization see you in the
next episode of agency of business travel Academy please subscribe to our YouTube channel to get
notified of our next video

Did You Know?

Adam Mc Cann created a data visualization that visualizes every single song recorded by
Bruce Springsteen. Using data from Spotify and other written sources, he was able to plot all
the aspects of songs.

For more information, visit Adam McCann.

Tools
The following tools help in visualizing your data:

 Excel
 Tableau
 QlikView
 Fusion Charts

Learn More

 Data Visualization
 Know the best Data Visualization tools
 Tableau
 QlikView
 Visualization with Matplotlib

Delve into Statistics

Statistics is the art of learning from data.

Quoting the famous mathematician Shakuntala Devi:

Numbers have life; they are not just symbols on paper.

Statistics is a powerful tool that works on numbers to provide us with inferences.

Let us dive into this technique to explore more!

Statistics

It is time to see the next technique of Data Mining!

Statistics can be defined as the science of collecting, analyzing, and


interpreting data.
Two broad categories of statistics that help in data analysis are:

 Descriptive Statistics
 Inferential Statistics

Statistics - A Glance

Watch this video to have a basic idea on Statistics.

This is an auto-generated transcript

Statistics is often hailed as one of the most useful areas of mathematics. It helps us to make
educated guesses of the unknown, and find useful information in an ocean of data. But despite its
usefulness, many people struggle with statistics. What is it, how does it work? For many statistics
can feel like an unending collection of rules and formulas. Even worse, if its misapplied, statistics can
lead to false conclusions, causing some people to develop a mistrust in the subject all together. So,
what is it all about then? Well, Statistics is the area of mathematics which primarily deals with
collecting and analyzing data. Some great examples might be keeping track of your favorite team as
they rack up wins and losses, or perhaps using data to try and predict the outcome of an election. To
be more specific statistics can be broken down into the areas of: Sampling Methods, Descriptive
Statistics, and Inferential Statistics. In Sampling Methods we cover how we collect our data. This is
arguably the most important area of all of statistics. It's important because it ensures that we collect
our data in just the right way so that we can make valid conclusions later on. It also helps us know
when we have just enough information for further analysis. Remember that even if you use the best
formulas and techniques, a study can quickly fall apart if you do not collect your data in just the right
way. Descriptive Statistics is about summarizing or highlighting important aspects of our data. When
people think of statistics they often think about this area. The tools here are about describing the
information you have collected and task such as using a graph or calculating a simple average fall
under the area of descriptive statistics. Inferential Statistics is about making predictions from our
data. Here the goal is to take just a small bit of information, analyze it carefully, and then see what
conclusions we can infer about the bigger picture. It is this part of statistics that can seem the most
mysterious, but in reality, is one of the most powerful and it allows us to find even more information
from the data we've already collected. Now when you take Statistics you may find that there are
many other topics that don't seem to fall into these three areas. That because Statistics is not
complete without exploring why the formulas and methods actually work, or more importantly
when they should be used. For this reason other topics like probability form a healthy part of any
statistics course. Think of these other topics as the framework and foundation that allows us to build
so many other useful tools. Lastly, when taking statistics, you may find that you need a little help. I'm
not talking about your friend who happens to be great at math. I'm talking about using a calculator
or computer. Although at some level many of the formulas can be very simple, using them for even a
handful of data points can quickly become tedious and cumbersome. The good news is technology
can really help with the required calculations. Remember even though you will certainly become
familiar with using technology, you will still need to know when a tool needs to be applied, or what
method is the most appropriate. Hopefully that sheds a little bit of light on the subject of Statistics
and motivates you to learn more about it. In this video we covered what statistics is, and its three
main areas of focus. We also pointed out that technology can be your ally, but in the end, you have
to make the decisions. Thanks for watching Did you enjoy watching this video? Don't forget to hit
the like button and then subscribe to my channel. As always you can find all of my videos free of
charge at: MySecretMathTutor.com Again, thanks for watching!

Descriptive Statistics

Descriptive Statistics

 It provides summary statistics of data.


 It helps in quantitatively interpreting the features of data.
 It is used in sample datasets (not population datasets).

Measures

 Measures of Central Tendency

Focus on the average or central point of a dataset.

o Mean
o Median
o Mode
 Measures of Spread

Focus on the dispersion of data from the central point.

o Range
o Standard Deviation
o Variance
o Interquartile Range

Inferential Statistics

Inferential Statistics
 It makes inferences about the properties of a population.
 It makes propositions about a population.

An Example

This video illustrates the difference between descriptive and inferential statistics with an
example.

This is an auto-generated transcript


numbers are everywhere especially in business where your company's health and future depend on
your ability to make good decisions you can use statistical methods to analyze your data to support
that work statistics is the practice of collecting analyzing and interpreting data there are two major
branches of Statistics descriptive statistics and inferential statistics you use descriptive statistics
when you know all of the values in a data set for example if you examine the olives on a tree you can
count the number of olives in this case 4,000 then you can find the number of green olives 3,000 the
number of black olives which are 1,000 and then find the proportion in this case the green olives are
75% of the total and the black olives are 25% of the total for inferential statistics you take
information from part of your population and extrapolate it to make guesses about your entire
population so for example let's say that you have a grove of 100 olive trees and you take a sample
from five of those trees you have 21,000 olives of which 14,000 and seventy are green and sixty nine
hundred and thirty are black and you see that the percentages are green is 67 percent and black is
33% now that's true for those five trees but it might not be and in fact almost certainly will not be
exactly true for the entire grove of 100 trees so what you need to ask is how close am I to the true
value and how confident am I in that estimate when you ask how close you're asking how close your
measured value is to the true value so for example black olives make up 33 percent of your sample
well the total might be 33 percent but most likely it varies a little bit so for example it might be 33
percent plus or minus two percent so the total percentage of black olives might be between 35
percent and 31 percent indicating how close gives you a range next you asked how confident am I in
that range so you're asking what percentage of random samples taken from this population would
fall within the range of say 31 percent to 35 percent the statement of how confident you are is
what's called a confidence interval and I'm going to switch from olives and use a more familiar
example perhaps when it deals with political polling so if we say that governor a has an approval
rating of 60% with a margin of error on the survey of 3% how confident are we of that most of the
time the confidence level that you will find in statistics is 95% so in other words 95% of the samples
that we take from the entire population should produce an approval rating of between 57 and 63
percent the other 5% would generate values outside of that range for example 55% 54% or on the
high side 70 or 75% but we know from inferential statistics that 95% of the time the approval rating
for a survey will be within 3% plus or minus of the actual true approval rating so the best way to
phrase that statement is we are 95% certain that governor a has an approval rating of 60% plus or
minus 3% when you read an article online or pick up a newspaper try to find a story that doesn't
refer to numbers in some way you probably won't be able to do it numbers are everywhere but once
you master the techniques in this course you'll have an easier time making sense of your own data
and interpreting data that others give you

End of Rainbow

We just cherished all the colors of the Data Mining rainbow!

Hiding within those piles of data is the knowledge that could change the life of a patient, or
change the world.

Mine the knowledge with these techniques to adore the colorful insights. Data isn’t the new
oil, it’s the soil in which findings grow.
Continue this hued journey to know more colors of Data Mining!

association measure compares the confidence with the expected confidence.

Lift

Inferential statistics is used in __________ datasets.


Population

Derived relationships in Association Rule Mining are represented in the form of ________rules__.

The __________ stage of the Data Science process helps in converting raw data to a machine-
readable format. R:Exploratory Data Analysis

Descriptive statistics is used in sample dataset.

Identify the algorithm that works based on the concept of clustering.


Choose the correct answer from below list
(1)K-Means
(2)SVM
(3)Decision Tree

Answer:-(1)K-Means

Which among the following is/are (an) outlier detection method(s)?


Choose the correct answer from below list
(1)All the options
(2)None of the options
(3)Proximity-based approach
(4)Clustering-based approach
(5)Classification approach
(6)Statistical approach

Answer:-(1)All the options

Distance measure(s) used in clustering process of Numeric Dataset is/are __________.


(1)Minkowski
(2)Hamming
(3)All the options
(4)Manhattan Distance

Answer:-(3)All the options


__________ aids in identifying associations, correlations, and frequent patterns in data.
Choose the correct answer from below list
(1)Association Rule Mining
(2)Classification
(3)Clustering

Answer:-(1)Association Rule Mining

Jacard Index distance measure is used on __________.


Choose the correct answer from below list
(1)Numeric dataset
(2)Non-numeric dataset

Answer:-(2)Non-numeric dataset

__________ parameter of regression helps in identifying the direction of relationship between variables.
Choose the correct answer from below list
(1)Measure of Discrepancy
(2)Regression Coefficient

Answer:-(2)Regression Coefficient

Clustering process works on _________ measure.


Choose the correct answer from below list
(1)Lift
(2)Support
(3)Confidence
(4)Probability
(5)Distance

Answer:-(5)Distance

The science of collecting, interpreting, and analyzing data is known as __________.


Choose the correct answer from below list
(1)Statistics
(2)Probability
(3)Data Collection
(4)Data Description

Answer:-(1)Statistics

__________statistics provides the summary statistics of the data.


Choose the correct answer from below list
(1)Descriptive
(2)Inferential

Answer:-(1)Descriptive

Which of the following helps in measuring the dispersion range of the data?
Choose the correct answer from below list
(1)Variance
(2)None of the options
(3)All the options
(4)Standard Deviation
(5)Range
(6)Interquartile range

Answer:-(3)All the options

__________ stage of data science process helps in exploring and determining the patterns from the
data.

Exploratory Data Analysis

In Association Rules, the Antecedent and Consequent form a disjoint set.

True
lassification predicts the value of __________ variable.
Choose the correct answer from below list
(1)Continuous
(2)Categorical

Answer:-(2)Categorical

Identify the Unsupervised Learning method.

Clustering
Classification is a __________ task.
Choose the correct answer from below list
(1)Data Analysis
(2)Data Transformation
(3)Data Integration
(4)Data Cleaning

Answer:-(1)Data Analysis
Which among the following is/are (an) Ensemble Classifier?
Choose the correct answer from below list
(1)All the options
(2)Random Forest
(3)Boosting
(4)Bagging
Answer:-(1)All the options

__________ stage of data science process helps in converting raw data into a machine-readable format.
Choose the correct answer from below list
(1)Data Description
(2)Data Cleaning
(3)Exploratory Data Analysis
(4)Data Gathering

Answer:-(3)Exploratory Data Analysis

Classification predicts the value of __________ variable.


Choose the correct answer from below list
(1)Continuous
(2)Categorical

Answer:-(2)Categorical

You might also like