0% found this document useful (0 votes)
22 views

Studentdata 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Studentdata 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

Marks:5

1. Explain Apriori algorithm with proper example. 5

Answer:

2. Describe Association rule with proper examples. Explain Outlier analysis.


3+2=5
ans:

Association Rule Learning


Association rule learning is a type of unsupervised learning technique that checks for the dependency
of one data item on another data item and maps accordingly so that it can be more profitable. It tries to
find some interesting relations or associations among the variables of dataset. It is based on different
rules to discover the interesting relations between variables in the database.
Association rule learning can be divided into three types of algorithms:

Apriori
• Eclat
• F-P Growth Algorithm
We will understand these algorithms in later chapters.

How does Association Rule Learning work?


Association rule learning works on the concept of If and Else Statement, such as if A then B.

Here the If element is called antecedent, and then statement is called as Consequent. These types of
relationships where we can find out some association or relation between two items is known as
single cardinality. It is all about creating rules, and if the number of items increases, then cardinality
also increases accordingly. So, to measure the associations between thousands of data items, there are
several metrics. These metrics are given below:
• Support
• Confidence
• Lift
Let's understand each of them:

Support
Support is the frequency of A or how frequently an item appears in the dataset. It is defined as the
fraction of the transaction T that contains the itemset X. If there are X datasets, then for transactions T,
it can be written as:

Confidence
Confidence indicates how often the rule has been found to be true. Or how often the items X and Y
occur together in the dataset when the occurrence of X is already given. It is the ratio of the
transaction that contains X and Y to the number of records that contain X.

Lift
It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y are independent of
each other. It has three possible values
(ii)

"outliers" refer to the data points that exist outside of what is to be expected. The major thing
about the outliers is what you do with them.

Difference between outliers and noise

Any unwanted error occurs in some previously measured variable, or there is any variance in
the previously measured variable called noise. Before finding the outliers present in any data
set, it is recommended first to remove the noise.

Types of Outliers

Outliers are divided into three different types

• Global or point outliers


• Collective outliers
• Contextual or conditional outliers
Global Outliers

Global outliers are also called point outliers. Global outliers are taken as the simplest form of
outliers. When data points deviate from all the rest of the data points in a given data set, it is
known

as the global outlier. In most cases, all the outlier detection procedures are targeted to
determine the global outliers. The green data point is the global outlier.

Collective Outliers

In a given set of data, when a group of data points deviates from the rest of the data set is
called collective outliers. Here, the particular set of data objects may not be outliers, but
when you consider the data objects as a whole, they may behave as outliers. To identify the
types of different outliers, you need to go through background information about the
relationship between the behavior of outliers shown by different data objects. For example,
in an Intrusion Detection System, the DOS package from one system to another is taken as
normal behavior. Therefore, if this happens with the various computer simultaneously, it is
considered abnormal behavior, and as a whole, they are called collective outliers. The green
data points as a whole represent the collective outlier.

Contextual Outliers

As the name suggests, "Contextual" means this outlier introduced within a context. For
example, in the speech recognition technique, the single background noise. Contextual
outliers are also known as Conditional outliers. These types of outliers happen if a data
object deviates from the other data points because of any specific condition in a given data
set. As we know, there are two types of attributes of objects of data: contextual attributes
and behavioral attributes. Contextual outlier analysis enables the users to examine outliers
in different contexts and conditions, which can be useful in various applications. For
example, A temperature reading of 45 degrees Celsius may behave as an outlier in a rainy
season. Still, it will behave like a normal data point in the context of a summer season. In the
given diagram, a green dot representing the low-temperature value in June is a contextual
outlier since the same value in December is not an outlier.
Outliers Analysis

Outliers are discarded at many places when data mining is applied. But it is still used in
many applications like fraud detection, medical, etc. It is usually because the events that
occur rarely can store much more significant information than the events that occur more
regularly.

Other applications where outlier detection plays a vital role are given below.

Any unusual response that occurs due to medical treatment can be analyzed through outlier
analysis in data mining.

• Fraud detection in the telecom industry


• In market analysis, outlier analysis enables marketers to identify the customer's
behaviors.
• In the Medical analysis field.
• Fraud detection in banking and finance such as credit cards, insurance sector, etc.

The process in which the behavior of the outliers is identified in a dataset is called outlier
analysis. It is also known as "outlier mining", the process is defined as a significant task of
data mining.

3. Out of 4000 transactions, 400 contain Biscuits, whereas 600 containChocolate, and these
600 transactions include a 200 that includes Biscuits and chocolates. Using this data, we will
find out the support, confidence, and lift. What is Data aggregation? 3+2=5

ans:

Support
Support refers to the default popularity of any product. You find the support as a quotient of the
division of the number of transactions comprising that product by the total number of
transactions. Hence, we get

Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)

= 400/4000 = 10 percent.
Confidence

Confidence refers to the possibility that the customers bought both biscuits and chocolates
together. So, you need to divide the number of transactions that comprise both biscuits and
chocolates by the total number of transactions to get the confiden. Hence,

Confidence = (Transactions relating both biscuits and Chocolate) / (Total transactions


involving Biscuits)

= 200/400

= 50 percent.

It means that 50 percent of customers who bought biscuits bought chocolates also.

Lift

Consider the above example; lift refers to the increase in the ratio of the sale of chocolates
when you sell biscuits. The mathematical equations of lift are given below.

Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)

= 50/10 = 5

It means that the probability of people buying both biscuits and chocolates together is
five times more than that of purchasing the biscuits alone. If the lift value is below one, it
requires that the people are unlikely to buy both the items together. Larger the value, the
better is the combination.

Data aggregation refers to a process of collecting information from different sources and presenting it
in a summarized format so that business analysts can perform statistical analyses of business schemes.
The collected information may be gathered from various data sources to summarize these data sources
into a draft for data analysis. This step is the major step taken by any business organization because
the accuracy of insights from data analysis majorly depends on the quality of data they use. It is very
necessary to collect quality content in huge amounts so that they can create relevant outcomes. Data
aggregation plays a vital role in finance, product, operations, and marketing strategies in any business
organization. Aggregated data is present in the data warehouse that can enable one to solve various
issues, which helps solve queries from data sets.

4.

what is clustering? Describe different types of clustering? 2+3=5

ans:
Clustering is the task of dividing the population or data points into a number of groups such
that data points in the same groups are more similar to other data points in the same group
and dissimilar to the data points in other groups. It is basically a collection of objects on the
basis of similarity and dissimilarity between them.

For ex– The data points in the graph below clustered together can be classified into one
single group. We can distinguish the clusters, and we can identify that there are 3 clusters in
the below picture.

Cluster is a group of objects that belongs to the same class. In other words, similar objects
are grouped in one cluster and dissimilar objects are grouped in another cluster.

• A cluster is a subset of similar objects


• A subset of objects such that the distance between any of the two objects in the cluster is
less than the distance between any object in the cluster and any object that is not located inside
it.
• A connected region of a multidimensional space with a comparatively high density of
objects.

Different types of Clustering

Cluster Analysis separates data into groups, usually known as clusters. If meaningful groups are the
objective, then the clusters catch the general information of the data. Some time cluster analysis is
only a useful initial stage for other purposes, such as data summarization. In the case of understanding
or utility, cluster analysis has long played a significant role in a wide area such as biology,
psychology, statistics, pattern recognition machine learning, and mining.
What is Cluster Analysis?

Cluster analysis is the group's data objects that primarily depend on information found in the data. It
defines the objects and their relationships. The objective of the objects within a group be similar or
different from the objects of the other groups.

The given Figure 1 illustrates different ways of Clustering at the same sets of the point.

In various applications, the concept of a cluster is not briefly defined. To better understand the
challenge of choosing what establishes a group, figure 1 illustrates twenty points and three different
ways to separate them into clusters. The design of the markers shows the cluster membership. The
figures divide the data into two and six sections, respectively. The division of each of the two more
significant clusters into three subclusters may be a product of the human visual system. It may not be
logical to state that the points from four clusters. The figure represents that the meaning of a cluster is
inaccurate. The best definition of cluster relies upon the nature of the data and the outcomes.

Cluster analysis is similar to other methods that are used to divide data objects into groups.
For example, Clustering can be view as a form of Classification. It constructs the labelling of
objects with Classification, i.e., new unlabelled objects are allowed a class label using a model
developed from objects with known class labels. So that, cluster analysis is sometimes
defined as unsupervised Classification. If the term classification is used without any ability
within data mining, then it typically refers to supervised Classification.

The terms segmentation and partitioning are generally used as synonyms for Clustering.
These terms are commonly used for techniques outside the traditional bounds of cluster
analysis. For example, the term partitioning is usually used in making relation with
techniques that separate graphs into subgraphs and that are not connected to
Clustering. Segmentation often introduces the division of data into groups using simple
methods. For example, an image can be broken into various sections depends on pixel
frequency and colour, or people can be divided into different groups based on their annual
income. However, some work in graph division and market segmentation is connected to
cluster analysis.

Different types of Clusters

Clustering addresses to discover helpful groups of objects (Clusters), where the objectives of
the data analysis characterize utility. Of course, there are various notions of a cluster that
demonstrate utility in practice. In order to visually show the differences between these kinds
of clusters, we utilize two-dimensional points, as shown in the figure that types of clusters
described here are equally valid for different sorts of data.

• Well-separated cluster

A cluster is a set of objects where each object is closer or more similar to every other object
in the cluster. Sometimes a limit is used to indicate that all the objects in a cluster must be
adequately close or similar to each other. The definition of a cluster is satisfied only when the
data contains natural clusters that are quite far from one another. The figure illustrates an
example of well-separated clusters that comprise of two points in a two-dimensional space.
Well-separated clusters do not require to be spherical but can have any shape.

• Prototype-Based cluster

A cluster is a set of objects where each object is closer or more similar to the prototype that
characterizes the cluster to the prototype of any other cluster. For data with continuous
characteristics, the prototype of a cluster is usually a centroid. It means the average (Mean)
of all the points in the cluster when a centroid is not significant. For example, when the data
has definite characteristics, the prototype is usually a medoid that is the most representative
point of a cluster. For some sorts of data, the model can be viewed as the most central point,
and in such examples, we commonly refer to prototype-based clusters as center-based
clusters. As anyone might expect, such clusters tend to be spherical. The figure illustrates an
example of center-based clusters.
Graph-Based cluster: If the data is depicted as a graph, where the nodes are the objects, then
a cluster can be described as a connected component. It is a group of objects that are
associated with each other, but that has no association with objects that is outside the
group. A significant example of graph-based clusters is contiguity-based clusters, where two
objects are associated when they are placed at a specified distance from each other. It
suggests that every object in a contiguity-based cluster is the same as some other object in
the cluster. Figures demonstrate an example of such clusters for two-dimensional points. The
meaning of a cluster is useful when clusters are unpredictable or intertwined but can
experience difficulty when noise present. It is shown by the two circular clusters in the figure;
the little extension of points can join two different clusters. Other kinds of graph-based
clusters are also possible. One such way describes a cluster as a clique. Clique is a set of
nodes in a graph that is completely associated with each other. Particularly, we add
connections between the objects according to their distance from one another. A cluster is
generated when a set of objects forms a clique. It is like prototype-based clusters, and such
clusters tend to be spherical.


• Density-Based Cluster

A cluster is a compressed domain of objects that are surrounded by a region of low density.
The two spherical clusters are not merged, as in the figure, because the bridge between
them fades into the noise. Similarly, the curve that is present in the Figure disappears into
the noise and does not form a cluster in Figure. It also disappears into the noise and does
not form a cluster shown in the figure. A density-based definition of a cluster is usually
occupied when the clusters are irregularly and intertwined, and when noise and outliers exist.
The other hand contiguity-based definition of a cluster would not work properly for the data
of Figure. Since the noise would tend to form a network between clusters.

• Shared- property or Conceptual Clusters

We can describe a cluster as a set of objects that offer some property. The object in a center-
based cluster shares the property that they are all closest to the similar centroid or medoid.
However, the shared-property approach additionally incorporates new types of the cluster.
Consider the cluster given in the figure. A triangular area (cluster) is next to a rectangular
one, and there are two intertwined circles (clusters). In both cases, a Clustering algorithm
would require a specific concept of a cluster to recognize these clusters effectively. The way
of discovering such clusters is called conceptual Clustering.

Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group of


data points into clusters so that the objects belong to the same group.

Clustering helps to splits data into several subsets. Each of these subsets contains data
similar to each other, and these subsets are called clusters. Now that the data from our
customer base is divided into clusters, we can make an informed decision about who we
think is best suited for this product.

What is Clustering?

Clustering is the process of making a group of abstract objects into classes of similar
objects.

Points to Remember
• A cluster of data objects can be treated as one group.
• While doing cluster analysis, we first partition the set of data into groups
based on data similarity and then assign the labels to the groups.
• The main advantage of clustering over classification is that, it is adaptable to
changes and helps single out useful features that distinguish different groups.

Why Clustering?
Clustering is very much important as it determines the intrinsic grouping among the
unlabelled data present. There are no criteria for good clustering. It depends on the user,
what is the criteria they may use which satisfy their need. For instance, we could be
interested in finding representatives for homogeneous groups (data reduction), in finding
“natural clusters” and describe their unknown properties (“natural” data types), in finding
useful and suitable groupings (“useful” data classes) or in finding unusual data objects
(outlier detection). This algorithm must make some assumptions that constitute the similarity
of points and each assumption make different and equally valid clusters.

Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group of data


points into clusters so that the objects belong to the same group.
Clustering helps to splits data into several subsets. Each of these subsets contains data similar to each
other, and these subsets are called clusters. Now that the data from our customer base is divided into
clusters, we can make an informed decision about who we think is best suited for this product.

Let's understand this with an example, suppose we are a market manager, and we have a new tempting
product to sell. We are sure that the product would bring enormous profit, as long as it is sold to the
right people. So, how can we tell who is best suited for the product from our company's huge
customer base?
Clustering, falling under the category of unsupervised machine learning, is one of the problems
that machine learning algorithms solve.

Clustering only utilizes input data, to determine patterns, anomalies, or similarities in its
input data.

A good clustering algorithm aims to obtain clusters whose:

• The intra-cluster similarities are high, it implies that the data present inside the cluster is
similar to one another.
• The inter-cluster similarity is low, and it means each cluster holds data that is not similar
to other data.

Applications of Cluster Analysis

• Clustering analysis is broadly used in many applications such as market


research, pattern recognition, data analysis, and image processing.
• Clustering can also help marketers discover distinct groups in their customer
base. And they can characterize their customer groups based on the
purchasing patterns.
• In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities and gain insight into structures
inherent to populations.
• Clustering also helps in identification of areas of similar land use in an earth
observation database. It also helps in the identification of groups of houses in
a city according to house type, value, and geographic location.
• Clustering also helps in classifying documents on the web for information
discovery.
• Clustering is also used in outlier detection applications such as detection of
credit card fraud.
• As a data mining function, cluster analysis serves as a tool to gain insight into
the distribution of data to observe characteristics of each cluster.

Requirements of Clustering in Data Mining

The following points throw light on why clustering is required in data mining −

• Scalability − We need highly scalable clustering algorithms to deal with large


databases.
• Ability to deal with different kinds of attributes − Algorithms should be capable
to be applied on any kind of data such as interval-based (numerical) data,
categorical, and binary data.
• Discovery of clusters with attribute shape − The clustering algorithm should be
capable of detecting clusters of arbitrary shape. They should not be bounded
to only distance measures that tend to find spherical cluster of small sizes.
• High dimensionality − The clustering algorithm should not only be able to
handle low-dimensional data but also the high dimensional space.
• Ability to deal with noisy data − Databases contain noisy, missing or
erroneous data. Some algorithms are sensitive to such data and may lead to
poor quality clusters.
• Interpretability − The clustering results should be interpretable,
comprehensible, and usable.

5. What is decision tree? Explain with proper example.

ans:

Decision Tree Algorithm

In a decision tree, which resembles a flowchart, an inner node represents a variable (or a feature)
of the dataset, a tree branch indicates a decision rule, and every leaf node indicates the outcome
of the specific decision. The first node from the top of a decision tree diagram is the root node.
We can split up data based on the attribute values that correspond to the independent
characteristics.

The Decision Tree Algorithm: How Does It Operate?

Every decision tree algorithm's fundamental principle is as follows:

To divide the data based on target variables, choose the best feature employing Attribute
Selection Measures (ASM).

Then it will divide the dataset into smaller sub-datasets and designate that feature as a decision
node for that branch.

Once one of the conditions matches, the procedure is repeated recursively for every child node
to begin creating the tree.

The identical property value applies to each of the tuples.

There aren't any more qualities left.

There aren't any more occurrences.


Decision Tree Regression

To predict future events using the decision tree algorithm and generate an insightful output of
continuous data type, the decision tree regression algorithm analyses an object's attributes and
trains this machine learning model as a tree. Since a predetermined set of discrete numbers
doesnot entirely define it, the output or outcome is not discrete.

This model illustrates a discrete output in the cricket match prediction that predicts whether a
certain team will win or lose a match.

A sales forecasting machine learning model that forecasts a firm's profit ranges will increase
throughout a fiscal year depending on the company's preliminary figures illustrates continuous
output.

A decision tree regression algorithm is utilized in this instance to forecast continuous values.

After talking about sklearn decision trees, let's look at how they are implemented step-by-step.

Some advantages of decision trees are:

Simple to understand and to interpret. Trees can be visualized.

Requires little data preparation. Other techniques often require data normalization, dummy
variables need to be created and blank values to be removed. Note however that this module
does not support missing values.

The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used
to train the tree.

Able to handle both numerical and categorical data. However, the scikit-learn implementation
does not support categorical variables for now. Other techniques are usually specialized in
analyzing datasets that have only one type of variable. See algorithms for more information.

Able to handle multi-output problems.

Uses a white box model. If a given situation is observable in a model, the explanation for the
condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an
artificial neural network), results may be more difficult to interpret.

Possible to validate a model using statistical tests. That makes it possible to account for the
reliability of the model.

Performs well even if its assumptions are somewhat violated by the true model from which the
data were generated.
The disadvantages of decision trees include:

Decision-tree learners can create over-complex trees that do not generalize the data well. This is
called overfitting. Mechanisms such as pruning, setting the minimum number of samples
required at a leaf node or setting the maximum depth of the tree are necessary to avoid this
problem.

Decision trees can be unstable because small variations in the data might result in a completely
different tree being generated. This problem is mitigated by using decision trees within an
ensemble.

Predictions of decision trees are neither smooth nor continuous, but piecewise constant
approximations as seen in the above figure.

Therefore, they are not good at extrapolation.

The problem of learning an optimal decision tree is known to be NP-complete under several
aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning
algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal
decisions are made at each node. Such algorithms cannot guarantee to return the globally
optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner,
where the features and samples are randomly sampled with replacement.

There are concepts that are hard to learn because decision trees do not express them easily, such
as XOR, parity or multiplexer problems.

Decision tree learners create biased trees if some classes dominate. It is therefore recommended
to balance the dataset prior to fitting with the decision tree.

6.(i) Construct FP tree for the following:

Transaction List of
s items
T1 l1,l2,l3
T2 l2,l3,l4
T3 l4,l5
T4 l1,l2,l4
T5 L1,l2,l3,l5
T6 L1,l2,l3,l4

What is correlation coefficient?

ans:
ans:

Mining of FP-tree is summarized below:

The lowest node item, I5, is not considered as it does not have a min support count. Hence it is
deleted.

The next lower node is I4. I4 occurs in 2 branches , {I2,I1,I3:,I41},{I2,I3,I4:1}. Therefore


considering I4 as suffix the prefix paths will be {I2, I1, I3:1}, {I2, I3: 1} this forms the conditional
pattern base.

The conditional pattern base is considered a transaction database, and an FP tree is constructed.
This will contain {I2:2, I3:2}, I1 is not considered as it does not meet the min support count.
This path will generate all combinations of frequent patterns : {I2,I4:2},{I3,I4:2},{I2,I3,I4:2}

For I3, the prefix path would be: {I2,I1:3},{I2:1}, this will generate a 2 node FP-tree : {I2:4, I1:3}
and frequent patterns are generated: {I2,I3:4}, {I1:I3:3}, {I2,I1,I3:3}.

For I1, the prefix path would be: {I2:4} this will generate a single node FP-tree: {I2:4} and frequent
patterns are generated: {I2, I1:4}.

Item Conditional Pattern Base Conditional FP-tree Frequent Patterns

GeneratedI4 {I2,I1,I3:1},{I2,I3:1} {I2:2, I3:2}

{I2,I4:2},{I3,I4:2},{I2,I3,I4:2}

I3 {I2,I1:3},{I2:1} {I2:4, I1:3} {I2,I3:4}, {I1:I3:3}, {I2,I1,I3:3}

I1 {I2:4} {I2:4} {I2,I1:4}

The diagram given below depicts the conditional FP tree associated with the conditional node I3.

(ii) Correlation Coefficient is a statistical concept, which helps in establishing a relation between
predicted and actual values obtained in a statistical experiment. The calculated value of the
correlation coefficient explains the exactness between the predicted and actual values.
Correlation Coefficient value always lies between -1 to +1. If correlation coefficient value is
positive, then there is a similar and identical relation between the two variables. Else it indicates
the dissimilarity between the two variables. The covariance of two variables divided by the
product of their standard deviations gives Pearson’s correlation coefficient. It is usually
represented by ρ (rho).

ρ (X,Y) = cov (X,Y) / σX.σY.

Here cov is the covariance. σX is the standard deviation of X and σY is the standard deviation of Y.
The given equation for correlation coefficient can be expressed in terms of means and
expectations.

μx and μy are mean of x and mean of y respectively. E is the expectation.


7. Describe Kmeans Algorithm. 5

ans:

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be
created in the process, as if K=2, there will be two clusters, and for K=3, there will be three
clusters, and so on.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of
this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters,
and repeats the process until it does not find the best clusters. The value of k should be
predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

Determines the best value for K center points or centroids by an iterative process.

Assigns each data point to its closest k-center. Those data points which are near to the particular
k-center, create a cluster.

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid
of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

8. What is Hierarchical clustering?

ans: Hierarchical clustering is another unsupervised machine learning algorithm, which is used to
group the unlabeled datasets into a cluster and also known as hierarchical cluster analysis or
HCA. In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-
shaped structure is known as the dendrogram. Sometimes the results of K-means clustering and
hierarchical clustering may look similar, but they both differ depending on how they work. As
there is no requirement to predetermine the number of clusters as we did in the K-Means
algorithm.

The hierarchical clustering technique has two approaches:

Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with


takingall data points as single clusters and merging them until one cluster is left.

Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down


approach.

The agglomerative hierarchical clustering algorithm is a popular example of HCA. To group the
datasets into clusters, it follows the bottom-up approach. It means, this algorithm considers each
dataset as a single cluster at the beginning, and then start combining the closest pair of clusters
together. It does this until all the clusters are merged into a single cluster that contains all the
datasets.

Step-1: Create each data point as a single cluster. Let's say there are N data points, so the number
of clusters will also be N.

Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there
will now be N-1 clusters.

Step-3: Again, take the two closest clusters and merge them together to form one cluster. There
will be N-2 clusters.

Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters.

Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to divide
the clusters as per the problem.

Single Linkage: It is the Shortest Distance between the closest points of the clusters. Consider
the below image:
Hierarchical Clustering in Machine Learning

Complete Linkage: It is the farthest distance between the two points of two different clusters. It is
one of the popular linkage methods as it forms tighter clusters than single-linkage.

Hierarchical Clustering in Machine Learning

Average Linkage: It is the linkage method in which the distance between each pair of datasets is
added up and then divided by the total number of datasets to calculate the average distance
between two clusters. It is also one of the most popular linkage methods.
Centroid Linkage: It is the linkage method in which the distance between the centroid of the
clusters is calculated. Consider the below image:

Hierarchical Clustering in Machine Learning

From the above-given approaches, we can apply any of them according to the type of problem or
business requirement.

Woking of Dendrogram in Hierarchical clustering

The dendrogram is a tree-like structure that is mainly used to store each step as a memory that
the HC algorithm performs. In the dendrogram plot, the Y-axis shows the Euclidean distances
between the data points, and the x-axis shows all the data points of the given dataset.

9.1)Find association rule using Apriori algorithm where minsup is 50% and minconfidence is
70%.

Ric Puls Oil Milk Appl


e e e
t1 1 1 1 0 0
t2 0 1 1 1 0
t3 0 0 0 1 1
t4 1 1 0 1 0
t5 1 1 1 0 1
t6 1 1 1 1 1
ans:

Consider a Big Bazar scenario where the product set is P = {Rice, Pulse, Oil, Milk, Apple}. The
database comprises six transactions where 1 represents the presence of the product and 0
represents the absence of the product.

Tid Rice Pulse Oil Milk Apple

t1 1 1 1 0 0

t2 0 1 1 1 0

t3 0 0 0 1 1

t4 1 1 0 1 0

t5 1 1 1 0 1

t6 1 1 1 1 1

The Apriori Algorithm makes the given assumptions

All subsets of a frequent itemset must be frequent.

The subsets of an infrequent item set must be infrequent.

Fix a threshold support level. In our case, we have fixed it at 50 percent.

Step 1

Make a frequency table of all the products that appear in all the transactions. Now, short the
frequency table to add only those products with a threshold support level of over 50 percent. We
find the given frequency table.

Product Frequency (Number of transactions)

Rice (R) 4

Pulse(P) 5

Oil(O) 4

Milk(M)4

The above table indicated the products frequently bought by the customers.

Step 2

Create pairs of products such as RP, RO, RM, PO, PM, OM. You will get the given frequency table.
Itemset Frequency (Number of transactions)

RP 4

RO 3

RM 2

PO 4

PM 3

OM 2

Step 3

Implementing the same threshold support of 50 percent and consider the products that are more
than 50 percent. In our case, it is more than 3

Thus, we get RP, RO, PO, and PM

Step 4

Now, look for a set of three products that the customers buy together. We get the given
combination.

RP and RO give RPO

PO and PM give POM

Step 5

Calculate the frequency of the two itemsets, and you will get the given frequency table.

Itemset Frequency (Number of transactions)

RPO 4

POM 3

If you implement the threshold assumption, you can figure out that the customers' set of three
products is RPO.

10. What is graph minig?


ans:

Graph mining is a process in which the mining techniques are used in finding a pattern or
relationship in the given real-world collection of graphs. By mining the graph, frequent
substructures and relationships can be identified which helps in clustering the graph sets, finding
a relationship between graph sets, or discriminating or characterizing graphs. Predicting these
patterning trends can help in building models for the enhancement of any application that is
used in real-time. To implement the process of graph mining, one must learn to mine frequent
subgraphs.

Frequent Subgraph Mining

Let us consider a graph h with an edge set E(h) and a vertex set V(h). Let us consider the
existence of subgraph isomorphism from h to h’ in such a way that h is a subgraph of h’. A label
function is a function that plots either the edges or vertices to a label. Let us consider a labeled
graph dataset, F = {H1, H2, H3....Hn} Let us consider s(h) as the support which means the
percentage of graphs in F where h is a subgraph. A frequent graph has support that will be no less
than the minimum support threshold. Let us denote it as min_support.

Steps in finding frequent subgraphs:

here are two steps in finding frequent subgraphs.

The first step is to create frequent substructure candidates.

The second step is to find the support of each and every candidate. We must optimize and
enhance the first step because the second step is an NP-completed set where the computational
complexity is accurate and high.

The Apriori-based approach: The approach to find the frequent graphs begin from the graph with
a small size. The approach advances in a bottom-up way by creating candidates with extra vertex
or edge. This algorithm is called an Apriori Graph.

Input:

F= a graph data set.

min_support= minimum support threshold

Output:

Q1,Q2,Q3, .. QK,

a frequent substructure set graphs with the size range from 1 to k.

Q1 <- all the frequent 1 subgraphs in F;

k <- 2;

while Qk-1 ≠ ∅ do
Qk <- ∅;

Gk <- candidate_generation(Qk-1);

foreach candidate l ∈ Gk do

l.count <- 0;

foreach Fi ∈ F do

if isomerism_subgraph(l,Fi) then

l.count <- l.count+1;

end

end

if l.count ≥ min_support(F) 𝖠 l∉Qk then

Qk = Qk U l;

end

end

k <- k+1;

end

The Pattern- growth approach: This pattern-growth approach can use both BFS and DFS(Depth
First Search). DFS is preferred for this approach due to its less memory consumption nature.

Constraint-based substructure mining

According to the request of the user, the constraints described changes in the mining process.
But, if we generalize and categorize them into specific constraints, the mining process would be
handled easily by pushing them into the given frameworks. constraint-pushing strategy is used in
pattern growth mining tasks. Let’s see some important constraint categories.

11. What is Time series analysis?

ans:

Time series is a sequence of observations recorded at regular time intervals. Depending on the
frequency of observations, a time series may typically be hourly, daily, weekly, monthly, quarterly
and annual. Sometimes, you might have seconds and minute-wise time series as well, like,
number of clicks and user visits every minute etc.
Because it is the preparatory step before you develop a forecast of the series. Besides, time series
forecasting has enormous commercial significance because stuff that is important to a business
like demand and sales, number of visitors to a website, stock price etc are essentially time series
data.

12. State Bayes theorem. How can it be applied for data classification? 5

ans:

Bayesian classification uses Bayes theorem to predict the occurrence of any event. Bayesian
classifiers are the statistical classifiers with the Bayesian probability understandings. The theory
expresses how a level of belief, expressed as a probability.

Bayes theorem came into existence after Thomas Bayes, who first utilized conditional probability
to provide an algorithm that uses evidence to calculate limits on an unknown parameter. Bayes's
theorem is expressed mathematically by the following equation that is given below.

Where X and Y are the events and P (Y) ≠ 0

P(X/Y) is a conditional probability that describes the occurrence of event X is given that Y is true.

P(Y/X) is a conditional probability that describes the occurrence of event Y is given that X is true.

P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This is
known as the marginal probability.

Bayesian interpretation:

In the Bayesian interpretation, probability determines a "degree of belief." Bayes theorem


connects the degree of belief in a hypothesis before and after accounting for evidence. For
example, Lets us consider an example of the coin. If we toss a coin, then we get either heads or
tails, and the percent of occurrence of either heads and tails is 50%. If the coin is flipped numbers
of times, and the outcomes are observed, the degree of belief may rise, fall, or remain the same
depending on the outcomes.

For proposition X and evidence Y,

P(X), the prior, is the primary degree of belief in X


P(X/Y), the posterior is the degree of belief having accounted for Y.

The quotient

Data Mining Bayesian Classifiers represents the supports Y provides for X.

Bayes theorem can be derived from the conditional probability:

Where P (X⋂Y) is the joint probability of both X and Y being true, because

13. With example explain Bayesian belief network.

ans:

Bayesian network:

A Bayesian Network falls under the classification of Probabilistic Graphical Modelling (PGM)
procedure that is utilized to compute uncertainties by utilizing the probability concept. Generally
known as Belief Networks, Bayesian Networks are used to show uncertainties using Directed
Acyclic Graphs (DAG)

A Directed Acyclic Graph is used to show a Bayesian Network, and like some other statistical
graph, a DAG consists of a set of nodes and links, where the links signify the connection between
the nodes.
The nodes here represent random variables, and the edges define the relationship between
these variables.

A DAG models the uncertainty of an event taking place based on the Conditional Probability
Distribution (CDP) of each random variable. A Conditional Probability Table (CPT) is used to
represent the CPD of each variable in a network.

14. What is aggregation?

ans:

Data aggregation refers to a process of collecting information from different sources and
presenting it in a summarized format so that business analysts can perform statistical analyses of
business schemes. The collected information may be gathered from various data sources to
summarize these data sources into a draft for data analysis.

How does data aggregation work?

Data aggregation is needed if a dataset has useless information that can not be used for analysis.
In data aggregation, the datasets are summarized into significant information, which helps attain
desirable outcomes and increases the user experience. Data aggregation provides accurate
measurements such as sum, average, and count. The collected, summarized data helps the
business analysts to perform the demographic study of customers and their behavior. Aggregated
data help in determining significant information about a specific group after they submit their
reports. With the help of data aggregation, we can also calculate the count of non-numeric data.
Generally, data aggregation is done for data sets, not for individual data.

Data aggregators
Data aggregators refer to a system used in data mining to collect data from various sources, then
process the data and extract them into useful information into a draft. They play a vital role in
enhancing the customer data by acting as an agent. It also helps in the query and delivery
procedure where the customer requests data instances about a specific product. The marketing
team does the data aggregation, which helps them personalize messaging, offers, and more in
the user's digital experiences with the brand. It also helps the product management team of any
organization to know which products generate more revenue and which are not. The aggregated
data is also used by the financial and company executive, which helps them select how to
allocate budget towards marketing or product development strategies.

Working of data aggregators

The working of data aggregators can be performed in three stages

Collection of data

Processing of data

Presentation of data

15. What is regression? Describe different types of regression. 2+3=5

ans:

Regression refers to a type of supervised machine learning technique that is used to predict any
continuous-valued attribute. Regression helps any business organization to analyze the target
variable and predictor variable relationships.

Types of Regression

Regression is divided into five different types

Linear Regression

Logistic Regression

Lasso Regression

Ridge Regression

Polynomial Regression

Linear Regression

Linear regression is the type of regression that forms a relationship between the target variable
and one or more independent variables utilizing a straight line. The given equation represents
the equation of linear regression
Y = a + b*X + e.

Where,

a represents the intercept

b represents the slope of the regression line

e represents the error

X and Y represent the predictor and target variables, respectively.

If X is made up of more than one variable, termed as multiple linear equations.

In linear regression, the best fit line is achieved utilizing the least squared method, and it
minimizes the total sum of the squares of the deviations from each data point to the line of
regression. Here, the positive and negative deviations do not get canceled as all the deviations
are squared.

Polynomial Regression

If the power of the independent variable is more than 1 in the regression equation, it is termed a
polynomial equation. With the help of the example given below, we will understand the concept
of polynomial regression.

Y = a + b * x2

In the particular regression, the best fit line is not considered a straight line like a linear equation;
however, it represents a curve fitted to all the data points.

Applying linear regression techniques can lead to overfitting when you are tempted to minimize
your errors by making the curve more complex. Therefore, always try to fit the curve by
generalizing it o the issue.

Logistic Regression

When the dependent variable is binary in nature, i.e., 0 and 1, true or false, success or failure, the
logistic regression technique comes into existence. Here, the target value (Y) ranges from 0 to 1,
and it is primarily used for classification-based problems. Unlike linear regression, it does not
need any independent and dependent variables to have a linear relationship.

Ridge Regression

Ride regression refers to a process that is used to analyze various regression data that have the
issue of multicollinearity. Multicollinearity is the existence of a linear correlation between two
independent variables.

Ridge regression exists when the least square estimates are the least biased with high variance,
so they are quite different from the real value. However, by adding a degree of bias to the
estimated regression value, the errors are reduced by applying ridge regression.

Lasso Regression

The term LASSO stands for Least Absolute Shrinkage and Selection Operator. Lasso regression is a
linear type of regression that utilizes shrinkage. In Lasso regression, all the data points are
shrunk towards a central point, also known as the mean. The lasso process is most fitted for
simple and sparse models with fewer parameters than other regression. This type of regression
is well fittedfor models that suffer from multicollinearity.

16. What is Data Warehouse.

ans:

A Data Warehouse (DW) is a relational database that is designed for query and analysis rather
than transaction processing. It includes historical data derived from transaction data from single
and multiple sources.

A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing
support for decision-makers for data modeling and analysis.

A Data Warehouse is a group of data specific to the entire organization, not only to a particular
group of users.

It is not used for daily operations and transaction processing but used for making decisions.

A Data Warehouse can be viewed as a data system with the following attributes:

It is a database designed for investigative tasks, using data from various applications.

It supports a relatively small number of clients with relatively long interactions.

It includes current and historical data to provide a historical perspective of information.

Its usage is read-intensive.


It contains a few large tables.

"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in


support of management's decisions."

Characteristics of Data Warehouse

Subject-Oriented

A data warehouse target on the modeling and analysis of data for decision-makers. Therefore,
data warehouses typically provide a concise and straightforward view around a particular subject,
such as customer, product, or sales, instead of the global organization's ongoing operations. This
is done by excluding data that are not useful concerning the subject and including all data
needed by the users to understand the subject.

Integrated

A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions, attributes types, etc., among different
data sources.

Time-Variant

Historical information is kept in a data warehouse. For example, one can retrieve files from 3
months, 6 months, 12 months, or even previous data from a data warehouse. These variations
with a transactions system, where often only the most current file is kept.

Non-Volatile

The data warehouse is a physically separate data storage, which is transformed from the source
operational RDBMS. The operational updates of data do not occur in the data warehouse, i.e.,
update, insert, and delete operations are not performed. It usually requires only two procedures
in data accessing: Initial loading of data and access to data. Therefore, the DW does not require
transaction processing, recovery, and concurrency capabilities, which allows for substantial
speedup of data retrieval. Non-Volatile defines that once entered into the warehouse, and data
should not change.

17. What is Data redundency ? Describe correlation coeficient.

ans:

In data mining, during data integration, many data stores are used. It may lead to data
redundancy. An attribute is known as redundant if it can be derived from any set of attributes. Let
us consider we have a set of data where there are 20 attributes. Now suppose that out of 20, an
attribute can be derived from some of the other set of attributes. Such attributes that can be
derived from other sets of attributes are called Redundant attributes. Inconsistencies in attribute
or dimension naming may lead to redundancies in the set of data.

The correlation coefficient for Numeric data:

In the case of numeric data, this test is used. In this test, the relation between the A attribute and
B attribute is computed by Pearson's product-moment coefficient, also called the correlation
coefficient. A correlation coefficient measures the extent to which the value of one variable
changes with another. The best known are Pearson's and Spearman's rank-order. The first is used
where both variables are continuous, the second where at least one represents a rank.

There are several different correlation coefficients, each of which is appropriate for different
types of data. The most common is the Pearson r, used for continuous variables. It is a statistic
that measures the degree to which one variable varies in tandem with another. It ranges from -1
to +1. A +1 correlation means that as one variable rises, the other rises proportionally; a -1
correlation means that as one rises, the other falls proportionally. A 0 correlation means that
there is no relationship between the movements of the two variables.

The formula used to calculate the numeric data is given below.

Where,

n = number of tuples

ai = value of x in tuple i

bi = value of y in tuple i

From the above discussion, we can say that the greater the correlation coefficient, the more
strongly the attributes are correlated to each other, and we can ignore any one of them (either a
or b). If the value of the correlation constant is null, the attributes are independent. If the value
of the correlation constant is negative, one attribute discourages the other. It means that the
value of one attribute increases, then the value of another attribute is decreasing.
18. What id data bining , define their types.

ANS:

Data binning, also called discrete binning or bucketing, is a data pre-processing technique used to
reduce the effects of minor observation errors. It is a form of quantization. The original data
values are divided into small intervals known as bins, and then they are replaced by a general
value calculated for that bin. This has a soothing effect on the input data and may also reduce the
chances of over fitting in the case of small datasets.

Statistical data binning is a way to group numbers of more or less continuous values into a
smaller number of "bins". It can also be used in multivariate statistics, binning in several
dimensions simultaneously. For example, if you have data about a group of people, you might
want to arrange their ages into a smaller number of age intervals, such as grouping every five
years together.

Binning can dramatically improve resource utilization and model build response time without
significant loss in model quality. Binning can improve model quality by strengthening the
relationship between attributes.

Supervised binning is a form of intelligent binning in which important characteristics of the data
are used to determine the bin boundaries. In supervised binning, the bin boundaries are
identified by a single-predictor decision tree that considers the joint distribution with the target.
Supervised binning can be used for both numerical and categorical attributes.

1. Equal Frequency Binning: Bins have an equal frequency.

For example, equal frequency:

Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]

Output:

[5, 10, 11, 13]

[15, 35, 50, 55]

[72, 92, 204, 215]


2. Equal Width Binning: Bins have equal width with a range of each bin are defined as [min + w],
[min + 2w] …. [min + nw] where w = (max - min) / (no of bins).

For example, equal Width:

Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]

Output:

[5, 10, 11, 13, 15, 35, 50, 55, 72]

[92]

[204, 215]

19. Describe Density-based clustering using example.

ans:

Density-Based Clustering refers to one of the most popular unsupervised learning methodologies
used in model building and machine learning algorithms. The data points in the region separated
by two clusters of low point density are considered as noise. The surroundings with a radius ε of
a given object are known as the ε neighborhood of the object. If the ε neighborhood of the
object comprises at least a minimum number, MinPts of objects, then it is called a core object.

Density-Based Clustering - There are two different parameters to calculate the density-based
clustering

EPS: It is considered as the maximum radius of the neighborhood.

MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that point.

NEps (i) : { k belongs to D and dist (i,k) < = Eps}

Directly density reachable:

A point i is considered as the directly density reachable from a point k with respect to Eps, MinPts
if

i belongs to NEps(k)

Core point condition:

NEps (k) >= MinPts

Density reachable:

A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if there is a
sequence chain of a point i1,…., in, i1 = j, pn = i such that ii + 1 is directly density reachable from
ii.

Density connected:
A point i refers to density connected to a point j with respect to Eps, MinPts if there is a point o
such that both i and j are considered as density reachable from o with respect to Eps and MinPts.

Working of Density-Based Clustering

Suppose a set of objects is denoted by D', we can say that an object I is directly density reachable
form the object j only if it is located within the ε neighborhood of j, and j is a core object.

An object i is density reachable form the object j with respect to ε and MinPts in a given set of
objects, D' only if there is a sequence of object chains point i1,…., in, i1 = j, pn = i such that ii + 1 is
directly density reachable from ii with respect to ε and MinPts.

An object i is density connected object j with respect to ε and MinPts in a given set of objects, D'
only if there is an object o belongs to D such that both point i and j are density reachable from o
with respect to ε and MinPts.

Major Features of Density-Based Clustering

The primary features of Density-based clustering are given below.

It is a scan method.

It requires density parameters as a termination condition.


It is used to manage noise in data clusters.

Density-based clustering is used to identify clusters of arbitrary size.

marks:15

1.

Find entropy and information gain.

answer:

Decision trees are upside down which means the root is at the top and then this root is split into
various several nodes. Decision trees are nothing but a bunch of if-else statements in layman
terms. It checks if the condition is true and if it is then it goes to the next node attached to that
decision.

In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy, or rainy? If
yes then it will go to the next feature which is humidity and wind. It will again check if there is a
strong wind or weak, if it’s a weak wind and it’s rainy then the person may go and play.
Entropy

Entropy is nothing but the uncertainty in our dataset or measure of disorder.

In a decision tree, the output is mostly “yes” or “no”

The formula for Entropy is shown below:

Here p+ is the probability of positive class

p– is the probability of negative class

S is the subset of the training example

Entropy basically measures the impurity of a node. Impurity is the degree of randomness; it tells
how random our data is. Apure sub-splitmeans that either you should be getting “yes”, or you
should be getting “no”.

Supposea featurehas 8 “yes” and 4 “no” initially, after the first split the left node gets 5 ‘yes’ and
2 ‘no’whereas right node gets 3 ‘yes’ and 2 ‘no’.

We see here the split is not pure, why? Because we can still see some negative classes in both the
nodes. In order to make a decision tree, we need to calculate the impurity of each split, and
when the purity is 100%, we make it as a leaf node.

To check the impurity of feature 2 and feature 3 we will take the help for Entropy formula.
low entropy or more purity than right node since left node has a greater number of “yes” and it is
easy to decide here.

Information Gain

Information gain measures the reduction of uncertainty given some feature and it is also a
deciding factor for which attribute should be selected as a decision node or root node.

information gain Decision tree algorithm

To understand this better let’s consider an example:Suppose our entire population has a total of
30 instances. The dataset is to predict whether the person will go to the gym or not. Let’s say 16
people go to the gym and 14 people don’t
Now we have two features to predict whether he/she will go to the gym or not.

Feature 1 is “Energy” which takes two values “high” and “low”

Feature 2 is “Motivation” which takes 3 values “No motivation”, “Neutral” and “Highly
motivated”.

Let’s see how our decision tree will be made using these 2 features. We’ll use information gain to
decide which feature should be the root node and which feature should be placed after the split.

Let’s calculate the entropy:


To see the weighted average of entropy of each node we will do as follows:

Now we have the value of E(Parent) and E(Parent|Energy), information gain will be:

Our parent entropy was near 0.99 and after looking at this value of information gain, we can say
that the entropy of the dataset will decrease by 0.37 if we make “Energy” as our root node.
Similarly, we will do this with the other feature “Motivation” and calculate its information gain.

Let’s calculate the entropy here:

To see the weighted average of entropy of each node we will do as follows:

Now we have the value of E(Parent) and E(Parent|Motivation), information gain will be:
2.
Using the k-means algorithm divide the dataset into two clusters.

ID X Y
1 1 1
2 1.5 2
3 3 4
4 5 7
5 3.5 5
6 4.5 5
7 3.5 4.5

ans:

c1={1,2}
c2 ={ 4,3,5,6,7}

c1= {1,1} and c2={5,7}

3.

S1 5 7
S2 8 4
S3 3 3
S4 4 4
S5 3 7
S6 6 7
S7 6 1
S8 5 5

Using DBSCAN Clustering find core and Noise Points, where epsilon €= 3, min plots = 3.

ans:
4.

Apply Agglomerative algorithm using single, complete, and average linkage. 15

Data Value
A 1
B 3
C 5
D 6
E 9

5. Using the k-means algorithm divide the dataset into two clusters.

ID X Y
1 185 72
2 170 56
3 168 60
4 179 68
5 72
182
6 72
188
7 180 71
8 70
180
9 183 84
10 180 88
11 180 67
12 177 76

Ans:

C1= {1,4,5,6,7,8,9,10,11}
C2= {2,3}

6. Illustrate FP Growth Algorithm with proper example. 10+5=15


Describe maximum frequent item set.

Ans:

The FP-Growth Algorithm is an alternative way to find frequent item sets without using
candidate generations, thus improving performance. For so much, it uses a divide-and-
conquer strategy. The core of this method is the usage of a special data structure named
frequent-pattern tree (FP-tree), which retains the item set association information.

This algorithm works as follows:


o First, it compresses the input database creating an FP-tree instance to represent
frequent items.

o After this first step, it divides the compressed database into a set of conditional
databases, each associated with one frequent pattern.

o Finally, each such database is mined separately.

Using this strategy, the FP-Growth reduces the search costs by recursively looking for short
patterns and then concatenating them into the long frequent patterns.

In large databases, holding the FP tree in the main memory is impossible. A strategy to cope
with this problem is to partition the database into a set of smaller databases (called projected
databases) and then construct an FP-tree from each of these smaller databases.

FP-Tree

The frequent-pattern tree (FP-tree) is a compact data structure that stores quantitative
information about frequent patterns in a database. Each transaction is read and then
mapped onto a path in the FP-tree. This is done until all transactions have been read.
Different transactions with common subsets allow the tree to remain compact because their
paths overlap.

A frequent Pattern Tree is made with the initial item sets of the database. The purpose of the
FP tree is to mine the most frequent pattern. Each node of the FP tree represents an item of
the item set.

The root node represents null, while the lower nodes represent the item sets. The
associations of the nodes with the lower nodes, that is, the item sets with the other item sets,
are maintained while forming the tree.

Han defines the FP-tree as the tree structure given below:

1. One root is labelled as "null" with a set of item-prefix subtrees as children and a
frequent-item-header table.

2. Each node in the item-prefix subtree consists of three fields:

o Item-name: registers which item is represented by the node;

o Count: the number of transactions represented by the portion of the path


reaching the node;

o Node-link: links to the next node in the FP-tree carrying the same item name
or null if there is none.

3. Each entry in the frequent-item-header table consists of two fields:

o Item-name: as the same to the node;


o Head of node-link: a pointer to the first node in the FP-tree carrying the item
name.

Additionally, the frequent-item-header table can have the count support for an item. The
below diagram is an example of a best-case scenario that occurs when all transactions have
the same item set; the size of the FP-tree will be only a single branch of nodes.

(ii)
A maximal frequent itemset is represented as a frequent itemset for which none of its direct
supersets are frequent. The itemsets in the lattice are broken into two groups such as those
that are frequent and those that are infrequent. A frequent itemset border, which is defined
by a dashed line.

Each item set situated above the border is frequent, while those located under the border
(the shaded nodes) are infrequent. Between the itemsets residing near the border, {a, d}, {a,
c, e}, and {b, c, d, e} are treated to be maximal frequent itemsets because their direct
supersets are infrequent.

An itemset including {a, d} is maximal frequent because some direct supersets, {a, b, d}, {a,
c, d}, and {a, d, e}, are infrequent. In contrast, {a, c} is non-maximal because the direct
supersets, {a, c, e}, is frequent.

Maximal frequent itemsets adequately support a compact description of frequent itemsets.


In other terms, they form the smallest set of itemsets from which some frequent itemsets
can be derived. For instance, the frequent itemsets can be broken into two groups such as
follows −
 Frequent itemsets that start with item a and that can include items c, d, or e.
This group contains itemsets including {a), {a, c), {a, d}, {a, e}, and {a, c, e}.
 Frequent itemsets that start with items b, c, d, or e. This group contains
itemsets including {b}, {b, c}, {c, d}, {b, c, d, e}, etc.
Frequent itemsets that apply in the first group are subsets of either {a, c, e} or {a, d}, while
those that apply in the second group are subsets of {b, c, d, e}. Therefore, the maximal
frequent itemsets {a, c, e}, {a, d}, and {b, c, d, e} support a compact description of the
frequent itemsets.

Maximal frequent itemsets support a valuable description for data sets that can make very
high, frequent itemsets, as there are exponentially several frequent itemsets in such data.
This method is practical only if an effective algorithm occurs to explicitly discover the
maximal frequent itemsets without having to enumerate some subsets.

Despite supporting a compact description, maximal frequent itemsets do not include the
support data of their subsets. For instance, the support of the maximal frequent itemsets
{a,c,e}, {a,d}, and {b,c,d,e} do not give any idea about the support of their subsets.

An additional pass over the data set is required to decide the support counts of the non-
maximal frequent itemsets. In some cases, it can be desirable to have a minimal description
of frequent itemsets that preserves the support data.

7. What id data cleaning? Explain the methods of data cleaning.

7+8=15

Ans:

Data cleaning is a crucial process in Data Mining. It carries an important part in the
building of a model. Data Cleaning can be regarded as the process needed, but everyone
often neglects it. Data quality is the main issue in quality information management. Data
quality problems occur anywhere in information systems. These problems are solved by
data cleaning.
Data cleaning is fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or
incomplete data within a dataset. If data is incorrect, outcomes and algorithms are unreliable,
even though they may look correct. When combining multiple data sources, there are many
opportunities for data to be duplicated or mislabeled.

Steps of Data Cleaning

While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to cleaning your data, such as:

1. Remove duplicate or irrelevant observations

Remove unwanted observations from your dataset, including duplicate observations or


irrelevant observations. Duplicate observations will happen most often during data
collection. When you combine data sets from multiple places, scrape data, or receive data
from clients or multiple departments, there are opportunities to create duplicate data. De-
duplication is one of the largest areas to be considered in this process. Irrelevant
observations are when you notice observations that do not fit into the specific problem you
are trying to analyze.

For example, if you want to analyze data regarding millennial customers, but your dataset
includes older generations, you might remove those irrelevant observations. This can make
analysis more efficient, minimize distraction from your primary target, and create a more
manageable and performable dataset.

2. Fix structural errors

Structural errors are when you measure or transfer data and notice strange naming
conventions, typos, or incorrect capitalization. These inconsistencies can cause mislabeled
categories or classes. For example, you may find "N/A" and "Not Applicable" in any sheet,
but they should be analyzed in the same category.

3. Filter unwanted outliers

Often, there will be one-off observations where, at a glance, they do not appear to fit within
the data you are analyzing. If you have a legitimate reason to remove an outlier, like
improper data entry, doing so will help the performance of the data you are working with.

However, sometimes, the appearance of an outlier will prove a theory you are working on.
And just because an outlier exists doesn't mean it is incorrect. This step is needed to
determine the validity of that number. If an outlier proves to be irrelevant for analysis or is a
mistake, consider removing it.

4. Handle missing data

You can't ignore missing data because many algorithms will not accept missing values. There
are a couple of ways to deal with missing data. Neither is optimal, but both can be
considered, such as:

You can drop observations with missing values, but this will drop or lose
information, so be careful before removing it.
You can input missing values based on other observations; again, there is an
opportunity to lose the integrity of the data because you may be operating from
assumptions and not actual observations.

You might alter how the data is used to navigate null values effectively.
5. Validate and QA

At the end of the data cleaning process, you should be able to answer these questions as a
part of basic validation, such as:

o Does the data make sense?

o Does the data follow the appropriate rules for its field?

o Does it prove or disprove your working theory or bring any insight to light?

o Can you find trends in the data to help you for your next theory?

o If not, is that because of a data quality issue?


Because of incorrect or noisy data, false conclusions can inform poor business strategy and decision-
making. False conclusions can lead to an embarrassing moment in a reporting meeting when you
realize your data doesn't stand up to study. Before you get there, it is important to create a culture of
quality data in your organization. To do this, you should document the tools you might use to create
this strategy.

Methods of Data Cleaning


There are many data cleaning methods through which the data should be run. The methods
are described below:

Ignore the tuples:This method is not very feasible, as it only comes to use when the tuple has
several attributes is has missing values.

1. Fill the missing value:This approach is also not very effective or feasible. Moreover,
it can be a time-consuming method. In the approach, one has to fill in the missing
value. This is usually done manually, but it can also be done by attribute mean or
using the most probable value.

2. Binning method:This approach is very simple to understand. The smoothing of


sorted data is done using the values around it. The data is then divided into several
segments of equal size. After that, the different methods are executed to complete
the task.
3. Regression:The data is made smooth with the help of using the regression function.
The regression can be linear or multiple. Linear regression has only one independent
variable, and multiple regressions have more than one independent variable.

4. Clustering:This method mainly operates on the group. Clustering groups the data in
a cluster. Then, the outliers are detected with the help of clustering. Next, the similar
values are then arranged into a "group" or a "cluster".

8. What is decision tree. Describe Naïve Bayes Classifier Algorithm. Why it is called Naive
Bayes ??

5+5+5= 15

ans:

Decision Tree is a supervised learning method used in data mining for classification and
regression methods. It is a tree that helps us in decision-making purposes. The decision
tree creates classification or regression models as a tree structure. It separates a data
set into smaller subsets, and at the same time, the decision tree is steadily developed.
The final tree is a tree with the decision nodes and leaf nodes. A decision node has at
least two branches. The leaf nodes show a classification or decision. We can't
accomplish more split on leaf nodes-The uppermost decision node in a tree that relates
to the best predictor called the root node. Decision trees can deal with both categorical
and numerical data.

Entropy:
Entropy refers to a common way to measure impurity. In the decision tree, it measures the
randomness or impurity in data sets.

Information Gain:
Information Gain refers to the decline in entropy after the dataset is split. It is also called
Entropy Reduction. Building a decision tree is all about discovering attributes that return the
highest data gain.

Decision tree Algorithm:

The decision tree algorithm may appear long, but it is quite simply the basis algorithm techniques is
as follows:

The algorithm is based on three parameters: D, attribute_list, and Attribute _selection_method.

Generally, we refer to D as a data partition.


Initially,D is the entire set of training tuples and their related class level (input training data).

The parameter attribute_list is a set of attributes defining the tuples.

Attribute_selection_method specifies a heuristic process for choosing the attribute that


"best" discriminates the given tuples according to class.

Attribute_selection_method process applies an attribute selection measure

Advantages of using decision trees:


A decision tree does not need scaling of information.

Missing values in data also do not influence the process of building a choice tree to any
considerable extent.

A decision tree model is automatic and simple to explain to the technical team as well as
stakeholders.

Compared to other algorithms, decision trees need less exertion for data preparation during
pre-processing.

A decision tree does not require a standardization of data.

(ii)

Naïve Bayes Classifier Algorithm

Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes


theorem and used for solving classification problems.

It is mainly used in text classification that includes a high-dimensional training


dataset.

Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can make
quick predictions.

It is a probabilistic classifier, which means it predicts on the basis of the probability


of an object.

Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
(iii)
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:

Naïve: It is called Naïve because it assumes that the occurrence of a certain feature
is independent of the occurrence of other features. Such as if the fruit is identified on
the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized
as an apple. Hence each feature individually contributes to identify that it is an apple
without depending on each other.

Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

9. What is Bayes theorem?? Describe Working of Naïve Bayes' Classifier. What is Data
preprocessing?? 5+5+5=15

ans:

Bayes' Theorem:

Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.

The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.


P(B) is Marginal Probability: Probability of Evidence.

(ii)

Working of Naïve Bayes' Classifier:


Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular
day according to the weather conditions. So to solve this problem, we need to follow the
below steps:

1. Convert the given dataset into frequency tables.

2. Generate Likelihood table by finding the probabilities of given features.

3.Now, use Bayes theorem to calculate the posterior probability.


(iii) (iii)

Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.

When creating a machine learning project, it is not always a case that we come across the
clean and formatted data. And while doing any operation with data, it is mandatory to clean
it and put in a formatted way. So for this, we use data preprocessing task.

Why do we need Data Preprocessing?


A real-world data generally contains noises, missing values, and maybe in an unusable
format which cannot be directly used for machine learning models. Data preprocessing is
required tasks for cleaning the data and making it suitable for a machine learning model
which also increases the accuracy and efficiency of a machine learning model.

It involves below steps:

Getting the dataset

Importing libraries

Importing datasets

Finding Missing Data

Encoding Categorical Data

Splitting dataset into training and test set

Feature scaling
1) Get the Dataset
To create a machine learning model, the first thing we required is a dataset as a machine
learning model completely works on data. The collected data for a particular problem in a
proper format is known as the dataset.

Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the dataset
required for a liver patient. So each dataset is different from another dataset. To use the
dataset in our code, we usually put it into a CSV file. However, sometimes, we may also need
to use an HTML or xlsx file.

2) Importing Libraries
In order to perform data preprocessing using Python, we need to import some predefined
Python libraries. These libraries are used to perform some specific jobs.

3) Importing the Datasets


Now we need to import the datasets which we have collected for our machine learning
project. But before importing a dataset, we need to set the current directory as a working
directory

4) Handling Missing data:


The next step of data preprocessing is to handle missing data in the datasets. If our dataset
contains some missing data, then it may create a huge problem for our machine learning
model. Hence it is necessary to handle missing values present in the dataset.

Ways to handle missing data:

There are mainly two ways to handle missing data, which are:

By deleting the particular row: The first way is used to commonly deal with null values. In this
way, we just delete the specific row or column which consists of null values. But this way is
not so efficient and removing data may lead to loss of information which will not give the
accurate output.

By calculating the mean: In this way, we will calculate the mean of that column or row which
contains any missing value and will put it on the place of missing value. This strategy is
useful for the features which have numeric data such as age, salary, year, etc. Here, we will
use this approach.

10.
he following is an example of Divisive Clustering.
Distance a b c d e
a 0 2 6 10 9
b 2 0 5 9 8
c 6 5 0 4 5
d 10 9 4 0 3
e 9 8 5 3 0

What is MST?? 10+5=15

Ans:
Step 1.Split whole data into 2 clusters

1. Who hates other members the most? (in Average)


• a to others: mean(2,6,10,9)=6.75 →a goes out! (Divide a into a new cluster)
• b to others: mean(2,5,9,8)=6.0
• c to others: mean(6,5,4,5)=5.0
• d to others: mean(10,9,4,3)=6.5
• e to others: mean(9,8,5,3)=6.25
2. Everyone in the old party asks himself: “In average, do I hate others in old party more than
hating the members in the new party?”
• If the answer is “Yes”, then he will also go to the new party.
α=distance to the old party 𝖠=distance to the new party α−𝖠
b 5+9+83=7.33 2 >0 (b also goes out!)
c 5+4+53=4.67 6 ∈0
d 9+4+33=5.33 10 ∈0
e 8+5+33=5.33 9 ∈0
3. Everyone in the old party ask himself the same question as above again and again until

everyone’s got the answer “No”.


α=distance to the old party 𝖠=distance to the new party α−𝖠
c … … ∈0
d … … ∈0
e … … ∈0
Step 2. Choose a current cluster and split it as in Step 1.

1.Choose a current cluster


• If split the cluster with the largest number of members, then the cluster {c,d,e} will be split.
• If split the cluster with the largest diameter, then the cluster {c,d,e} will be split.
cluster diameter
{a,b} 2
{c,d,e} 5
2.Split the chosen cluster as in Step 1.
Step 3. Repeat Step 2. until each cluster contains a point (or there are k clusters)

(ii)
Building MST (Minimum Spanning Tree) is a method for constructing hierarchy of

clusters.

It starts with a tree that consists of a point p. In successive steps, look for the closest

pair of points (p,q) such that p is in the current tree but q is not. With this closest pair of

points (p,q), add q to the tree and put an edge between p and q.

The procedure of constructing hierarchy of clusters using MST would be as follows:

Construct a MST as a proximity graph


repeat

Split a cluster by breaking the inconsistent edge.

until Only singleton clusters remain


Note that the inconsistent edge is the link of the largest distance (smallest similarity).

The definition of inconsistency varies. For example, we can also use local inconsistency

remove edges significantly larger than their neighborhood edges.

11. Compare and contrast closeness centrality and betweeness centrality. Explain the
significance of betweeness centrality measure and focus on at least two applications.
7+8=15

Ans:
Comparison between Closeness Centrality and Betweenness Centrality:
• Closeness Centrality measures that how easily any actor (node) in a social
network can interact with all other actors. This is based on the closeness and
closeness can bemeasured using distance. Whereas the betweennes centrality
measures that if any
actor (node) is on the path between two actors then how much it can control over the
interaction of these actors.
• The shortest distance between the two actors is used to measure the closeness
centrality whereas the number of shortest distance between the two actors are used to
measure the betweenness centrality.
• Closeness centrality can be computed for only the connected graph but betweenness
centrality can be calculated for both connected and disconnected graphs. Both
centralities can be computed for directed and undirected graphs.
• In closeness centrality same equation can be used for both directed and undirected
graphs whereas in standardized betweenness centrality there is variation in equation
for directed and undirected graphs. For directed graphs the standardized
betweenness centrality is two times that of for the undirected graphs.

Significance Of Betweenness Centrality:


The betweenness Centrality is very significant in analyzing Network. Network may our social
network like twitter or facebook or may be some different type of network like network of
cancer diagnosis. Betweenness Centrality is widely used measure that captures a person's
role in allowing information to pass from one part of the netwotk to the other.
A node in social network with high betweenness is likely to be aware of what is going in
multiple social circles. For example consider a Twitter account. It captures how important a
node is in flow of information from one part of the network to another. Between can have
many meanings. One meaning may be that a user with high betweenness may be followed
by many others who don't follow the same people as the user. This can indicate that the user
well followed. Other way the user may have fewer followers, but connect them to many
accounts that are otherwise distant. This would indicate the user is reader of many people.

12. Describe different types of Social Networks Analysis. Difference between Ego
network analysis and Complete network analysis. 10+5=15

ans:

Social networks are the networks that depict the relations between people in the form
of a graph for different kinds of analysis. The graph to store the relationships of people
is known as Sociogram. All the graph points and lines are stored in the matrix data
structure called Sociomatrix. The relationships indicate of any kind like kinship,
friendship, enemies, acquaintances, colleagues, neighbors, disease transmission, etc.

Social Network Analysis (SNA) is the process of exploring or examining the social structure
by using graph theory. It is used for measuring and analyzing the structural properties of the
network. It helps to measure relationships and flows between groups, organizations, and
other connected entities. We need specialized tools to study and analyze social networks.
Basically, there are two types of social networks:
• Ego network Analysis
• Complete network Analysis
1. Ego Network Analysis
Ego network Analysis is the one that finds the relationship among people. The analysis is
done for a particular sample of people chosen from the whole population. This sampling is
done randomly to analyze the relationship. The attributes involved in this ego network
analysis are a person’s size, diversity, etc.

This analysis is done by traditional surveys. The surveys involve that they people are
asked with whom they interact with and their name of the relationship between them.
It is not focused to find the relationship between everyone in the sample. It is an effort
to find the density of the network in those samples. This hypothesis is tested using
some statistical hypothesis testing techniques.

The following functions are served by Ego Networks:


• Propagation of information efficiently.
• Sensemaking from links, For example, Social links, relationships.
• Access to resources, efficient connection path generation.
• Community detection, identification of the formation of groups.
• Analysis of the ties among individuals for social support.
2. Complete Network Analysis
Complete network analysis is the analysis that is used in all network analyses. It analyses
the relationship among the sample of people chosen from the large population. Subgroup
analysis, centrality measure, and equivalence analysis are based on the complete network
analysis. This analysis
measure helps the
organization or the company
to make any decision with the
help of their relationship.
Testing the sample will show
the relationship in
the whole network since
the sample is taken from a
single set of domains

(ii)

Difference between Ego network analysis


and Complete network analysis:
The difference between ego and complete network
analysis is that the ego network focus on collecting
the relationship of people in the sample with the outside
world whereas, in Complete network, it is focused on finding the relationship among the samples.

The majority of the network analysis will be done only for a particular domain or one organization. It is not
focused on the relationships between the organization. So many of the social network analysis measure uses
only Complete network analysis.

13. What is Frequent pattern mining. Describe Classification algorithm .


7+8=5

ans:

Frequent pattern mining in data mining is the process of identifying patterns or


associations within a dataset that occur frequently. This is typically done by analyzing
large datasets to find items or sets of items that appear together frequently.

There are several different algorithms used for frequent pattern mining, including:

1. Apriori algorithm: This is one of the most commonly used algorithms for frequent pattern
mining. It uses a “bottom-up” approach to identify frequent itemsets and then generates
association rules from those itemsets.
1. ECLAT algorithm: This algorithm uses a “depth-first search” approach to identify
frequent itemsets. It is particularly efficient for datasets with a large number of items.
2.FP-growth algorithm: This algorithm uses a “compression” technique to find frequent
patterns efficiently. It is particularly efficient for datasets with a large number of
transactions.
3.Frequent pattern mining has many applications, such as Market Basket Analysis,
Recommender Systems, Fraud Detection, and many more.

Advantages:

1.It can find useful information which is not visible in simple data browsing
2.It can find interesting association and correlation among data items

Disadvantages:

1. It can generate a large number of patterns


2. With high dimensionality, the number of patterns can be very large, making it difficult to
interpret the results.

The increasing power of computer technology creates a large amount of data and storage.
Databases are increasing rapidly and in this computerized world everything is shifting online
and data is increasing as a new currency. Data comes in different shapes and sizes and is
collected in different ways. By using data mining there are many benefits it helps us to
improve the particular process and in some cases, it costs saving or revenue generation.
Data mining is commonly used to search a large amount of data for patterns and trends, and
not only for searching it uses the data for further processes and develops actionable
processes.

Data mining is the process of converting raw data into suitable patterns based
on trends.
Data mining has different types of patterns and frequent pattern mining is one of them. This
concept was introduced for mining transaction databases. Frequent patterns are
patterns(such as items, subsequences, or substructures) that appear frequently in the
database. It is an analytical process that finds frequent patterns, associations, or causal
structures from databases in various databases. This process aims to find the frequently
occurring item in a transaction. By frequent patterns, we can identify strongly correlated
items together and we can identify similar characteristics and associations among them. By
doing frequent data mining we can go further for clustering and association.
Frequent pattern mining is a major concern it plays a major role in associations and
correlations and disclose an intrinsic and important property of dataset.
Frequent data mining can be done by using association rules with particular algorithms eclat
and apriori algorithms. Frequent pattern mining searches for recurring relationships in a data
set. It also helps to find the inheritance regularities. to make fast processing software with a
user interface and used for a long time without any error.

(ii)

The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program learns
from the given dataset or observations and then classifies new observation into a number of
classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes
can be called as targets/labels or categories.

Unlike regression, the output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised
learning technique, hence it takes labeled input data, which means it contains input with the
corresponding output.

In classification algorithm, a discrete output function(y) is mapped to input variable(x).

y=f(x), where y = categorical output

The best example of an ML classification algorithm is Email Spam Detector.

The main goal of the Classification algorithm is to identify the category of a given dataset,
and these algorithms are mainly used to predict the output for the categorical data.

Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are
similar to each other and dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier. There
are two types of Classifications:

Binary Classifier: If the classification problem has only two possible outcomes, then
it is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.

Multi-class Classifier: If a classification problem has more than two outcomes, then
it is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.

14. Difference between Lazy Learners and Eager Learners. Describe K-Nearest Neighbor(KNN)
Algorithm for Machine Learning.

10+5=15

Ans:

Learners in Classification Problems:


In the classification problems, there are two types of learners:

1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it
receives the test dataset. In Lazy learner case, classification is done on the basis of the
most related data stored in the training dataset. It takes less time in training but more
time for predictions.
Example: K-NN algorithm, Case-based reasoning

2. Eager Learners:Eager Learners develop a classification model based on a training


dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes
more time in learning, and less time in prediction. Example: Decision Trees, Naïve
Bayes, ANN.
Lazy Versus Eager Learning
• Radial Basis Function is eager, all the other IBLers are lazy
• lazy methods use less time in training and more during prediction
• lazy methods may consider the query instance when deciding how to generalize
• lazy learners have the advantage of a richer hypothesis space because it uses many different
local functions to form its implicit global approximation
• eager learners, even RBFs, cannot customize itself to unknown future query instances

(ii)

K-Nearest Neighbor(KNN) Algorithm for Machine Learning

K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.

K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.

K-NN algorithm stores all the available data and classifies a new data point based
on the similarity. This means when new data appears then it can be easily classified
into a well suite category by using K- NN algorithm.

K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.

K-NN is a non-parametric algorithm, which means it does not make any


assumption on underlying data.

It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.

KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new data.

Example: Suppose, we have an image of a creature that looks similar to cat and
dog, but we want to know either it is a cat or dog. So for this identification, we can
use the KNN algorithm, as it works on a similarity measure. Our KNN model will find
the similar features of the new data set to the cats and dogs images and based on
the most similar features it will put it in either cat or dog category.
Why do we need a K- NN Algorithm?
Suppose there are two categories, i.e., Category A
and Category B, and we have a new data point
x1, so this data point will lie in which of these
categories. To solve this type of problem, we need a
K-NN algorithm. With the help of K-NN, we can
easily identify the category or class of a particular dataset. Consider the below diagram:

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:

Step-1: Select the number K of the neighbors

Step-2: Calculate the Euclidean distance of K number of neighbors

Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

Step-4: Among these k neighbors, count the number of the data points in each
category.

Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.

Step-6: Our model is ready.

16. What is Sequential pattern mining? What is Linear Regression??


10+5=15

ans:

Sequential pattern mining is the mining of frequently appearing series events or


subsequences as patterns. An instance of a sequential pattern is users who purchase a
Canon digital camera are to purchase an HP color printer within a month.

For retail information, sequential patterns are beneficial for shelf placement and promotions.
This industry, and telecommunications and different businesses, can also use sequential
patterns for targeted marketing, user retention, and several tasks.

There are several areas in which sequential patterns can be used such as Web access
pattern analysis, weather prediction, production processes, and web intrusion detection.

Given a set of sequences, where each sequence includes a file of events (or elements) and
each event includes a group of items, and given a user-specified minimum provide
threshold of min sup, sequential pattern mining discover all frequent subsequences, i.e., the
subsequences whose occurrence frequency in the group of sequences is no less than
min_sup.

Let I = {I1, I2,..., Ip} be the set of all items. An itemset is a nonempty set of items. A
sequence is an ordered series of events. A sequence s is indicated {e1, e2, e3 … el} where
event e1 appears before e2, which appears before e3, etc. Event ej is also known as
element of s.

In the case of user purchase information, an event defines a shopping trip in which a
customer purchase items at a specific store. The event is an itemset, i.e., an unordered list
of items that the customer purchased during the trip. The itemset (or event) is indicated
(x1x2···xq), where xk is an item.

An item can appear just once in an event of a sequence, but can appear several times in
different events of a sequence. The multiple instances of items in a sequence is known as
the length of the sequence. A sequence with length l is known as l-sequence.

A sequence database, S, is a group of tuples, (SID, s), where SID is a sequence_ID and s is
a sequence. For instance, S includes sequences for all users of the store. A tuple (SID, s) is
include a sequence α, if α is a subsequence of s.

This phase of sequential pattern mining is an abstraction of user-shopping sequence


analysis. Scalable techniques for sequential pattern mining on such records are as follows −
There are several sequential pattern mining applications cannot be covered by this phase.
For instance, when analyzing Web clickstream series, gaps among clicks become essential
if one required to predict what the next click can be.

In DNA sequence analysis, approximate patterns become helpful because DNA sequences
can include (symbol) insertions, deletions, and mutations. Such diverse requirements can
be considered as constraint relaxation or application.

(ii)

Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to
the value of the independent variable.

The
linear

regression model provides a sloped straight line representing the relationship between the variables.
Consider the below image:
Mathematically, we can represent a linear regression as:

y= a0+a1x+ ε

Here,

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model
representation.

Types of Linear Regression


Linear regression can be further divided into two types of the algorithm:

Simple Linear Regression:


If a single independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Simple Linear Regression.

Multiple Linear regression:


If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression.

Linear Regression Line


A linear line showing the relationship between the dependent and independent variables is
called a regression line. A regression line can show two types of relationship:

Positive Linear Relationship:


If the dependent variable increases on the Y-axis and independent variable increases
on X-axis, then such a relationship is termed as a Positive linear relationship.

Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and independent variable increases
on the X-axis, then such a relationship is called a negative linear relationship.
Finding the best fit line:
When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will
have the least error.

The different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to
calculate this we use cost function.

Cost function-

The different values for weights or coefficient of lines (a0, a1) gives the different line
of regression, and the cost function is used to estimate the values of the coefficient
for the best fit line.

Cost function optimizes the regression coefficients or weights. It measures how a


linear regression model is performing.

We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also known
as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the
average of squared error occurred between the predicted values and actual values. It can be
written as:

For the above linear equation, MSE can be calculated as:

Where,

N=Total number of observation


Yi = Actual value
(a1xi+a0)= Predicted value.

Residuals: The distance between the actual value and predicted values is called residual. If
the observed points are far from the regression line, then the residual will be high, and so
cost function will high. If the scatter points are close to the regression line, then the residual
will be small and hence the cost function.
Gradient Descent:

Gradient descent is used to minimize the MSE by calculating the gradient of the
cost function.

A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.

It is done by a random selection of values of coefficient and then iteratively update


the values to reach the minimum cost function.

17. Describe K-Medoids clustering-Theorem. Difference between K-Means and K-


Medoids.10+5=15
Ans:

K-Medoids and K-Means are two types of clustering mechanisms in Partition Clustering. First,
Clustering is the process of breaking down an abstract group of data points/ objects into
classes of similar objects such that all the objects in one cluster have similar traits. , a group
of n objects is broken down into k number of clusters based on their similarities.

Two statisticians, Leonard Kaufman, and Peter J. Rousseeuw came up with this method. This
tutorial explains what K-Medoids do, their applications, and the difference between K-Means
and K-Medoids.

K-medoids is an unsupervised method with unlabelled data to be clustered. It is an


improvised version of the K-Means algorithm mainly designed to deal with outlier data
sensitivity. Compared to other partitioning algorithms, the algorithm is simple, fast, and easy
to implement.

The partitioning will be carried on such that:

Each cluster must have at least one object

1. An object must belong to only one cluster

Here is a small recap on K-Means clustering:


In the K-Means algorithm, given the value of k and unlabelled data:

1. Choose k number of random points (Data point from the data set or some other
points). These points are also called "Centroids" or "Means".

2. Assign all the data points in the data set to the closest centroid by applying any
distance formula like Euclidian distance, Manhattan distance, etc.
3. Now, choose new centroids by calculating the mean of all the data points in the
clusters and goto step 2

4. Continue step 3 until no data point changes classification between two iterations.

The problem with the K-Means algorithm is that the algorithm needs to handle outlier data.
An outlier is a point different from the rest of the points. All the outlier data points show up
in a different cluster and will attract other clusters to merge with it. Outlier data increases the
mean of a cluster by up to 10 units. Hence, K-Means clustering is highly affected by
outlier data.

K-Medoids:
Medoid: A Medoid is a point in the cluster from which the sum of distances to other data
points is minimal.

(or)

A Medoid is a point in the cluster from which dissimilarities with all the other points in the
clusters are minimal.

Instead of centroids as reference points in K-Means algorithms, the K-Medoids algorithm


takes a Medoid as a reference point.

There are three types of algorithms for K-Medoids Clustering:

1.PAM (Partitioning Around Clustering)

2. CLARA (Clustering Large Applications)

3.CLARANS (Randomized Clustering Large Applications)

PAM is the most powerful algorithm of the three algorithms but has the disadvantage of
time complexity. The following K-Medoids are performed using PAM. In the further parts,
we'll see what CLARA and CLARANS are.

Algorithm:
Given the value of k and unlabelled data:

1. Choose k number of random points from the data and assign these k points to k
number of clusters. These are the initial medoids.

2. For all the remaining data points, calculate the distance from each medoid and
assign it to the cluster with the nearest medoid.

3. Calculate the total cost (Sum of all the distances from all the data points to the
medoids)
4. Select a random point as the new medoid and swap it with the previous medoid.
Repeat 2 and 3 steps.

5. If the total cost of the new medoid is less than that of the previous medoid, make
the new medoid permanent and repeat step 4.

6. If the total cost of the new medoid is greater than the cost of the previous medoid,
undo the swap and repeat step 4.

7. The Repetitions have to continue until no change is encountered with new medoids
to classify data points.

(ii)

K-Means K-Medoids
Both methods are types of Partition Clustering.

Unsupervised iterative algorithms

Have to deal with unlabelled data

Both algorithms group n objects into k clusters based on similar traits where k is pre-

defined.

Inputs: Unlabelled data and the value of k

Metric of similarity: Euclidian Distance Metric of similarity: Manhattan Distance

Clustering is done based on distance Clustering is done based on distance

from centroids. from medoids.

A centroid can be a data point or some other A medoid is always a data point in the

point in the cluster cluster.

Can't cope with outlier data Can manage outlier data too

Sometimes, outlier sensitivity can turn out to Tendency to ignore meaningful clusters in

be useful outlier data


18. Describe Regression vs. Classification. ExplainLinear Regression vs Logistic
Regression.

10+5 =15

ans:

Regression and Classification algorithms are Supervised Learning algorithms. Both the
algorithms are used for prediction in Machine learning and work with the labeled datasets.
But the difference between both is how they are used for different machine learning
problems.

The main difference between Regression and Classification algorithms that Regression
algorithms are used to predict the continuous values such as price, salary, age, etc. and
Classification algorithms are used to predict/Classify the discrete values such as Male or
Female, True or False, Spam or Not Spam, etc.

Consider the below diagram:

Classification:
Classification is a process of finding a function which helps in dividing the dataset into
classes based on different parameters. In Classification, a computer program is trained on
the training dataset and based on that training, it categorizes the data into different classes.
mapping function to map the input(x) to the discrete output(y).

Example: The best example to understand the Classification problem is Email Spam
Detection. The model is trained on the basis of millions of emails on different parameters,
and whenever it receives a new email, it identifies whether the email is spam or not. If the
email is spam, then it is moved to the Spam folder.

Types of ML Classification Algorithms:

Classification Algorithms can be further divided into the following types:

Logistic Regression

K-Nearest Neighbours

Support Vector Machines

Kernel SVM

Naïve Bayes

Decision Tree Classification

Random Forest Classification

Regression:
Regression is a process of finding the correlations between dependent and independent
variables. It helps in predicting the continuous variables such as prediction of Market
Trends, prediction of House prices, etc.

The task of the Regression algorithm is to find the mapping function to map the input
variable(x) to the continuous output variable(y).

Example: Suppose we want to do weather forecasting, so for this, we will use the Regression
algorithm. In weather prediction, the model is trained on the past data, and once the training
is completed, it can easily predict the weather for future days.

Types of Regression Algorithm:

Simple Linear Regression

Multiple Linear Regression

Polynomial Regression

Support Vector Regression


Decision Tree Regression

Random Forest Regression

Difference between Regression and Classification


Regression Algorithm Classification Algorithm
In Regression, the output variable
In Classification, the output variable must be a
must be of continuous nature or real
discrete value.
value.

The task of the regression algorithm is The task of the classification algorithm is to map

to map the input value (x) with the the input value(x) with the discrete output

continuous output variable(y). variable(y).

Regression Algorithms are used with Classification Algorithms are used with discrete

continuous data. data.

In Regression, we try to find the best In Classification, we try to find the decision

fit line, which can predict the output boundary, which can divide the dataset into

more accurately. different classes.

Regression algorithms can be used to Classification Algorithms can be used to solve

solve the regression problems such as classification problems such as Identification of

Weather Prediction, House price spam emails, Speech Recognition, Identification

prediction, etc. of cancer cells, etc.

The regression Algorithm can be


The Classification algorithms can be divided into
further divided into Linear and Non-
Binary Classifier and Multi-class Classifier.
linear Regression.
(ii)

Linear Regression and Logistic Regression are the two famous Machine Learning Algorithms
which come under supervised learning technique. Since both the algorithms are of
supervised in nature hence these algorithms use labeled dataset to make the predictions. But
the main difference between them is how they are being used. The Linear Regression is used
for solving Regression problems whereas Logistic Regression is used for solving the
Classification problems. The description of both the algorithms is given below along with
difference table.

Linear Regression:

Linear Regression is one of the most simple Machine learning algorithm that comes
under Supervised Learning technique and used for solving regression problems.

It is used for predicting the continuous dependent variable with the help of
independent variables.

The goal of the Linear regression is to find the best fit line that can accurately
predict the output for the continuous dependent variable.

If single independent variable is used for prediction then it is called Simple Linear
Regression and if there are more than two independent variables then such
regression is called as Multiple Linear Regression.

By finding the best fit line, algorithm establish the relationship between dependent
variable and independent variable. And the relationship should be of linear nature.
The output for Linear regression should only be the continuous values such as price,
age, salary, etc. The relationship between the dependent variable and independent
variable can be shown in below image:

In above image the dependent variable is on Y-axis (salary) and independent variable is on x-
axis(experience). The regression line can be written as:

y= a0+a1x+ ε

Where, a0 and a1 are the coefficients and ε is the error term.

Logistic Regression:

Logistic regression is one of the most popular Machine learning algorithm that
comes under Supervised Learning techniques.

It can be used for Classification as well as for Regression problems, but mainly used
for Classification problems.

Logistic regression is used to predict the categorical dependent variable with the
help of independent variables.

The output of Logistic Regression problem can be only between the 0 and 1.
Logistic regression can be used where the probabilities between two classes is
required. Such as whether it will rain today or not, either 0 or 1, true or false etc.

Logistic regression is based on the concept of Maximum Likelihood estimation.


According to this estimation, the observed data should be most probable.

In logistic regression, we pass the weighted sum of inputs through an activation


function that can map values in between 0 and 1. Such activation function is known
as sigmoid function and the curve obtained is called as sigmoid curve or S-curve.
Consider the below image:

The equation for logistic regression is:

Linear Regression Logistic Regression


Linear regression is used to predict the Logistic Regression is used to predict the

continuous dependent variable using a categorical dependent variable using a

given set of independent variables. given set of independent variables.

Linear Regression is used for solving Logistic regression is used for solving

Regression problem. Classification problems.

In Linear regression, we predict the value of In logistic Regression, we predict the values

continuous variables. of categorical variables.


In linear regression, we find the best fit line, In Logistic Regression, we find the S-curve

by which we can easily predict the output. by which we can classify the samples.

Least square estimation method is used for Maximum likelihood estimation method is

estimation of accuracy. used for estimation of accuracy.

The output of Logistic Regression must be a


The output for Linear Regression must be a
Categorical value such as 0 or 1, Yes or No,
continuous value, such as price, age, etc.
etc.

In Linear regression, it is required that In Logistic regression, it is not required to

relationship between dependent variable have the linear relationship between the

and independent variable must be linear. dependent and independent variable.

In linear regression, there may be In logistic regression, there should not be

collinearity between the independent collinearity between the independent

variables. variable.

19.What is Stream Processing? Why it is important?? 10+5=15


ans:

What is Stream Processing?


Stream Processing is the act of taking action on a set of data as it is being created.
Historically, data professionals used the term “real-time processing” to refer to data that
was processed as frequently as was required for a certain use case. However, with the
introduction and adoption of stream processing technologies and frameworks, as well as
lower RAM prices, “Stream Processing” has become a more particular term.
Multiple jobs on an incoming sequence of data (the “data stream“) are frequently conducted
in Stream Processing, which can be done serially, in parallel, or both. This workflow is known
as a Stream Processing Pipeline, and it includes the generation of stream data, data
processing, and data delivery to a final destination.
Aggregations (e.g., sum, mean, standard deviation), Analytics (e.g., predicting a future
event based on patterns in the data), Transformations (e.g., changing a number into a date
format), Enrichment (e.g., combining the data point with other data sources to create more
context and meaning), and ingestion are all actions that Stream Processing performs on
data (e.g., inserting the data into a database).

How Does Stream Processing Function/Work?


Data from IoT sensors, Payment Processing Systems and Server and Application
Logs are all examples of data that can benefit from Stream Processing. Publisher/subscriber
(also known as pub/sub) and source/sink are two prevalent paradigms. A publisher or source
generates data and events, which are provided to a Stream Processing Application, where
they are enhanced, tested against fraud detection algorithms, or otherwise altered before
being sent to a subscriber or sink. Apache Kafka, Hadoop, TCP connections, and in-memory
Data Grids are some of the most frequent sources and sink on the technical side.

Why do you need a Stream Processing Architecture?


The usefulness of insights obtained from Data Processing was demonstrated by Big Data.
Not all of these insights are made equal. Some insights are more beneficial just after they
occur, but their value fades rapidly with time. Such scenarios are possible thanks to Stream
Processing, which provides insights faster, frequently within milliseconds to seconds
of the trigger.
Some of the secondary reasons for employing Stream Processing are listed below.
Multiple Streams
Some data comes in the form of an unending stream of occurrences. To accomplish batch
processing, you must first store the data, then pause data collecting for a period of time
before processing it.
Then you have to worry
about doing the
following batch and
aggregating across numerous batches. Streaming, on the other hand, easily and naturally
accommodates never-ending data streams. Patterns can be detected, results may be
inspected, several degrees of focus can be examined, and data from multiple streams can
be viewed simultaneously.
Time series data and spotting trends over time are obvious fits for stream processing. If
you’re trying to determine the length of a web session in a never-ending stream, for example
( this is an example of trying to detect a sequence). It’s difficult to accomplish it using
batches because certain sessions will be split into two. Stream processing can readily
handle this.
When you take a step back and think about it, time-series data are the most continuous data
series: traffic sensors, health sensors, transaction logs, activity logs, and so on. Almost all
IoT data is in the form of a time series. As a result, using a programming model that fits
organically makes it logical.
Processing Time and Storage
Batch processing allows data to accumulate and attempt to process it all at once, whereas
stream processing processes data as it arrives, spreading the processing out over time. As a
result, stream processing requires far less hardware than batch processing. Stream
processing also allows for approximate query processing through systematic load reduction.
As a result, stream processing is well suited to applications where only approximate results
are required.
Sometimes data is so large that it is impossible to store it. Stream processing allows you to
handle massive amounts of data in a fire-horse-like fashion while retaining only the most
important information.
Accesibility
Finally, there is a lot of Streaming Data accessible (for example, customer transactions,
activities, and website visits), and it will continue to expand as IoT use cases become more
prevalent ( all kinds of sensors). Streaming is a far more natural way to consider and
program those scenarios.
What are the Key Use Cases of Stream Processing?
In most use scenarios, event data is generated as a result of some activity, and some action
should be taken right away. The following are some examples of real-time Stream
Processing applications:
• Fraud and Anomaly Detection in real-time. Thanks to Fraud and Anomaly
Detection powered by Stream Processing, one of the world’s leading credit card
companies were able to cut fraud write-downs by $800 million per year. Delays in
credit card processing are inconvenient for both the end customer and the store
attempting to process the card (and any other customers inline). Credit card
organizations used to do their time-consuming fraud detection operations in Batches
after each transaction. With Stream Processing, businesses can run extensive
algorithms to spot and stop fraudulent payments as soon as you swipe your card, as
well as trigger alarms for odd charges that require further investigation, without
letting their (non-fraudulent) clients wait.
• Edge analytics for the Internet of Things (IoT). Stream Processing is used by
organizations in manufacturing, oil and gas, and transportation, as well as those
designing smart cities and smart buildings, to keep up with data from billions of
“things.” Detecting abnormalities in manufacturing that signal problem that needs to
be corrected in order to enhance operations and increase yields is one example
of IoT Data Analysis. With real-time Stream Processing, a manufacturer may see if
a manufacturing line is producing too many anomalies as they happen, rather than
waiting until the end of the shift to discover a whole defective batch. They may save a
lot of money and avoid a lot of waste by pausing the line for rapid repairs.
• Personalization, Marketing, and Advertising in real-time. Companies may
provide customized, contextual experiences for their customers via real-time Stream
Processing. This could be a discount on something you put in your shopping basket
but didn’t buy right away, a referral to connect with a newly registered friend on a
social network, or an advertisement for a product.
Before Stream Processing: A Batched At-rest Data Framework
This paradigm is turned on its head with Stream Processing: application logic, analytics, and
queries exist in real-time, and data flows through them in real-time. A Stream Processing
program reacts to an event received from the stream by triggering an action, updating an
aggregate or other statistic, or “remembering” the event for future reference.
Streaming Computations can process numerous data streams at the same time, and each
computation over the event data stream can result in the creation of new event data
streams.
20) What is Data Aggregation? Describe regression and correlation coefficient.

21). What is association rule? Describe with examples.

22). Describe Apriori algorithm with proper example.

23. Describe FP Growth Algorithm. Describe Partitioning algorithm.


24. Describe types of Hierarchical algorithm.
25. What is Time series analysis.
26. What is Data cleaning. Describe the process of Data processioning.
27. What is data selection? What is Bayes theorem.
28. What is Spanning tree. Describe any algorithm of MST.
29. What is Regression. Describe data warehouse.
30. Describe different types of correlation coefficient. What is data intregration.

You might also like