Studentdata 1
Studentdata 1
Answer:
Apriori
• Eclat
• F-P Growth Algorithm
We will understand these algorithms in later chapters.
Here the If element is called antecedent, and then statement is called as Consequent. These types of
relationships where we can find out some association or relation between two items is known as
single cardinality. It is all about creating rules, and if the number of items increases, then cardinality
also increases accordingly. So, to measure the associations between thousands of data items, there are
several metrics. These metrics are given below:
• Support
• Confidence
• Lift
Let's understand each of them:
Support
Support is the frequency of A or how frequently an item appears in the dataset. It is defined as the
fraction of the transaction T that contains the itemset X. If there are X datasets, then for transactions T,
it can be written as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the items X and Y
occur together in the dataset when the occurrence of X is already given. It is the ratio of the
transaction that contains X and Y to the number of records that contain X.
Lift
It is the strength of any rule, which can be defined as below formula:
It is the ratio of the observed support measure and expected support if X and Y are independent of
each other. It has three possible values
(ii)
"outliers" refer to the data points that exist outside of what is to be expected. The major thing
about the outliers is what you do with them.
Any unwanted error occurs in some previously measured variable, or there is any variance in
the previously measured variable called noise. Before finding the outliers present in any data
set, it is recommended first to remove the noise.
Types of Outliers
Global outliers are also called point outliers. Global outliers are taken as the simplest form of
outliers. When data points deviate from all the rest of the data points in a given data set, it is
known
as the global outlier. In most cases, all the outlier detection procedures are targeted to
determine the global outliers. The green data point is the global outlier.
Collective Outliers
In a given set of data, when a group of data points deviates from the rest of the data set is
called collective outliers. Here, the particular set of data objects may not be outliers, but
when you consider the data objects as a whole, they may behave as outliers. To identify the
types of different outliers, you need to go through background information about the
relationship between the behavior of outliers shown by different data objects. For example,
in an Intrusion Detection System, the DOS package from one system to another is taken as
normal behavior. Therefore, if this happens with the various computer simultaneously, it is
considered abnormal behavior, and as a whole, they are called collective outliers. The green
data points as a whole represent the collective outlier.
Contextual Outliers
As the name suggests, "Contextual" means this outlier introduced within a context. For
example, in the speech recognition technique, the single background noise. Contextual
outliers are also known as Conditional outliers. These types of outliers happen if a data
object deviates from the other data points because of any specific condition in a given data
set. As we know, there are two types of attributes of objects of data: contextual attributes
and behavioral attributes. Contextual outlier analysis enables the users to examine outliers
in different contexts and conditions, which can be useful in various applications. For
example, A temperature reading of 45 degrees Celsius may behave as an outlier in a rainy
season. Still, it will behave like a normal data point in the context of a summer season. In the
given diagram, a green dot representing the low-temperature value in June is a contextual
outlier since the same value in December is not an outlier.
Outliers Analysis
Outliers are discarded at many places when data mining is applied. But it is still used in
many applications like fraud detection, medical, etc. It is usually because the events that
occur rarely can store much more significant information than the events that occur more
regularly.
Other applications where outlier detection plays a vital role are given below.
Any unusual response that occurs due to medical treatment can be analyzed through outlier
analysis in data mining.
The process in which the behavior of the outliers is identified in a dataset is called outlier
analysis. It is also known as "outlier mining", the process is defined as a significant task of
data mining.
3. Out of 4000 transactions, 400 contain Biscuits, whereas 600 containChocolate, and these
600 transactions include a 200 that includes Biscuits and chocolates. Using this data, we will
find out the support, confidence, and lift. What is Data aggregation? 3+2=5
ans:
Support
Support refers to the default popularity of any product. You find the support as a quotient of the
division of the number of transactions comprising that product by the total number of
transactions. Hence, we get
= 400/4000 = 10 percent.
Confidence
Confidence refers to the possibility that the customers bought both biscuits and chocolates
together. So, you need to divide the number of transactions that comprise both biscuits and
chocolates by the total number of transactions to get the confiden. Hence,
= 200/400
= 50 percent.
It means that 50 percent of customers who bought biscuits bought chocolates also.
Lift
Consider the above example; lift refers to the increase in the ratio of the sale of chocolates
when you sell biscuits. The mathematical equations of lift are given below.
= 50/10 = 5
It means that the probability of people buying both biscuits and chocolates together is
five times more than that of purchasing the biscuits alone. If the lift value is below one, it
requires that the people are unlikely to buy both the items together. Larger the value, the
better is the combination.
Data aggregation refers to a process of collecting information from different sources and presenting it
in a summarized format so that business analysts can perform statistical analyses of business schemes.
The collected information may be gathered from various data sources to summarize these data sources
into a draft for data analysis. This step is the major step taken by any business organization because
the accuracy of insights from data analysis majorly depends on the quality of data they use. It is very
necessary to collect quality content in huge amounts so that they can create relevant outcomes. Data
aggregation plays a vital role in finance, product, operations, and marketing strategies in any business
organization. Aggregated data is present in the data warehouse that can enable one to solve various
issues, which helps solve queries from data sets.
4.
ans:
Clustering is the task of dividing the population or data points into a number of groups such
that data points in the same groups are more similar to other data points in the same group
and dissimilar to the data points in other groups. It is basically a collection of objects on the
basis of similarity and dissimilarity between them.
For ex– The data points in the graph below clustered together can be classified into one
single group. We can distinguish the clusters, and we can identify that there are 3 clusters in
the below picture.
Cluster is a group of objects that belongs to the same class. In other words, similar objects
are grouped in one cluster and dissimilar objects are grouped in another cluster.
Cluster Analysis separates data into groups, usually known as clusters. If meaningful groups are the
objective, then the clusters catch the general information of the data. Some time cluster analysis is
only a useful initial stage for other purposes, such as data summarization. In the case of understanding
or utility, cluster analysis has long played a significant role in a wide area such as biology,
psychology, statistics, pattern recognition machine learning, and mining.
What is Cluster Analysis?
Cluster analysis is the group's data objects that primarily depend on information found in the data. It
defines the objects and their relationships. The objective of the objects within a group be similar or
different from the objects of the other groups.
The given Figure 1 illustrates different ways of Clustering at the same sets of the point.
In various applications, the concept of a cluster is not briefly defined. To better understand the
challenge of choosing what establishes a group, figure 1 illustrates twenty points and three different
ways to separate them into clusters. The design of the markers shows the cluster membership. The
figures divide the data into two and six sections, respectively. The division of each of the two more
significant clusters into three subclusters may be a product of the human visual system. It may not be
logical to state that the points from four clusters. The figure represents that the meaning of a cluster is
inaccurate. The best definition of cluster relies upon the nature of the data and the outcomes.
Cluster analysis is similar to other methods that are used to divide data objects into groups.
For example, Clustering can be view as a form of Classification. It constructs the labelling of
objects with Classification, i.e., new unlabelled objects are allowed a class label using a model
developed from objects with known class labels. So that, cluster analysis is sometimes
defined as unsupervised Classification. If the term classification is used without any ability
within data mining, then it typically refers to supervised Classification.
The terms segmentation and partitioning are generally used as synonyms for Clustering.
These terms are commonly used for techniques outside the traditional bounds of cluster
analysis. For example, the term partitioning is usually used in making relation with
techniques that separate graphs into subgraphs and that are not connected to
Clustering. Segmentation often introduces the division of data into groups using simple
methods. For example, an image can be broken into various sections depends on pixel
frequency and colour, or people can be divided into different groups based on their annual
income. However, some work in graph division and market segmentation is connected to
cluster analysis.
Clustering addresses to discover helpful groups of objects (Clusters), where the objectives of
the data analysis characterize utility. Of course, there are various notions of a cluster that
demonstrate utility in practice. In order to visually show the differences between these kinds
of clusters, we utilize two-dimensional points, as shown in the figure that types of clusters
described here are equally valid for different sorts of data.
• Well-separated cluster
A cluster is a set of objects where each object is closer or more similar to every other object
in the cluster. Sometimes a limit is used to indicate that all the objects in a cluster must be
adequately close or similar to each other. The definition of a cluster is satisfied only when the
data contains natural clusters that are quite far from one another. The figure illustrates an
example of well-separated clusters that comprise of two points in a two-dimensional space.
Well-separated clusters do not require to be spherical but can have any shape.
• Prototype-Based cluster
A cluster is a set of objects where each object is closer or more similar to the prototype that
characterizes the cluster to the prototype of any other cluster. For data with continuous
characteristics, the prototype of a cluster is usually a centroid. It means the average (Mean)
of all the points in the cluster when a centroid is not significant. For example, when the data
has definite characteristics, the prototype is usually a medoid that is the most representative
point of a cluster. For some sorts of data, the model can be viewed as the most central point,
and in such examples, we commonly refer to prototype-based clusters as center-based
clusters. As anyone might expect, such clusters tend to be spherical. The figure illustrates an
example of center-based clusters.
Graph-Based cluster: If the data is depicted as a graph, where the nodes are the objects, then
a cluster can be described as a connected component. It is a group of objects that are
associated with each other, but that has no association with objects that is outside the
group. A significant example of graph-based clusters is contiguity-based clusters, where two
objects are associated when they are placed at a specified distance from each other. It
suggests that every object in a contiguity-based cluster is the same as some other object in
the cluster. Figures demonstrate an example of such clusters for two-dimensional points. The
meaning of a cluster is useful when clusters are unpredictable or intertwined but can
experience difficulty when noise present. It is shown by the two circular clusters in the figure;
the little extension of points can join two different clusters. Other kinds of graph-based
clusters are also possible. One such way describes a cluster as a clique. Clique is a set of
nodes in a graph that is completely associated with each other. Particularly, we add
connections between the objects according to their distance from one another. A cluster is
generated when a set of objects forms a clique. It is like prototype-based clusters, and such
clusters tend to be spherical.
•
• Density-Based Cluster
A cluster is a compressed domain of objects that are surrounded by a region of low density.
The two spherical clusters are not merged, as in the figure, because the bridge between
them fades into the noise. Similarly, the curve that is present in the Figure disappears into
the noise and does not form a cluster in Figure. It also disappears into the noise and does
not form a cluster shown in the figure. A density-based definition of a cluster is usually
occupied when the clusters are irregularly and intertwined, and when noise and outliers exist.
The other hand contiguity-based definition of a cluster would not work properly for the data
of Figure. Since the noise would tend to form a network between clusters.
We can describe a cluster as a set of objects that offer some property. The object in a center-
based cluster shares the property that they are all closest to the similar centroid or medoid.
However, the shared-property approach additionally incorporates new types of the cluster.
Consider the cluster given in the figure. A triangular area (cluster) is next to a rectangular
one, and there are two intertwined circles (clusters). In both cases, a Clustering algorithm
would require a specific concept of a cluster to recognize these clusters effectively. The way
of discovering such clusters is called conceptual Clustering.
Clustering helps to splits data into several subsets. Each of these subsets contains data
similar to each other, and these subsets are called clusters. Now that the data from our
customer base is divided into clusters, we can make an informed decision about who we
think is best suited for this product.
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar
objects.
Points to Remember
• A cluster of data objects can be treated as one group.
• While doing cluster analysis, we first partition the set of data into groups
based on data similarity and then assign the labels to the groups.
• The main advantage of clustering over classification is that, it is adaptable to
changes and helps single out useful features that distinguish different groups.
Why Clustering?
Clustering is very much important as it determines the intrinsic grouping among the
unlabelled data present. There are no criteria for good clustering. It depends on the user,
what is the criteria they may use which satisfy their need. For instance, we could be
interested in finding representatives for homogeneous groups (data reduction), in finding
“natural clusters” and describe their unknown properties (“natural” data types), in finding
useful and suitable groupings (“useful” data classes) or in finding unusual data objects
(outlier detection). This algorithm must make some assumptions that constitute the similarity
of points and each assumption make different and equally valid clusters.
Let's understand this with an example, suppose we are a market manager, and we have a new tempting
product to sell. We are sure that the product would bring enormous profit, as long as it is sold to the
right people. So, how can we tell who is best suited for the product from our company's huge
customer base?
Clustering, falling under the category of unsupervised machine learning, is one of the problems
that machine learning algorithms solve.
Clustering only utilizes input data, to determine patterns, anomalies, or similarities in its
input data.
• The intra-cluster similarities are high, it implies that the data present inside the cluster is
similar to one another.
• The inter-cluster similarity is low, and it means each cluster holds data that is not similar
to other data.
The following points throw light on why clustering is required in data mining −
ans:
In a decision tree, which resembles a flowchart, an inner node represents a variable (or a feature)
of the dataset, a tree branch indicates a decision rule, and every leaf node indicates the outcome
of the specific decision. The first node from the top of a decision tree diagram is the root node.
We can split up data based on the attribute values that correspond to the independent
characteristics.
To divide the data based on target variables, choose the best feature employing Attribute
Selection Measures (ASM).
Then it will divide the dataset into smaller sub-datasets and designate that feature as a decision
node for that branch.
Once one of the conditions matches, the procedure is repeated recursively for every child node
to begin creating the tree.
To predict future events using the decision tree algorithm and generate an insightful output of
continuous data type, the decision tree regression algorithm analyses an object's attributes and
trains this machine learning model as a tree. Since a predetermined set of discrete numbers
doesnot entirely define it, the output or outcome is not discrete.
This model illustrates a discrete output in the cricket match prediction that predicts whether a
certain team will win or lose a match.
A sales forecasting machine learning model that forecasts a firm's profit ranges will increase
throughout a fiscal year depending on the company's preliminary figures illustrates continuous
output.
A decision tree regression algorithm is utilized in this instance to forecast continuous values.
After talking about sklearn decision trees, let's look at how they are implemented step-by-step.
Requires little data preparation. Other techniques often require data normalization, dummy
variables need to be created and blank values to be removed. Note however that this module
does not support missing values.
The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used
to train the tree.
Able to handle both numerical and categorical data. However, the scikit-learn implementation
does not support categorical variables for now. Other techniques are usually specialized in
analyzing datasets that have only one type of variable. See algorithms for more information.
Uses a white box model. If a given situation is observable in a model, the explanation for the
condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an
artificial neural network), results may be more difficult to interpret.
Possible to validate a model using statistical tests. That makes it possible to account for the
reliability of the model.
Performs well even if its assumptions are somewhat violated by the true model from which the
data were generated.
The disadvantages of decision trees include:
Decision-tree learners can create over-complex trees that do not generalize the data well. This is
called overfitting. Mechanisms such as pruning, setting the minimum number of samples
required at a leaf node or setting the maximum depth of the tree are necessary to avoid this
problem.
Decision trees can be unstable because small variations in the data might result in a completely
different tree being generated. This problem is mitigated by using decision trees within an
ensemble.
Predictions of decision trees are neither smooth nor continuous, but piecewise constant
approximations as seen in the above figure.
The problem of learning an optimal decision tree is known to be NP-complete under several
aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning
algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal
decisions are made at each node. Such algorithms cannot guarantee to return the globally
optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner,
where the features and samples are randomly sampled with replacement.
There are concepts that are hard to learn because decision trees do not express them easily, such
as XOR, parity or multiplexer problems.
Decision tree learners create biased trees if some classes dominate. It is therefore recommended
to balance the dataset prior to fitting with the decision tree.
Transaction List of
s items
T1 l1,l2,l3
T2 l2,l3,l4
T3 l4,l5
T4 l1,l2,l4
T5 L1,l2,l3,l5
T6 L1,l2,l3,l4
ans:
ans:
The lowest node item, I5, is not considered as it does not have a min support count. Hence it is
deleted.
The conditional pattern base is considered a transaction database, and an FP tree is constructed.
This will contain {I2:2, I3:2}, I1 is not considered as it does not meet the min support count.
This path will generate all combinations of frequent patterns : {I2,I4:2},{I3,I4:2},{I2,I3,I4:2}
For I3, the prefix path would be: {I2,I1:3},{I2:1}, this will generate a 2 node FP-tree : {I2:4, I1:3}
and frequent patterns are generated: {I2,I3:4}, {I1:I3:3}, {I2,I1,I3:3}.
For I1, the prefix path would be: {I2:4} this will generate a single node FP-tree: {I2:4} and frequent
patterns are generated: {I2, I1:4}.
{I2,I4:2},{I3,I4:2},{I2,I3,I4:2}
The diagram given below depicts the conditional FP tree associated with the conditional node I3.
(ii) Correlation Coefficient is a statistical concept, which helps in establishing a relation between
predicted and actual values obtained in a statistical experiment. The calculated value of the
correlation coefficient explains the exactness between the predicted and actual values.
Correlation Coefficient value always lies between -1 to +1. If correlation coefficient value is
positive, then there is a similar and identical relation between the two variables. Else it indicates
the dissimilarity between the two variables. The covariance of two variables divided by the
product of their standard deviations gives Pearson’s correlation coefficient. It is usually
represented by ρ (rho).
Here cov is the covariance. σX is the standard deviation of X and σY is the standard deviation of Y.
The given equation for correlation coefficient can be expressed in terms of means and
expectations.
ans:
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be
created in the process, as if K=2, there will be two clusters, and for K=3, there will be three
clusters, and so on.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of
this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters,
and repeats the process until it does not find the best clusters. The value of k should be
predetermined in this algorithm.
Determines the best value for K center points or centroids by an iterative process.
Assigns each data point to its closest k-center. Those data points which are near to the particular
k-center, create a cluster.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid
of each cluster.
ans: Hierarchical clustering is another unsupervised machine learning algorithm, which is used to
group the unlabeled datasets into a cluster and also known as hierarchical cluster analysis or
HCA. In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-
shaped structure is known as the dendrogram. Sometimes the results of K-means clustering and
hierarchical clustering may look similar, but they both differ depending on how they work. As
there is no requirement to predetermine the number of clusters as we did in the K-Means
algorithm.
The agglomerative hierarchical clustering algorithm is a popular example of HCA. To group the
datasets into clusters, it follows the bottom-up approach. It means, this algorithm considers each
dataset as a single cluster at the beginning, and then start combining the closest pair of clusters
together. It does this until all the clusters are merged into a single cluster that contains all the
datasets.
Step-1: Create each data point as a single cluster. Let's say there are N data points, so the number
of clusters will also be N.
Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there
will now be N-1 clusters.
Step-3: Again, take the two closest clusters and merge them together to form one cluster. There
will be N-2 clusters.
Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters.
Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to divide
the clusters as per the problem.
Single Linkage: It is the Shortest Distance between the closest points of the clusters. Consider
the below image:
Hierarchical Clustering in Machine Learning
Complete Linkage: It is the farthest distance between the two points of two different clusters. It is
one of the popular linkage methods as it forms tighter clusters than single-linkage.
Average Linkage: It is the linkage method in which the distance between each pair of datasets is
added up and then divided by the total number of datasets to calculate the average distance
between two clusters. It is also one of the most popular linkage methods.
Centroid Linkage: It is the linkage method in which the distance between the centroid of the
clusters is calculated. Consider the below image:
From the above-given approaches, we can apply any of them according to the type of problem or
business requirement.
The dendrogram is a tree-like structure that is mainly used to store each step as a memory that
the HC algorithm performs. In the dendrogram plot, the Y-axis shows the Euclidean distances
between the data points, and the x-axis shows all the data points of the given dataset.
9.1)Find association rule using Apriori algorithm where minsup is 50% and minconfidence is
70%.
Consider a Big Bazar scenario where the product set is P = {Rice, Pulse, Oil, Milk, Apple}. The
database comprises six transactions where 1 represents the presence of the product and 0
represents the absence of the product.
t1 1 1 1 0 0
t2 0 1 1 1 0
t3 0 0 0 1 1
t4 1 1 0 1 0
t5 1 1 1 0 1
t6 1 1 1 1 1
Step 1
Make a frequency table of all the products that appear in all the transactions. Now, short the
frequency table to add only those products with a threshold support level of over 50 percent. We
find the given frequency table.
Rice (R) 4
Pulse(P) 5
Oil(O) 4
Milk(M)4
The above table indicated the products frequently bought by the customers.
Step 2
Create pairs of products such as RP, RO, RM, PO, PM, OM. You will get the given frequency table.
Itemset Frequency (Number of transactions)
RP 4
RO 3
RM 2
PO 4
PM 3
OM 2
Step 3
Implementing the same threshold support of 50 percent and consider the products that are more
than 50 percent. In our case, it is more than 3
Step 4
Now, look for a set of three products that the customers buy together. We get the given
combination.
Step 5
Calculate the frequency of the two itemsets, and you will get the given frequency table.
RPO 4
POM 3
If you implement the threshold assumption, you can figure out that the customers' set of three
products is RPO.
Graph mining is a process in which the mining techniques are used in finding a pattern or
relationship in the given real-world collection of graphs. By mining the graph, frequent
substructures and relationships can be identified which helps in clustering the graph sets, finding
a relationship between graph sets, or discriminating or characterizing graphs. Predicting these
patterning trends can help in building models for the enhancement of any application that is
used in real-time. To implement the process of graph mining, one must learn to mine frequent
subgraphs.
Let us consider a graph h with an edge set E(h) and a vertex set V(h). Let us consider the
existence of subgraph isomorphism from h to h’ in such a way that h is a subgraph of h’. A label
function is a function that plots either the edges or vertices to a label. Let us consider a labeled
graph dataset, F = {H1, H2, H3....Hn} Let us consider s(h) as the support which means the
percentage of graphs in F where h is a subgraph. A frequent graph has support that will be no less
than the minimum support threshold. Let us denote it as min_support.
The second step is to find the support of each and every candidate. We must optimize and
enhance the first step because the second step is an NP-completed set where the computational
complexity is accurate and high.
The Apriori-based approach: The approach to find the frequent graphs begin from the graph with
a small size. The approach advances in a bottom-up way by creating candidates with extra vertex
or edge. This algorithm is called an Apriori Graph.
Input:
Output:
Q1,Q2,Q3, .. QK,
k <- 2;
while Qk-1 ≠ ∅ do
Qk <- ∅;
Gk <- candidate_generation(Qk-1);
foreach candidate l ∈ Gk do
l.count <- 0;
foreach Fi ∈ F do
if isomerism_subgraph(l,Fi) then
end
end
Qk = Qk U l;
end
end
k <- k+1;
end
The Pattern- growth approach: This pattern-growth approach can use both BFS and DFS(Depth
First Search). DFS is preferred for this approach due to its less memory consumption nature.
According to the request of the user, the constraints described changes in the mining process.
But, if we generalize and categorize them into specific constraints, the mining process would be
handled easily by pushing them into the given frameworks. constraint-pushing strategy is used in
pattern growth mining tasks. Let’s see some important constraint categories.
ans:
Time series is a sequence of observations recorded at regular time intervals. Depending on the
frequency of observations, a time series may typically be hourly, daily, weekly, monthly, quarterly
and annual. Sometimes, you might have seconds and minute-wise time series as well, like,
number of clicks and user visits every minute etc.
Because it is the preparatory step before you develop a forecast of the series. Besides, time series
forecasting has enormous commercial significance because stuff that is important to a business
like demand and sales, number of visitors to a website, stock price etc are essentially time series
data.
12. State Bayes theorem. How can it be applied for data classification? 5
ans:
Bayesian classification uses Bayes theorem to predict the occurrence of any event. Bayesian
classifiers are the statistical classifiers with the Bayesian probability understandings. The theory
expresses how a level of belief, expressed as a probability.
Bayes theorem came into existence after Thomas Bayes, who first utilized conditional probability
to provide an algorithm that uses evidence to calculate limits on an unknown parameter. Bayes's
theorem is expressed mathematically by the following equation that is given below.
P(X/Y) is a conditional probability that describes the occurrence of event X is given that Y is true.
P(Y/X) is a conditional probability that describes the occurrence of event Y is given that X is true.
P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This is
known as the marginal probability.
Bayesian interpretation:
The quotient
Where P (X⋂Y) is the joint probability of both X and Y being true, because
ans:
Bayesian network:
A Bayesian Network falls under the classification of Probabilistic Graphical Modelling (PGM)
procedure that is utilized to compute uncertainties by utilizing the probability concept. Generally
known as Belief Networks, Bayesian Networks are used to show uncertainties using Directed
Acyclic Graphs (DAG)
A Directed Acyclic Graph is used to show a Bayesian Network, and like some other statistical
graph, a DAG consists of a set of nodes and links, where the links signify the connection between
the nodes.
The nodes here represent random variables, and the edges define the relationship between
these variables.
A DAG models the uncertainty of an event taking place based on the Conditional Probability
Distribution (CDP) of each random variable. A Conditional Probability Table (CPT) is used to
represent the CPD of each variable in a network.
ans:
Data aggregation refers to a process of collecting information from different sources and
presenting it in a summarized format so that business analysts can perform statistical analyses of
business schemes. The collected information may be gathered from various data sources to
summarize these data sources into a draft for data analysis.
Data aggregation is needed if a dataset has useless information that can not be used for analysis.
In data aggregation, the datasets are summarized into significant information, which helps attain
desirable outcomes and increases the user experience. Data aggregation provides accurate
measurements such as sum, average, and count. The collected, summarized data helps the
business analysts to perform the demographic study of customers and their behavior. Aggregated
data help in determining significant information about a specific group after they submit their
reports. With the help of data aggregation, we can also calculate the count of non-numeric data.
Generally, data aggregation is done for data sets, not for individual data.
Data aggregators
Data aggregators refer to a system used in data mining to collect data from various sources, then
process the data and extract them into useful information into a draft. They play a vital role in
enhancing the customer data by acting as an agent. It also helps in the query and delivery
procedure where the customer requests data instances about a specific product. The marketing
team does the data aggregation, which helps them personalize messaging, offers, and more in
the user's digital experiences with the brand. It also helps the product management team of any
organization to know which products generate more revenue and which are not. The aggregated
data is also used by the financial and company executive, which helps them select how to
allocate budget towards marketing or product development strategies.
Collection of data
Processing of data
Presentation of data
ans:
Regression refers to a type of supervised machine learning technique that is used to predict any
continuous-valued attribute. Regression helps any business organization to analyze the target
variable and predictor variable relationships.
Types of Regression
Linear Regression
Logistic Regression
Lasso Regression
Ridge Regression
Polynomial Regression
Linear Regression
Linear regression is the type of regression that forms a relationship between the target variable
and one or more independent variables utilizing a straight line. The given equation represents
the equation of linear regression
Y = a + b*X + e.
Where,
In linear regression, the best fit line is achieved utilizing the least squared method, and it
minimizes the total sum of the squares of the deviations from each data point to the line of
regression. Here, the positive and negative deviations do not get canceled as all the deviations
are squared.
Polynomial Regression
If the power of the independent variable is more than 1 in the regression equation, it is termed a
polynomial equation. With the help of the example given below, we will understand the concept
of polynomial regression.
Y = a + b * x2
In the particular regression, the best fit line is not considered a straight line like a linear equation;
however, it represents a curve fitted to all the data points.
Applying linear regression techniques can lead to overfitting when you are tempted to minimize
your errors by making the curve more complex. Therefore, always try to fit the curve by
generalizing it o the issue.
Logistic Regression
When the dependent variable is binary in nature, i.e., 0 and 1, true or false, success or failure, the
logistic regression technique comes into existence. Here, the target value (Y) ranges from 0 to 1,
and it is primarily used for classification-based problems. Unlike linear regression, it does not
need any independent and dependent variables to have a linear relationship.
Ridge Regression
Ride regression refers to a process that is used to analyze various regression data that have the
issue of multicollinearity. Multicollinearity is the existence of a linear correlation between two
independent variables.
Ridge regression exists when the least square estimates are the least biased with high variance,
so they are quite different from the real value. However, by adding a degree of bias to the
estimated regression value, the errors are reduced by applying ridge regression.
Lasso Regression
The term LASSO stands for Least Absolute Shrinkage and Selection Operator. Lasso regression is a
linear type of regression that utilizes shrinkage. In Lasso regression, all the data points are
shrunk towards a central point, also known as the mean. The lasso process is most fitted for
simple and sparse models with fewer parameters than other regression. This type of regression
is well fittedfor models that suffer from multicollinearity.
ans:
A Data Warehouse (DW) is a relational database that is designed for query and analysis rather
than transaction processing. It includes historical data derived from transaction data from single
and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing
support for decision-makers for data modeling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to a particular
group of users.
It is not used for daily operations and transaction processing but used for making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
It is a database designed for investigative tasks, using data from various applications.
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers. Therefore,
data warehouses typically provide a concise and straightforward view around a particular subject,
such as customer, product, or sales, instead of the global organization's ongoing operations. This
is done by excluding data that are not useful concerning the subject and including all data
needed by the users to understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions, attributes types, etc., among different
data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from 3
months, 6 months, 12 months, or even previous data from a data warehouse. These variations
with a transactions system, where often only the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the source
operational RDBMS. The operational updates of data do not occur in the data warehouse, i.e.,
update, insert, and delete operations are not performed. It usually requires only two procedures
in data accessing: Initial loading of data and access to data. Therefore, the DW does not require
transaction processing, recovery, and concurrency capabilities, which allows for substantial
speedup of data retrieval. Non-Volatile defines that once entered into the warehouse, and data
should not change.
ans:
In data mining, during data integration, many data stores are used. It may lead to data
redundancy. An attribute is known as redundant if it can be derived from any set of attributes. Let
us consider we have a set of data where there are 20 attributes. Now suppose that out of 20, an
attribute can be derived from some of the other set of attributes. Such attributes that can be
derived from other sets of attributes are called Redundant attributes. Inconsistencies in attribute
or dimension naming may lead to redundancies in the set of data.
In the case of numeric data, this test is used. In this test, the relation between the A attribute and
B attribute is computed by Pearson's product-moment coefficient, also called the correlation
coefficient. A correlation coefficient measures the extent to which the value of one variable
changes with another. The best known are Pearson's and Spearman's rank-order. The first is used
where both variables are continuous, the second where at least one represents a rank.
There are several different correlation coefficients, each of which is appropriate for different
types of data. The most common is the Pearson r, used for continuous variables. It is a statistic
that measures the degree to which one variable varies in tandem with another. It ranges from -1
to +1. A +1 correlation means that as one variable rises, the other rises proportionally; a -1
correlation means that as one rises, the other falls proportionally. A 0 correlation means that
there is no relationship between the movements of the two variables.
Where,
n = number of tuples
ai = value of x in tuple i
bi = value of y in tuple i
From the above discussion, we can say that the greater the correlation coefficient, the more
strongly the attributes are correlated to each other, and we can ignore any one of them (either a
or b). If the value of the correlation constant is null, the attributes are independent. If the value
of the correlation constant is negative, one attribute discourages the other. It means that the
value of one attribute increases, then the value of another attribute is decreasing.
18. What id data bining , define their types.
ANS:
Data binning, also called discrete binning or bucketing, is a data pre-processing technique used to
reduce the effects of minor observation errors. It is a form of quantization. The original data
values are divided into small intervals known as bins, and then they are replaced by a general
value calculated for that bin. This has a soothing effect on the input data and may also reduce the
chances of over fitting in the case of small datasets.
Statistical data binning is a way to group numbers of more or less continuous values into a
smaller number of "bins". It can also be used in multivariate statistics, binning in several
dimensions simultaneously. For example, if you have data about a group of people, you might
want to arrange their ages into a smaller number of age intervals, such as grouping every five
years together.
Binning can dramatically improve resource utilization and model build response time without
significant loss in model quality. Binning can improve model quality by strengthening the
relationship between attributes.
Supervised binning is a form of intelligent binning in which important characteristics of the data
are used to determine the bin boundaries. In supervised binning, the bin boundaries are
identified by a single-predictor decision tree that considers the joint distribution with the target.
Supervised binning can be used for both numerical and categorical attributes.
Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output:
Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output:
[92]
[204, 215]
ans:
Density-Based Clustering refers to one of the most popular unsupervised learning methodologies
used in model building and machine learning algorithms. The data points in the region separated
by two clusters of low point density are considered as noise. The surroundings with a radius ε of
a given object are known as the ε neighborhood of the object. If the ε neighborhood of the
object comprises at least a minimum number, MinPts of objects, then it is called a core object.
Density-Based Clustering - There are two different parameters to calculate the density-based
clustering
MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that point.
A point i is considered as the directly density reachable from a point k with respect to Eps, MinPts
if
i belongs to NEps(k)
Density reachable:
A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if there is a
sequence chain of a point i1,…., in, i1 = j, pn = i such that ii + 1 is directly density reachable from
ii.
Density connected:
A point i refers to density connected to a point j with respect to Eps, MinPts if there is a point o
such that both i and j are considered as density reachable from o with respect to Eps and MinPts.
Suppose a set of objects is denoted by D', we can say that an object I is directly density reachable
form the object j only if it is located within the ε neighborhood of j, and j is a core object.
An object i is density reachable form the object j with respect to ε and MinPts in a given set of
objects, D' only if there is a sequence of object chains point i1,…., in, i1 = j, pn = i such that ii + 1 is
directly density reachable from ii with respect to ε and MinPts.
An object i is density connected object j with respect to ε and MinPts in a given set of objects, D'
only if there is an object o belongs to D such that both point i and j are density reachable from o
with respect to ε and MinPts.
It is a scan method.
marks:15
1.
answer:
Decision trees are upside down which means the root is at the top and then this root is split into
various several nodes. Decision trees are nothing but a bunch of if-else statements in layman
terms. It checks if the condition is true and if it is then it goes to the next node attached to that
decision.
In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy, or rainy? If
yes then it will go to the next feature which is humidity and wind. It will again check if there is a
strong wind or weak, if it’s a weak wind and it’s rainy then the person may go and play.
Entropy
Entropy basically measures the impurity of a node. Impurity is the degree of randomness; it tells
how random our data is. Apure sub-splitmeans that either you should be getting “yes”, or you
should be getting “no”.
Supposea featurehas 8 “yes” and 4 “no” initially, after the first split the left node gets 5 ‘yes’ and
2 ‘no’whereas right node gets 3 ‘yes’ and 2 ‘no’.
We see here the split is not pure, why? Because we can still see some negative classes in both the
nodes. In order to make a decision tree, we need to calculate the impurity of each split, and
when the purity is 100%, we make it as a leaf node.
To check the impurity of feature 2 and feature 3 we will take the help for Entropy formula.
low entropy or more purity than right node since left node has a greater number of “yes” and it is
easy to decide here.
Information Gain
Information gain measures the reduction of uncertainty given some feature and it is also a
deciding factor for which attribute should be selected as a decision node or root node.
To understand this better let’s consider an example:Suppose our entire population has a total of
30 instances. The dataset is to predict whether the person will go to the gym or not. Let’s say 16
people go to the gym and 14 people don’t
Now we have two features to predict whether he/she will go to the gym or not.
Feature 2 is “Motivation” which takes 3 values “No motivation”, “Neutral” and “Highly
motivated”.
Let’s see how our decision tree will be made using these 2 features. We’ll use information gain to
decide which feature should be the root node and which feature should be placed after the split.
Now we have the value of E(Parent) and E(Parent|Energy), information gain will be:
Our parent entropy was near 0.99 and after looking at this value of information gain, we can say
that the entropy of the dataset will decrease by 0.37 if we make “Energy” as our root node.
Similarly, we will do this with the other feature “Motivation” and calculate its information gain.
Now we have the value of E(Parent) and E(Parent|Motivation), information gain will be:
2.
Using the k-means algorithm divide the dataset into two clusters.
ID X Y
1 1 1
2 1.5 2
3 3 4
4 5 7
5 3.5 5
6 4.5 5
7 3.5 4.5
ans:
c1={1,2}
c2 ={ 4,3,5,6,7}
3.
S1 5 7
S2 8 4
S3 3 3
S4 4 4
S5 3 7
S6 6 7
S7 6 1
S8 5 5
Using DBSCAN Clustering find core and Noise Points, where epsilon €= 3, min plots = 3.
ans:
4.
Data Value
A 1
B 3
C 5
D 6
E 9
5. Using the k-means algorithm divide the dataset into two clusters.
ID X Y
1 185 72
2 170 56
3 168 60
4 179 68
5 72
182
6 72
188
7 180 71
8 70
180
9 183 84
10 180 88
11 180 67
12 177 76
Ans:
C1= {1,4,5,6,7,8,9,10,11}
C2= {2,3}
Ans:
The FP-Growth Algorithm is an alternative way to find frequent item sets without using
candidate generations, thus improving performance. For so much, it uses a divide-and-
conquer strategy. The core of this method is the usage of a special data structure named
frequent-pattern tree (FP-tree), which retains the item set association information.
o After this first step, it divides the compressed database into a set of conditional
databases, each associated with one frequent pattern.
Using this strategy, the FP-Growth reduces the search costs by recursively looking for short
patterns and then concatenating them into the long frequent patterns.
In large databases, holding the FP tree in the main memory is impossible. A strategy to cope
with this problem is to partition the database into a set of smaller databases (called projected
databases) and then construct an FP-tree from each of these smaller databases.
FP-Tree
The frequent-pattern tree (FP-tree) is a compact data structure that stores quantitative
information about frequent patterns in a database. Each transaction is read and then
mapped onto a path in the FP-tree. This is done until all transactions have been read.
Different transactions with common subsets allow the tree to remain compact because their
paths overlap.
A frequent Pattern Tree is made with the initial item sets of the database. The purpose of the
FP tree is to mine the most frequent pattern. Each node of the FP tree represents an item of
the item set.
The root node represents null, while the lower nodes represent the item sets. The
associations of the nodes with the lower nodes, that is, the item sets with the other item sets,
are maintained while forming the tree.
1. One root is labelled as "null" with a set of item-prefix subtrees as children and a
frequent-item-header table.
o Node-link: links to the next node in the FP-tree carrying the same item name
or null if there is none.
Additionally, the frequent-item-header table can have the count support for an item. The
below diagram is an example of a best-case scenario that occurs when all transactions have
the same item set; the size of the FP-tree will be only a single branch of nodes.
(ii)
A maximal frequent itemset is represented as a frequent itemset for which none of its direct
supersets are frequent. The itemsets in the lattice are broken into two groups such as those
that are frequent and those that are infrequent. A frequent itemset border, which is defined
by a dashed line.
Each item set situated above the border is frequent, while those located under the border
(the shaded nodes) are infrequent. Between the itemsets residing near the border, {a, d}, {a,
c, e}, and {b, c, d, e} are treated to be maximal frequent itemsets because their direct
supersets are infrequent.
An itemset including {a, d} is maximal frequent because some direct supersets, {a, b, d}, {a,
c, d}, and {a, d, e}, are infrequent. In contrast, {a, c} is non-maximal because the direct
supersets, {a, c, e}, is frequent.
Maximal frequent itemsets support a valuable description for data sets that can make very
high, frequent itemsets, as there are exponentially several frequent itemsets in such data.
This method is practical only if an effective algorithm occurs to explicitly discover the
maximal frequent itemsets without having to enumerate some subsets.
Despite supporting a compact description, maximal frequent itemsets do not include the
support data of their subsets. For instance, the support of the maximal frequent itemsets
{a,c,e}, {a,d}, and {b,c,d,e} do not give any idea about the support of their subsets.
An additional pass over the data set is required to decide the support counts of the non-
maximal frequent itemsets. In some cases, it can be desirable to have a minimal description
of frequent itemsets that preserves the support data.
7+8=15
Ans:
Data cleaning is a crucial process in Data Mining. It carries an important part in the
building of a model. Data Cleaning can be regarded as the process needed, but everyone
often neglects it. Data quality is the main issue in quality information management. Data
quality problems occur anywhere in information systems. These problems are solved by
data cleaning.
Data cleaning is fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or
incomplete data within a dataset. If data is incorrect, outcomes and algorithms are unreliable,
even though they may look correct. When combining multiple data sources, there are many
opportunities for data to be duplicated or mislabeled.
While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to cleaning your data, such as:
For example, if you want to analyze data regarding millennial customers, but your dataset
includes older generations, you might remove those irrelevant observations. This can make
analysis more efficient, minimize distraction from your primary target, and create a more
manageable and performable dataset.
Structural errors are when you measure or transfer data and notice strange naming
conventions, typos, or incorrect capitalization. These inconsistencies can cause mislabeled
categories or classes. For example, you may find "N/A" and "Not Applicable" in any sheet,
but they should be analyzed in the same category.
Often, there will be one-off observations where, at a glance, they do not appear to fit within
the data you are analyzing. If you have a legitimate reason to remove an outlier, like
improper data entry, doing so will help the performance of the data you are working with.
However, sometimes, the appearance of an outlier will prove a theory you are working on.
And just because an outlier exists doesn't mean it is incorrect. This step is needed to
determine the validity of that number. If an outlier proves to be irrelevant for analysis or is a
mistake, consider removing it.
You can't ignore missing data because many algorithms will not accept missing values. There
are a couple of ways to deal with missing data. Neither is optimal, but both can be
considered, such as:
You can drop observations with missing values, but this will drop or lose
information, so be careful before removing it.
You can input missing values based on other observations; again, there is an
opportunity to lose the integrity of the data because you may be operating from
assumptions and not actual observations.
You might alter how the data is used to navigate null values effectively.
5. Validate and QA
At the end of the data cleaning process, you should be able to answer these questions as a
part of basic validation, such as:
o Does the data follow the appropriate rules for its field?
o Does it prove or disprove your working theory or bring any insight to light?
o Can you find trends in the data to help you for your next theory?
Ignore the tuples:This method is not very feasible, as it only comes to use when the tuple has
several attributes is has missing values.
1. Fill the missing value:This approach is also not very effective or feasible. Moreover,
it can be a time-consuming method. In the approach, one has to fill in the missing
value. This is usually done manually, but it can also be done by attribute mean or
using the most probable value.
4. Clustering:This method mainly operates on the group. Clustering groups the data in
a cluster. Then, the outliers are detected with the help of clustering. Next, the similar
values are then arranged into a "group" or a "cluster".
8. What is decision tree. Describe Naïve Bayes Classifier Algorithm. Why it is called Naive
Bayes ??
5+5+5= 15
ans:
Decision Tree is a supervised learning method used in data mining for classification and
regression methods. It is a tree that helps us in decision-making purposes. The decision
tree creates classification or regression models as a tree structure. It separates a data
set into smaller subsets, and at the same time, the decision tree is steadily developed.
The final tree is a tree with the decision nodes and leaf nodes. A decision node has at
least two branches. The leaf nodes show a classification or decision. We can't
accomplish more split on leaf nodes-The uppermost decision node in a tree that relates
to the best predictor called the root node. Decision trees can deal with both categorical
and numerical data.
Entropy:
Entropy refers to a common way to measure impurity. In the decision tree, it measures the
randomness or impurity in data sets.
Information Gain:
Information Gain refers to the decline in entropy after the dataset is split. It is also called
Entropy Reduction. Building a decision tree is all about discovering attributes that return the
highest data gain.
The decision tree algorithm may appear long, but it is quite simply the basis algorithm techniques is
as follows:
Missing values in data also do not influence the process of building a choice tree to any
considerable extent.
A decision tree model is automatic and simple to explain to the technical team as well as
stakeholders.
Compared to other algorithms, decision trees need less exertion for data preparation during
pre-processing.
(ii)
Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can make
quick predictions.
Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
(iii)
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
Naïve: It is called Naïve because it assumes that the occurrence of a certain feature
is independent of the occurrence of other features. Such as if the fruit is identified on
the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized
as an apple. Hence each feature individually contributes to identify that it is an apple
without depending on each other.
9. What is Bayes theorem?? Describe Working of Naïve Bayes' Classifier. What is Data
preprocessing?? 5+5+5=15
ans:
Bayes' Theorem:
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
(ii)
Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.
When creating a machine learning project, it is not always a case that we come across the
clean and formatted data. And while doing any operation with data, it is mandatory to clean
it and put in a formatted way. So for this, we use data preprocessing task.
Importing libraries
Importing datasets
Feature scaling
1) Get the Dataset
To create a machine learning model, the first thing we required is a dataset as a machine
learning model completely works on data. The collected data for a particular problem in a
proper format is known as the dataset.
Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the dataset
required for a liver patient. So each dataset is different from another dataset. To use the
dataset in our code, we usually put it into a CSV file. However, sometimes, we may also need
to use an HTML or xlsx file.
2) Importing Libraries
In order to perform data preprocessing using Python, we need to import some predefined
Python libraries. These libraries are used to perform some specific jobs.
There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal with null values. In this
way, we just delete the specific row or column which consists of null values. But this way is
not so efficient and removing data may lead to loss of information which will not give the
accurate output.
By calculating the mean: In this way, we will calculate the mean of that column or row which
contains any missing value and will put it on the place of missing value. This strategy is
useful for the features which have numeric data such as age, salary, year, etc. Here, we will
use this approach.
10.
he following is an example of Divisive Clustering.
Distance a b c d e
a 0 2 6 10 9
b 2 0 5 9 8
c 6 5 0 4 5
d 10 9 4 0 3
e 9 8 5 3 0
Ans:
Step 1.Split whole data into 2 clusters
(ii)
Building MST (Minimum Spanning Tree) is a method for constructing hierarchy of
clusters.
It starts with a tree that consists of a point p. In successive steps, look for the closest
pair of points (p,q) such that p is in the current tree but q is not. With this closest pair of
points (p,q), add q to the tree and put an edge between p and q.
The definition of inconsistency varies. For example, we can also use local inconsistency
11. Compare and contrast closeness centrality and betweeness centrality. Explain the
significance of betweeness centrality measure and focus on at least two applications.
7+8=15
Ans:
Comparison between Closeness Centrality and Betweenness Centrality:
• Closeness Centrality measures that how easily any actor (node) in a social
network can interact with all other actors. This is based on the closeness and
closeness can bemeasured using distance. Whereas the betweennes centrality
measures that if any
actor (node) is on the path between two actors then how much it can control over the
interaction of these actors.
• The shortest distance between the two actors is used to measure the closeness
centrality whereas the number of shortest distance between the two actors are used to
measure the betweenness centrality.
• Closeness centrality can be computed for only the connected graph but betweenness
centrality can be calculated for both connected and disconnected graphs. Both
centralities can be computed for directed and undirected graphs.
• In closeness centrality same equation can be used for both directed and undirected
graphs whereas in standardized betweenness centrality there is variation in equation
for directed and undirected graphs. For directed graphs the standardized
betweenness centrality is two times that of for the undirected graphs.
12. Describe different types of Social Networks Analysis. Difference between Ego
network analysis and Complete network analysis. 10+5=15
ans:
Social networks are the networks that depict the relations between people in the form
of a graph for different kinds of analysis. The graph to store the relationships of people
is known as Sociogram. All the graph points and lines are stored in the matrix data
structure called Sociomatrix. The relationships indicate of any kind like kinship,
friendship, enemies, acquaintances, colleagues, neighbors, disease transmission, etc.
Social Network Analysis (SNA) is the process of exploring or examining the social structure
by using graph theory. It is used for measuring and analyzing the structural properties of the
network. It helps to measure relationships and flows between groups, organizations, and
other connected entities. We need specialized tools to study and analyze social networks.
Basically, there are two types of social networks:
• Ego network Analysis
• Complete network Analysis
1. Ego Network Analysis
Ego network Analysis is the one that finds the relationship among people. The analysis is
done for a particular sample of people chosen from the whole population. This sampling is
done randomly to analyze the relationship. The attributes involved in this ego network
analysis are a person’s size, diversity, etc.
This analysis is done by traditional surveys. The surveys involve that they people are
asked with whom they interact with and their name of the relationship between them.
It is not focused to find the relationship between everyone in the sample. It is an effort
to find the density of the network in those samples. This hypothesis is tested using
some statistical hypothesis testing techniques.
(ii)
The majority of the network analysis will be done only for a particular domain or one organization. It is not
focused on the relationships between the organization. So many of the social network analysis measure uses
only Complete network analysis.
ans:
There are several different algorithms used for frequent pattern mining, including:
1. Apriori algorithm: This is one of the most commonly used algorithms for frequent pattern
mining. It uses a “bottom-up” approach to identify frequent itemsets and then generates
association rules from those itemsets.
1. ECLAT algorithm: This algorithm uses a “depth-first search” approach to identify
frequent itemsets. It is particularly efficient for datasets with a large number of items.
2.FP-growth algorithm: This algorithm uses a “compression” technique to find frequent
patterns efficiently. It is particularly efficient for datasets with a large number of
transactions.
3.Frequent pattern mining has many applications, such as Market Basket Analysis,
Recommender Systems, Fraud Detection, and many more.
Advantages:
1.It can find useful information which is not visible in simple data browsing
2.It can find interesting association and correlation among data items
Disadvantages:
The increasing power of computer technology creates a large amount of data and storage.
Databases are increasing rapidly and in this computerized world everything is shifting online
and data is increasing as a new currency. Data comes in different shapes and sizes and is
collected in different ways. By using data mining there are many benefits it helps us to
improve the particular process and in some cases, it costs saving or revenue generation.
Data mining is commonly used to search a large amount of data for patterns and trends, and
not only for searching it uses the data for further processes and develops actionable
processes.
Data mining is the process of converting raw data into suitable patterns based
on trends.
Data mining has different types of patterns and frequent pattern mining is one of them. This
concept was introduced for mining transaction databases. Frequent patterns are
patterns(such as items, subsequences, or substructures) that appear frequently in the
database. It is an analytical process that finds frequent patterns, associations, or causal
structures from databases in various databases. This process aims to find the frequently
occurring item in a transaction. By frequent patterns, we can identify strongly correlated
items together and we can identify similar characteristics and associations among them. By
doing frequent data mining we can go further for clustering and association.
Frequent pattern mining is a major concern it plays a major role in associations and
correlations and disclose an intrinsic and important property of dataset.
Frequent data mining can be done by using association rules with particular algorithms eclat
and apriori algorithms. Frequent pattern mining searches for recurring relationships in a data
set. It also helps to find the inheritance regularities. to make fast processing software with a
user interface and used for a long time without any error.
(ii)
The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program learns
from the given dataset or observations and then classifies new observation into a number of
classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes
can be called as targets/labels or categories.
Unlike regression, the output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised
learning technique, hence it takes labeled input data, which means it contains input with the
corresponding output.
The main goal of the Classification algorithm is to identify the category of a given dataset,
and these algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are
similar to each other and dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier. There
are two types of Classifications:
Binary Classifier: If the classification problem has only two possible outcomes, then
it is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
Multi-class Classifier: If a classification problem has more than two outcomes, then
it is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
14. Difference between Lazy Learners and Eager Learners. Describe K-Nearest Neighbor(KNN)
Algorithm for Machine Learning.
10+5=15
Ans:
1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it
receives the test dataset. In Lazy learner case, classification is done on the basis of the
most related data stored in the training dataset. It takes less time in training but more
time for predictions.
Example: K-NN algorithm, Case-based reasoning
(ii)
K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
K-NN algorithm stores all the available data and classifies a new data point based
on the similarity. This means when new data appears then it can be easily classified
into a well suite category by using K- NN algorithm.
K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.
KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new data.
Example: Suppose, we have an image of a creature that looks similar to cat and
dog, but we want to know either it is a cat or dog. So for this identification, we can
use the KNN algorithm, as it works on a similarity measure. Our KNN model will find
the similar features of the new data set to the cats and dogs images and based on
the most similar features it will put it in either cat or dog category.
Why do we need a K- NN Algorithm?
Suppose there are two categories, i.e., Category A
and Category B, and we have a new data point
x1, so this data point will lie in which of these
categories. To solve this type of problem, we need a
K-NN algorithm. With the help of K-NN, we can
easily identify the category or class of a particular dataset. Consider the below diagram:
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each
category.
Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
ans:
For retail information, sequential patterns are beneficial for shelf placement and promotions.
This industry, and telecommunications and different businesses, can also use sequential
patterns for targeted marketing, user retention, and several tasks.
There are several areas in which sequential patterns can be used such as Web access
pattern analysis, weather prediction, production processes, and web intrusion detection.
Given a set of sequences, where each sequence includes a file of events (or elements) and
each event includes a group of items, and given a user-specified minimum provide
threshold of min sup, sequential pattern mining discover all frequent subsequences, i.e., the
subsequences whose occurrence frequency in the group of sequences is no less than
min_sup.
Let I = {I1, I2,..., Ip} be the set of all items. An itemset is a nonempty set of items. A
sequence is an ordered series of events. A sequence s is indicated {e1, e2, e3 … el} where
event e1 appears before e2, which appears before e3, etc. Event ej is also known as
element of s.
In the case of user purchase information, an event defines a shopping trip in which a
customer purchase items at a specific store. The event is an itemset, i.e., an unordered list
of items that the customer purchased during the trip. The itemset (or event) is indicated
(x1x2···xq), where xk is an item.
An item can appear just once in an event of a sequence, but can appear several times in
different events of a sequence. The multiple instances of items in a sequence is known as
the length of the sequence. A sequence with length l is known as l-sequence.
A sequence database, S, is a group of tuples, (SID, s), where SID is a sequence_ID and s is
a sequence. For instance, S includes sequences for all users of the store. A tuple (SID, s) is
include a sequence α, if α is a subsequence of s.
In DNA sequence analysis, approximate patterns become helpful because DNA sequences
can include (symbol) insertions, deletions, and mutations. Such diverse requirements can
be considered as constraint relaxation or application.
(ii)
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to
the value of the independent variable.
The
linear
regression model provides a sloped straight line representing the relationship between the variables.
Consider the below image:
Mathematically, we can represent a linear regression as:
y= a0+a1x+ ε
Here,
The values for x and y variables are training datasets for Linear Regression model
representation.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to
calculate this we use cost function.
Cost function-
The different values for weights or coefficient of lines (a0, a1) gives the different line
of regression, and the cost function is used to estimate the values of the coefficient
for the best fit line.
We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also known
as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the
average of squared error occurred between the predicted values and actual values. It can be
written as:
Where,
Residuals: The distance between the actual value and predicted values is called residual. If
the observed points are far from the regression line, then the residual will be high, and so
cost function will high. If the scatter points are close to the regression line, then the residual
will be small and hence the cost function.
Gradient Descent:
Gradient descent is used to minimize the MSE by calculating the gradient of the
cost function.
A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
K-Medoids and K-Means are two types of clustering mechanisms in Partition Clustering. First,
Clustering is the process of breaking down an abstract group of data points/ objects into
classes of similar objects such that all the objects in one cluster have similar traits. , a group
of n objects is broken down into k number of clusters based on their similarities.
Two statisticians, Leonard Kaufman, and Peter J. Rousseeuw came up with this method. This
tutorial explains what K-Medoids do, their applications, and the difference between K-Means
and K-Medoids.
1. Choose k number of random points (Data point from the data set or some other
points). These points are also called "Centroids" or "Means".
2. Assign all the data points in the data set to the closest centroid by applying any
distance formula like Euclidian distance, Manhattan distance, etc.
3. Now, choose new centroids by calculating the mean of all the data points in the
clusters and goto step 2
4. Continue step 3 until no data point changes classification between two iterations.
The problem with the K-Means algorithm is that the algorithm needs to handle outlier data.
An outlier is a point different from the rest of the points. All the outlier data points show up
in a different cluster and will attract other clusters to merge with it. Outlier data increases the
mean of a cluster by up to 10 units. Hence, K-Means clustering is highly affected by
outlier data.
K-Medoids:
Medoid: A Medoid is a point in the cluster from which the sum of distances to other data
points is minimal.
(or)
A Medoid is a point in the cluster from which dissimilarities with all the other points in the
clusters are minimal.
PAM is the most powerful algorithm of the three algorithms but has the disadvantage of
time complexity. The following K-Medoids are performed using PAM. In the further parts,
we'll see what CLARA and CLARANS are.
Algorithm:
Given the value of k and unlabelled data:
1. Choose k number of random points from the data and assign these k points to k
number of clusters. These are the initial medoids.
2. For all the remaining data points, calculate the distance from each medoid and
assign it to the cluster with the nearest medoid.
3. Calculate the total cost (Sum of all the distances from all the data points to the
medoids)
4. Select a random point as the new medoid and swap it with the previous medoid.
Repeat 2 and 3 steps.
5. If the total cost of the new medoid is less than that of the previous medoid, make
the new medoid permanent and repeat step 4.
6. If the total cost of the new medoid is greater than the cost of the previous medoid,
undo the swap and repeat step 4.
7. The Repetitions have to continue until no change is encountered with new medoids
to classify data points.
(ii)
K-Means K-Medoids
Both methods are types of Partition Clustering.
Both algorithms group n objects into k clusters based on similar traits where k is pre-
defined.
A centroid can be a data point or some other A medoid is always a data point in the
Can't cope with outlier data Can manage outlier data too
Sometimes, outlier sensitivity can turn out to Tendency to ignore meaningful clusters in
10+5 =15
ans:
Regression and Classification algorithms are Supervised Learning algorithms. Both the
algorithms are used for prediction in Machine learning and work with the labeled datasets.
But the difference between both is how they are used for different machine learning
problems.
The main difference between Regression and Classification algorithms that Regression
algorithms are used to predict the continuous values such as price, salary, age, etc. and
Classification algorithms are used to predict/Classify the discrete values such as Male or
Female, True or False, Spam or Not Spam, etc.
Classification:
Classification is a process of finding a function which helps in dividing the dataset into
classes based on different parameters. In Classification, a computer program is trained on
the training dataset and based on that training, it categorizes the data into different classes.
mapping function to map the input(x) to the discrete output(y).
Example: The best example to understand the Classification problem is Email Spam
Detection. The model is trained on the basis of millions of emails on different parameters,
and whenever it receives a new email, it identifies whether the email is spam or not. If the
email is spam, then it is moved to the Spam folder.
Logistic Regression
K-Nearest Neighbours
Kernel SVM
Naïve Bayes
Regression:
Regression is a process of finding the correlations between dependent and independent
variables. It helps in predicting the continuous variables such as prediction of Market
Trends, prediction of House prices, etc.
The task of the Regression algorithm is to find the mapping function to map the input
variable(x) to the continuous output variable(y).
Example: Suppose we want to do weather forecasting, so for this, we will use the Regression
algorithm. In weather prediction, the model is trained on the past data, and once the training
is completed, it can easily predict the weather for future days.
Polynomial Regression
The task of the regression algorithm is The task of the classification algorithm is to map
to map the input value (x) with the the input value(x) with the discrete output
Regression Algorithms are used with Classification Algorithms are used with discrete
In Regression, we try to find the best In Classification, we try to find the decision
fit line, which can predict the output boundary, which can divide the dataset into
Linear Regression and Logistic Regression are the two famous Machine Learning Algorithms
which come under supervised learning technique. Since both the algorithms are of
supervised in nature hence these algorithms use labeled dataset to make the predictions. But
the main difference between them is how they are being used. The Linear Regression is used
for solving Regression problems whereas Logistic Regression is used for solving the
Classification problems. The description of both the algorithms is given below along with
difference table.
Linear Regression:
Linear Regression is one of the most simple Machine learning algorithm that comes
under Supervised Learning technique and used for solving regression problems.
It is used for predicting the continuous dependent variable with the help of
independent variables.
The goal of the Linear regression is to find the best fit line that can accurately
predict the output for the continuous dependent variable.
If single independent variable is used for prediction then it is called Simple Linear
Regression and if there are more than two independent variables then such
regression is called as Multiple Linear Regression.
By finding the best fit line, algorithm establish the relationship between dependent
variable and independent variable. And the relationship should be of linear nature.
The output for Linear regression should only be the continuous values such as price,
age, salary, etc. The relationship between the dependent variable and independent
variable can be shown in below image:
In above image the dependent variable is on Y-axis (salary) and independent variable is on x-
axis(experience). The regression line can be written as:
y= a0+a1x+ ε
Logistic Regression:
Logistic regression is one of the most popular Machine learning algorithm that
comes under Supervised Learning techniques.
It can be used for Classification as well as for Regression problems, but mainly used
for Classification problems.
Logistic regression is used to predict the categorical dependent variable with the
help of independent variables.
The output of Logistic Regression problem can be only between the 0 and 1.
Logistic regression can be used where the probabilities between two classes is
required. Such as whether it will rain today or not, either 0 or 1, true or false etc.
Linear Regression is used for solving Logistic regression is used for solving
In Linear regression, we predict the value of In logistic Regression, we predict the values
by which we can easily predict the output. by which we can classify the samples.
Least square estimation method is used for Maximum likelihood estimation method is
relationship between dependent variable have the linear relationship between the
variables. variable.