0% found this document useful (0 votes)
55 views37 pages

Unit 3 Data

Clustering is an unsupervised machine learning technique used to group unlabeled data points together based on similarities. It divides data into clusters such that data points within a cluster are more similar to each other than data points in other clusters. Common clustering methods include density-based, partitioning, and hierarchical clustering. K-means clustering is a popular partitioning method that divides data into k predefined clusters by minimizing distances between data points and their assigned cluster centers.

Uploaded by

Sangam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views37 pages

Unit 3 Data

Clustering is an unsupervised machine learning technique used to group unlabeled data points together based on similarities. It divides data into clusters such that data points within a cluster are more similar to each other than data points in other clusters. Common clustering methods include density-based, partitioning, and hierarchical clustering. K-means clustering is a popular partitioning method that divides data into k predefined clusters by minimizing distances between data points and their assigned cluster centers.

Uploaded by

Sangam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

UNIT 3

CLUSTERING

Clustering is the task of dividing the population or data points into a number of groups such that data
points in the same groups are more similar to other data points in the same group and dissimilar to the
data points in other groups. It is basically a collection of objects on the basis of similarity and
dissimilarity between them.

For ex– The data points in the graph below clustered together can be classified into one single group.

Clustering Methods :

 Density-Based Methods: Density-Based Clustering refers to one of the most popular unsupervised
learning methodologies used in model building and machine learning algorithms. The data points in
the region separated by two clusters of low point density are considered as noise. These methods
consider the clusters as the dense region having some similarities and differences from the lower
dense region of the space. These methods have good accuracy and the ability to merge two
clusters. Example DBSCAN (Density-Based Spatial Clustering of Applications with
Noise), OPTICS (Ordering Points to Identify Clustering Structure), etc.

Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is the K-Means
Clustering algorithm.

In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-
defined groups. The cluster center is created in such a way that the distance between the data points of
one cluster is minimum as compared to another cluster centroid.

 Hierarchical Based Methods:

Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group the
unlabeled datasets into a cluster and also known as hierarchical cluster analysis or HCA.In this
algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped structure is
known as the dendrogram. The clusters formed in this method form a tree-type structure based on the
hierarchy. New clusters are formed using the previously formed one. It is divided into two category

https://www.kdnuggets.com/2019/09/hierarchical-clustering.html


 Agglomerative (bottom-up approach)


 Divisive (top-down approach)

1. Agglomerative:

Initially consider every data point as an individual Cluster and at every step, merge the nearest
pairs of the cluster. (It is a bottom-up method).
Algorithm for Agglomerative Hierarchical Clustering is:
 Calculate the similarity of one cluster with all the other clusters (calculate proximity matrix)
 Consider every data point as a individual cluster
 Merge the clusters which are highly similar or close to each other.
 Recalculate the proximity matrix for each cluster
 Repeat Step 3 and 4 until only a single cluster remains.
Let’s see the graphical representation of this algorithm using a dendrogram.
Note:

Figure – Agglomerative Hierarchical clustering

2. Divisive:

We can say that the Divisive Hierarchical clustering is precisely the opposite of the
Agglomerative Hierarchical clustering. In Divisive Hierarchical clustering, we take into account
all of the data points as a single cluster and in every iteration, we separate the data points from the
clusters which aren’t comparable. In the end, we are left with N clusters.

Figure – Divisive Hierarchical clustering

Measure for the distance between two clusters

As we have seen, the closest distance between the two clusters is crucial for the hierarchical clustering.
There are various ways to calculate the distance between two clusters, and these ways decide the rule
for clustering. These measures are called Linkage methods. Some of the popular linkage methods are
given below:

1. Single Linkage: It is the Shortest Distance between the closest points of the clusters. Consider
the below image:
2. Complete Linkage: It is the farthest distance between the two points of two different clusters. It
is one of the popular linkage methods as it forms tighter clusters than single-linkage.

3. Average Linkage: It is the linkage method in which the distance between each pair of datasets
is added up and then divided by the total number of datasets to calculate the average distance
between two clusters. It is also one of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the centroid of the
clusters is calculated. Consider the below image:

 Partitioning Methods: These methods partition the objects into k clusters and each partition
forms one cluster. This method is used to optimize an objective criterion similarity function such
as when the distance is a major parameter example K-means, CLARANS (Clustering Large
Applications based upon Randomized Search), etc.

K-MEANS
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems
in machine learning or data science. Here K defines the number of pre-defined clusters that need to be
created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and
so on. The main aim of this algorithm is to minimize the sum of distances between the data point and
their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in this
algorithm.
The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given
below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point. So, here we are selecting the below two
points as k points, which are not the part of our dataset. Consider the below image:
o

o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance between
two points. So, we will draw a median between both the centroids. Consider the below image:

From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and
points to the right of the line are close to the yellow centroid. Let's color them as blue and yellow for
clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new centroid.
To choose the new centroids, we will compute the center of gravity of these centroids, and will
find new centroids as below:

o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:
o

From the above image, we can see, one yellow point is on the left side of the line, and two blue points
are right to the line. So, these three points will be assigned to new centroids.

As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-
points.

o We will repeat the process by finding the center of gravity of centroids, so the new centroids will
be as shown in the below image:
o

o As we got the new centroids so again will draw the median line and reassign the data points. So,
the image will be:

o We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:
o

As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be
as shown in the below image:
USE CASES OF K MEAN

1. Identifying crime localities


2. Call record detail analysis : This information provides greater insights about the
customer’s needs when used with customer demographics. We can cluster customer
activities for 24 hours by using the unsupervised k-means clustering algorithm. It is used to
understand segments of customers concerning their usage by hours.
3. Document Classification : Cluster documents in multiple categories based on tags, topics, and the
content of the document. This is a very standard classification problem and k-means is a highly
suitable algorithm for this purpose.
4. Crime Analysis

5. Insurance Fraud Detection

CLASSIFICATION
What is the Classification Algorithm?

The Classification algorithm is a Supervised Learning technique that is used to identify the category of
new observations on the basis of training data. In Classification, a program learns from the given dataset
or observations and then classifies new observation into a number of classes or groups. Such as, Yes or
No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or categories.

The best example of an ML classification algorithm is Email Spam Detector.

The main goal of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data.

Classification algorithms can be better understood using the below diagram. In the below diagram, there
are two classes, class A and Class B. These classes have features that are similar to each other and
dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier. There are two
types of Classifications:

o Binary Classifier: If the classification problem has only two possible outcomes, then it is called
as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then it is called as
Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.

Learners in Classification Problems:

In the classification problems, there are two types of learners:

1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the test
dataset. In Lazy learner case, classification is done on the basis of the most related data stored in
the training dataset. It takes less time in training but more time for predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners:Eager Learners develop a classification model based on a training dataset before
receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in learning,
and less time in prediction. Example: Decision Trees, Naïve Bayes, ANN.

Types of ML Classification Algorithms:

Classification Algorithms can be further divided into the Mainly two category:

o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification
o Artificial Neural Networks
o

Evaluating a Classification model:

2. Confusion Matrix:

o The confusion matrix provides us a matrix/table as output and describes the performance of the
model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a total number of
correct predictions and incorrect predictions. The matrix looks like as below table:

Actual Positive Actual Negative

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative

Use cases of Classification Algorithms

Classification algorithms can be used in different places. Below are some popular use cases of
Classification Algorithms:

o Email Spam Detection


o Speech Recognition
o Identifications of Cancer tumor cells.
o Drugs Classification
o Biometric Identification, etc.

SVM
SVM stands for Support Vector Machine. It is a machine learningapproach used for classification
and regression analysis. It depends on supervised learning models and trained by learning algorithms.
They analyze the large amount of data to identify patterns from them.
An SVM Divide the 2 categories by a clear gap that should be as wide as possible. Do this partitioning
by a plane called hyperplane.An SVM creates hyperplanes that have the largest margin in a high-
dimensional space to separate given data into classes. The margin between the 2 classes represents the
longest distance between closest data points of those classes.
The larger the margin, the lower is the generalization error of the classifier.

SVM Algorithm

To understand the algorithm of SVM, consider two cases:

Separable case – Infinite boundaries are possible to separate the data into two classes.
Non Separable case – Two classes are not separated but overlap with each other.
What is a separating hyperplane?

Such a line is called a separating hyperplane and is depicted below:


What is the optimal separating hyperplane?
The fact that you can find a separating hyperplane, does not mean it is the best one ! In the example
below there is several separating hyperplanes. Each of them is valid as it successfully separates our data
set with men on one side and women on the other side.

There can be a lot of separating hyperplanes

Suppose we select the green hyperplane and use it to classify on real life data.

This hyperplane does not generalize well

This time, it makes some mistakes as it wrongly classify three women. Intuitively, we can see that if we
select an hyperplane which is close to the data points of one class, then it might not generalize well.
So we will try to select an hyperplane as far as possible from data points from each category:

This one looks better. When we use it with real life data, we can see it still make perfect classification.
The black hyperplane classifies more accurately than the green one

That's why the objective of a SVM is to find the optimal separating hyperplane: because it correctly
classifies the training data and because it is the one which will generalize better with unseen data
What is the margin and how does it help choosing the optimal hyperplane?

The margin of our optimal hyperplane

Given a particular hyperplane, we can compute the distance between the hyperplane and the closest data
point. Once we have this value, if we double it we will get what is called the margin.
Basically the margin is a no man's land. There will never be any data point inside the margin. (Note:
this can cause some problems when data is noisy, and this is why soft margin classifier will be
introduced later)
For another hyperplane, the margin will look like this :
As you can see, Margin B is smaller than Margin A.

We can make the following observations:

If an hyperplane is very close to a data point, its margin will be small.


The further an hyperplane is from a data point, the larger its margin will be.

NAÏVE BAYES

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is
not a single algorithm but a family of algorithms where all of them share a common principle, i.e.
every pair of features being classified is independent of each other.Let us go through some of the
simple concepts of probability that we will use. Consider the following example of tossing two coins. If
we toss two coins and look at all the different possibilities, we have the sample space as:{HH, HT, TH,
TT}

While calculating the math on probability, we usually denote probability as P. Some of the probabilities
in this event would be as follows:

 The probability of getting two heads = 1/4


 The probability of at least one tail = 3/4
 The probability of the second coin being head given the first coin is tail = 1/2
 The probability of getting two heads given the first coin is a head = 1/2

The Bayes theorem gives us the conditional probability of event A, given that event B has occurred. In
this case, the first coin toss will be B and the second coin toss A. This could be confusing because we've
reversed the order of them and go from B to A instead of A to B.

Naïve Bayes algorithms is a classification technique based on applying Bayes’ theorem with a strong
assumption that all the predictors are independent to each other. In simple words, the assumption is that
the presence of a feature in a class is independent to the presence of any other feature in the same class.
For example, a phone may be considered as smart if it is having touch screen, internet facility, good
camera etc. Though all these features are dependent on each other, they contribute independently to the
probability of that the phone is a smart phone.

o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem
and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which
helps in building the fast machine learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis,
and classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases of
color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each
feature individually contributes to identify that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis
is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.


Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using
this dataset we need to decide that whether we should play or not on a particular day according to the
weather conditions. So to solve this problem, we need to follow the below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes
Frequency table for the Weather Conditions:
Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5
Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3 , P(Sunny)= 0.35 , P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29 , P(Sunny)= 0.35 , So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:

o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.

Applications of Naïve Bayes Classifier:

o It is used for Credit Scoring.


o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a normal distribution. This means
if predictors take continuous values instead of discrete, then the model assumes that these values
are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor
variables are the independent Booleans variables. Such as if a particular word is present or not in
a document. This model is also famous for document classification tasks.
KNN

o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised
Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases and put
the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for
the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an action on
the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we
want to know either it is a cat or dog. So for this identification, we can use the KNN algorithm,
as it works on a similarity measure. Our KNN model will find the similar features of the new
data set to the cats and dogs images and based on the most similar features it will put it in either
cat or dog category.

Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so
this data point will lie in which of these categories. To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset.
Consider the below diagram:
How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the below
image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean distance
is the distance between two points, which we have already studied in geometry. It can be
calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:

o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.
How to select the value of K in the K-NN Algorithm?
o There is no particular way to determine the best value for "K", so we need to try some values to
find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in
the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points for all
the training samples.

SEGMENTATION

Image segmentation is a sub-domain of computer vision and digital image processing


which aims at grouping similar regions or segments of an image under their
respective class labels.
Image segmentation is an extension of image classification where, in addition to
classification, we perform localization. Image segmentation thus is a superset of
image classification with the model pinpointing where a corresponding object is
present by outlining the object's boundary

REGRESSION
REGRESSION NUMERICALS
Example 9.9

Calculate the regression coefficient and obtain the lines of regression for the following
data

Solution:

Regression coefficie nt of X on Y
(i) Regression equation of X on Y

(ii) Regression coefficient of Y on X

(iii) Regression equation of Y on X


Y = 0.929X–3.716+11

= 0.929X+7.284

The regression equation of Y on X is Y= 0.929X + 7.284

ML Search: Indexing and Indexing Techniques,


Indexing is a way to optimize the performance of a database by minimizing the number of
disk accesses required when a query is processed. It is a data structure technique which is
used to quickly locate and access the data in a database.
Indexes are created using a few database columns.
 The first column is the Search key that contains a copy of the primary key or candidate
key of the table. These values are stored in sorted order so that the corresponding data
can be accessed quickly.
Note: The data may or may not be stored in sorted order.
 The second column is the Data Reference or Pointer which contains a set of pointers
holding the address of the disk block where that particular key value can be found.

The indexing has various attributes:


 Access Types: This refers to the type of access such as value based search, range access, etc.
 Access Time: It refers to the time needed to find particular data element or set of elements.
 Insertion Time: It refers to the time taken to find the appropriate space and insert a new data.
 Deletion Time: Time taken to find an item and delete it as well as update the index structure.
 Space Overhead: It refers to the additional space required by the index.
In general, there are two types of file organization mechanism which are followed by the indexing
methods to store the data:
1. Sequential File Organization or Ordered Index File: In this, the indices are based on a sorted
ordering of the values. These are generally fast and a more traditional type of storing mechanism.
These Ordered or Sequential file organization might store the data in a dense or sparse format:
(i) Dense Index:
 For every search key value in the data file, there is an index record.
 This record contains the search key and also a reference to the first data record with that search
key value.

(ii) Sparse Index:


 The index record appears only for a few items in the data file. Each item points to a block as
shown.
 To locate a record, we find the index record with the largest search key value less than or equal to
the search key value we are looking for.
 We start at that record pointed to by the index record, and proceed along with the pointers in the
file (that is, sequentially) until we find the desired record.
2. Hash File organization: Indices are based on the values being distributed uniformly across a range
of buckets. The buckets to which a value is assigned is determined by a function called a hash
function.
There are primarily three methods of indexing:
 Clustered Indexing
 Non-Clustered or Secondary Indexing
 Multilevel Indexing
1. Clustered Indexing
When more than two records are stored in the same file these types of storing known as cluster
indexing. By using the cluster indexing we can reduce the cost of searching reason being multiple
records related to the same thing are stored at one place and it also gives the frequent joining of more
than two tables (records).
Clustering index is defined on an ordered data file. The data file is ordered on a non-key field. In
some cases, the index is created on non-primary key columns which may not be unique for each
record. In such cases, in order to identify the records faster, we will group two or more columns
together to get the unique values and create index out of them. This method is known as the clustering
index. Basically, records with similar characteristics are grouped together and indexes are created for
these groups.
For example, students studying in each semester are grouped together. i.e. 1st Semester students,
2nd semester students, 3rd semester students etc. are grouped.
Clustered index sorted according to first name (Search key)
Primary Indexing:
This is a type of Clustered Indexing wherein the data is sorted according to the search key and the
primary key of the database table is used to create the index. It is a default format of indexing where it
induces sequential file organization. As primary keys are unique and are stored in a sorted manner, the
performance of the searching operation is quite efficient.
2. Non-clustered or Secondary Indexing
A non clustered index just tells us where the data lies, i.e. it gives us a list of virtual pointers or
references to the location where the data is actually stored. Data is not physically stored in the order
of the index. Instead, data is present in leaf nodes. For eg. the contents page of a book. Each entry
gives us the page number or location of the information stored. The actual data here(information on
each page of the book) is not organized but we have an ordered reference(contents page) to where the
data points actually lie. We can have only dense ordering in the non-clustered index as sparse
ordering is not possible because data is not physically organized accordingly.
It requires more time as compared to the clustered index because some amount of extra work is done
in order to extract the data by further following the pointer. In the case of a clustered index, data is
directly present in front of the index.
3. Multilevel Indexing
With the growth of the size of the database, indices also grow. As the index is stored in the main
memory, a single-level index might become too large a size to store with multiple disk accesses. The
multilevel indexing segregates the main block into various smaller blocks so that the same can stored
in a single block. The outer blocks are divided into inner blocks which in turn are pointed to the data
blocks. This can be easily stored in the main memory with fewer overheads.

Create inverted index using JAQL,


The inverted index is a data structure that allows efficient, full-text searches in the database. It is a very
important part of information retrieval systems and search engines that stores a mapping of words (or
any type of search terms) to their locations in the database table or document.

There are two types of inverted indexes: A record-level inverted index contains a list of references
to documents for each word. A word-level inverted index additionally contains the positions of each
word within a document. The latter form offers more functionality, but needs more processing power
and space to be created.

Advantage of Inverted Index are:


 Inverted index is to allow fast full text searches, at a cost of increased processing when a
document is added to the database.
 It is easy to develop.
 It is the most popular data structure used in document retrieval systems, used on a large scale for
example in search engines.
Inverted Index also has disadvantage:
 Large storage overhead and high maintenance costs on update, delete and insert.
Decision Tree
o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands
on further branches and constructs a tree-like structure.

Decision Tree Terminologies

 Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches from the tree.
 Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

There are two popular techniques for ASM, which are:

o Information Gain
o Gini Index

1. Information Gain:

o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in
data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:

o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

You might also like