Fundamentals of Datascience
Fundamentals of Datascience
Fundamentals of Datascience
UNIT-I
What is Data?
Data is distinct pieces of information, usually formatted in a special way”. Data
can be measured, collected, reported, and analyzed, whereupon it is often
visualized using graphs, images, or other analysis tools. Raw data (“unprocessed
data”) may be a collection of numbers or characters before it’s been “cleaned”
and corrected by researchers.
What is Information ?
Information is data that has been processed , organized, or structured in a way
that makes it meaningful, valuable and useful.
Categories of Data
Data can be catogeries into two main parts –
Structured Data: This type of data is organized data into specific format, making
it easy to search , analyze and process. Structured data is found in a relational
databases that includes information like numbers, data and categories.
UnStructured Data: Unstructured data does not conform to a specific structure
or format. It may include some text documents , images, videos, and other data
that is not easily organized or analyzed without additional processing.
What is Data Mining
Data Mining:
Definition: Data mining is the process of analyzing large datasets to discover
patterns, relationships, correlations, or meaningful insights that can help in
making informed decisions and predictions.
Purpose: The primary purpose of data mining is to extract valuable knowledge
and information from large volumes of data that might be hidden or not readily
apparent. It involves using advanced statistical and machine learning techniques
to identify patterns and trends.
Functions: Data mining algorithms and techniques are applied to the data to
identify associations, clusters, classifications, and anomalies. It helps in
understanding customer behavior, predicting trends, detecting fraud, and making
data-driven business decisions.
INNAHAI ANUGRAHAM 1
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Usage: Data mining is widely used in areas such as marketing analysis, customer
segmentation, recommendation systems, fraud detection, healthcare research, and
financial forecasting.
Goals of Data Mining:
The goal of data mining is to extract useful information from large datasets
and use it to make predictions or inform decision-making.
Data mining is important because it allows organizations to uncover
insights and trends in their data that would be difficult or impossible to
discover manually.
This can help organizations make better decisions, improve their
operations, and gain a competitive advantage.
Data Mining History and Origins
One of the earliest and most influential pioneers of data mining was Dr. Herbert
Simon, a Nobel laureate in economics who is widely considered to be the father
of artificial intelligence. In the 1950s and 1960s, Simon and his colleagues
developed a number of algorithms and techniques for extracting useful
information and insights from data, including clustering, classification, and
decision trees.
In the 1980s and 1990s, the field of data mining continued to evolve, and new
algorithms and techniques were developed to address the challenges of working
with large and complex data sets. The development of data mining software and
platforms, such as SAS, SPSS, and RapidMiner, made it easier for organizations
to apply data mining techniques to their data.
In recent years, the availability of large data sets and the growth of cloud
computing and big data technologies have made data mining even more powerful
and widely used. Today, data mining is a crucial tool for many organizations and
industries and is used to extract valuable insights and information from data sets
in a wide range of domains.
Tasks of Data Mining
1. Classification: Categorizing data into predefined classes.
2. Clustering: Grouping similar data points together.
3. Regression: Predicting numerical values based on data relationships.
4. Association Rule Mining: Discovering interesting relationships between
variables.
INNAHAI ANUGRAHAM 2
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
INNAHAI ANUGRAHAM 3
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
based on the mining goals. The selected data may also undergo transformation to
better suit the mining algorithms.
4. Data Mining Engine: This is the core component where various data mining
algorithms are applied to the prepared data to discover patterns, trends, and
insights.
5. Pattern Evaluation: Once patterns are discovered, they need to be evaluated
for their relevance, validity, and usefulness. This step often involves statistical
techniques and domain expertise.
6. Knowledge Presentation: Finally, the discovered knowledge is presented to
users in a comprehensible format, such as reports, visualizations, or dashboards,
to aid in decision making.
Throughout this process, feedback loops may exist where insights gained from
the data mining results inform subsequent data selection, cleaning, or mining
steps, creating a continuous improvement cycle.
Data Mining Process
The data mining process typically involves several key stages:
1. Understanding the Business Problem: The first step is to clearly understand
the business problem or objective that data mining aims to address. This involves
collaborating closely with domain experts to identify key questions and goals.
2. Data Collection: In this stage, relevant data is gathered from various sources
such as databases, data warehouses, spreadsheets, or even web scraping. The data
collected should be comprehensive and representative of the problem domain.
3. Data Preprocessing: Raw data often requires preprocessing to ensure its quality
and suitability for analysis. This includes tasks such as cleaning data to remove
errors and inconsistencies, handling missing values, and transforming data into a
suitable format for analysis.
4. Exploratory Data Analysis (EDA): EDA involves examining the collected data
to understand its characteristics, identify patterns, and detect outliers or
anomalies. Techniques such as descriptive statistics, data visualization, and
clustering may be used during this stage.
5. Feature Selection and Engineering: Feature selection involves identifying the
most relevant variables (features) that will be used for analysis, while feature
engineering may involve creating new features or transforming existing ones to
enhance the predictive power of the model.
INNAHAI ANUGRAHAM 4
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
6. Model Selection and Training: Based on the nature of the problem and the
available data,suitable data mining algorithms or models are selected. These may
include techniques such as decision trees, neural networks, support vector
machines, or clustering algorithms. The selected models are then trained on the
prepared data.
7. Model Evaluation: Trained models need to be evaluated to assess their
performance and generalization ability. This involves using evaluation metrics
such as accuracy, precision, recall, or F1-score, and techniques such as cross-
validation to ensure robustness.
8. Model Deployment: Once a satisfactory model is obtained, it is deployed into
production to make predictions or generate insights on new, unseen data. This
may involve integrating the model into existing systems or workflows.
9. Monitoring and Maintenance: Deployed models should be regularly
monitored to ensure they continue to perform effectively over time. This may
involve monitoring for concept drift (changes in the underlying data distribution)
and updating the model or its parameters as necessary.
Throughout the entire data mining process, it's essential to maintain a clear focus
on the business objectives and involve domain experts at each stage to ensure that
the insights gained are relevant and actionable.
Classification of data mining
Classification Based on the mined Databases
A data mining system can be classified based on the types of databases that have
been mined. A database system can be further segmented based on distinct
principles, such as data models, types of data, etc., which further assist in
classifying a data mining system.
For example, if we want to classify a database based on the data model, we need
to select either relational, transactional, object-relational or data warehouse
mining systems.
Classification Based on the type of Knowledge Mined
A data mining system categorized based on the kind of knowledge mind may have
the following functionalities:
1. Characterization
2. Discrimination
3. Association and Correlation Analysis
INNAHAI ANUGRAHAM 5
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
4. Classification
5. Prediction
6. Outlier Analysis
7. Evolution Analysis
Classification Based on the Techniques Utilized
A data mining system can also be classified based on the type of techniques that
are being incorporated.
These techniques can be assessed based on the involvement of user interaction
involved or the methods of analysis employed.
Classification Based on the Applications Adapted
Data mining systems classified based on adapted applications adapted are as
follows:
1. Finance
2. Telecommunications
3. DNA
4. Stock Markets
5. E-mail
What is KDD (Knowledge Discovery in Databases).
KDD is a computer science field specializing in extracting previously unknown
and interesting information from raw data. KDD is the whole process of trying to
make sense of data by developing appropriate methods or techniques. The
following steps are included in KDD process:
Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from collection.
Cleaning in case of Missing values.
Cleaning noisy data, where noise is a random or variance error.
Cleaning with Data discrepancy detection and Data transformation tools.
Data Integration
Data integration is defined as heterogeneous data from multiple sources
combined in a common source(DataWarehouse). Data integration using Data
INNAHAI ANUGRAHAM 6
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
INNAHAI ANUGRAHAM 7
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
INNAHAI ANUGRAHAM 9
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Applications of OLAP
Database Marketing
Marketing and sales analysis
It is used for future data prediction. It is used for analyzing past data.
INNAHAI ANUGRAHAM 10
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
INNAHAI ANUGRAHAM 12
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
New product development: Data mining can be used to identify new product
opportunities by analyzing customer purchase patterns and preferences.
Risk management: Data mining can be used to identify potential risks by
analyzing data on customer behavior, market conditions, and other factors.
Challenges and Issues in Data Mining
1]Data Quality
The quality of data used in data mining is one of the most significant challenges.
The accuracy, completeness, and consistency of the data affect the accuracy of
the results obtained. The data may contain errors, omissions, duplications, or
inconsistencies, which may lead to inaccurate results.
To address these challenges, data mining practitioners must apply data cleaning
and data preprocessing techniques to improve the quality of the data
2]Data Complexity
Data complexity refers to the vast amounts of data generated by various sources,
such as sensors, social media, and the internet of things (IoT). The complexity of
the data may make it challenging to process, analyze, and understand. In addition,
the data may be in different formats, making it challenging to integrate into a
single dataset.
To address this challenge, data mining practitioners use advanced techniques such
as clustering, classification, and association rule mining.
3]Data Privacy and Security
Data privacy and security is another significant challenge in data mining. As more
data is collected, stored, and analyzed, the risk of data breaches and cyber-attacks
increases. The data may contain personal, sensitive, or confidential information
that must be protected. Moreover, data privacy regulations such as GDPR, CCPA,
and HIPAA impose strict rules on how data can be collected, used, and shared.
To address this challenge, data mining practitioners must apply data
anonymization and data encryption techniques to protect the privacy and security
of the data. Data anonymization involves removing personally identifiable
information (PII) from the data, while data encryption involves using algorithms
to encode the data to make it unreadable to unauthorized users.
INNAHAI ANUGRAHAM 14
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
4]Scalability
Data mining algorithms must be scalable to handle large datasets efficiently. As
the size of the dataset increases, the time and computational resources required to
perform data mining operations also increase.
To address this challenge, data mining practitioners use distributed computing
frameworks such as Hadoop and Spark.
5]Interpretability
Data mining algorithms can produce complex models that are difficult to
interpret. This is because the algorithms use a combination of statistical and
mathematical techniques to identify patterns and relationships in the data.
To address this challenge, data mining practitioners use visualization techniques
to represent the data and the models visually.
Data Mining Applications
Data mining is used by a wide range of organizations and individuals across many
different industries and domains. Some examples of who uses data mining
include:
Businesses and Enterprises – Many businesses and enterprises use data mining
to extract useful insights and information from their data, in order to make better
decisions, improve their operations, and gain a competitive advantage. For
example, a retail company might use data mining to identify customer trends and
preferences or to predict demand for its products.
Government Agencies and Organizations – Government agencies and
organizations use data mining to analyze data related to their operations and the
population they serve, in order to make better decisions and improve their
services. For example, a health department might use data mining to identify
patterns and trends in public health data or to predict the spread of infectious
diseases.
Academic and Research Institutions – Academic and research institutions use
data mining to analyze data from their research projects and experiments, in order
to identify patterns, relationships, and trends in the data. For example, a university
might use data mining to analyze data from a clinical trial or to explore the
relationships between different variables in a social science study.
Individuals – Many individuals use data mining to analyze their own data, in
order to better understand and manage their personal information and activities.
INNAHAI ANUGRAHAM 15
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
For example, a person might use data mining to analyze their financial data and
identify patterns in their spending or to analyze their social media data and
understand their online behavior and interactions.
INNAHAI ANUGRAHAM 16
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
UNIT-II
What Is a Data Warehouse?
A Data Warehouse (DW) is a relational database that is designed for query and
analysis rather than transaction processing. It includes historical data derived
from transaction data from single and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and
focuses on providing support for decision-makers for data modeling and analysis.
A Data Warehouse can be viewed as a data system with the following attributes:
It is a database designed for investigative tasks, using data from various
applications.
It supports a relatively small number of clients with relatively long
interactions.
It includes current and historical data to provide a historical perspective of
information.
Its usage is read-intensive.
It contains a few large tables.
Benefits of Data Warehouse
1.Understand business trends and make better forecasting decisions.
2.Data Warehouses are designed to perform well enormous amounts of data.
3.The structure of data warehouses is more accessible for end-users to navigate,
understand, and query.
4.Queries that would be complex in many normalized databases could be easier
to build and maintain in data warehouses.
5.Data warehousing is an efficient method to manage demand for lots of
information from lots of users.
6.Data warehousing provide the capabilities to analyze a large amount of
historical data.
What is Multi-Dimensional Data Model?
The dimensions are the perspectives or entities concerning which an organization
keeps records. For example, a shop may create a sales data warehouse to keep
records of the store's sales for the dimension time, item, and location. These
dimensions allow the save to keep track of things, for example, monthly sales of
items and the locations at which the items were sold. Each dimension has a table
INNAHAI ANUGRAHAM 17
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
related to it, called a dimensional table, which describes the dimension further.
For example, a dimensional table for an item may contain the attributes
item_name, brand, and type.
A multidimensional data model is organized around a central theme, for example,
sales. This theme is represented by a fact table. Facts are numerical measures.
The fact table contains the names of the facts or measures of the related
dimensional tables.
Consider the data of a shop for items sold per quarter in the city of Delhi. The
data is shown in the table. In this 2D representation, the sales for Delhi are shown
for the time dimension (organized in quarters) and the item dimension (classified
according to the types of an item sold). The fact or measure displayed in
rupee_sold (in thousands).
INNAHAI ANUGRAHAM 18
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Now, if we want to view the sales data with a third dimension, For example,
suppose the data according to time and item, as well as the location is considered
for the cities Chennai, Kolkata, Mumbai, and Delhi. These 3D data are shown in
the table. The 3D data of the table are represented as a series of 2D tables.
INNAHAI ANUGRAHAM 19
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Data Cleaning :
INNAHAI ANUGRAHAM 20
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Data integration is one of the steps of data pre-processing that involves combining
data residing in different sources and providing users with a unified view of these
data.
• It merges the data from multiple data stores (data sources)
• It includes multiple databases, data cubes or flat files.
• Metadata, Correlation analysis, data conflict detection, and resolution of
semantic heterogeneity contribute towards smooth data integration.
• There are mainly 2 major approaches for data integration - commonly known as
"tight coupling approach" and "loose coupling approach".
Tight Coupling
o Here data is pulled over from different sources into a single physical location
through the process of ETL - Extraction, Transformation and Loading.
o The single physical location provides an uniform interface for querying the data.
ETL layer helps to map the data from the sources so as to provide a uniform data
o warehouse. This approach is called tight coupling since in this approach the
data is tightly coupled with the physical repository at the time of query.
INNAHAI ANUGRAHAM 21
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
ADVANTAGES:
Independence (Lesser dependency to source systems since data is
physically copied over)
Faster query processing
Complex query processing
Advanced data summarization and storage possible
High Volume data processing
For example, let's imagine that an electronics company is preparing to roll out a
new mobile device. The marketing department might want to retrieve customer
information from a sales department database and compare it to information from
INNAHAI ANUGRAHAM 22
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
the product department to create a targeted sales list. A good data integration
system would let the marketing department view information from both sources
in a unified way, leaving out any information that didn't apply to the search.
DATA TRANSFORMATION:
In data mining pre-processes and especially in metadata and data warehouse, we
use data transformation in order to convert data from a source data format into
destination data.
INNAHAI ANUGRAHAM 23
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
• For example, attributes, like age, may be mapped to higher-level concepts, like
youth, middle-aged, and senior.
• Generalization is a form of data reduction.
Normalization:
• Here the attribute data are scaled so as to fall within a small specified range,
such as 1:0 to 1:0, or 0:0 to 1:0.
• Normalization is particularly useful for classification algorithms involving
neural networks, or distance measurements such as nearest-neighbor
classification and clustering
• For distance-based methods, normalization helps prevent attributes with initially
large ranges (e.g., income).
• There are three methods for data normalization:
min-max normalization
z-score normalization
normalization by decimal scaling:
Attribute construction:
Here new attributes are constructed and added from the given set of attributes
to help the mining process.
Attribute construction helps to improve the accuracy and understanding of
structure in high-dimensional data.
By combining attributes, attribute construction can discover missing
information about the relationships between data attributes that can be useful
for knowledge discovery.
EG:The structure of stored data may vary between applications, requiring
semantic mapping prior to the transformation process. For instance, two
applications might store the same customer credit card information using slightly
different structures:
INNAHAI ANUGRAHAM 24
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
To ensure that critical data isn’t lost when the two applications are integrated,
information from Application A needs to be reorganized to fit the data structure
of Application B.
Data Reduction
Data reduction is a method of reducing the size of original data so that it may be
represented in a much smaller space. While reducing data, data reduction
techniques preserve data integrity.
Data Reduction Techniques
1.Dimensionality Reduction
Dimensionality reduction removes characteristics from the data set in question,
resulting in a reduction in the size of the original data. It shrinks data by removing
obsolete or superfluous characteristics.
1. Wavelet Transform: Assume that a data vector A is transformed into a
numerically different data vector A’ such that both A and A’ vectors are of
the same length using the wavelet transform. Then, because the data
received from the wavelet transform may be abbreviated, how is it
beneficial in data reduction? By keeping the smallest piece of the strongest
wavelet coefficients, compressed data can be produced. Data cubes, sparse
data, and skewed data can all benefit from the Wavelet transform.
2. Principal Component Analysis: Assume we have a data set with n
properties that needs to be evaluated. The main component analysis finds
k distinct tuples with n properties that may be used to describe the data
collection. The original data may be cast on a considerably smaller space
in this fashion, resulting in dimensionality reduction. Principal component
analysis can be used on data that is sparse or skewed.
INNAHAI ANUGRAHAM 25
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
2. sNumerosity Reduction
The numerosity reduction decreases the size of the original data and expresses it
in a much more compact format. There are two sorts of numerosity reduction
techniques: parametric and non-parametric.
Parametric: Instead of keeping the original data, parametric numerosity reduction
stores just data parameters. The regression and log-linear technique is one way
for reducing parametric numerosity.
Non-Parametric: There is no model in a non-parametric numerosity reduction
strategy. The non-parametric approach achieves a more uniform reduction,
regardless of data size, but it may not accomplish the same volume of data
reduction as the parametric technique. Histogram, Clustering, Sampling, Data
Cube Aggregation, and Data Compression are at least four forms of Non-
Parametric data reduction techniques.
3.Data Cube Aggregation
This method is used to condense data into a more manageable format. Data Cube
Aggregation is a multidimensional aggregation that represents the original data
set by aggregating at multiple layers of a data cube, resulting in data reduction.
Aggregation gives you with the needed data, which is considerably smaller in
size, and we achieve data reduction without losing any data in this method.
4.Data Compression
Data compression is the process of altering, encoding, or transforming the
structure of data in order to save space. By reducing duplication and encoding
data in binary form, data compression creates a compact representation of
information. Lossless compression refers to data that can be effectively recovered
from its compressed state. Lossy compression, on the other hand, occurs when
the original form cannot be restored from the compressed version. For data
compression, dimensionality and numerosity reduction methods are also utilised.
5.Discretization Operation
Data discretization is a technique for dividing continuous nature qualities into
data with intervals. We use labels of tiny intervals to replace several of the
characteristics’ constant values. This implies that mining results are presented in
a clear and succinct manner.
INNAHAI ANUGRAHAM 26
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Data Discretization:
Data discretization is a technique for dividing continuous nature qualities into
data with intervals. We use labels of tiny intervals to replace several of the
characteristics’ constant values. This implies that mining results are presented in
a clear and succinct manner.
Discretization techniques can be categorized depends on how the discretization
is implemented, such as whether it uses class data or which direction it proceeds
(i.e., top-down vs. bottom-up).
If the process begins by first discovering one or a few points (known as
split points or cut points) to split the whole attribute range, and then
continue this recursively on the resulting intervals, it is known as top-down
discretization or splitting.
In bottom-up discretization or merging, it can start by considering all of
the continuous values as potential split-points, removes some by merging
neighbourhood values to form intervals, and then recursively applies this
process to the resulting intervals.
INNAHAI ANUGRAHAM 27
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
UNIT-III
Mining Frequent Patterns:
Frequent pattern extraction is an essential mission in data mining that intends to
uncover repetitive patterns or itemsets in a granted dataset. It encompasses
recognizing collections of components that occur together frequently in a
transactional or relational database. This procedure can offer valuable perceptions
into the connections and affiliations among diverse components or features within
the data.
The technique of frequent pattern mining is built upon a number of fundamental
ideas.
Transactional and Relational Databases: The analysis is based on transaction
databases, which include records or transactions that represent collections of
objects. Items inside these transactions are grouped together as itemsets.
Support and Repeating Groupings: The importance of patterns is greatly
influenced by support and confidence measurements. Support quantifies how
frequently an itemset appears in the database, whereas confidence quantifies how
likely it is that a rule generated from the itemset is accurate.
The Apriori algorithm, is one of the most well-known and widely used
algorithms for repeating arrangement prospecting. It uses a breadth-first search
strategy to discover repeating groupings efficiently. The algorithm works in
multiple iterations. It starts by finding repeating individual objects by scanning
the database once and counting the occurrence of each object. It then generates
candidate groupings of size 2 by combining the repeating groupings of size 1.
The support of these candidate groupings is calculated by scanning the database
again. The process continues iteratively, generating candidate groupings of size k
and calculating their support until no more repeating groupings can be found.
Support-based Pruning: During the Apriori algorithm’s execution, aid-based
pruning is used to reduce the search space and enhance efficiency. If an itemset
is found to be rare (i.e., its aid is below the minimum aid threshold), then all its
supersets are also assured to be rare. Therefore, these supersets are trimmed from
further consideration. This trimming step significantly decreases the number of
potential item sets that need to be evaluated in subsequent iterations..
Association Rule Mining: Frequent item sets can be further examined to
discover association rules, which represent connections between different items.
An association rule consists of an antecedent and a consequent (right-hand side),
both of which are item sets. For instance, {milk, bread} => {eggs} is an
INNAHAI ANUGRAHAM 28
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
INNAHAI ANUGRAHAM 30
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Support
Confidence
Lift
Let's understand each of them:
Support
Support is the frequency of A or how frequently an item appears in the dataset. It
is defined as the fraction of the transaction T that contains the itemset X. If there
are X datasets, then for transactions T, it can be written as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how often
the items X and Y occur together in the dataset when the occurrence of X is
already given. It is the ratio of the transaction that contains X and Y to the number
of records that contain X.
Lift
It is the strength of any rule, which can be defined as below formula:
It is the ratio of the observed support measure and expected support if X and Y
are independent of each other. It has three possible values:
If Lift= 1: The probability of occurrence of antecedent and consequent is
independent of each other.
Lift>1: It determines the degree to which the two itemsets are dependent
to each other.
Lift<1: It tells us that one item is a substitute for other items, which means
one item has a negative effect on another.
INNAHAI ANUGRAHAM 31
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Apriori Algorithm:
Apriori algorithm refers to the algorithm which is used to calculate the association
rules between objects. It means how two or more objects are related to one
another. In other words, we can say that the apriori algorithm is an association
rule leaning that analyzes that people who bought product A also bought product
B.
The primary objective of the apriori algorithm is to create the association rule
between different objects. The association rule describes how two or more objects
are related to one another. Apriori algorithm is also called frequent pattern
mining.
Components of Apriori algorithm
The given three components comprise the apriori algorithm.
Support
Confidence
Lift
Support
Support refers to the default popularity of any product. You find the support as a
quotient of the division of the number of transactions comprising that product by
the total number of transactions. Hence, we get
Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)
= 400/4000 = 10 percent.
Confidence
Confidence refers to the possibility that the customers bought both biscuits and
chocolates together. So, you need to divide the number of transactions that
comprise both biscuits and chocolates by the total number of transactions to get
the confidence.
Hence,
Confidence = (Transactions relating both biscuits and Chocolate) / (Total
transactions involving Biscuits)
= 200/400
= 50 percent.
It means that 50 percent of customers who bought biscuits bought chocolates also.
INNAHAI ANUGRAHAM 32
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Lift
Consider the above example; lift refers to the increase in the ratio of the sale of
chocolates when you sell biscuits. The mathematical equations of lift are given
below.
Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)
= 50/10 = 5
It means that the probability of people buying both biscuits and chocolates
together is five times more than that of purchasing the biscuits alone. If the lift
value is below one, it requires that the people are unlikely to buy both the items
together. Larger the value, the better is the combination.
How does the Apriori Algorithm work in Data Mining?
Consider the following dataset and we will find frequent itemsets and generate
association rules for them.
Step-1: K=1
(I) Create a table containing support count of each item present in dataset – Called
C1(candidate set)
(II) compare candidate set item’s support count with minimum support count(here
min_support=2 if support_count of candidate set items is less than min_support
then remove those items). This gives us itemset L1.
INNAHAI ANUGRAHAM 33
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Step-2: K=2
Generate candidate set C2 using L1 (this is called join step). Condition of
joining Lk-1 and Lk-1 is that it should have (K-2) elements in common.
Check all subsets of an itemset are frequent or not and if not frequent remove
that itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check
for each itemset)
Now find support count of these itemsets by searching in dataset.
(II) compare candidate (C2) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support
then remove those items) this gives us itemset L2.
Step-3:
INNAHAI ANUGRAHAM 34
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and
Lk-1 is that it should have (K-2) elements in common. So here, for L2, first
element should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3,
I4}{I2, I4, I5}{I2, I3, I5}
Check if all subsets of these itemsets are frequent or not and if not, then
remove that itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3}
which are frequent. For {I2, I3, I4}, subset {I3, I4} is not frequent so remove
it. Similarly check for every itemset)
find support count of these remaining itemset by searching in dataset.
(II) Compare candidate (C3) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support
then remove those items) this gives us itemset L3.
Step-4:
Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and
Lk-1 (K=4) is that, they should have (K-2) elements in common. So here, for
L3, first 2 elements (items) should match.
Check all subsets of these itemsets are frequent or not (Here itemset formed
by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not
frequent). So no itemset in C4
We stop here because no frequent itemsets are found further.
Thus, we have discovered all the frequent item-sets. Now generation of strong
association rule comes into picture. For that we need to calculate confidence of
each rule.
Advantages of Apriori Algorithm
It is used to calculate large itemsets.
Simple to understand and apply.
INNAHAI ANUGRAHAM 35
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
ITEM FREQUENCY
A 1
C 2
D 1
E 4
I 1
K 5
M 3
N 2
O 4
U 1
Y 3
Let the minimum support be 3. A Frequent Pattern set is built which will contain
all the elements whose frequency is greater than or equal to the minimum support.
These elements are stored in descending order of their respective frequencies.
After insertion of the relevant items, the set L looks like this:-
L = {K : 5, E : 4, M : 3, O : 4, Y : 3}
INNAHAI ANUGRAHAM 36
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Now, for each transaction, the respective Ordered-Item set is built. It is done by
iterating the Frequent Pattern set and checking if the current item is contained in
the transaction in question. If the current item is contained, the item is inserted in
the Ordered-Item set for the current transaction. The following table is built for
all the transactions:
Transaction ID Items Ordered-Item Set
T1 {E,K,M,N,O,Y} {K,E,M,O,Y}
T2 {D,E,K,N,O,Y} {K,E,O,Y}
T3 {A,E,K,M} {K,E,M}
T4 {C,KM,U,Y} {K,M,Y}
T5 {C,E,K,O,O} {K,E,O}
Now, all the Ordered-Item sets are inserted into a Trie Data Structure.
a) Inserting the set {K, E, M, O, Y}:
Here, all the items are simply linked one after the other in the order of occurrence
in the set and initialize the support count for each item as 1.
INNAHAI ANUGRAHAM 37
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
d) Inserting the set {K, M, Y}: Similar to step b), first the support count of K is
increased, then new nodes for M and Y are initialized and linked accordingly.
INNAHAI ANUGRAHAM 38
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
e) Inserting the set {K, E, O}: Here simply the support counts of the respective
elements are increased. Note that the support count of the new node of item O is
increased.
Now, for each item, the Conditional Pattern Base is computed which is path
labels of all the paths which lead to any node of the given item in the frequent-
pattern tree. Note that the items in the below table are arranged in the ascending
order of their frequencies.
INNAHAI ANUGRAHAM 39
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Now for each item, the Conditional Frequent Pattern Tree is built. It is done by
taking the set of elements that is common in all the paths in the Conditional
Pattern Base of that item and calculating its support count by summing the
support counts of all the paths in the Conditional Pattern Base.
From the Conditional Frequent Pattern tree, the Frequent Pattern rules are
generated by pairing the items of the Conditional Frequent Pattern Tree set to the
corresponding to the item as given in the below table.
For each row, two types of association rules can be inferred for example for the
first row which contains the element, the rules K -> Y and Y -> K can be inferred.
To determine the valid rule, the confidence of both the rules is calculated and the
one with confidence greater than or equal to the minimum confidence value is
retained.
INNAHAI ANUGRAHAM 40
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
INNAHAI ANUGRAHAM 41
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
UNIT-IV
Basic Concept of Classification
Classification is a task in data mining that involves assigning a class label to each
instance in a dataset based on its features. The goal of classification is to build a
model that accurately predicts the class labels of new instances based on their
features.
There are two main types of classification: binary classification and multi-class
classification. Binary classification involves classifying instances into two
classes, such as “spam” or “not spam”, while multi-class classification involves
classifying instances into more than two classes.
How does Classification Works?
There are two stages in the data classification system: classifier or model
creation and classification classifier.
1.Developing the Classifier or model creation: This level is the learning stage
or the learning process. The classification algorithms construct the classifier in
this stage. A classifier is constructed from a training set composed of the records
of databases and their corresponding class names. Each category that makes up
the training set is referred to as a category or class. We may also refer to these
records as samples, objects, or data points.
2.Applying classifier for classification: The classifier is used for classification
at this level. The test data are used here to estimate the accuracy of the
INNAHAI ANUGRAHAM 42
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
INNAHAI ANUGRAHAM 43
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
The following decision tree is for the concept buy_computer that indicates
whether a customer at a company is likely to buy a computer or not. Each internal
node represents a test on an attribute. Each leaf node represents a class.
Key factors:
Entropy:
Entropy refers to a common way to measure impurity. In the decision tree, it
measures the randomness or impurity in data sets.
Information Gain:
Information Gain refers to the decline in entropy after the dataset is split. It is also
called Entropy Reduction. Building a decision tree is all about discovering
attributes that return the highest data gain.
INNAHAI ANUGRAHAM 44
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
The entropy for a subset of the original dataset having K number of classes for
the ith node can be defined as:
Where,
S is the dataset sample.
k is the particular class from K classes
p(k) is the proportion of the data points that belong to class k to the total
where
A is the specific attribute or class label
|H| is the entropy of dataset sample S
|HV| is the number of instances in the subset S that have the value v for
attribute A
Gini Impurity or index:
Gini Impurity is a score that evaluates how accurate a split is among the classified
groups. The Gini Impurity evaluates a score in the range between 0 and 1, where
0 is when all observations belong to one class, and 1 is a random distribution of
the elements within classes.
where
A is the specific attribute or class label
|H| is the entropy of dataset sample S
|HV| is the number of instances in the subset S that have the value v for
attribute A
INNAHAI ANUGRAHAM 46
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
The complete process can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
Step-3: Divide the S into subsets that contains possible values for the best
attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you cannot
further classify the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the decision
tree starts with the root node (Salary attribute by ASM). The root node splits
further into the next decision node (distance from the office) and one leaf node
based on the corresponding labels. The next decision node further gets split into
one decision node (Cab facility) and one leaf node. Finally, the decision node
splits into two leaf nodes (Accepted offers and Declined offer). Consider the
below diagram:
Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due
to noise or outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches
INNAHAI ANUGRAHAM 47
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
INNAHAI ANUGRAHAM 48
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
INNAHAI ANUGRAHAM 49
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
CART algorithm uses Gini Impurity to split the dataset into a decision tree .It
does that by searching for the best homogeneity for the sub nodes, with the help
of the Gini index criterion.
INNAHAI ANUGRAHAM 50
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Data set
INNAHAI ANUGRAHAM 51
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Gini index
Gini index is a metric for classification tasks in CART. It stores sum of squared
probabilities of each class. We can formulate it as illustrated below.
Gini = 1 – Σ (Pi)2 for i=1 to number of classes
Outlook
Outlook is a nominal feature. It can be sunny, overcast or rain. I will summarize
the final decisions for outlook feature.
Sunny 2 3 5
Overcast 4 0 4
Rain 3 2 5
INNAHAI ANUGRAHAM 52
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Outlook 0.342
Temperature 0.439
Humidity 0.367
Wind 0.428
We will apply same principles to those sub datasets in the following steps. We
end the algotithm when no more subset is possible.
INNAHAI ANUGRAHAM 53
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Advantages of CART
Results are simplistic.
Classification and regression trees are Nonparametric and Nonlinear.
Classification and regression trees implicitly perform feature selection.
Outliers have no meaningful effect on CART.
It requires minimal supervision and produces easy-to-understand models.
Limitations of CART
Overfitting.
High Variance.
low bias.
the tree structure may be unstable.
Applications of the CART algorithm
For quick Data insights.
In Blood Donors Classification.
For environmental and ecological data.
In the financial sectors.
Bayesian Classification
Bayesian classification uses Bayes theorem to predict the occurrence of any
event. Bayesian classifiers are the statistical classifiers with the Bayesian
INNAHAI ANUGRAHAM 54
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
INNAHAI ANUGRAHAM 55
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
The arc in the diagram allows representation of causal knowledge. For example,
lung cancer is influenced by a person's family history of lung cancer, as well as
whether or not the person is a smoker. It is worth noting that the variable
PositiveXray is independent of whether the patient has a family history of lung
cancer or that the patient is a smoker, given that we know the patient has lung
cancer.
Conditional Probability Table
The conditional probability table for the values of the variable LungCancer (LC)
showing each possible combination of the values of its parent nodes,
FamilyHistory (FH), and Smoker (S) is as follows –
INNAHAI ANUGRAHAM 56
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed
event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the
probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the
evidence.
P(B) is Marginal Probability: Probability of Evidence.
INNAHAI ANUGRAHAM 57
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
INNAHAI ANUGRAHAM 58
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
INNAHAI ANUGRAHAM 60
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
INNAHAI ANUGRAHAM 61
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
INNAHAI ANUGRAHAM 62
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
o As we can see the 3 nearest neighbors are from category A, hence this new
data point must belong to category A.
INNAHAI ANUGRAHAM 63
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
What is accuracy?
Accuracy is a metric that measures how often a machine learning model correctly
predicts the outcome. You can calculate accuracy by dividing the number of
correct predictions by the total number of predictions.
In other words, accuracy answers the question: how often the model is right?
What is precision?
Precision is a metric that measures how often a machine learning model correctly
predicts the positive class. You can calculate precision by dividing the number of
correct positive predictions (true positives) by the total number of instances the
model predicted as positive (both true and false positives).
In other words, precision answers the question: how often the positive predictions
are correct?
What is recall?
Recall is a metric that measures how often a machine learning model correctly
identifies positive instances (true positives) from all the actual positive samples
in the dataset. You can calculate recall by dividing the number of true positives
by the number of positive instances. The latter includes true positives
(successfully identified cases) and false negative results (missed cases).
In other words, recall answers the question: can an ML model find all instances
of the positive class?
INNAHAI ANUGRAHAM 64
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
UNIT V
CLUSTERING
CLUSTER ANALYSIS:
Cluster analysis, also known as clustering, is a method of data mining that groups
similar data points together. The goal of cluster analysis is to divide a dataset into
groups (or clusters) such that the data points within each group are more similar
to each other than to data points in other groups.
Cluster Analysis is the process to find similar groups of objects in order to form
clusters. It is an unsupervised machine learning-based algorithm that acts on
unlabelled data. A group of data points would comprise together to form a cluster
in which all the objects would belong to the same group.
For example, consider a dataset of vehicles given in which it contains information
about different vehicles like cars, buses, bicycles, etc. As it is unsupervised
learning there are no class labels like Cars, Bikes, etc for all the vehicles, all the
data is combined and is not in a structured manner.
Properties of Clustering :
1. Clustering Scalability: Nowadays there is a vast amount of data and should
be dealing with huge databases. In order to handle extensive databases, the
clustering algorithm should be scalable to get appropriate results.
2. High Dimensionality: The algorithm should be able to handle high
dimensional space along with the data of small size.
3. Algorithm Usability with multiple data kinds: Different kinds of data can be
used with algorithms of clustering. It should be capable of dealing with different
types of data like discrete, categorical and interval-based data, binary data etc.
4. Dealing with unstructured data: There would be some databases that contain
missing values, and noisy or erroneous data. If the algorithms are sensitive to such
data then it may lead to poor quality clusters. So it should be able to handle
unstructured data and give some structure to the data by organising it into groups
of similar data objects. This makes the job of the data expert easier in order to
process the data and discover new patterns.
5. Interpretability: The clustering outcomes should be interpretable,
comprehensible, and usable. The interpretability reflects how easily the data is
understood.
Applications Of Cluster Analysis:
It is widely used in image processing, data analysis, and pattern
recognition.
It helps marketers to find the distinct groups in their customer base and
they can characterize their customer groups by using purchasing patterns.
INNAHAI ANUGRAHAM 65
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
INNAHAI ANUGRAHAM 66
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Algorithm: K mean:
The K means algorithm takes the input parameter K from the user and
partitions the dataset containing N objects into K clusters so that resulting
similarity among the data objects inside the group (intracluster) is high but
the similarity of data objects with the data objects from outside the cluster is
low (intercluster).
The similarity of the cluster is determined with respect to the mean value of
the cluster. It is a type of square error algorithm.
At the start randomly k objects from the dataset are chosen in which each of
the objects represents a cluster mean(centre).
For the rest of the data objects, they are assigned to the nearest cluster based
on their distance from the cluster mean.
The new mean of each of the cluster is then calculated with the added data
objects.
Algorithm:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects
Output:
A dataset of K clusters
Method:
1. Randomly assign K objects from the dataset(D) as cluster centres(C)
2. (Re) Assign each object to which object is most similar based upon mean
values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with the
updated values.
4. Repeat Step 2 until no change occurs.
INNAHAI ANUGRAHAM 67
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
INNAHAI ANUGRAHAM 68
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Let’s consider the following example: If a graph is drawn using the above data
points, we obtain the following:
Step 1: Let the randomly selected 2 medoids, so select k = 2, and let C1 -(4,
5) and C2 -(8, 5) are the two medoids.
Step 2: Calculating cost. The dissimilarity of each non-medoid point with the
medoids is calculated and tabulated:
Here we have used Manhattan distance formula to calculate the distance matrices
between medoid and non-medoid points. That formula tell that
Distance = |X1-X2| + |Y1-Y2|.
INNAHAI ANUGRAHAM 69
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Each point is assigned to the cluster of that medoid whose dissimilarity is less.
Points 1, 2, and 5 go to cluster C1 and 0, 3, 6, 7, 8 go to cluster C2. The Cost =
(3 + 4 + 4) + (3 + 1 + 1 + 2 + 2) = 20
Step 3: randomly select one non-medoid point and recalculate the cost. Let
the randomly selected point be (8, 4). The dissimilarity of each non-medoid point
with the medoids – C1 (4, 5) and C2 (8, 4) is calculated and tabulated.
Each point is assigned to that cluster whose dissimilarity is less. So, points 1, 2,
and 5 go to cluster C1 and 0, 3, 6, 7, 8 go to cluster C2. The New cost = (3 + 4 +
4) + (2 + 2 + 1 + 3 + 3) = 22 Swap Cost = New Cost – Previous Cost = 22 – 20
and 2 >0 As the swap cost is not less than zero, we undo the swap. Hence (4, 5)
and (8, 5) are the final medoids.
INNAHAI ANUGRAHAM 70
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
(more than several thousand observations) in order to reduce computing time and
RAM storage problem.
The algorithm is as follow:
1. Create randomly, from the original dataset, multiple subsets with fixed size
(sampsize)
2. Compute PAM algorithm on each subset and choose the corresponding k
representative objects (medoids). Assign each observation of the entire data
set to the closest medoid.
3. Calculate the mean (or the sum) of the dissimilarities of the observations
to their closest medoid. This is used as a measure of the goodness of the
clustering.
4. Retain the sub-dataset for which the mean (or sum) is minimal. A further
analysis is carried out on the final partition.
INNAHAI ANUGRAHAM 71
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Steps:
Consider each alphabet as a single cluster and calculate the distance of one
cluster from all the other clusters.
In the second step, comparable clusters are merged together to form a
single cluster. Let’s say cluster (B) and cluster (C) are very similar to each
other therefore we merge them in the second step similarly to cluster (D)
and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]
We recalculate the proximity according to the algorithm and merge the two
nearest clusters([(DE), (F)]) together to form new clusters as [(A), (BC),
(DEF)]
Repeating the same process; The clusters DEF and BC are comparable and
merged together to form a new cluster. We’re now left with clusters [(A),
(BCDEF)].
At last, the two remaining clusters are merged together to form a single
cluster [(ABCDEF)].
INNAHAI ANUGRAHAM 72
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Once the group is split or merged then it can never be undone as it is a rigid
method and is not so flexible. The two approaches which can be used to improve
the Hierarchical Clustering Quality in Data Mining are: –
One should carefully analyze the linkages of the object at every partitioning
of hierarchical clustering.
One can use a hierarchical agglomerative algorithm for the integration of
hierarchical agglomeration. In this approach, first, the objects are grouped
into micro-clusters. After grouping data objects into microclusters, macro
clustering is performed on the microcluster.
Density-Based Method: The density-based method mainly focuses on density.
In this method, the given cluster will keep on growing continuously as long as the
density in the neighbourhood exceeds some threshold, i.e, for each data point
within a given cluster. The radius of a given cluster has to contain at least a
minimum number of points.
DBSCAN Algorithm
Parameters Required For DBSCAN Algorithm
Density-Based Clustering - Background
There are two different parameters to calculate the density-based clustering
EPS: It is considered as the maximum radius of the neighborhood.
MinPts: MinPts refers to the minimum number of points in an Eps neighborhood
of that point.
NEps (i) : { k belongs to D and dist (i,k) < = Eps}
INNAHAI ANUGRAHAM 73
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Density reachable:
A point denoted by i is a density reachable from a point j with respect to Eps,
MinPts if there is a sequence chain of a point i1,…., in, i1 = j, pn = i such that i i +
1 is directly density reachable from ii.
INNAHAI ANUGRAHAM 74
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
2. For each core point if it is not already assigned to a cluster, create a new
cluster.
3. Find recursively all its density-connected points and assign them to the
same cluster as the core point. A point a and b are said to be density
connected if there exists a point c which has a sufficient number of points
in its neighbors and both points a and b are within the eps distance. This
is a chaining process. So, if b is a neighbor of c, c is a neighbor of d,
and d is a neighbor of e, which in turn is neighbor of a implying that b is
a neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points
that do not belong to any cluster are noise.
OPTICS
OPTICS stands for Ordering Points To Identify the Clustering Structure. It gives
a significant order of database with respect to its density-based clustering
structure. The order of the cluster comprises information equivalent to the
density-based clustering related to a long range of parameter settings. OPTICS
methods are beneficial for both automatic and interactive cluster analysis,
including determining an intrinsic clustering structure.
DENCLUE
Density-based clustering by Hinnebirg and Kiem. It enables a compact
mathematical description of arbitrarily shaped clusters in high dimension state of
data, and it is good for data sets with a huge amount of noise.
Grid-Based Method: In the Grid-Based method a grid is formed using the object
together, i.e., the object space is quantized into a finite number of cells that form
INNAHAI ANUGRAHAM 75
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
a grid structure. One of the major advantages of the grid-based method is fast
processing time and it is dependent only on the number of cells in each dimension
in the quantized space.
Statistical Information Grid(STING):
A STING is a grid-based clustering technique. It uses a multidimensional grid
data structure that quantifies space into a finite number of cells. Instead of
focusing on data points, it focuses on the value space surrounding the data
points.
In STING, the spatial area is divided into rectangular cells and several levels
of cells at different resolution levels. High-level cells are divided into several
low-level cells.
In STING Statistical Information about attributes in each cell, such as mean,
maximum, and minimum values, are precomputed and stored as statistical
parameters. These statistical parameters are useful for query processing and
other data analysis tasks.
Working of CLIQUE Algorithm:
The CLIQUE algorithm first divides the data space into grids. It is done by
dividing each dimension into equal intervals called units. After that, it identifies
dense units. A unit is dense if the data points in this are exceeding the threshold
value.
Once the algorithm finds dense cells along one dimension, the algorithm tries to
find dense cells along two dimensions, and it works until all dense cells along the
entire dimension are found.
After finding all dense cells in all dimensions, the algorithm proceeds to find the
largest set (“cluster”) of connected dense cells. Finally, the CLIQUE algorithm
generates a minimal description of the cluster. Clusters are then generated from
all dense subspaces using the apriori approach.
Model based Methods
Model-based clustering is a statistical approach to data clustering. The observed
(multivariate) data is considered to have been created from a finite combination
of component models. Each component model is a probability distribution,
generally a parametric multivariate distribution.
There are the following types of model-based clustering are as follows −
1.Statistical approach − Expectation maximization is a popular iterative
refinement algorithm. An extension to k-means −
It can assign each object to a cluster according to weight (probability
distribution).
New means are computed based on weight measures.
INNAHAI ANUGRAHAM 76
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Classification Tree
INNAHAI ANUGRAHAM 77
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
Based on its closeness to other data points in that cluster as well as to data
points in other clusters, each data point's silhouette score evaluates how well
it fits into the cluster to which it has been allocated. A score of 1 means the
data point is well-clustered, whereas a value of -1 means it has been
misclassified. The silhouette score goes from -1 to 1.
Calinski-Harabasz Index
A higher index value implies greater clustering performance. The Calinski-
Harabasz index evaluates the ratio of between-cluster variation to within-
cluster variance.
Davies-Bouldin index
A lower Davies-Bouldin index suggests greater clustering performance since
it gauges the average similarity between each cluster and its most
comparable cluster.
Rand Index
A higher Rand index denotes better clustering performance. It quantifies the
similarity between the anticipated grouping and the ground truth clustering.
Adjusted Mutual Information (AMI)
A higher index implies greater clustering performance. The AMI evaluates
the mutual information between the expected clustering and the ground truth
clustering, corrected for the chance.
Choosing the Right Evaluation Metric
The nature and objectives of a clustering problem will dictate the most
appropriate assessment measure to employ. If the goal of clustering is to group
similar data points together, the Calinski-Harabasz index or the silhouette score
can be beneficial. If the clustering results need to be compared to ground truth
clustering, however, the Rand index or AMI would be more appropriate. So, it is
important to consider the objectives and constraints of the clustering issue while
selecting the evaluation metric.
INNAHAI ANUGRAHAM 78
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
INNAHAI ANUGRAHAM 79
FUNDAMENTALS OF DATASCIENCE RAJADHANI DEGREE COLLEGE
INNAHAI ANUGRAHAM 80