Foundations of Data Science
Foundations of Data Science
3 0 0 3
AIM
To explore the different techniques in Data Science for drawing useful conclusions
from large and diverse data sets through exploration, prediction, and inference.
OBJECTIVES
l To obtain a comprehensive knowledge of various tools and techniques for Data
transformation and visualization
l To learn the probability and probabilistic models of data science
l To learn the basic statistics and testing hypothesis for specific problems
l To learn about the prediction models
Subject Code: CS32A3 Subject Name: FOUNDATIONS OF DATA
SCIENCE
Cognitive
Course Outcome – Cos
Skill
Understand the fundamental concepts of data science
CO-01 R,U
Evaluate the data analysis techniques for applications handling
CO-02 E
large data
Demonstrate the various machine learning algorithms used in data
CO-03 U,An
science process
Understand the basic statistics and testing hypothesis for specific
CO-04 U
problems and different prediction models.
CO-05 Visualize and present the inference using various tools A
R- Remember,U-Understand,A-Apply,An-Analyze,E-Evaluate,S-Synthesis
Mapping of Course Outcome with Programme Outcome
PO
PO- PO PO PO PO PO PO
A -B -C -D -E -F -G
CO
CO-01 M M N N N L L
CO-02 L M N N L L N
CO-03 M S N N L L L
CO-04 S M N N N L M
CO-05 L L N N N L N
S-Strong, M-Medium,L-Low,N-Not relevant
UNIT I INTRODUCTION 9
What is Data Science? Big Data and Data Science – Datafication - Current landscape of
perspectives - Skill sets needed; Matrices - Matrices to represent relations between data, and
necessary linear algebraic operations on matrices -Approximately representing matrices by
decompositions (SVD and PCA); Statistics: Descriptive Statistics: distributions and probability
- Statistical Inference: Populations and samples - Statistical modeling - probability distributions
- fitting a model - Hypothesis Testing - Intro to R/ Python.
UNIT II DATA PREPROCESSING 9
Data cleaning - data integration - Data Reduction Data Transformation and Data
Discretization.Evaluation of classification methods – Confusion matrix, Students T-tests and
ROC curves-Exploratory Data Analysis - Basic tools (plots, graphs and summary statistics) of
EDA, Philosophy of EDA - The Data Science Process.
UNIT IV CLUSTERING 9
Choosing distance metrics - Different clustering approaches - hierarchical agglomerative
clustering, k-means (Lloyd's algorithm), - DBSCAN - Relative merits of each method -
clustering tendency and quality.
REFERENCES
1. Jure Leskovek, AnandRajaraman and Jeffrey Ullman. Mining of Massive Datasets. v2.1,
Cambridge University Press. 2014.
2. Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. ISBN 0262018020.
2013.
3. Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning,
Second Edition. ISBN 0387952845. 2009.
4. Mohammed J. Zaki and Wagner Miera Jr. Data Mining and Analysis: Fundamental
Concepts and Algorithms. Cambridge University Press. 2014.
5. Jiawei Han, Micheline Kamber and Jian Pei. Data Mining: Concepts and Techniques, Third
Edition. ISBN 0123814790. 2011.
1
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________
UNIT II DATA PREPROCESSING
Data cleaning - data integration - Data Reduction Data Transformation and Data Discretization. Evaluation
of classification methods – Confusion matrix, Students T-tests and ROC curves-Exploratory Data Analysis
- Basic tools (plots, graphs and summary statistics) of EDA, Philosophy of EDA - The Data Science
Process.
DATA PREPROCESSING
Data preprocessing is an important step. It refers to the cleaning, transforming, and integrating of data in
order to make it ready for analysis. The goal of data preprocessing is to improve the quality of the data and
to make it more suitable for the specific data mining task. Data preprocessing is a data mining technique
which is used to transform the raw data in a useful and efficient format.
Some common steps in data preprocessing include:
Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data, such as
missing values, outliers, and duplicates. Various techniques can be used for data cleaning, such as
imputation, removal, and transformation.
Data Integration: This involves combining data from multiple sources to create a unified dataset. Data
integration can be challenging as it requires handling data with different formats, structures, and semantics.
Techniques such as record linkage and data fusion can be used for data integration.
Data Transformation: This involves converting the data into a suitable format for analysis. Common
techniques used in data transformation include normalization, standardization, and discretization.
Normalization is used to scale the data to a common range, while standardization is used to transform the
data to have zero mean and unit variance. Discretization is used to convert continuous data into discrete
categories.
Data Reduction: This involves reducing the size of the dataset while preserving the important information.
Data reduction can be achieved through techniques such as feature selection and feature extraction. Feature
selection involves selecting a subset of relevant features from the dataset, while feature extraction involves
transforming the data into a lower-dimensional space while preserving the important information.
Data Discretization: This involves dividing continuous data into discrete categories or intervals.
Discretization is often used in data mining and machine learning algorithms that require categorical data.
Discretization can be achieved through techniques such as equal width binning, equal frequency binning,
and clustering.
Data Cleaning:
Data cleaning is an essential step in the data mining process. It is crucial to the construction of a
model.Data cleaning is the process of correcting or deleting inaccurate, damaged, improperly formatted,
duplicated, or insufficient data from a dataset. Even if results and algorithms appear to be correct, they
are unreliable if the data is inaccurate. There are numerous ways for data to be duplicated or incorrectly
labeled when merging multiple data sources.The data can have many irrelevant and missing parts. To
handle this part, data cleaning is done. It involves handling of missing data, noisy data etc.
Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside
the clusters.
Steps for Cleaning Data
1. Remove duplicate or irrelevant observations
2. Fix structural errors
3. Filter unwanted outliers
4. Handle missing data
Data Integration
Data Integration is a data preprocessing technique that combines data from multiple heterogeneous data
sources into a coherent data store and provides a unified view of the data. These sources may include
multiple data cubes, databases, or flat files.
The data integration approaches are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
M stands for mapping between the queries of source and global schema.
Data integration can be challenging due to the variety of data formats, structures, and semantics used by
different data sources. Different data sources may use different data types, naming conventions, and
schemas, making it difficult to combine the data into a single view. Data integration typically involves a
combination of manual and automated processes, including data profiling, data mapping, data
transformation, and data reconciliation.
There are mainly 2 major approaches
Tight Coupling:
This approach involves creating a centralized repository or data warehouse to store the integrated data. The
data is extracted from various sources, transformed and loaded into a data warehouse. Data is integrated in
a tightly coupled manner, meaning that the data is integrated at a high level, such as at the level of the
entire dataset or schema. This approach is also known as data warehousing, and it enables data consistency
and integrity, but it can be inflexible and difficult to change or update.
Here, a data warehouse is treated as an information retrieval component.
In this coupling, data is combined from different sources into a single physical location through the process
of ETL – Extraction, Transformation, and Loading.
Loose Coupling:
This approach involves integrating data at the lowest level, such as at the level of individual data elements
or records. Data is integrated in a loosely coupled manner, meaning that the data is integrated at a low
level, and it allows data to be integrated without having to create a central repository or data warehouse.
This approach is also known as data federation, and it enables data flexibility and easy updates, but it can
be difficult to maintain consistency and integrity across multiple data sources.
Data reduction
Data reduction is a technique used in data mining to reduce the size of a dataset while still preserving the
most important information. This can be beneficial in situations where the dataset is too large to be
processed efficiently, or where the dataset contains a large amount of irrelevant or redundant
information.
Techniques
Data Sampling: This technique involves selecting a subset of the data to work with, rather than using the
entire dataset. This can be useful for reducing the size of a dataset while still preserving the overall trends
and patterns in the data.
Data transformation
• Data transformation is the process of converting, cleansing, and structuring data into a usable
format that can be analyzed to support decision making processes, and to propel the growth of an
organization.
• Data transformation is used when data needs to be converted to match that of the destination
system. This can occur at two places of the data pipeline.
• The process of data transformation can be handled manually, automated or a combination of
both.
• Transformation is an essential step in many processes, such as data integration, migration,
warehousing and wrangling. The process of data transformation can be:
✓ Constructive, where data is added, copied or replicated
✓ Destructive, where records and fields are deleted
✓ Aesthetic, where certain values are standardized, or
✓ Structural, which includes columns being renamed, moved and combined
1. Data Smoothing
• Data smoothing is a process that is used to remove noise from the dataset using some algorithms.
It allows for highlighting important features present in the dataset. It helps in predicting the patterns.
When collecting data, it can be manipulated to eliminate or reduce any variance or any other noise
form.The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns.
• Binning: This method splits the sorted data into the number of bins and smoothens the data values
in each bin considering the neighborhood values around it.
• Regression: This method identifies the relation among two dependent attributes so that if we have
one attribute, it can be used to predict the other attribute.
• Clustering: This method groups similar data values and form a cluster. The values that lie outside
a cluster are known as outliers.
2. Attribute Construction
• In the attribute construction method, the new attributes consult the existing attributes to construct
a new data set that eases data mining. New attributes are created and applied to assist the mining
process from the given attributes. This simplifies the original data and makes the mining more
efficient.
3. Data Aggregation
• Data collection or aggregation is the method of storing and presenting data in a summary format.
The data may be obtained from multiple data sources to integrate these data sources into a data
analysis description. This is a crucial step since the accuracy of data analysis insights is highly
dependent on the quantity and quality of the data used.Gathering accurate data of high quality and
a large enough quantity is necessary to produce relevant results. The collection of data is useful for
everything from decisions concerning financing or business strategy of the product, pricing,
operations, and marketing strategies.
4. Data Normalization
Consider that we have a numeric attribute A and we have n number of observed values for attribute A that
are V1, V2, V3, ….Vn.
5. Data Discretization
• This is a process of converting continuous data into a set of data intervals. Continuous attribute
values are substituted by small interval labels. This makes the data easier to study and analyze. If a
data mining task handles a continuous attribute, then its discrete values can be replaced by constant
quality attributes. This improves the efficiency of the task.This method is also called a data
reduction mechanism as it transforms a large dataset into a set of categorical data. Discretization
also uses decision tree-based algorithms to produce short, compact, and accurate results when using
discrete values.Data discretization can be classified into two types: supervised discretization, where
6. Data Generalization
• It converts low-level data attributes to high-level data attributes using concept hierarchy. This
conversion from a lower level to a higher conceptual level is useful to get a clearer picture of the
data. Data generalization can be divided into two approaches:
Accuracy
➢ Accuracy simply measures how often the classifier correctly predicts. We can define accuracy as
the ratio of the number of correct predictions and the total number of predictions.
When any model gives an accuracy rate of 99%, you might think that model is performing very good but
this is not always true and can be misleading in some situations
Confusion Matrix
➢ Confusion Matrix is a performance measurement for the machine learning classification problems
where the output can be two or more classes. It is a table with combinations of predicted and actual
values.
➢ confusion matrix is a tabular way of visualizing the performance of your prediction model. Each
entry in a confusion matrix denotes the number of predictions made by the model where it classified
the classes correctly or incorrectly.
➢ True Positive (TP): It refers to the number of predictions where the classifier correctly predicts the
positive class as positive.
➢ True Negative (TN): It refers to the number of predictions where the classifier correctly predicts
the negative class as negative.
➢ False Positive (FP): It refers to the number of predictions where the classifier incorrectly predicts
the negative class as positive.
➢ False Negative (FN): It refers to the number of predictions where the classifier incorrectly predicts
the positive class as negative.
➢ Import the necessary libraries like Numpy, confusion_matrix from sklearn.metrics, seaborn, and
matplotlib.
➢ Create the NumPy array for actual and predicted labels.
➢ compute the confusion matrix.
➢ Plot the confusion matrix with the help of the seaborn heatmap.
Precision
➢ Precision for a label is defined as the number of true positives divided by the number of predicted
positives.
➢ It explains how many of the correctly predicted cases actually turned out to be positive. Precision
is useful in the cases where False Positive is a higher concern than False Negatives. The importance
of Precision is in music or video recommendation systems, e-commerce websites, etc. where wrong
results could lead to customer churn and this could be harmful to the business.
Recall (Sensitivity)
➢ Recall for a label is defined as the number of true positives divided by the total number of actual
positives.
➢ It explains how many of the actual positive cases we were able to predict correctly with our
model. Recall is a useful metric in cases where False Negative is of higher concern than False
Positive. It is important in medical cases where it doesn’t matter whether we raise a false alarm
but the actual positive cases should not go undetected!
F1 Score
➢ F1 Score is the harmonic mean of precision and recall.
➢ It gives a combined idea about Precision and Recall metrics. It is maximum when Precision is
equal to Recall.
➢ The F1 score punishes extreme values more. F1 Score could be an effective evaluation metric in
the following cases:
➢ When FP and FN are equally costly.
➢ Adding more data doesn’t effectively change the outcome
➢ True Negative is high
AUC-ROC
➢ The Receiver Operator Characteristic (ROC) is a probability curve that plots the TPR(True
Positive Rate) against the FPR(False Positive Rate) at various threshold values and separates the
‘signal’ from the ‘noise’.
➢ The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish
between classes. From the graph, we simply say the area of the curve ABDE and the X and Y-
axis.
From the graph shown below, the greater the AUC, the better is the performance of the model at
different threshold points between positive and negative classes. This simply means that When
AUC is equal to 1, the classifier is able to perfectly distinguish between all Positive and Negative
class points. When AUC is equal to 0, the classifier would be predicting all Negatives as Positives
and vice versa. When AUC is 0.5, the classifier is not able to distinguish between the Positive and
Negative classes.
Working of AUC
➢ In a ROC curve, the X-axis value shows False Positive Rate (FPR), and Y-axis shows True
Positive Rate (TPR). Higher the value of X means higher the number of False Positives(FP) than
True Negatives(TN), while a higher Y-axis value indicates a higher number of TP than FN. So,
the choice of the threshold depends on the ability to balance between FP and FN.
For a single sample with true label y∈{0,1} and a probability estimate p=Pr(y=1), the log loss is:
T-Test
• A t-test is an inferential statistic used to determine if there is a significant difference between the
means of two groups and how they are related. T-tests are used when the data sets follow a normal
distribution and have unknown variances, like the data set recorded from flipping a coin 100 times.
• A t-test is a type of inferential statistic used to determine the significant difference between the
means of two groups, which may be related to certain features. A t-test is used as a hypothesis
testing tool, which allows testing an assumption applicable to a population.
• Degrees of freedom refers to the values in a study that can vary and are essential for assessing the
null hypothesis's importance and validity. The computation of these values usually depends upon
the number of data records available in the sample set.
Types of T-Tests
• There are three types of t-tests we can perform based on the data, such as:
1. One-Sample t-test
In a one-sample t-test, we compare the average of one group against the set average. This set
average can be any theoretical value, or it can be the population mean.
Where,
t = t-statistic
m = mean of the group
µ = theoretical value or population mean
s = standard deviation of the group
n = group size or sample size
The unpaired t-test is used to compare the means of two different groups of samples.
Where,
3. Paired t-test
The paired sample t-test is quite intriguing. Here, we measure one group at two different times. We
compare different means for a group at two different times or under two different conditions.
A certain manager realized that the productivity level of his employees was trending significantly
downwards. This manager decided to conduct a training program for all his employees to increase
their productivity levels. The formula to calculate the t-statistic for a paired t-test is:
Where,
t = t-statistic
m = mean of the group
µ = theoretical value or population mean
s = standard deviation of the group
n = group size or sample size
Perform a t-test
For all of the t-tests involving means, you perform the same steps in analysis:
• Define your null (Ho) and alternative ( H α) hypotheses before collecting your data.
• Decide on the alpha value (or α value). This involves determining the risk you are willing to take
of drawing the wrong conclusion.
• Check the data for errors.
• Check the assumptions for the test.
• Perform the test and draw your conclusion. All t-tests for means involve calculating a test statistic.
ROC Curve
• ROC stands for Receiver Operating Characteristics, and the ROC curve is the graphical
representation of the effectiveness of the binary classification model. It plots the true positive rate
(TPR) vs the false positive rate (FPR) at different classification thresholds.
AUC Curve:
• AUC stands for Area Under the Curve, and the AUC curve represents the area under the ROC
curve.
• It measures the overall performance of the binary classification model. As both TPR and FPR range
between 0 to 1, So, the area will always lie between 0 and 1, and A greater value of AUC denotes
better model performance.
• It represents the probability with with our model is able to distinguish between the two classes
which are present in our target.
•
• The ROC curve is a graph that shows the performance of a classification model at all possible
thresholds ( threshold is a particular value beyond which you say a point belongs to a particular
class). The curve is plotted between two parameters
import numpy as np
from sklearn .metrics import roc_auc_score
y_true = [1, 1, 0, 0, 1, 0]
y_pred = [0.95, 0.90, 0.85, 0.81, 0.78, 0.70]
auc = np.round(roc_auc_score(y_true, y_pred), 3)
print("Auc for our sample data is {}".format(auc))
Univariate EDA
✓ It involves looking at a single variable at a time. Univariate EDA can help you understand the
distribution of the data and identify any outliers.
✓ In univariate analysis, the output is a single variable and all data collected is for it. There is no
cause-and-effect relationship at all. In bivariate analysis, the outcome is dependent on two
variables, e.g., the age of an employee, while the relation with it is compared with two variables,
i.e., his salary earned and expenses per month.
✓ The analysis of data is done on variables that can be numerical or categorical. The result of the
analysis can be represented in numerical values, visualization, or graphical form.
The significant parameters which are estimated from a distribution point of view are as follows:
a) Univariate Non-Graphical
➢ Central Tendency:
This term refers to values located at the data's central position or middle zone. The three generally
estimated parameters of central tendency are mean, median, and mode. Mean is the average of all
values in data, while the mode is the value that occurs the maximum number of times. The Median
is the middle value with equal observations to its left and right.
➢ Range:
The range is the difference between the maximum and minimum value in the data, thus indicating
how much the data is away from the central value on the higher and lower side.
➢ Variance and Standard Deviation:
Two more useful parameters are standard deviation and variance. Variance is a measure of
dispersion that indicates the spread of all data points in a data set. It is the measure of dispersion
mostly used and is the mean squared difference between each data point and mean, while standard
deviation is the square root value of it. The larger the value of standard deviation, the farther the
spread of data, while a low value indicates more values clustering near the mean.
b)Univariate Graphical
➢ Stem-and-leaf Plots:
This is a very simple but powerful EDA method used to display quantitative data but in a
shortened format. It displays the values in the data set, keeping each observation intact but
separating them as stem (the leading digits) and remaining or trailing digits as leaves. But
histogram is mostly used in its place now.
➢ Histograms (Bar Charts):
These plots are used to display both grouped or ungrouped data. On the x-axis, values of variables
are plotted, while on the y-axis are the number of observations or frequencies. Histograms are
very simple to quickly understand your data, which tell about values of data like central
tendency, dispersion, outliers, etc
✓ Bivariate EDA involves looking at two variables at a time. Bivariate EDA can help you
understand the relationship between two variables and identify any patterns that might exist.
The essential graphical EDA technique for two quantitative variables is the scatter plot, so
one variable appears on the x-axis and the other on the y-axis and, therefore, the point for
every case in your dataset. This can be used for bivariate analysis.
B) Multivariate Chart
A Multivariate chart is a type of control chart used to monitor two or more interrelated
process variables. This is beneficial in situations such as process control, where engineers
are likely to benefit from using multivariate charts. These charts allow monitoring
multiple parameters together in a single chart.
C) Run Chart
A run chart is a data line chart drawn over time. In other words, a run chart visually
illustrates the process performance or data values in a time sequence. D) Bubble Chart
D) Bubble charts scatter plots that display multiple circles (bubbles) in a two-dimensional
plot. These are used to assess the relationships between three or more numeric variables.
In a bubble chart, every single dot corresponds to one data point, and the values of the
variables for each point are indicated by different positions such as horizontal, vertical,
dot size, and dot colors.
E) Heat Map
2. R
R programming language is a regularly used option to make statistical observations and analyze
data, i.e., perform detailed EDA by data scientists and statisticians. Like Python, R is also an open-
source programming language suitable for statistical computing and graphics. Apart from the
commonly used libraries like ggplot, Leaflet, and Lattice, there are several powerful R libraries for
automated EDA, such as Data Explorer, SmartEDA, GGally, etc.
3. MATLAB
MATLAB is a well-known commercial tool among engineers since it has a very strong
mathematical calculation ability. Due to this, it is possible to use MATLAB for EDA but it requires
some basic knowledge of the MATLAB programming language.
1. Discovery:
Discovery step involves acquiring data from all the identified internal & external sources, which
helps you answer the business question.
The data can be:
✓ Logs from webservers
✓ Data gathered from social media
✓ Census datasets
✓ Data streamed from online sources using APIs
2. Preparation:
Data can have many inconsistencies like missing values, blank columns, an incorrect data format,
which needs to be cleaned. You need to process, explore, and condition data before modelling. The
cleaner your data, the better are your predictions.
Data Cleaning – Most of the real-world data is not structured and requires cleaning and conversion
into structured data before it can be used for any analysis or modeling.
Exploratory Data Analysis – This is the step in which we try to find the hidden patterns in the
data at hand. Also, we try to analyze different factors which affect the target variable and the extent
to which it does so. How the independent features are related to each other and what can be done
to achieve the desired results all these answers can be extracted from this process as well. This also
gives us a direction in which we should work to get started with the modeling process.
3. Model Planning:
In this stage, you need to determine the method and technique to draw the relation between input
variables. Planning for a model is performed by using different statistical formulas and visualization
tools. SQL analysis services, R, and SAS/access are some of the tools used for this purpose.
4. Model Building:
In this step, the actual model building process starts. Here, Data scientist distributes datasets for
training and testing. Techniques like association, classification, and clustering are applied to the
training data set. The model, once prepared, is tested against the “testing” dataset.
5. Operationalize:
You deliver the final baselined model with reports, code, and technical documents in this stage.
Model is deployed into a real-time production environment after thorough testing.
6. Communicate Results
In this stage, the key findings are communicated to all stakeholders. This helps you decide if the
project results are a success or a failure based on the inputs from the model.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 2
________________________________________________________________________________________________
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 3
________________________________________________________________________________________________
Here the If element is called antecedent, and then statement is called as Consequent. These types of
relationships where we can find out some association or relation between two items is known as single
cardinality.
Association rule mining is a procedure which aims to observe frequently occurring patterns,
correlations, or associations from datasets found in various kinds of databases such as r elational databases,
transactional databases, and other forms of repositories.
An association rule has 2 parts:
• an antecedent (if) and
• a consequent (then)
An antecedent is something that’s found in data, and a consequent is an item that is found in combina tion
with the antecedent. Example rule for instance:
“If a customer buys bread, he’s 70% likely of buying milk.”
In the above association rule, bread is the antecedent and milk is the consequent.
The Association rule is very useful in analyzing datasets.
Support
Support is the frequency of A or how frequently an item appears in the dataset. It is defined as the fraction of
the transaction T that contains the itemset X. If there are X datasets, then for transactions T, it can be written as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the items X and Y occur
together in the dataset when the occurrence of X is already given. It is the ratio of the transaction that contains
X and Y to the number of records that contain X.
Lift
It is the strength of any rule, which can be defined as below formula:
It is the ratio of the observed support measure and expected support if X and Y are independent of each other.
It has three possible values:
o If Lift= 1: The probability of occurrence of antecedent and consequent is independent of each other.
o Lift>1: It determines the degree to which the two itemsets are dependent to each other.
o Lift<1: It tells us that one item is a substitute for other items, which means one item has a negative effect
on another.
Association Rule Mining is sometimes referred to as “Market Basket Analysis”, as it was the first application
area of association mining.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 4
________________________________________________________________________________________________
Apriori Algorithm
Apriori algorithm is given for finding frequent item sets in a dataset for Boolean association rule. To improve
the efficiency of level-wise generation of frequent item sets, an important property is used called Apriori
property which helps by reducing the search space.
Apriori Property
All non-empty subset of frequent itemset must be frequent. The key concept of Apriori algorithm is its anti-
monotonicity of support measure.
Consider the following dataset and we will find frequent itemsets and generate association rules for them.
(II) compare candidate set item’s support count with minimum support count(here min_support=2 if
support_count of candidate set items is less than min_support then remove those items). This gives us itemset
L1.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 5
________________________________________________________________________________________________
Step-2: K=2
• Generate candidate set C2 using L1 (this is called join step). Condition of joining L k-1 and Lk-1 is
that it should have (K-2) elements in common.
• Check all subsets of an itemset are frequent or not and if not frequent remove that itemset.(Example
subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each itemset)
• Now find support count of these itemsets by searching in dataset.
(II) compare candidate (C2) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives us
itemset L2.
Step-3:
• Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-1 is that it
should have (K-2) elements in common. So here, for L2, first element should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4,
I5}{I2, I3, I5}
• Check if all subsets of these itemsets are frequent or not and if not, then remove that
itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2,
I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly check for every itemset)
• find support count of these remaining itemset by searching in dataset.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 6
________________________________________________________________________________________________
(II) Compare candidate (C3) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives us
itemset L3.
Step-4:
• Generate candidate set C4 using L3 (join step). Condition of joining L k-1 and Lk-1 (K=4) is that,
they should have (K-2) elements in common. So here, for L3, first 2 elements (items) should
match.
• Check all subsets of these itemsets are frequent or not (Here itemset formed by joining L3 is {I1, I2,
I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no itemset in C4
• We stop here because no frequent itemsets are found further
Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and bread also bought butter.
• Confidence(A->B)=Support_count(A∪B)/Support_count(A)
• So here, by taking an example of any frequent itemset, we will show the rule generation.
Itemset {I1, I2, I3} //from L3
So rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50%, then first 3 rules can be considered as strong association rules.
Step 7: When the threshold criterion is applied again, you'll get the significant itemset.
Steps for Apriori Algorithm
The Apriori algorithm has the following steps:
• Step 1: Determine the level of transactional database support and establish the minimal degree of
assistance and dependability.
• Step 2: Take all of the transaction's supports that are greater than the standard or chosen support
value.
• Step 3: Look for all rules with greater precision than the cutoff or baseline standard, in these
subgroups.
• Step 4: It is best to arrange the rules in ascending order of strength.
REGRESSION ANALYSIS
Regression Analysis is a statistical process for estimating the relationships between the dependent
variables or criterion variables and one or more independent variables or predictors. Regression analysis is
generally used when we deal with a dataset that has the target variable in the form of continuous data.
Regression analysis explains the changes in criteria in relation to changes in select predictors. The conditional
expectation of the criteria is based on predictors where the average value of the dependent variables is given
when the independent variables are changed. Three major uses for regression analysis are determining the
strength of predictors, forecasting an effect, and trend forecasting.
What is the purpose to use Regression Analysis?
→To analyze the effect of different independent features on the target or what we say dependent features.
→ helps us make decisions that can affect the target variable in the desired direction.
→ Regression analysis is heavily based on statistics and hence gives quite reliable results
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 7
________________________________________________________________________________________________
→Along with the development of the machine learning domain regression analysis techniques have gained
popularity as well as developed manifold from just y = mx + c.
Linear Regression
Regression: It predicts the continuous output variables based on the independent input variable. like the
prediction of house prices based on different parameters like house age, distance from the main road, location,
area, etc. Linear regression is a type of supervised machine learning algorithm that computes the linear
relationship between a dependent variable and one or more independent features. When the number of the
independent feature, is 1 then it is known as Univariate Linear regression
Linear regression is used for predictive analysis. Linear regression is a linear approach for modeling the
relationship between the criterion or the scalar response and the multiple predictors or explanatory variables.
Linear regression focuses on the conditional probability distribution of the response given the values of the
predictors. For linear regression, there is a danger of overfitting. The formula for linear regression is:
Syntax:
y = θx + b
where,
• θ – It is the model weights or parameters
• b – It is known as the bias.
This is the most basic form of regression analysis and is used to model a linear relationship between a single
dependent variable and one or more independent variables.
In regression set of records are present with X and Y values and these values are used to learn a function so
if you want to predict Y from an unknown X this learned function can be used. In regression we have to find
the value of Y, So, a function is required that predicts continuous Y in the case of regression given X as
independent features.
Here Y is called a dependent or target variable and X is called an independent variable also known as the
predictor of Y.
Linear regression performs the task to predict a dependent variable value (y) based on a given independent
variable (x)). Hence, the name is Linear Regression. In the figure above, X (input) is the work experience and
Y (output) is the salary of a person.
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
o Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent variable, then
such a Linear Regression algorithm is called Simple Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Multiple Linear Regression.
Linear Regression Line
A linear line showing the relationship between the dependent and independent variables is called a regression
line. A regression line can show two types of relationship:
o Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then
such a relationship is termed as a Positive linear relationship.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 8
________________________________________________________________________________________________
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line of regression, and
the cost function is used to estimate the values of the coefficient for the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear regression model
is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps the input
variable to the output variable. This mapping function is also known as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of squared
error occurred between the predicted values and actual values. It can be written as:
For the above linear equation, MSE can be calculated as:
Where,
N=Total number of observation
Yi = Actual value
(a1xi+a0)= Predicted value.
Residuals: The distance between the actual value and predicted values is called residual. If the observed points
are far from the regression line, then the residual will be high, and so cost function will high. If the scatter points
are close to the regression line, then the residual will be small and hence the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
o A regression model uses gradient descent to update the coefficients of the line by reducing the cost
function.
o It is done by a random selection of values of coefficient and then iteratively update the values to reach
the minimum cost function.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 9
________________________________________________________________________________________________
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The process of finding
the best model out of various models is called optimization. It can be achieved by below method:
1. R-squared method:
o R-squared is a statistical method that determines the goodness of fit.
o It measures the strength of the relationship between the dependent and independent variables on a scale
of 0-100%.
o The high value of R-square determines the less difference between the predicted values and actual values
and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple determination for multiple
regression.
o It can be calculated from the below formula:
Logistic Regression
Logistic regression is a supervised machine learning algorithm mainly used for classification tasks where the
goal is to predict the probability that an instance of belonging to a given class. It is used for classification
algorithms its name is logistic regression.
It is used for predicting the categorical dependent variable using a given set of independent variables.
• Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome
must be a categorical or discrete value.
• Logistic Regression can be used to classify the observations using different types of data and can
easily determine the most effective variables used for the classification.
Logistic Function (Sigmoid Function):
• The sigmoid function is a mathematical function used to map the predicted values to probabilities.
• It maps any real value into another value within a range of 0 and 1. o The value of the logistic
regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve like the
“S” form.
• The S-form curve is called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold value tends to 1, and a value below the threshold
values tends to 0.
Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
1. Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types
of the dependent variable, such as “cat”, “dogs”, or “sheep”
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as “low”, “Medium”, or “High”.
Sr.No Linear Regresssion Logistic Regression
Linear regression is used to predict the continuous Logistic regression is used to predict the categorical
dependent variable using a given set of independent dependent variable using a given set of independent
1 variables. variables.
2 Linear regression is used for solving Regression problem. It is used for solving classification problems.
3 In this we predict the value of continuous variables In this we predict values of categorical varibles
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 10
________________________________________________________________________________________________
Least square estimation method is used for estimation of Maximum likelihood estimation method is used for
5 accuracy. Estimation of accuracy.
There may be collinearity between the independent There should not be collinearity between independent
8 variables. varible.
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation by (1-
y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will become:
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 11
________________________________________________________________________________________________
CLASSIFIER
• A classifier in machine learning is an algorithm that automatically orders or categorizes
data into one or more of a set of “classes.”
• One of the most common examples is an email classifier that scans emails to filter them
by class label: Spam or Not Spam.
• A classifier is the algorithm itself – the rules used by machines to classify data.
• A classification model, on the other hand, is the end result of your classifier’s machine
learning.
• The model is trained using the classifier, so that the model, ultimately, classifies your
data.
Types of Classification Algorithms
✓ Decision Tree
✓ Naive Bayes Classifier
✓ K-Nearest Neighbors
✓ Support Vector Machines
✓ Artificial Neural Networks
Decision Tree
➢ Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems used to build models like the structure of a tree.
➢ It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
➢ It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
➢ A decision tree is a tree where each
✓ Node - a feature(attribute)
✓ Branch - a decision(rule)
✓ Leaf - an outcome(categorical or continuous)
➢ It classifies data into finer and finer categories: from “tree trunk,” to “branches,” to
“leaves.” It uses the if-then rule of mathematics to create sub-categories that fit into
broader categories and allows for precise, organic categorization.
For example, this is how a decision tree would categorize individual sports:
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 12
________________________________________________________________________________________________
➢ In a decision tree, for predicting the class of the given dataset, the algorithm starts from
the root node of the tree. This algorithm compares the values of root attribute with the
record (real dataset) attribute and, based on the comparison, follows the branch and jumps
to the next node.
➢ For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further.
➢ It continues the process until it reaches the leaf node of the tree. The complete process
can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 13
________________________________________________________________________________________________
✓ It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2
ID3 algorithm
ID3 algorithm, stands for Iterative Dichotomiser 3, is a classification algorithm that follows
a greedy approach of building a decision tree by selecting a best attribute that
yields maximum Information Gain (IG) or minimum Entropy (H).
Entropy is a measure of the amount of uncertainty in the dataset S. Mathematical
Representation of Entropy is shown here
𝐻(𝑆) = ∑ −𝑝(𝑐) log 2 𝑝(𝑐)
𝑐∈𝐶
Where,
• S - The current dataset for which entropy is being calculated (changes every iteration of
the ID3 algorithm).
• C - Set of classes in S {example - C ={yes, no}}
• p(c) - The proportion of the number of elements in class c to the number of elements in
set S.
In ID3, entropy is calculated for each remaining attribute. The attribute with the smallest entropy
is used to split the set S on that particular iteration.
Entropy = 0 implies it is of pure class, that means all are of same category.
Information Gain IG(A) tells us how much uncertainty in S was reduced after splitting set S on
attribute A. Mathematical representation of Information gain is shown here
𝐼𝐺(𝐴, 𝑆) = 𝐻(𝑆) − ∑ 𝑝(𝑡)𝐻(𝑡)
𝑡𝜖𝑇
Where H(S) – Entropy of set S.
T- subsets created from splitting set S by attribute A such that
𝑆 = ⋃𝑡
𝑡𝜖𝑇
P(t) -proportion of the number of elements in t to the number of elements in set S.
H(t) -Entropy of subset t.
The steps in ID3 algorithm are as follows:
Step 1: Data Preprocessing:
Clean and preprocess the data. Handle missing values and convert categorical variables
into numerical representations if needed.
Step 2: Selecting the Root Node:
Calculate the entropy of the target variable (class labels) based on the dataset. The
formula for entropy is:
Entropy(S) = -Σ (p_i * log2(p_i))
where p_i is the proportion of instances belonging to class i.
Step 3: Calculating Information Gain:
For each attribute in the dataset, calculate the information gain when the dataset is split on that
attribute. The formula for information gain is:
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 14
________________________________________________________________________________________________
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 16
________________________________________________________________________________________________
No 0
Sunny 5 Yes 2
No 3
Rainy 5 Yes 3
No 2
Rules, individual error, and total for Outlook attribute
Attribute Rules Error Total Error
No 2
Mild 6 Yes 4
No 2
Cold 4 Yes 3
No 1
No 4
Normal 7 Yes 6
No 1
Rules, individual error, and total for Humidity attribute
Attribute Rules Error Total Error
No 2
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 18
________________________________________________________________________________________________
True 6 Yes 3
No 3
True → No 3/6
From the above table, we can notice that the attributes Outlook and Humidity have the same
minimum error that is 4/14.
Hence we consider the individual attribute value errors.
The outlook attribute has one rule which generates zero error that is the rule Overcast → Yes.
Hence we consider the Outlook as the splitting attribute.
Now we build the tree with Outlook as the root node. It has three branches for each possible
value of the outlook attribute. As the rule, Overcast → Yes generates zero error. When the
outlook attribute value is overcast we get the result as Yes. For the remaining two attribute
values we consider the subset of data and continue building the tree. Tree with Outlook as root
node is,
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 19
________________________________________________________________________________________________
Now, for the left and right subtrees, we write all possible rules and find the total error. Based
on the total error table, we will construct the tree.
Left subtree,
Consolidated rules, errors for individual attributes values, and total error of the attribute are
given below.
From the above table, we can notice that Humidity has the lowest error. Hence Humidity is
considered as the splitting attribute. Also, when Humidity is High the answer is No as it
produces zero errors. Similarly, when Humidity is Normal the answer is Yes, as it produces
zero errors.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 20
________________________________________________________________________________________________
Right subtree,
Consolidated rules, errors for individual attributes values, and total error of the attribute are
given below.
From the above table, we can notice that Windy has the lowest error. Hence Windy is
considered as the splitting attribute. Also, when Windy is False the answer is Yes as it
produces zero errors. Similarly, when Windy is True the answer is No, as it produces zero
errors.
The final decision tree for the given Paly Tennis data set is,
Also, from the above decision tree the prediction for the new example:
is, Yes
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 21
________________________________________________________________________________________________
Conditional Probability
• Conditional probability is a fundamental concept in probability theory that measures the
likelihood of an event occurring given that another event has already occurred.
• It helps us understand how the probability of one event is influenced by the presence or
knowledge of another event.
• Conditional probability is denoted as P(A | B), which reads as "the probability of event
A given event B."
Conditional Probability is defined as the probability of any event occurring when another
event has already occurred.
✓ it calculates the probability of one event happening given that a certain condition is
satisfied.
✓ P(A | B): This notation represents the conditional probability of event A occurring given
that event B has already occurred.
Conditional probability is calculated using the formula:
P(A|B) = P(A ∩ B) / P(B)
Where,
→P(A ∩ B) represents the probability of both events A and B occurring simultaneously, and
→P(B) represents the probability of event B occurring.
To calculate the conditional probability, we can use the following step-by-step method:
Step 1: Identify the Events. Let’s call them Event A and Event B.
Step 2: Determine the Probability of Event A i.e., P(A)
Step 3: Determine the Probability of Event B i.e., P(B)
Step 4: Determine the Probability of Event A and B i.e., P(A∩B).
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 22
________________________________________________________________________________________________
Step 5: Apply the Conditional Probability Formula and calculate the required probability.
Bayes' theorem can be derived using product rule and conditional probability of event X with
known event Y:
Here, both events X and Y are independent events which means probability of outcome of
both events does not depends one another.
The above equation is called as Bayes Rule or Bayes Theorem.
✓ P(X|Y) is called as posterior, which we need to calculate. It is defined as updated
probability after considering the evidence.
✓ P(Y|X) is called the likelihood. It is the probability of evidence when hypothesis is true.
✓ P(X) is called the prior probability, probability of hypothesis before considering the
evidence
✓ P(Y) is called marginal probability. It is defined as the probability of evidence under
any consideration.
Hence, Bayes Theorem can be written as:
posterior = likelihood * prior / evidence
Advantages of Naïve Bayes Classifier in Machine Learning:
✓ It is one of the simplest and effective methods for calculating the conditional probability
and text classification problems.
✓ A Naïve-Bayes classifier algorithm is better than all other models where assumption of
independent predictors holds true.
✓ It is easy to implement than other models.
✓ It requires small amount of training data to estimate the test data which minimize the
training time period.
✓ It can be used for Binary as well as Multi-class Classifications.
Disadvantages of Naïve Bayes Classifier in Machine Learning:
✓ it limits the assumption of independent predictors because it implicitly assumes that all
attributes are independent or unrelated but in real life it is not feasible to get mutually
independent attributes.
Example: Predictively Classifying Customers of a Bookstore
We have the following dataset from a bookstore:
Age Income Student Credit_Rating Buys_Book
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 23
________________________________________________________________________________________________
the Naive Bayes Classifier predicts that the customer with the above-mentioned
attributes will buy a book.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 25
________________________________________________________________________________________________
✓ The Elbow method is one of the most popular ways to find the optimal number of clusters.
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of
Squares, which defines the total variations within a cluster. The formula to calculate the
value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
In the above formula of WCSS,
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point
and its centroid within a cluster1 and the same for the other two terms.
The steps to be followed for the implementation are given below:
✓ Data Pre-processing
✓ Finding the optimal number of clusters using the elbow method
✓ Training the K-means algorithm on the training dataset
✓ Visualizing the clusters
K-Nearest Neighbor(KNN) Algorithm for Machine Learning
• K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
• K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
• It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
• Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 26
________________________________________________________________________________________________
• Firstly, we will choose the number of neighbors, so we will choose the k=5.
• Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry.
It can be calculated as:
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 27
________________________________________________________________________________________________
•
• By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 28
________________________________________________________________________________________________
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature. So as
support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of
the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence
called a Support vector.
Margin in Support Vector Machine
We all know the equation of a hyperplane is w.x+b=0 where w is a vector normal to hyperplane
and b is an offset.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 29
________________________________________________________________________________________________
To classify a point as negative or positive we need to define a decision rule. We can define
decision rule as:
So we basically need to find X12 , X22 and X1.X2, and now we can see that 2 dimensions got
converted into 5 dimensions.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 30
________________________________________________________________________________________________
2. Sigmoid Kernel
We can use it as the proxy for neural networks. Equation is:
3. RBF Kernel
What it actually does is to create non-linear combinations of our features to lift your samples
onto a higher-dimensional feature space where we can use a linear decision boundary to separate
your classes It is the most used kernel in SVM classifications, the following formula explains it
mathematically:
where,
1. ‘σ’ is the variance and our hyperparameter
2. ||X₁ – X₂|| is the Euclidean Distance between two points X₁ and X₂
5. Anova Kernel
It performs well on multidimensional regression problems. The formula for this kernel function
is:
Advantages of SVM
• SVM works better when the data is Linear
• It is more effective in high dimensions
• With the help of the kernel trick, we can solve any complex problem
• SVM is not sensitive to outliers
• Can help us with Image classification
Disadvantages of SVM
• Choosing a good kernel is not easy
• It doesn’t show good results on a big dataset
• The SVM hyperparameters are Cost -C and gamma. It is not that easy to fine-tune these
hyper-parameters. It is hard to visualize their impact
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 31
________________________________________________________________________________________________
Ensemble Methods
• Ensemble methods are techniques that aim at improving the accuracy of results in models
by combining multiple models instead of using a single model.
• The combined models increase the accuracy of the results significantly. This has boosted
the popularity of ensemble methods in machine learning.
3. Stacking
• Stacking, another ensemble method, is often referred to as stacked generalization. This
technique works by allowing a training algorithm to ensemble several other similar
learning algorithm predictions.
This combination of multiple models is called Ensemble. Ensemble uses two methods:
1. Bagging: Creating a different training subset from sample training data with
replacement is called Bagging. The final output is based on majority voting.
2. Boosting: Combing weak learners into strong learners by creating sequential models
such that the final model has the highest accuracy is called Boosting. Example: ADA
BOOST, XG BOOST.
Bagging: From the principle mentioned above, we can understand Random forest uses
the Bagging code. Now, let us understand this concept in detail. Bagging is also known
as Bootstrap Aggregation used by random forest. The process begins with any original
random data. After arranging, it is organised into samples known as Bootstrap Sample.
This process is known as Bootstrapping. Further, the models are trained individually,
yielding different results known as Aggregation. In the last step, all the results are
combined, and the generated output is based on majority voting. This step is known as
Bagging and is done using an Ensemble Classifier.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 34
________________________________________________________________________________________________
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given
to the Random forest classifier. The dataset is divided into subsets and given to each decision
tree. During the training phase, each decision tree produces a prediction result, and when a new
data point occurs, then based on the majority of results, the Random Forest classifier predicts
the final decision. Consider the below image:
Feature Generation
• Feature generation is the process of constructing new features from existing ones. The
goal of feature generation is to derive new combinations and representations of our data
that might be useful to the machine learning model.
• A feature (or column) represents a measurable piece of data like name, age or gender.
• It is the basic building block of a dataset.
• The quality of a feature can vary significantly and has an immense effect on model
performance.
• We can improve the quality of a dataset’s features in the pre-processing stage using
processes like Feature Generation and Feature Selection.
• Feature Generation (also known as feature construction, feature extraction or feature
engineering) is the process of transforming features into new features that better relate to
the target.
• This can involve mapping a feature into a new feature using a function like log, or creating
a new feature from one or multiple features using multiplication or addition.
• Feature Generation can improve model performance when there is a feature interaction.
• The generation of new flexible features is important as it allows us to use less complex
models that are faster to run and easier to understand and maintain.
Feature selection:
• Feature selection is a process that chooses a subset of features from the original features
so that the feature space is optimally reduced according to a certain criterion.
• Its goal is to find the best possible set of features for building a machine learning model.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 36
________________________________________________________________________________________________
Chi-square Formula
• Fisher’s Score – Fisher’s Score selects each feature independently according to their
scores under Fisher criterion leading to a suboptimal set of features. The larger the
Fisher’s score is, the better is the selected feature.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 37
________________________________________________________________________________________________
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 38
________________________________________________________________________________________________
Given a predefined classifier, a typical wrapper model will perform the following steps:
Step 1: searching a subset of features,
Step 2: evaluating the selected subset of features by the performance of the classifier,
Step 3: repeating Step 1 and Step 2 until the desired quality is reached.
Some techniques used are:
• Forward selection – This method is an iterative approach where we initially start
with an empty set of features and keep adding a feature which best improves our model
after each iteration. The stopping criterion is till the addition of a new variable does
not improve the performance of the model.
• Backward elimination – This method is also an iterative approach where we
initially start with all features and after each iteration, we remove the least significant
feature. The stopping criterion is till no improvement in the performance of the model
is observed after the feature is removed.
• Bi-directional elimination – This method uses both forward selection and backward
elimination technique simultaneously to reach one unique solution.
• Exhaustive selection – This technique is considered as the brute force approach for
the evaluation of feature subsets. It creates all possible subsets and builds a learning
algorithm for each subset and selects the subset whose model’s performance is best.
• Recursive elimination – This greedy optimization method selects features by
recursively considering the smaller and smaller set of features. The estimator is trained
on an initial set of features and their importance is obtained using
feature_importance_attribute. The least important features are then removed from the
current set of features till we are left with the required number of features.
Embedded methods:
✓ In embedded methods, the feature selection algorithm is blended as part of the learning
algorithm, thus having its own built-in feature selection methods.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 39
________________________________________________________________________________________________
✓ Embedded methods encounter the drawbacks of filter and wrapper methods and merge
their advantages. These methods are faster like those of filter methods and more accurate
than the filter methods and take into consideration a combination of features as well.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT IV 1
____________________________________________________________________________________________
UNIT IV CLUSTERING
Choosing distance metrics - Different clustering approaches - hierarchical agglomerative clustering, k-
means (Lloyd's algorithm), - DBSCAN - Relative merits of each method - clustering tendency and quality.
Extending to n dimensions, the points x and y are of the form x = (x1, x2, …, xn) and y = (y1, y2, …, yn),
we have the following equation for Euclidean distance:
Where,
n = number of dimensions
xi, yi = data points
Computing Euclidean Distance in Python
from scipy.spatial import distance
We then initialize two points x and y like so:
x = [3,6,9]
y = [1,0,1]
We can use the euclidean convenience function to find the Euclidean distance between the points x and y:
print(distance.euclidean(x,y))
Output >> 10.198039027185569
Manhattan Distance
Manhattan Distance is the sum of absolute differences between points across all the dimensions.
The Manhattan distance between the points x and y is given by:
In n-dimensional space, where each point has n coordinates, the Manhattan distance is given by:
n = number of dimensions
xi, yi = data points
Computing Manhattan Distance in Python
from scipy.spatial import distance
x = [3,6,9]
y = [1,0,1]
To compute the Manhattan (or cityblock) distance, we can use the cityblock function:
print(distance.cityblock(x,y))
Output >> 16
Minkowski Distance
Minkowski Distance is the generalized form of Euclidean and Manhattan Distance.
Formula for Minkowski Distance
Hamming Distance
Hamming distance is a metric for comparing two binary data strings. While comparing two binary strings of
equal length, Hamming distance is the number of bit positions in which the two bits are different.
The Hamming distance between two strings, a and b is denoted as d(a,b).
In order to calculate the Hamming distance between two strings, and, we perform their XOR operation, (a⊕
b), and then count the total number of 1s in the resultant string.
Suppose there are two strings 11011001 and 10011101.
11011001 ⊕ 10011101 = 01000100. Since, this contains two 1s, the Hamming distance, d(11011001,
10011101) = 2.
clustering approaches
• Clustering is a set of techniques used to partition data into groups, or clusters.
• Clusters are defined as groups of data objects that are more similar to other objects in their cluster
than to data objects in other clusters.
• The clustering technique is commonly used for statistical data analysis.
• It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it
deals with the unlabeled dataset.
Applications of Clustering in different fields:
Marketing: It can be used to characterize & discover customer segments for marketing purposes.
Biology: It can be used for classification among different species of plants and animals.
Libraries: It is used in clustering different books on the basis of topics and information.
Insurance: It is used to acknowledge the customers, their policies and identifying the frauds.
City Planning: It is used to make groups of houses and to study their values based on their geographical
locations and other factors present.
Earthquake studies: By learning the earthquake-affected areas we can determine the dangerous zones.
Image Processing: Clustering can be used to group similar images together, classify images based on
content, and identify patterns in image data.
Genetics: Clustering is used to group genes that have similar expression patterns and identify gene networks
that work together in biological processes.
Finance: Clustering is used to identify market segments based on customer behavior, identify patterns in
stock market data, and analyze risk in investment portfolios.
Customer Service: Clustering is used to group customer inquiries and complaints into categories, identify
common issues, and develop targeted solutions.
Manufacturing: Clustering is used to group similar products together, optimize production processes, and
identify defects in manufacturing processes.
Medical diagnosis: Clustering is used to group patients with similar symptoms or diseases, which helps in
making accurate diagnoses and identifying effective treatments.
Fraud detection: Clustering is used to identify suspicious patterns or anomalies in financial transactions,
which can help in detecting fraud or other financial crimes.
Traffic analysis: Clustering is used to group similar patterns of traffic data, such as peak hours, routes, and
speeds, which can help in improving transportation planning and infrastructure.
Social network analysis: Clustering is used to identify communities or groups within social networks,
which can help in understanding social behavior, influence, and trends.
Cybersecurity: Clustering is used to group similar patterns of network traffic or system behavior, which
can help in detecting and preventing cyberattacks.
Climate analysis: Clustering is used to group similar patterns of climate data, such as temperature,
precipitation, and wind, which can help in understanding climate change and its impact on the environment.
Sports analysis: Clustering is used to group similar patterns of player or team performance data, which can
help in analyzing player or team strengths and weaknesses and making strategic decisions.
Crime analysis: Clustering is used to group similar patterns of crime data, such as location, time, and type,
which can help in identifying crime hotspots, predicting future crime trends, and improving crime
prevention strategies.
Benefits of clustering
✓ It helps to visualize high-dimensional data
✓ It further enables data scientists to deal with different types of data like discrete, categorical, and
binary data
✓ It gives them some structure to unstructured data sets by organizing them into a group
✓ Helps to identify obscure patterns and relationships within a data set
✓ It helps to carry out exploratory data analysis
✓ It can also be used for market segmentation, customer profiling, and more
Clustering Methods:
• Density-Based Methods:
➢ The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected.
➢ This algorithm does it by identifying different clusters in the dataset and connects the areas
of high densities into clusters.
➢ The dense areas in data space are divided from each other by sparser areas.
➢ These algorithms can face difficulty in clustering the data points if the dataset has varying
densities and high dimensions.
➢ These methods have good accuracy and the ability to merge two clusters. Example DBSCAN
(Density-Based Spatial Clustering of Applications with Noise), OPTICS (Ordering Points to
Identify Clustering Structure), etc.
• Partitioning Methods:
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is the K-Means
Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-defined
groups. The cluster center is created in such a way that the distance between the data points of one cluster
is minimum as compared to another cluster centroid.
Common Algorithms used in this method are,
• K-Means
• K-Medoids
• K-Modes
• Grid-based Methods:
In this method, the data space is formulated into a finite number of cells that form a grid-like
structure. All the clustering operations done on these grids are fast and independent of the number
of data objects example STING (Statistical Information Grid), wave cluster, CLIQUE (Clustering
In Quest), etc.
K-Means Clustering
• K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be created in
the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so
on.
• It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way
that each dataset belongs only one group that has similar properties.
• it allows us to cluster the data into different groups and a convenient way to discover the categories
of groups in the unlabeled dataset on its own without the need for any training.
• It is a centroid-based algorithm, where each cluster is associated with a centroid.
• The main aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters
• The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters,
and repeats the process until it does not find the best clusters. The value of k should be
predetermined in this algorithm.
• The k-means clustering algorithm mainly performs two tasks:
✓ Determines the best value for K center points or centroids by an iterative process.
✓ Assigns each data point to its closest k-center. Those data points which are near to the particular k-
center, create a cluster.
→Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
The K-Means Clustering takes the input of dataset D and parameter k, and then divides a dataset D of n
objects into k groups.
→ Cluster similarity is measured regarding the mean value of the objects in a cluster, which can be showed
as the cluster’s mean.
→ K-Means iteratively relocates the cluster centers by computing the mean of a cluster.
→The quality of the cluster assignments is determined by computing the sum of the squared error (SSE) after
the centroids converge, or match the previous iteration’s assignment.
→ The SSE is defined as the sum of the squared Euclidean distances of each point to its closest centroid.
Since this is a measure of error, the objective of k-means is to try to minimize this value.
import matplotlib.cm as cm
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')
plt.style.use('fivethirtyeight')
dataset = pd.read_csv("Data......csv", sep=",")
dataset.head()
dataset.info()
plt.figure(figsize=(10, 9))
sns.scatterplot('Annual Income (k$)', 'Spending Score (1-100)', data=dataset, alpha=0.9)
data_x = dataset.iloc[:, 3:5]
data_x.head()x_array = np.array(data_x)
print(x_array)
Sum_of_squared_distances =[]
K = range(1,15)
for k in K:
km =KMeans(n_clusters =k)
km =km.fit(x_scaled)
Sum_of_squared_distances.append(km.inertia_)
→The process of transforming numerical features to use the same scale is known as feature scaling.
Advantages:
• Scalability
• perform well on huge data
• K-Means is faster than other clustering algorithms
Disadvantages:
• K-Means is its sensitivity to outliers.
• Cluster results vary according to k value and initial choice of cluster centers.
• K-Means algorithm works well only for spherical data and fails to perform well on arbitrary shapes
of data.
Hierarchical Clustering
➔ Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group
the unlabeled datasets into a cluster and also known as hierarchical cluster analysis or HCA.
➔ In this algorithm, the hierarchy of clusters in the form of a tree, and this tree-shaped structure is
known as the dendrogram.
➔ Dendrograms are tree diagrams frequently used to illustrate the arrangement of the clusters produced
by hierarchical clustering.
➔ A subset of similiar data is created in a tree-like structure in which the root node corresponds to entire
data and branches are created from the root node to form several clusters.
➔ The optimal number of clusters is equal to the number of vertical lines going through the horizontal
line.
➔ Hierarchical clustering involves creating clusters that have a predetermined ordering from top to
bottom. There are two types of hierarchical clustering, Agglomerative and Divisive.
➔ Divisive Clustering : the type of hierarchical clustering that uses a top-down approach to make
clusters. It uses an approach of the partitioning of 2 least similar clusters and repeats this step until
there is only one cluster. Divisive clustering is not commonly used in real life.
➔ Agglomerative Clustering : the type of hierarchical clustering which uses a bottom-up approach to
make clusters. It uses an approach of the partitioning 2 most similar clusters and repeats this step until
there is only one cluster. These steps are how the agglomerative hierarchical clustering works:
➢ Computing (dis)similarity information between every pair of objects in the data set.
➢ Using linkage function to group objects into hierarchical cluster tree, based on the distance
information generated at step 1. Objects/clusters that are in close proximity are linked together using
the linkage function.
➢ Determining where to cut the hierarchical tree into clusters. This creates a partition of the data.
steps:
✓ Consider each alphabet as a single cluster and calculate the distance of one cluster from all the
other clusters.
✓ In the second step, comparable clusters are merged together to form a single cluster. Let’s say
cluster (B) and cluster (C) are very similar to each other therefore we merge them in the second
step similarly to cluster (D) and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]
✓ We recalculate the proximity according to the algorithm and merge the two nearest clusters([(DE),
(F)]) together to form new clusters as [(A), (BC), (DEF)]
✓ Repeating the same process; The clusters DEF and BC are comparable and merged together to
form a new cluster. We’re now left with clusters [(A), (BCDEF)].
✓ At last, the two remaining clusters are merged together to form a single cluster [(ABCDEF)].
Algorithm :
given a dataset (d1, d2, d3, ....dN) of size N
# compute the distance matrix
for i=1 to N:
# as the distance matrix is symmetric about
# the primary diagonal so we compute only lower
# part of the primary diagonal
for j=1 to i:
dis_mat[i][j] = distance[di, dj]
each data point is a singleton cluster
repeat
merge the two cluster having minimum distance
update the distance matrix
until only a single cluster remains
• Single Linkage: It is the Shortest Distance between the closest points of the clusters. Consider the
below image:
Complete Linkage: It is the farthest distance between the two points of two different clusters. It is one of
the popular linkage methods as it forms tighter clusters than single-linkage.
Average Linkage: It is the linkage method in which the distance between each pair of datasets is added up
and then divided by the total number of datasets to calculate the average distance between two clusters. It is
also one of the most popular linkage methods.
Centroid Linkage: It is the linkage method in which the distance between the centroid of the clusters is
calculated. Consider the below image:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
And then importing the dataset:
df = pd.read_csv("Data......csv", sep=",")
df.head()
plt.figure(figsize=(8,5))
plt.title("Annual income distribution",fontsize=15)
plt.xlabel ("Annual income (k$)",fontsize=13)
plt.grid(True)
plt.hist(df['Annual Income (k$)'],color='blue',edgecolor='k')
plt.show()
plt.figure(figsize=(8,5))
plt.title("Spending Score distribution",fontsize=15)
plt.xlabel ("Spending Score (1-100)",fontsize=14)
plt.grid(True)
plt.hist(df['Spending Score (1-100)'],color='brown',edgecolor='k')
plt.show()
plt.figure(figsize=(11,8))
plt.title("Annual Income and Spending Score Correlation",fontsize=18)
plt.xlabel ("Annual Income (k$)",fontsize=14)
plt.ylabel ("Spending Score (1-100)",fontsize=14)
plt.grid(True)
plt.scatter(df['Annual Income (k$)'],df['Spending Score (1-100)'],color='green',edgecolor='k',alpha=0.6, s=100)
plt.show()
X = df.iloc[:,[3,4]].valuesimport scipy.cluster.hierarchy as sch
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster import hierarchyplt.figure(figsize=(17,10))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
#plt.grid(True)
dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.show()
Pros
• No assumption of a particular number of clusters (i.e. k-means)
• May correspond to meaningful taxonomies
Cons
• Once a decision is made to combine two clusters, it can’t be undone
• Too slow for large data sets, O(𝑛2 log(𝑛))
.N
o. Agglomerative Clustering Divisive Clustering
Agglomerative clustering is
generally more computationally
Comparatively less expensive as
expensive, especially for large
divisive clustering only requires the
datasets as this approach requires
3 calculation of distances between sub-
the calculation of all pairwise
. clusters, which can reduce the
distances between data points,
computational burden.
which can be computationally
expensive.
DBSCAN
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density based
clustering algorithm.
• The main concept of DBSCAN algorithm is to locate regions of high density that are separated from
one another by regions of low density.
• DBSCAN can identify clusters in a large spatial dataset by looking at the local density of
corresponding elements.
• The advantage of the DBSCAN algorithm over the K-Means algorithm, is that the DBSCAN can
determine which data points are noise or outliers. DBSCAN can identify points that are not part of
any cluster (very useful as outliers detector).
• it slower than agglomerative clustering and k-means, but still scales to relatively large datasets.
There are two parameters in DBSCAN: minPoints and eps :
eps: specifies how close points should be to each other to be considered a part of a cluster. It means that if
the distance between two points is lower or equal to this value (eps), these points are considered to be
neighbors.
minPoints: the minimum number of data points to form a dense region/ cluster. For example, if we set
the minPoints parameter as 5, then we need at least 5 points to form a dense region.
Based on the two parameters, the points are classified as Core point, Border point and Noise point.
Core Point
• A point is a core point if it has more than a specified number of minPoints within eps radius around
it or |N (p)|≥ minPoints .
• Core Point always belongs in a dense region.
• For example, let’s consider ‘p’ is set to be a core point if ‘p’ has ≥ minPoints in an eps radius around
it.
Border Point
• A point is a border point if it has fewer than minPoints within eps, but is in the neighborhood of
a core point.
• For example, p is set to be a border point if ‘p’ is not a core point. i.e ‘p’ has
< minPoints in eps radius. But ‘p’ should belong to the neighborhood ‘q’. Where ‘q’ is a core point.
• p ∈ neighborhood of q and distance(p,q) ≤ eps .
Noise Point
• A noise point is any point that is not a core point or a border point.
Density Edge:
If p and q both are core points and distance between (p,q) ≤ eps then we can connect p, q vertex in a graph
and call it “Density Edge”.
Density Connected Points:
Two points p and q are said to be density connected points if both p and q are core points and they exist a
path formed by density edges connecting point (p) to point(q).
import numpy as np
import pandas as pd
import osimport matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler%matplotlib inline
Next step is importing the dataset:
df = pd.read_csv('Data_Customer_Mall.csv', sep=',')
df.head()
Clus_dataSet = df[['Annual_Income','Spending_Score']]
Clus_dataSet = np.nan_to_num(Clus_dataSet)
Clus_dataSet = np.array(Clus_dataSet, dtype=np.float64)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)# Compute DBSCAN
db = DBSCAN(eps=0.4, min_samples=5).fit(Clus_dataSet)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
df['Clus_Db']=labelsrealClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels))# A sample of clusters
print(df[['Annual_Income','Spending_Score']].head())# Number of Labels
print("number of labels: ", set(labels))
K Means clustering needed advance In hierarchical clustering one can stop at any number
knowledge of K i.e. no. of clusters one want of clusters, one find appropriate by interpreting the
to divide your data. dendrogram.
Methods used are normally less Divisive methods work in the opposite direction,
computationally intensive and are suited with beginning with one cluster that includes all the
very large datasets. records and Hierarchical methods are especially
K Means clustering is found to work well Hierarchical clustering don’t work as well as, k
when the structure of the clusters is hyper means when the shape of the clusters is
spherical (like circle in 2D, sphere in 3D). hyper spherical.
Lloyd’s Algorithm
We can improve the distortion in two ways: by changing the points’ cluster assignments and moving the
cluster centers. Specifically, given a fixed set of points X, we are attempting to minimize the distortion
function
Step 1 occurs only once, while steps 2(a) and 2(b) alternate until the algorithm converges. Convergence is
guaranteed due to the fact that steps 2(a) and 2(b) both reduce J.X; C /, and there is a finite number of
ways to partition the n points among k clusters
Clusters formed are more or less spherical or Clusters formed are arbitrary in shape and may
1. convex in shape and must have same feature size. not have same feature size.
K-means Clustering is more efficient for large DBSCan Clustering can not efficiently handle
3. datasets. high dimensional datasets.
K-means Clustering does not work well with DBScan clustering efficiently handles outliers
4. outliers and noisy datasets. and noisy datasets.
DATA VISULAIZATION
• Data visualization may be described as graphically representing data. It is the act of
translating data into a visual context, which can be done using charts, plots, animations,
infographics, etc
• Data visualization in data science refers to the process of generating graphical
representations of information. These graphical depictions, often known as plots or charts
• Data visualization in data science is pivotal for effectively communicating insights.
• The purpose of data visualization is to help drive informed decision-making
Examples of Data Visualization in Data Science
✓ Weather reports: Maps and other plot types are commonly used in weather reports.
✓ Internet websites: Social media analytics websites such as Social Blade and Google
Analytics use data visualization techniques to analyze and compare the performance of
websites.
✓ Astronomy: NASA uses advanced data visualization techniques in its reports and
presentations.
✓ Geography
✓ Gaming industry
✓ Python libraries include various features that allow users to create highly customized, classy,
and interactive plots. They are
✓ Matplotlib
✓ Seaborn
✓ Bokeh
✓ Plotly
Importing Matplotlib
• Matplotlib is a cross-platform, data visualization and graphical plotting library for Python
and its numerical extension NumPy.
• Matplotlib is a comprehensive library for creating static, animated, and interactive
visualizations in Python.
• Matplotlib is a Python visualization library for 2D array plots.
• Matplotlib is a library that contains several submodules such as pyplot.
• Matplotlib is a Python library that uses the NumPy library. Matplotlib includes a wide range
of plots, such as scatter, line, bar, histogram, and others, that can assist us in delving deeper
into trends, behavioral patterns, and correlations.
plt.plot([1,2,3],[5,7,4])
plt.show()
→importing matplotlib.pyplot and use the alias plt, which is the alias used by convention for this
submodule.
→The plt.plot will "draw" this plot in the background
➔ plt.show(): With that, the graph should pop up
Line Charts
→A Line chart is a graph that represents information as a series of data points connected by a
straight line. In line charts, each data point or marker is plotted and connected with a line or curve.
Using Matplotlib
Using Seaborn
• A grid can be added to a Matplotlib plot using the plt.grid() command. By defaut, the grid is
turned off. To turn on the grid use:
plt.grid(True)
• The only valid options are plt.grid(True) and plt.grid(False).
Defining the Line Appearance and Working with Line Style
• Line styles help differentiate graphs by drawing the lines in various ways. Following line style is
used by Matplotlib.
• Matplotlib has an additional parameter to control the colour and style of the plot.
plt.plot(xa, ya 'g')
• This will make the line green. To use any colour of red, green, blue, cyan, magenta, yellow, white
or black just by using the first character of the colour name in lower case (use "k" for black, as "b"
means blue).
• To alter the linestyle, for example two dashes -- makes a dashed line. This can be used added to
the colour selector, like this:
plt.plot(xa, ya 'r--')
• to use "-" for a solid line (the default), "-." for dash-dot lines, or ":" for a dotted line. Here is an
example :
from matplotlib import pyplot as plt
import numpy as np
xa = np.linspace(0, 5, 20)
ya = xa**2
plt.plot(xa, ya, 'g')
ya = 3*xa
plt.plot(xa, ya, 'r--')
plt.show()
Output:
Adding Markers
• Markers add a special symbol to each data point in a line graph. Unlike line style and color,
markers tend to be a little less susceptible to accessibility and printing issues.
• Basically, the matplotlib tries to have identifiers for the markers which look similar to the marker:
1. Triangle-shaped: v, <, > Λ
2. Cross-like: *,+, 1, 2, 3, 4
3. Circle-like: 0,., h, p, H, 8
• Having differently shaped markers is a great way to distinguish between different groups of data
points. If your control group is all circles and your experimental group is all X's the difference pops
out, even to colorblind viewers.
N = x.size // 3
ax.scatter(x[:N], y[:N], marker="o")
ax.scatter(x[N: 2* N], y[N: 2* N], marker="x")
ax.scatter(x[2* N:], y[2 * N:], marker="s")
• There's no way to specify multiple marker styles in a single scatter() call, but we can separate our
data out into groups and plot each marker style separately. Here we chopped our data up into three
equal groups.
Creating a legend
• There are several options available for customizing the appearance and behavior of the plot
legend. By default the legend always appears when there are multiple series and only appears on
mouseover when there is a single series. By default the legend shows point values when the mouse
is over the graph but not when the mouse leaves.
• A legend documents the individual elements of a plot. Each line is presented in a table that
contains a label for it so that people can differentiate between each line.
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(-10, 9, 20)
y = x ** 3
Z = x ** 2
figure = plt.figure()
axes = figure.add_axes([0,0,1,1])
axes.plot(x, z, label="Square Function")
axes.plot(x, y, label="Cube Function")
axes.legend()
• In the script above we define two functions: square and cube using x, y and z variables. Next, we
first plot the square function and for the label parameter, we pass the value Square Function.
• This will be the value displayed in the label for square function. Next, we plot the cube function
and pass Cube Function as value for the label parameter.
• The output looks likes this:
Bar Graphs
When you have categorical data, you can represent it with a bar graph. A bar graph plots data with
the help of bars, which represent value on the y-axis and category on the x-axis. Bar graphs use bars
with varying heights to show the data which belongs to a specific category.
Bar Chart
A bar plot or a bar chart is just a graph that uses rectangular bars which have lengths and heights
proportional to the data values they represent to represent a category of data.
import pandas as pd
import matplotlib.pyplot as plt
# reading the csv data set
dataset = pd.read_csv("tips.csv")
# Plotting Scatter plot of total_bill vs tip
plt.bar(dataset['total_bill'], dataset['tip'])
# Giving our plot a title
plt.title("Bar Chart")
# GIving x and y labels names
plt.xlabel('Total Bill')
plt.ylabel('Tip')
plt.show()
Seaborn
→Seaborn is a Python library for creating statistical representations based on datasets. It is built on
top of matplotlib and is used to create various visualizations. It's built on top of pandas' data
structures. The library conducts the necessary modeling and aggregation internally to create
insightful visuals.
# importing the required packages
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# reading the csv data set using pandas
dataset = pd.read_csv("tips.csv")
sns.lineplot(x='total_bill', y='tip', data=dataset)
plt.show()
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes()
x = np.linspace(0, 10, 1000)
ax.plot(x, np.sin(x))
Output:
The matplotlib provides the fill_between() function which is used to fill area around the lines based
on the user defined logic.
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes()
x = np.linspace(0, 10, 1000)
ax.plot(x, np.sin(x))
Output:
Pie Chart
→A pie chart is a circular graph that is broken down in the segment or slices of pie.
→It is generally used to represent the percentage or proportional data where each slice of pie
represents a particular category. Let's have a look at the below example:
from matplotlib import pyplot as plt
# Pie chart, where the slices will be ordered and plotted counter-clockwise:
Players = 'Rohit', 'Virat', 'Shikhar', 'Yuvraj'
Runs = [45, 30, 15, 10]
explode = (0.1, 0, 0, 0) # it "explode" the 1st slice
fig1, ax1 = plt.subplots()
ax1.pie(Runs, explode=explode, labels=Players, autopct='%1.1f%%',
shadow=True, startangle=90)
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
Output:
Scatter plot
→The scatter plots are mostly used for comparing variables when we need to define how much one
variable is affected by another variable.
→The data is displayed as a collection of points. Each point has the value of one variable, which
defines the position on the horizontal axes, and the value of other variable represents the position on
the vertical axis.
Example
from matplotlib import pyplot as plt
from matplotlib import style
style.use('ggplot')
x = [5,7,10]
y = [18,10,6]
x2 = [6,9,11]
y2 = [7,14,17]
plt.scatter(x, y)
plt.scatter(x2, y2, color='g')
plt.title('Epic Info')
plt.ylabel('Y axis')
Visualizing Errors
• Error bars are included in Matplotlib line plots and graphs. Error is the difference between the
calculated value and actual value.
• Without error bars, bar graphs provide the perception that a measurable or determined number is
defined to a high level of efficiency. The method matplotlib.pyplot.errorbar() draws y vs. x as
planes and/or indicators with error bars associated.
• Adding the error bar in Matplotlib, Python. It's very simple, we just have to write the value of the
error. We use the command:
plt.errorbar(x, y, yerr = 2, capsize=3)
Where:
x = The data of the X axis.
Y = The data of the Y axis.
yerr = The error value of the Y axis. Each point has its own error value.
xerr = The error value of the X axis.
capsize = The size of the lower and upper lines of the error bar
• A simple example, where we only plot one point. The error is the 10% on the Y axis.
• We plot using the command "plt.errorbar (...)", giving it the desired characteristics.
importmatplotlib.pyplot as plt
importnumpy as np
x = np.arange(1,8)
y = np.array([20,10,45,32,38,21,27])
y_error = y * 0.10 ##El 10%
plt.errorbar(x, y, yerr = y_error,
linestyle="None", fmt="ob", capsize=3, ecolor="k")
plt.show()
• Parameters of the errorbar :
a) yerr is the error value in each point.
b) linestyle, here it indicate that we will not plot a line.
c) fmt, is the type of marker, in this case is a point ("o") blue ("b").
d) capsize, is the size of the lower and upper lines of the error bar.
e) ecolor, is the color of the error bar. The default color is the marker color.
Output:
• Multiple lines in MatplotlibErrorbar in Python : The ability to draw numerous lines in almost the
same plot is critical. We'll draw many errorbars in the same graph by using this scheme.
importnumpy as np
importmatplotlib.pyplot as plt
fig = plt.figure()
x = np.arange(20)
y = 4* np.sin(x / 20 * np.pi)
yerr = np.linspace (0.06, 0.3, 20)
Output:
import numpy as np
import matplotlib.pyplot as plt
# define a function
def func(x, y):
return np.sin(x) ** 2 + np.cos(y) **2
# generate 50 values b/w 0 a5
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 50)
# Generate combination of grids
X, Y = np.meshgrid(x, y)
Z = func(X, Y)
# Draw rectangular contour plot
plt.contour(X, Y, Z, cmap='gist_rainbow_r');
import numpy as np
import matplotlib.pyplot as plt
def f(x, y):
return np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 40)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
plt.imshow(Z, extent=[0, 10, 0, 10], origin='lower', cmap='RdGy')
plt.colorbar()
Histogram
→ A histogram is used for the distribution, whereas a bar chart is used to compare different entities.
A histogram is a type of bar plot that shows the frequency of a number of values compared to a set
of values ranges.
For example we take the data of the different age group of the people and plot a histogram with
respect to the bin. Now, bin represents the range of values that are divided into series of intervals.
Bins are generally created of the same size.
from matplotlib import pyplot as plt
from matplotlib import pyplot as plt
population_age = [21,53,60,49,25,27,30,42,40,1,2,102,95,8,15,105,70,65,55,70,75,60,52,44,43,42,4
5]
bins = [0,10,20,30,40,50,60,70,80,90,100]
plt.hist(population_age, bins, histtype='bar', rwidth=0.8)
plt.xlabel('age groups')
plt.ylabel('Number of people')
plt.title('Histogram')
plt.show()
Output:
Legend
→Plot legends give meaning to a visualization, assigning labels to the various plot elements.
→Legends are found in maps - describe the pictorial language or symbology of the map. Legends
are used in line graphs to explain the function or the values underlying the different lines of the
graph.
→ Matplotlib has native support for legends. Legends can be placed in various positions: A legend
can be placed inside or outside the chart and the position can be moved. The legend() method adds
the legend to the plot.
To place the legend inside, simply call legend():
import matplotlib.pyplot as plt
import numpy as np
y = [2,4,6,8,10,12,14,16,18,20]
y2 = [10,11,12,13,14,15,16,17,18,19]
x = np.arange(10)
fig = plt.figure()
ax = plt.subplot(111)
ax.plot(x, y, label='$y = numbers')
ax.plot(x, y2, label='$y2 = other numbers')
plt.title('Legend inside')
ax.legend()
plt.show()
Output:
frompolynomialsimportPolynomial
importnumpyasnp
importmatplotlib.pyplotasplt
p=Polynomial(-0.8,2.3,0.5,1,0.2)
p_der=p.derivative()
fig, ax=plt.subplots()
X=np.linspace (-2,3,50, endpoint=True)
F=p(X)
F_derivative=p_der(X)
ax.plot(X,F,label="p")
ax.plot(X,F_derivative,label="derivation of p")
ax.legend(loc='upper left')
Output:
Subplots
• Subplots mean groups of axes that can exist in a single matplotlib figure. subplots() function in the
matplotlib library, helps in creating multiple layouts of subplots. It provides control over all the
individual plots that are created.
• subplots() without arguments returns a Figure and a single Axes. This is actually the simplest and
recommended way of creating a single Figure and Axes.
fig, ax = plt.subplots()
ax.plot(x,y)
ax.set_title('A single plot')
Output:
• There are 3 different ways (at least) to create plots (called axes) in matplotlib. They are:plt.axes(),
figure.add_axis() and plt.subplots()
• plt.axes(): The most basic method of creating an axes is to use the plt.axes function. It takes
optional argument for figure coordinate system. These numbers represent [bottom, left, width,
height] in the figure coordinate system, which ranges from 0 at the bottom left of the figure to 1 at
the top right of the figure.
• Plot just one figure with (x,y) coordinates: plt.plot(x, y).
• By calling subplot(n,m,k), we subdidive the figure into n rows and m columns and specify that
plotting should be done on the subplot number k. Subplots are numbered row by row, from left to
right.
importmatplotlib.pyplotasplt
importnumpyasnp
frommathimportpi
plt.figure(figsize=(8,4)) # set dimensions of the figure
x=np.linspace (0,2*pi,100)
foriinrange(1,7):
plt.subplot(2,3,i)# create subplots on a grid with 2 rows and 3 columns
plt.xticks([])# set no ticks on x-axis
plt.yticks([])# set no ticks on y-axis
plt.plot(np.sin(x), np.cos(i*x))
plt.title('subplot'+'(2,3,' + str(i)+')')
plt.show()
Output:
Example :
importplotly.graph_objectsasgo
fig=go.Figure()
fig.add_trace(go.Scatter(
x=[0,1,2,3,4,5,6,7,8],
Customization
• A tick is a short line on an axis. For category axes, ticks separate each category. For value axes,
ticks mark the major divisions and show the exact point on an axis that the axis label defines. Ticks
are always the same color and line style as the axis.
• Ticks are the markers denoting data points on axes. Matplotlib's default tick locators and
formatters are designed to be generally sufficient in many common situations. Position and labels of
ticks can be explicitly mentioned to suit specific requirements.
• Fig. 5.9.1 shows ticks.
Example :
fig=plt.figure(figsize=(8,8))
ax=plt.axes(projection='3d')
ax.grid()
t=np.arange(0,10*np.pi,np.pi/50)
x=np.sin(t)
y=np.cos(t)
ax.plot3D(x,y,t)
ax.set_title('3D Parametric Plot')
# Set axes label
ax.set_xlabel('x',labelpad=20)
ax.set_ylabel('y', labelpad=20)
ax.set_zlabel('t', labelpad=20)
plt.show()
Output:
3D graph plot
→Three-dimension plots can be created by importing the mplot3d toolkit, include with the main
Matplotlib installation:
from mpl_toolkits import mplot3d
When this module is imported in the program, three-dimension axes can be created by passing the
keyword projection='3d' to any of the normal axes creation routines:
Example
from mpltoolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes(projection='3d')_
Output:
Example-2:
from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt
height = np.array([100,110,87,85,65,80,96,75,42,59,54,63,95,71,86])
weight = np.array([105,123,84,85,78,95,69,42,87,91,63,83,75,41,80])
scatter(height,weight)
fig = plt.figure()
ax = plt.axes(projection='3d')
# This is used to plot 3D scatter
ax.scatter3D(height,weight)
plt.title("3D Scatter Plot")
plt.xlabel("Height")
plt.ylabel("Weight")
plt.title("3D Scatter Plot")
plt.xlabel("Height")
plt.ylabel("Weight")
plt.show()
Output:
plot(x-axis value, y-axis-values) It is used to plot a simple line graph with x-axis value
against the y-axis values. show() It is used to display
the graph.
xtricks(index, categorical It is used to set or get the current tick locations labels
variables) of the x-axis.
xlim(start value, end value) It is used to set the limit of values of the x-axis.
ylim(start value, end value) It is used to set the limit of values of the y-axis.
scatter(x-axis values, y-axis It is used to plots a scatter plot with x-axis value
values) against the y-axis values.
Heat Maps:
→Heat maps are a type of graphical representation that displays data in a matrix format. The value
of the data point that each matrix cell represents determines its hue. Heatmaps are often used to
visualize the correlation between variables or to identify patterns in time-series data.
Tree Maps: Tree maps are used to display hierarchical data in a compact format and are useful in
showing the relationship between different levels of a hierarchy.
Box Plots: Box plots are a graphical representation of the distribution of a set of data. In a box plot,