0% found this document useful (0 votes)
34 views

Foundations of Data Science

Uploaded by

Jebin Bose S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Foundations of Data Science

Uploaded by

Jebin Bose S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 139

CS32A3 FOUNDATIONS OF DATA SCIENCE L T P C

3 0 0 3
AIM
To explore the different techniques in Data Science for drawing useful conclusions
from large and diverse data sets through exploration, prediction, and inference.

OBJECTIVES
l To obtain a comprehensive knowledge of various tools and techniques for Data
transformation and visualization
l To learn the probability and probabilistic models of data science
l To learn the basic statistics and testing hypothesis for specific problems
l To learn about the prediction models
Subject Code: CS32A3 Subject Name: FOUNDATIONS OF DATA
SCIENCE
Cognitive
Course Outcome – Cos
Skill
Understand the fundamental concepts of data science
CO-01 R,U
Evaluate the data analysis techniques for applications handling
CO-02 E
large data
Demonstrate the various machine learning algorithms used in data
CO-03 U,An
science process
Understand the basic statistics and testing hypothesis for specific
CO-04 U
problems and different prediction models.
CO-05 Visualize and present the inference using various tools A
R- Remember,U-Understand,A-Apply,An-Analyze,E-Evaluate,S-Synthesis
Mapping of Course Outcome with Programme Outcome
PO
PO- PO PO PO PO PO PO
A -B -C -D -E -F -G
CO
CO-01 M M N N N L L
CO-02 L M N N L L N
CO-03 M S N N L L L
CO-04 S M N N N L M
CO-05 L L N N N L N
S-Strong, M-Medium,L-Low,N-Not relevant

UNIT I INTRODUCTION 9
What is Data Science? Big Data and Data Science – Datafication - Current landscape of
perspectives - Skill sets needed; Matrices - Matrices to represent relations between data, and
necessary linear algebraic operations on matrices -Approximately representing matrices by
decompositions (SVD and PCA); Statistics: Descriptive Statistics: distributions and probability
- Statistical Inference: Populations and samples - Statistical modeling - probability distributions
- fitting a model - Hypothesis Testing - Intro to R/ Python.
UNIT II DATA PREPROCESSING 9
Data cleaning - data integration - Data Reduction Data Transformation and Data
Discretization.Evaluation of classification methods – Confusion matrix, Students T-tests and
ROC curves-Exploratory Data Analysis - Basic tools (plots, graphs and summary statistics) of
EDA, Philosophy of EDA - The Data Science Process.

UNIT III BASIC MACHINE LEARNING ALGORITHMS 9


Association Rule mining - Linear Regression- Logistic Regression - Classifiers - k-Nearest
Neighbors (k-NN), k-means -Decision tree - Naive Bayes- Ensemble Methods - Random
Forest. Feature Generation and Feature Selection - Feature Selection algorithms - Filters;
Wrappers; Decision Trees; Random Forests.

UNIT IV CLUSTERING 9
Choosing distance metrics - Different clustering approaches - hierarchical agglomerative
clustering, k-means (Lloyd's algorithm), - DBSCAN - Relative merits of each method -
clustering tendency and quality.

UNIT V DATA VISUALIZATION 9


Importing Matplotlib – Line plots – Scatter plots – visualizing errors – density and contour
plots – Histograms – legends – colors – subplots – text and annotation – customization – three
dimensional plotting - Geographic Data with Basemap - Visualization with Seaborn.
TOTAL:45 PERIODS
TEXT BOOKS
1. Cathy O'Neil and Rachel Schutt, “Doing Data Science, Straight Talk From The
Frontline”, O'Reilly, 2014.
2. Jiawei Han, Micheline Kamber and Jian Pei, “Data Mining: Concepts and Techniques”,
Third Edition. ISBN 0123814790, 2011.
3. Mohammed J. Zaki and Wagner Miera Jr, “Data Mining and Analysis: Fundamental
Concepts and Algorithms”, Cambridge University Press, 2014.
4. Matt Harrison, “Learning the Pandas Library: Python Tools for Data Munging,
Analysis, and Visualization , O'Reilly, 2016.
5. Joel Grus, “Data Science from Scratch: First Principles with Python”, O’Reilly Media,
2015.
6. Wes McKinney, “Python for Data Analysis: Data Wrangling with Pandas, NumPy, and
IPython”, O'Reilly Media, 2012.

REFERENCES
1. Jure Leskovek, AnandRajaraman and Jeffrey Ullman. Mining of Massive Datasets. v2.1,
Cambridge University Press. 2014.
2. Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. ISBN 0262018020.
2013.
3. Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning,
Second Edition. ISBN 0387952845. 2009.
4. Mohammed J. Zaki and Wagner Miera Jr. Data Mining and Analysis: Fundamental
Concepts and Algorithms. Cambridge University Press. 2014.
5. Jiawei Han, Micheline Kamber and Jian Pei. Data Mining: Concepts and Techniques, Third
Edition. ISBN 0123814790. 2011.
1
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________
UNIT II DATA PREPROCESSING
Data cleaning - data integration - Data Reduction Data Transformation and Data Discretization. Evaluation
of classification methods – Confusion matrix, Students T-tests and ROC curves-Exploratory Data Analysis
- Basic tools (plots, graphs and summary statistics) of EDA, Philosophy of EDA - The Data Science
Process.

DATA PREPROCESSING
Data preprocessing is an important step. It refers to the cleaning, transforming, and integrating of data in
order to make it ready for analysis. The goal of data preprocessing is to improve the quality of the data and
to make it more suitable for the specific data mining task. Data preprocessing is a data mining technique
which is used to transform the raw data in a useful and efficient format.
Some common steps in data preprocessing include:

Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data, such as
missing values, outliers, and duplicates. Various techniques can be used for data cleaning, such as
imputation, removal, and transformation.
Data Integration: This involves combining data from multiple sources to create a unified dataset. Data
integration can be challenging as it requires handling data with different formats, structures, and semantics.
Techniques such as record linkage and data fusion can be used for data integration.
Data Transformation: This involves converting the data into a suitable format for analysis. Common
techniques used in data transformation include normalization, standardization, and discretization.
Normalization is used to scale the data to a common range, while standardization is used to transform the
data to have zero mean and unit variance. Discretization is used to convert continuous data into discrete
categories.
Data Reduction: This involves reducing the size of the dataset while preserving the important information.
Data reduction can be achieved through techniques such as feature selection and feature extraction. Feature
selection involves selecting a subset of relevant features from the dataset, while feature extraction involves
transforming the data into a lower-dimensional space while preserving the important information.
Data Discretization: This involves dividing continuous data into discrete categories or intervals.
Discretization is often used in data mining and machine learning algorithms that require categorical data.
Discretization can be achieved through techniques such as equal width binning, equal frequency binning,
and clustering.

Dr.BEN SUJITHA, PROFESSOR/CSE , NICHE SEM 5


2
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________
Data Normalization: This involves scaling the data to a common range, such as between 0 and 1 or -1
and 1. Normalization is often used to handle data with different units and scales. Common normalization
techniques include min-max normalization, z-score normalization, and decimal scaling.

Data Cleaning:
Data cleaning is an essential step in the data mining process. It is crucial to the construction of a
model.Data cleaning is the process of correcting or deleting inaccurate, damaged, improperly formatted,
duplicated, or insufficient data from a dataset. Even if results and algorithms appear to be correct, they
are unreliable if the data is inaccurate. There are numerous ways for data to be duplicated or incorrectly
labeled when merging multiple data sources.The data can have many irrelevant and missing parts. To
handle this part, data cleaning is done. It involves handling of missing data, noisy data etc.

(a). Missing Data:


This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are missing
within a tuple.

Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing values manually, by attribute
mean or the most probable value.

(b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty
data collection, data entry errors etc. It can be handled in following ways :
Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into segments of equal
size and then various methods are performed to complete the task. Each segmented is handled separately.
One can replace all data in a segment by its mean or boundary values can be used to complete the task.
Dr.BEN SUJITHA, PROFESSOR/CSE , NICHE SEM 5
3
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________
Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be linear
(having one independent variable) or multiple (having multiple independent variables).

Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside
the clusters.
Steps for Cleaning Data
1. Remove duplicate or irrelevant observations
2. Fix structural errors
3. Filter unwanted outliers
4. Handle missing data
Data Integration
Data Integration is a data preprocessing technique that combines data from multiple heterogeneous data
sources into a coherent data store and provides a unified view of the data. These sources may include
multiple data cubes, databases, or flat files.
The data integration approaches are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
M stands for mapping between the queries of source and global schema.
Data integration can be challenging due to the variety of data formats, structures, and semantics used by
different data sources. Different data sources may use different data types, naming conventions, and
schemas, making it difficult to combine the data into a single view. Data integration typically involves a
combination of manual and automated processes, including data profiling, data mapping, data
transformation, and data reconciliation.
There are mainly 2 major approaches
Tight Coupling:
This approach involves creating a centralized repository or data warehouse to store the integrated data. The
data is extracted from various sources, transformed and loaded into a data warehouse. Data is integrated in
a tightly coupled manner, meaning that the data is integrated at a high level, such as at the level of the
entire dataset or schema. This approach is also known as data warehousing, and it enables data consistency
and integrity, but it can be inflexible and difficult to change or update.
Here, a data warehouse is treated as an information retrieval component.
In this coupling, data is combined from different sources into a single physical location through the process
of ETL – Extraction, Transformation, and Loading.
Loose Coupling:
This approach involves integrating data at the lowest level, such as at the level of individual data elements
or records. Data is integrated in a loosely coupled manner, meaning that the data is integrated at a low
level, and it allows data to be integrated without having to create a central repository or data warehouse.
This approach is also known as data federation, and it enables data flexibility and easy updates, but it can
be difficult to maintain consistency and integrity across multiple data sources.

Data reduction
Data reduction is a technique used in data mining to reduce the size of a dataset while still preserving the
most important information. This can be beneficial in situations where the dataset is too large to be
processed efficiently, or where the dataset contains a large amount of irrelevant or redundant
information.
Techniques
Data Sampling: This technique involves selecting a subset of the data to work with, rather than using the
entire dataset. This can be useful for reducing the size of a dataset while still preserving the overall trends
and patterns in the data.

Dr.BEN SUJITHA, PROFESSOR/CSE , NICHE SEM 5


4
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________
Dimensionality Reduction: This technique involves reducing the number of features in the dataset,
either by removing features that are not relevant or by combining multiple features into a single feature.
Data Compression: This technique involves using techniques such as lossy or lossless compression to
reduce the size of a dataset.
Data Discretization: This technique involves converting continuous data into discrete data by
partitioning the range of possible values into intervals or bins.
Feature Selection: This technique involves selecting a subset of features from the dataset that are most
relevant to the task at hand.

Methods of data reduction:


These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example, imagine the information you
gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your company
every three months.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute required for our
analysis. It reduces data size as it eliminates outdated or redundant features.
Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we decide the best of the original attributes
on the set based on their relevance to other attributes. We know it as a p-value in statistics.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}
Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at each point, it eliminates
the worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }
Step-1: {X1, X2, X3, X4, X5}
Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}

Combination of forwarding and Backward Selection –


It allows us to remove the worst and select the best attributes, saving time and making the process faster.
3. Data Compression:
The data compression technique reduces the size of the files using different encoding mechanisms
(Huffman Encoding & run-length Encoding). We can divide it into two types based on their compression
techniques.
4. Numerosity Reduction:
In this reduction technique, the actual data is replaced with mathematical models or smaller
representations of the data instead of actual data, it is important to only store the model parameter. Or
non-parametric methods such as clustering, histogram, and sampling.

5. Discretization & Concept Hierarchy Operation:


Techniques of data discretization are used to divide the attributes of the continuous nature into data with
Dr.BEN SUJITHA, PROFESSOR/CSE , NICHE SEM 5
5
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________
intervals. We replace many constant values of the attributes by labels of small intervals. This means that
mining results are shown in a concise, and easily understandable way.
Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to divide the whole
set of attributes and repeat this method up to the end, then the process is known as top-down
discretization also known as splitting.
Bottom-up discretization –
If you first consider all the constant values as split points, some are discarded through a combination of
the neighborhood values in the interval, that process is called bottom-up discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for age) with
high-level concepts (categorical variables such as middle age or Senior).
For numeric data following techniques can be followed:
Binning –
Binning is the process of changing numerical variables into categorical counterparts. The number of
categorical counterparts depends on the number of bins specified by the user.
Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X, into disjoint
ranges called brackets. There are several partitioning rules:
Equal Frequency partitioning: Partitioning the values based on their number of occurrences in the data
set.
Equal Width Partitioning: Partitioning the values in a fixed gap based on the number of bins i.e. a set
of values ranging from 0-20.
Clustering: Grouping similar data together.

Data transformation
• Data transformation is the process of converting, cleansing, and structuring data into a usable
format that can be analyzed to support decision making processes, and to propel the growth of an
organization.
• Data transformation is used when data needs to be converted to match that of the destination
system. This can occur at two places of the data pipeline.
• The process of data transformation can be handled manually, automated or a combination of
both.
• Transformation is an essential step in many processes, such as data integration, migration,
warehousing and wrangling. The process of data transformation can be:
✓ Constructive, where data is added, copied or replicated
✓ Destructive, where records and fields are deleted
✓ Aesthetic, where certain values are standardized, or
✓ Structural, which includes columns being renamed, moved and combined

Data Transformation Process


• The entire process for transforming data is known as ETL (Extract, Load, and Transform). Through
the ETL process, analysts can convert data to its desired format. Here are the steps involved in the
data transformation process:

Data Transformation in Data Mining


• Data Discovery: During the first stage, analysts work to understand and identify data in its source
format. To do this, they will use data profiling tools. This step helps analysts decide what they need
to do to get data into its desired format.
• Data Mapping: During this phase, analysts perform data mapping to determine how individual
fields are modified, mapped, filtered, joined, and aggregated. Data mapping is essential to many
Dr.BEN SUJITHA, PROFESSOR/CSE , NICHE SEM 5
6
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________
data processes, and one misstep can lead to incorrect analysis and ripple through your entire
organization.
• Data Extraction: During this phase, analysts extract the data from its original source. These may
include structured sources such as databases or streaming sources such as customer log files from
web applications.
• Code Generation and Execution: Once the data has been extracted, analysts need to create a code
to complete the transformation. Often, analysts generate codes with the help of data transformation
platforms or tools.
• Review: After transforming the data, analysts need to check it to ensure everything has been
formatted correctly.
• Sending: The final step involves sending the data to its target destination. The target might be a
data warehouse or a database that handles both structured and unstructured data.

Data Transformation Techniques


There are several data transformation techniques that can help structure and clean up the data before
analysis or storage in a data warehouse.
Data Transformation in Data Mining

1. Data Smoothing
• Data smoothing is a process that is used to remove noise from the dataset using some algorithms.
It allows for highlighting important features present in the dataset. It helps in predicting the patterns.
When collecting data, it can be manipulated to eliminate or reduce any variance or any other noise
form.The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns.

• Binning: This method splits the sorted data into the number of bins and smoothens the data values
in each bin considering the neighborhood values around it.
• Regression: This method identifies the relation among two dependent attributes so that if we have
one attribute, it can be used to predict the other attribute.
• Clustering: This method groups similar data values and form a cluster. The values that lie outside
a cluster are known as outliers.

2. Attribute Construction
• In the attribute construction method, the new attributes consult the existing attributes to construct
a new data set that eases data mining. New attributes are created and applied to assist the mining
process from the given attributes. This simplifies the original data and makes the mining more
efficient.

3. Data Aggregation

• Data collection or aggregation is the method of storing and presenting data in a summary format.
The data may be obtained from multiple data sources to integrate these data sources into a data
analysis description. This is a crucial step since the accuracy of data analysis insights is highly
dependent on the quantity and quality of the data used.Gathering accurate data of high quality and
a large enough quantity is necessary to produce relevant results. The collection of data is useful for
everything from decisions concerning financing or business strategy of the product, pricing,
operations, and marketing strategies.

4. Data Normalization

Dr.BEN SUJITHA, PROFESSOR/CSE , NICHE SEM 5


7
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________
• Normalizing the data refers to scaling the data values to a much smaller range such as [-1, 1] or
[0.0, 1.0]. There are different methods to normalize the data, as discussed below.

Consider that we have a numeric attribute A and we have n number of observed values for attribute A that
are V1, V2, V3, ….Vn.

5. Data Discretization
• This is a process of converting continuous data into a set of data intervals. Continuous attribute
values are substituted by small interval labels. This makes the data easier to study and analyze. If a
data mining task handles a continuous attribute, then its discrete values can be replaced by constant
quality attributes. This improves the efficiency of the task.This method is also called a data
reduction mechanism as it transforms a large dataset into a set of categorical data. Discretization
also uses decision tree-based algorithms to produce short, compact, and accurate results when using
discrete values.Data discretization can be classified into two types: supervised discretization, where

Dr.BEN SUJITHA, PROFESSOR/CSE , NICHE SEM 5


8
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________
the class information is used, and unsupervised discretization, which is based on which direction
the process proceeds, i.e., 'top-down splitting strategy' or 'bottom-up merging strategy'.

6. Data Generalization

• It converts low-level data attributes to high-level data attributes using concept hierarchy. This
conversion from a lower level to a higher conceptual level is useful to get a clearer picture of the
data. Data generalization can be divided into two approaches:

✓ Data cube process (OLAP) approach.


✓ Attribute-oriented induction (AOI) approach.
For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a higher conceptual
level into a categorical value (young, old).

Evaluation of classification methods


✓ Classification metrics are evaluation measures used to assess the performance of a classification
model. Common metrics include
➢ accuracy (proportion of correct predictions) - The proportion of correct predictions out of
the total predictions.
➢ precision (true positives over total predicted positives) - The proportion of true positive
predictions out of the total positive predictions (precision = true positives / (true positives +
false positives)).
➢ recall (true positives over total actual positives)- The proportion of true positive predictions
out of the total actual positive instances (recall = true positives / (true positives + false
negatives)).
➢ F1 score (harmonic mean of precision and recall) - The harmonic mean of precision and
recall, providing a balance between the two metrics (F1 score = 2 * ((precision * recall) /
(precision + recall))).
➢ area under the receiver operating characteristic curve (AUC-ROC).

Accuracy
➢ Accuracy simply measures how often the classifier correctly predicts. We can define accuracy as
the ratio of the number of correct predictions and the total number of predictions.

When any model gives an accuracy rate of 99%, you might think that model is performing very good but
this is not always true and can be misleading in some situations
Confusion Matrix
➢ Confusion Matrix is a performance measurement for the machine learning classification problems
where the output can be two or more classes. It is a table with combinations of predicted and actual
values.
➢ confusion matrix is a tabular way of visualizing the performance of your prediction model. Each
entry in a confusion matrix denotes the number of predictions made by the model where it classified
the classes correctly or incorrectly.

Dr.BEN SUJITHA, PROFESSOR/CSE , NICHE SEM 5


9
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________
➢ A confusion matrix is a technique for summarizing the performance of a classification algorithm.
➢ Classification accuracy alone can be misleading if you have an unequal number of observations in
each class or if you have more than two classes in your dataset.
➢ Calculating a confusion matrix can give you a better idea of what your classification model is
getting right and what types of errors it is making.
➢ The matrix displays the number of true positives (TP), true negatives (TN), false positives (FP),
and false negatives (FN) produced by the model on the test data.

➢ True Positive (TP): It refers to the number of predictions where the classifier correctly predicts the
positive class as positive.

➢ True Negative (TN): It refers to the number of predictions where the classifier correctly predicts
the negative class as negative.

➢ False Positive (FP): It refers to the number of predictions where the classifier incorrectly predicts
the negative class as positive.

➢ False Negative (FN): It refers to the number of predictions where the classifier incorrectly predicts
the positive class as negative.

Implementations of Confusion Matrix in Python


Steps:

➢ Import the necessary libraries like Numpy, confusion_matrix from sklearn.metrics, seaborn, and
matplotlib.
➢ Create the NumPy array for actual and predicted labels.
➢ compute the confusion matrix.
➢ Plot the confusion matrix with the help of the seaborn heatmap.

#Import the necessary libraries


import numpy as np
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

#Create the NumPy array for actual and predicted labels.


actual = np.array(
['Dog','Dog','Dog','Not Dog','Dog','Not Dog','Dog','Dog','Not Dog','Not Dog'])
predicted = np.array(
['Dog','Not Dog','Dog','Not Dog','Dog','Dog','Dog','Dog','Not Dog','Not Dog'])
Dr.BEN SUJITHA, PROFESSOR/CSE , NICHE SEM 5
10
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________

#compute the confusion matrix.


cm = confusion_matrix(actual,predicted)

#Plot the confusion matrix.


sns.heatmap(cm,
annot=True,
fmt='g',
xticklabels=['Dog','Not Dog'],
yticklabels=['Dog','Not Dog'])
plt.ylabel('Prediction',fontsize=13)
plt.xlabel('Actual',fontsize=13)
plt.title('Confusion Matrix',fontsize=17)
plt.show()

Precision
➢ Precision for a label is defined as the number of true positives divided by the number of predicted
positives.
➢ It explains how many of the correctly predicted cases actually turned out to be positive. Precision
is useful in the cases where False Positive is a higher concern than False Negatives. The importance
of Precision is in music or video recommendation systems, e-commerce websites, etc. where wrong
results could lead to customer churn and this could be harmful to the business.

Recall (Sensitivity)
➢ Recall for a label is defined as the number of true positives divided by the total number of actual
positives.
➢ It explains how many of the actual positive cases we were able to predict correctly with our
model. Recall is a useful metric in cases where False Negative is of higher concern than False
Positive. It is important in medical cases where it doesn’t matter whether we raise a false alarm
but the actual positive cases should not go undetected!

Dr.BEN SUJITHA, PROFESSOR/CSE , NICHE SEM 5


11
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________

F1 Score
➢ F1 Score is the harmonic mean of precision and recall.
➢ It gives a combined idea about Precision and Recall metrics. It is maximum when Precision is
equal to Recall.

➢ The F1 score punishes extreme values more. F1 Score could be an effective evaluation metric in
the following cases:
➢ When FP and FN are equally costly.
➢ Adding more data doesn’t effectively change the outcome
➢ True Negative is high

AUC-ROC
➢ The Receiver Operator Characteristic (ROC) is a probability curve that plots the TPR(True
Positive Rate) against the FPR(False Positive Rate) at various threshold values and separates the
‘signal’ from the ‘noise’.
➢ The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish
between classes. From the graph, we simply say the area of the curve ABDE and the X and Y-
axis.

From the graph shown below, the greater the AUC, the better is the performance of the model at
different threshold points between positive and negative classes. This simply means that When
AUC is equal to 1, the classifier is able to perfectly distinguish between all Positive and Negative
class points. When AUC is equal to 0, the classifier would be predicting all Negatives as Positives
and vice versa. When AUC is 0.5, the classifier is not able to distinguish between the Positive and
Negative classes.

Working of AUC
➢ In a ROC curve, the X-axis value shows False Positive Rate (FPR), and Y-axis shows True
Positive Rate (TPR). Higher the value of X means higher the number of False Positives(FP) than
True Negatives(TN), while a higher Y-axis value indicates a higher number of TP than FN. So,
the choice of the threshold depends on the ability to balance between FP and FN.

Dr.BEN SUJITHA, PROFESSOR/CSE , NICHE SEM 5


12
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________
Log Loss
➢ Log loss (Logistic loss) or Cross-Entropy Loss is one of the major metrics to assess the
performance of a classification problem.

For a single sample with true label y∈{0,1} and a probability estimate p=Pr(y=1), the log loss is:

T-Test
• A t-test is an inferential statistic used to determine if there is a significant difference between the
means of two groups and how they are related. T-tests are used when the data sets follow a normal
distribution and have unknown variances, like the data set recorded from flipping a coin 100 times.
• A t-test is a type of inferential statistic used to determine the significant difference between the
means of two groups, which may be related to certain features. A t-test is used as a hypothesis
testing tool, which allows testing an assumption applicable to a population.
• Degrees of freedom refers to the values in a study that can vary and are essential for assessing the
null hypothesis's importance and validity. The computation of these values usually depends upon
the number of data records available in the sample set.

Types of T-Tests
• There are three types of t-tests we can perform based on the data, such as:

1. One-Sample t-test
In a one-sample t-test, we compare the average of one group against the set average. This set
average can be any theoretical value, or it can be the population mean.

In a nutshell, here's the formula to calculate or perform a one-sample t-test:

Where,
t = t-statistic
m = mean of the group
µ = theoretical value or population mean
s = standard deviation of the group
n = group size or sample size

2. Unpaired or Independent t-test

The unpaired t-test is used to compare the means of two different groups of samples.

Dr.BEN SUJITHA, PROFESSOR/CSE , NICHE SEM 5


13
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________
For example, we want to compare the male employees' average height to their average height. Of
course, the number of males and females should be equal for this comparison. This is where an
unpaired or independent t-test is used.

Here's the formula to calculate the t-statistic for a two-sample t-test:

Where,

mAand mB are the means of two different groups


nAand nB are the sample sizes
S2is an estimator of the common variance of two samples, such as:

3. Paired t-test

The paired sample t-test is quite intriguing. Here, we measure one group at two different times. We
compare different means for a group at two different times or under two different conditions.
A certain manager realized that the productivity level of his employees was trending significantly
downwards. This manager decided to conduct a training program for all his employees to increase
their productivity levels. The formula to calculate the t-statistic for a paired t-test is:

Where,

t = t-statistic
m = mean of the group
µ = theoretical value or population mean
s = standard deviation of the group
n = group size or sample size

Dr.BEN SUJITHA, PROFESSOR/CSE , NICHE SEM 5


14
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________

Perform a t-test
For all of the t-tests involving means, you perform the same steps in analysis:

• Define your null (Ho) and alternative ( H α) hypotheses before collecting your data.
• Decide on the alpha value (or α value). This involves determining the risk you are willing to take
of drawing the wrong conclusion.
• Check the data for errors.
• Check the assumptions for the test.
• Perform the test and draw your conclusion. All t-tests for means involve calculating a test statistic.

AUC ROC Curve in Machine Learning


ROC: Receiver Operating Characteristics
AUC: Area Under Curve

ROC Curve
• ROC stands for Receiver Operating Characteristics, and the ROC curve is the graphical
representation of the effectiveness of the binary classification model. It plots the true positive rate
(TPR) vs the false positive rate (FPR) at different classification thresholds.

AUC Curve:
• AUC stands for Area Under the Curve, and the AUC curve represents the area under the ROC
curve.
• It measures the overall performance of the binary classification model. As both TPR and FPR range
between 0 to 1, So, the area will always lie between 0 and 1, and A greater value of AUC denotes
better model performance.
• It represents the probability with with our model is able to distinguish between the two classes
which are present in our target.

Dr.BEN SUJITHA, PROFESSOR/CSE , NICHE SEM 5


15
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________


• The ROC curve is a graph that shows the performance of a classification model at all possible
thresholds ( threshold is a particular value beyond which you say a point belongs to a particular
class). The curve is plotted between two parameters

➢ TPR – True Positive Rate


➢ FPR – False Positive Rate
➢ True Positive: Actual Positive and Predicted as Positive
➢ True Negative: Actual Negative and Predicted as Negative
➢ False Positive(Type I Error): Actual Negative but predicted as Positive
➢ False Negative(Type II Error): Actual Positive but predicted as Negative
✓ Specificity measures the proportion of actual negative instances that are correctly identified by
the model as negative. It represents the ability of the model to correctly identify negative
instances.
✓ ROC is nothing but the plot between TPR and FPR across all possible thresholds and AUC is the
entire area beneath this ROC curve.

import numpy as np
from sklearn .metrics import roc_auc_score
y_true = [1, 1, 0, 0, 1, 0]
y_pred = [0.95, 0.90, 0.85, 0.81, 0.78, 0.70]
auc = np.round(roc_auc_score(y_true, y_pred), 3)
print("Auc for our sample data is {}".format(auc))

Dr.BEN SUJITHA, PROFESSOR/CSE , NICHE SEM 5


16
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________

Exploratory Data Analysis (EDA)


➢ Exploratory Data Analysis (EDA) refers to the method of studying and exploring record sets to
apprehend their predominant traits, discover patterns, locate outliers, and identify relationships
between variables.
➢ Exploratory Data Analysis (EDA) is one of the techniques used for extracting vital features and
trends used by machine learning and deep learning models in Data Science.
The three main types of EDA are univariate, bivariate, and multivariate EDA.

Univariate EDA
✓ It involves looking at a single variable at a time. Univariate EDA can help you understand the
distribution of the data and identify any outliers.
✓ In univariate analysis, the output is a single variable and all data collected is for it. There is no
cause-and-effect relationship at all. In bivariate analysis, the outcome is dependent on two
variables, e.g., the age of an employee, while the relation with it is compared with two variables,
i.e., his salary earned and expenses per month.
✓ The analysis of data is done on variables that can be numerical or categorical. The result of the
analysis can be represented in numerical values, visualization, or graphical form.
The significant parameters which are estimated from a distribution point of view are as follows:
a) Univariate Non-Graphical
➢ Central Tendency:
This term refers to values located at the data's central position or middle zone. The three generally
estimated parameters of central tendency are mean, median, and mode. Mean is the average of all
values in data, while the mode is the value that occurs the maximum number of times. The Median
is the middle value with equal observations to its left and right.
➢ Range:
The range is the difference between the maximum and minimum value in the data, thus indicating
how much the data is away from the central value on the higher and lower side.
➢ Variance and Standard Deviation:
Two more useful parameters are standard deviation and variance. Variance is a measure of
dispersion that indicates the spread of all data points in a data set. It is the measure of dispersion
mostly used and is the mean squared difference between each data point and mean, while standard
deviation is the square root value of it. The larger the value of standard deviation, the farther the
spread of data, while a low value indicates more values clustering near the mean.

b)Univariate Graphical
➢ Stem-and-leaf Plots:
This is a very simple but powerful EDA method used to display quantitative data but in a
shortened format. It displays the values in the data set, keeping each observation intact but
separating them as stem (the leading digits) and remaining or trailing digits as leaves. But
histogram is mostly used in its place now.
➢ Histograms (Bar Charts):
These plots are used to display both grouped or ungrouped data. On the x-axis, values of variables
are plotted, while on the y-axis are the number of observations or frequencies. Histograms are
very simple to quickly understand your data, which tell about values of data like central
tendency, dispersion, outliers, etc

✓ Bivariate EDA involves looking at two variables at a time. Bivariate EDA can help you
understand the relationship between two variables and identify any patterns that might exist.

Dr.BEN SUJITHA, PROFESSOR/CSE , NICHE SEM 5


17
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________
✓ Multivariate EDA involves looking at three or more variables at a time. Multivariate EDA can
help you understand the relationships between several variables and identify any complex
patterns or outliers that might exist.

➢ The multivariate non-graphical exploratory data analysis technique is usually used to


show the connection between two or more variables with the help of either cross-
tabulation or statistics.
➢ For categorical data, an extension of tabulation called cross-tabulation is extremely useful.
For two variables, cross-tabulation is preferred by making a two-way table with column
headings that match the amount of one variable and row headings that match the amount
of the opposite two variables, then filling the counts with all subjects that share an
equivalent pair of levels.
➢ For each categorical variable and one quantitative variable, we can generate statistical
information for quantitative variables separately for every level of the specific variable.
We then compare the statistics across the number of categorical variables.
Multivariate Graphical
➢ Graphics are used in multivariate graphical data to show the relationships between two or
more variables.
A) Scatter Plot

The essential graphical EDA technique for two quantitative variables is the scatter plot, so
one variable appears on the x-axis and the other on the y-axis and, therefore, the point for
every case in your dataset. This can be used for bivariate analysis.

B) Multivariate Chart

A Multivariate chart is a type of control chart used to monitor two or more interrelated
process variables. This is beneficial in situations such as process control, where engineers
are likely to benefit from using multivariate charts. These charts allow monitoring
multiple parameters together in a single chart.
C) Run Chart

A run chart is a data line chart drawn over time. In other words, a run chart visually
illustrates the process performance or data values in a time sequence. D) Bubble Chart

D) Bubble charts scatter plots that display multiple circles (bubbles) in a two-dimensional
plot. These are used to assess the relationships between three or more numeric variables.
In a bubble chart, every single dot corresponds to one data point, and the values of the
variables for each point are indicated by different positions such as horizontal, vertical,
dot size, and dot colors.

E) Heat Map

A heat map is a colored graphical representation of multivariate data structured as a


matrix of columns and rows. The heat map transforms the correlation matrix into color
coding and represents these coefficients to visualize the strength of correlation among
variables.
Steps involved in exploratory data analysis (EDA).

Step 1: Data Collection


The first step in EDA is to generate questions about your data. Depends on collecting the required
data from various sources through surveys, social media, and customer reviews, to name a few
Dr.BEN SUJITHA, PROFESSOR/CSE , NICHE SEM 5
18
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________
Step 2: Finding all Variables and Understanding Them
The next step is to apply visualization techniques to the data to help you answer those questions.
When the analysis process starts, the first focus is on the available data that gives a lot of
information. This information contains changing values about various features or characteristics,
which helps to understand and get valuable insights from them.
Step 3: Cleaning the Dataset
The next step is to clean the data set, which may contain null values and irrelevant information.
These are to be removed so that data contains only those values that are relevant and important
from the target point of view.
Step 4. Identify Correlated Variables
Finding a correlation between variables helps to know how a particular variable is related to
another. The correlation matrix method gives a clear picture of how different variables correlate,
which further helps in understanding vital relationships among them.
Step 5. Choosing the Right Statistical Methods
Depending on the data, categorical or numerical, the size, type of variables, and the purpose of
analysis, different statistical tools are employed. Statistical formulae applied for numerical outputs
give fair information, but graphical visuals are more appealing and easier to interpret.

Step 6. Visualizing and Analyzing Results


Once the analysis is over, the findings are to be observed cautiously and carefully so that proper
interpretation can be made

Exploratory Data Analysis Tools


1. Python
Python is used for different tasks in EDA, such as finding missing values in data collection, data
description, handling outliers, obtaining insights through charts, etc. The syntax for EDA libraries
like Matplotlib, Pandas, Seaborn, NumPy, Altair, and more in Python is fairly simple and easy to
use for beginners. You can find many open-source packages in Python, such as D-Tale, AutoViz,
PandasProfiling, etc., that can automate the entire exploratory data analysis process and save time.

2. R
R programming language is a regularly used option to make statistical observations and analyze
data, i.e., perform detailed EDA by data scientists and statisticians. Like Python, R is also an open-
source programming language suitable for statistical computing and graphics. Apart from the
commonly used libraries like ggplot, Leaflet, and Lattice, there are several powerful R libraries for
automated EDA, such as Data Explorer, SmartEDA, GGally, etc.

3. MATLAB
MATLAB is a well-known commercial tool among engineers since it has a very strong
mathematical calculation ability. Due to this, it is possible to use MATLAB for EDA but it requires
some basic knowledge of the MATLAB programming language.

Advantages of Using EDA


Here are a few advantages of using Exploratory Data Analysis -

1. Gain Insights Into Underlying Trends and Patterns


EDA assists data analysts in identifying crucial trends quickly through data visualizations using
various graphs, such as box plots and histograms. Businesses also expect to make some unexpected
discoveries in the data while performing EDA, which can help improve certain existing business
strategies.

2. Improved Understanding of Variables


Dr.BEN SUJITHA, PROFESSOR/CSE , NICHE SEM 5
19
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________
Data analysts can significantly improve their comprehension of many factors related to the dataset.
Using EDA, they can extract various information such as averages, means, minimum and
maximum, and more such information is required for preprocessing the data appropriately.

3. Better Preprocess Data to Save Time


EDA can assist data analysts in identifying significant mistakes, abnormalities, or missing values
in the existing dataset. Handling the above entities is critical for any organization before beginning
a full study as it ensures correct preprocessing of data and may help save a significant amount of
time by avoiding mistakes later when applying machine learning models.

4. Make Data-driven Decisions


The most significant advantage of employing EDA in an organization is that it helps businesses to
improve their understanding of data. With EDA, they can use the available tools to extract critical
insights and make conclusions, which assist in making decisions based on the insights from the
EDA.

Data Science Process Life Cycle


➢ Data Science is the area of study which involves extracting insights from vast amounts of
data using various scientific methods, algorithms, and processes. It helps you to discover
hidden patterns from the raw data.
➢ The term Data Science has emerged because of the evolution of mathematical statistics, data
analysis, and big data.
➢ Data Science is an interdisciplinary field that allows you to extract knowledge from
structured or unstructured data. Data science enables you to translate a business problem
into a research project and then translate it back into a practical solution.

Dr.BEN SUJITHA, PROFESSOR/CSE , NICHE SEM 5


20
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________

1. Discovery:
Discovery step involves acquiring data from all the identified internal & external sources, which
helps you answer the business question.
The data can be:
✓ Logs from webservers
✓ Data gathered from social media
✓ Census datasets
✓ Data streamed from online sources using APIs
2. Preparation:
Data can have many inconsistencies like missing values, blank columns, an incorrect data format,
which needs to be cleaned. You need to process, explore, and condition data before modelling. The
cleaner your data, the better are your predictions.
Data Cleaning – Most of the real-world data is not structured and requires cleaning and conversion
into structured data before it can be used for any analysis or modeling.
Exploratory Data Analysis – This is the step in which we try to find the hidden patterns in the
data at hand. Also, we try to analyze different factors which affect the target variable and the extent
to which it does so. How the independent features are related to each other and what can be done
to achieve the desired results all these answers can be extracted from this process as well. This also
gives us a direction in which we should work to get started with the modeling process.
3. Model Planning:
In this stage, you need to determine the method and technique to draw the relation between input
variables. Planning for a model is performed by using different statistical formulas and visualization
tools. SQL analysis services, R, and SAS/access are some of the tools used for this purpose.

Dr.BEN SUJITHA, PROFESSOR/CSE , NICHE SEM 5


21
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________

4. Model Building:
In this step, the actual model building process starts. Here, Data scientist distributes datasets for
training and testing. Techniques like association, classification, and clustering are applied to the
training data set. The model, once prepared, is tested against the “testing” dataset.
5. Operationalize:
You deliver the final baselined model with reports, code, and technical documents in this stage.
Model is deployed into a real-time production environment after thorough testing.
6. Communicate Results
In this stage, the key findings are communicated to all stakeholders. This helps you decide if the
project results are a success or a failure based on the inputs from the model.

Components of Data Science Process


Data Science is a very vast field and to get the best out of the data at hand one has to apply multiple
methodologies and use different tools to make sure the integrity of the data remains intact
throughout the process keeping data privacy in mind. Machine Learning and Data analysis is the
part where we focus on the results which can be extracted from the data at hand. But Data
engineering is the part in which the main task is to ensure that the data is managed properly and
proper data pipelines are created for smooth data flow. If we try to point out the main components
of Data Science then it would be:
Data Analysis – There are times when there is no need to apply advanced deep learning and
complex methods to the data at hand to derive some patterns from it. Due to this before moving on
to the modeling part, we first perform an exploratory data analysis to get a basic idea of the data
and patterns which are available in it this gives us a direction to work on if we want to apply some
complex analysis methods on our data.
Statistics – It is a natural phenomenon that many real-life datasets follow a normal distribution.
And when we already know that a particular dataset follows some known distribution then most of
its properties can be analyzed at once. Also, descriptive statistics and correlation and covariances
between two features of the dataset help us get a better understanding of how one factor is related
to the other in our dataset.
Data Engineering – When we deal with a large amount of data then we have to make sure that the
data is kept safe from any online threats also it is easy to retrieve and make changes in the data as
well. To ensure that the data is used efficiently Data Engineers play a crucial role.
Advanced Computing
Machine Learning – Machine Learning has opened new horizons which had helped us to build
different advanced applications and methodologies so, that the machines become more efficient
and provide a personalized experience to each individual and perform tasks in a snap of the hand
earlier which requires heavy human labor and time intense.
Deep Learning – This is also a part of Artificial Intelligence and Machine Learning but it is a bit
more advanced than machine learning itself. High computing power and a huge corpus of data have
led to the emergence of this field in data science.

Dr.BEN SUJITHA, PROFESSOR/CSE , NICHE SEM 5


22
FOUNDATIONS OF DATA SCEINCE UNIT 2
______________________________________________________________

Dr.BEN SUJITHA, PROFESSOR/CSE , NICHE SEM 5


CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 1
________________________________________________________________________________________________

UNIT III BASIC MACHINE LEARNING ALGORITHMS


Association Rule mining - Linear Regression- Logistic Regression - Classifiers - k-Nearest Neighbors (k-NN), k-means -
Decision tree - Naive Bayes- Ensemble Methods – Random Forest. Feature Generation and Feature Selection - Feature
Selection algorithms - Filters; Wrappers; Decision Trees; Random Forests.
---------------------------------------------------------------------------------------------------------------------------------------------
Machine Learning
→Machine Learning is a branch of artificial intelligence that develops algorithms by learning the hidden
patterns of the datasets used it to make predictions on new similar type data, without being explicitly
programmed for each task.
→It allows software applications to become more accurate at predicting outcomes without being explicitly
programmed to do so. Machine learning algorithms use historical data as input to predict new output values. We
are using machine learning in our daily life even without knowing it such as Google Maps, Google assistant, Alexa, etc.
Machine learning Life cycle
Machine learning life cycle is a cyclic process to build an efficient machine learning project. The main
purpose of the life cycle is to find a solution to the problem or project.
Machine learning life cycle involves seven major steps, which are given below:
➢ Gathering Data
➢ Data preparation
➢ Data Wrangling
➢ Analyse Data
➢ Train the model
➢ Test the model
➢ Deployment
Types of Machine Learning
• Supervised Machine Learning
• Unsupervised Machine Learning
• Reinforcement Machine Learning
1. Supervised Machine Learning:
Supervised learning is a type of machine learning in which the algorithm is trained on the labeled
dataset. It learns to map input features to targets based on labeled training data. In supervised learning, the
algorithm is provided with input features and corresponding output labels, and it learns to generalize from this
data to make predictions on new, unseen data.
There are two main types of supervised learning:
• Regression: Regression is a type of supervised learning where the algorithm learns to predict
continuous values based on input features. The output labels in regression are continuous values,
such as stock prices, and housing prices. The different regression algorithms in machine learning
are: Linear Regression, Polynomial Regression, Decision Tree Regression, etc
• Classification: Classification is a type of supervised learning where the algorithm learns to
assign input data to a specific category or class based on input features. The output labels in
classification are discrete values. Classification algorithms can be binary, where the output is one
of two possible classes, or multiclass, where the output can be one of several classes. The different
Classification algorithms in machine learning are: Logistic Regression, Naive Bayes, Decision
Tree, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), etc
2. Unsupervised Machine Learning:
Unsupervised learning is a type of machine learning where the algorithm learns to recognize patterns
in data without being explicitly trained using labeled examples. The goal of unsupervised learning is to
discover the underlying structure or distribution in the data.
There are two main types of unsupervised learning:
• Clustering: Clustering algorithms group similar data points together based on their
characteristics. The goal is to identify groups, or clusters, of data points that are similar to each
other, while being distinct from other groups. Some popular clustering algorithms include K-
means, Hierarchical clustering, and DBSCAN.

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 2
________________________________________________________________________________________________

• Dimensionality reduction: Dimensionality reduction algorithms reduce the number of input


variables in a dataset while preserving as much of the original information as possible. This is useful
for reducing the complexity of a dataset and making it easier to visualize and analyze. Some popular
dimensionality reduction algorithms include Principal Component Analysis (PCA).

3. Reinforcement Machine Learning


Reinforcement learning is a type of machine learning where an agent learns to interact with an
environment by performing actions and receiving rewards or penalties based on its actions. The goal of
reinforcement learning is to learn a policy, which is a mapping from states to actions, that maximizes the
expected cumulative reward over time.
Need for machine learning:
Machine learning is important because it allows computers to learn from data and improve their performance
on specific tasks without being explicitly programmed. This ability to learn from data and adapt to new
situations makes machine learning particularly useful for tasks that involve large amounts of data, complex
decision-making, and dynamic environments.

Various Applications of Machine Learning


• Automation: Machine learning, which works entirely autonomously in any field without the
need for any human intervention. For example, robots perform the essential process steps in
manufacturing plants.
• Finance Industry: Machine learning is growing in popularity in the finance industry. Banks are
mainly using ML to find patterns inside the data but also to prevent fraud.
• Government organization: The government makes use of ML to manage public safety and
utilities. Take the example of China with its massive face recognition. The government
uses Artificial intelligence to prevent jaywalking.
• Healthcare industry: Healthcare was one of the first industries to use machine learning with
image detection.
• Marketing: Broad use of AI is done in marketing thanks to abundant access to data. Before the
age of mass data, researchers develop advanced mathematical tools like Bayesian analysis to
estimate the value of a customer. With the boom of data, the marketing department relies on AI to
optimize customer relationships and marketing campaigns.
• Retail industry: Machine learning is used in the retail industry to analyze customer behavior,
predict demand, and manage inventory. It also helps retailers to personalize the shopping
experience for each customer by recommending products based on their past purchases and
preferences.
• Transportation: Machine learning is used in the transportation industry to optimize routes,
reduce fuel consumption, and improve the overall efficiency of transportation systems. It also plays
a role in autonomous vehicles, where ML algorithms are used to make decisions about navigation
and safety.
Limitations of Machine Learning:
1. The primary challenge of machine learning is the lack of data or the diversity in the dataset.
2. A machine cannot learn if there is no data available. Besides, a dataset with a lack of diversity
gives the machine a hard time.
3. A machine needs to have heterogeneity to learn meaningful insight.
4. It is rare that an algorithm can extract information when there are no or few variations.
5. It is recommended to have at least 20 observations per group to help the machine learn. This
constraint leads to poor evaluation and prediction.

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 3
________________________________________________________________________________________________

ASSOCIATION RULE LEARNING


Association rule learning is a type of unsupervised learning technique that checks for the dependency of
one data item on another data item and maps accordingly so that it can be more profitable. It tries to find some
interesting relations or associations among the variables of dataset. It is based on different rules to discover the
interesting relations between variables in the database.. Association rule mining is suitable for non-numeric,
categorical data .
Association rule learning works on the concept of If and Else Statement, such as if A then B.

Here the If element is called antecedent, and then statement is called as Consequent. These types of
relationships where we can find out some association or relation between two items is known as single
cardinality.
Association rule mining is a procedure which aims to observe frequently occurring patterns,
correlations, or associations from datasets found in various kinds of databases such as r elational databases,
transactional databases, and other forms of repositories.
An association rule has 2 parts:
• an antecedent (if) and
• a consequent (then)
An antecedent is something that’s found in data, and a consequent is an item that is found in combina tion
with the antecedent. Example rule for instance:
“If a customer buys bread, he’s 70% likely of buying milk.”
In the above association rule, bread is the antecedent and milk is the consequent.
The Association rule is very useful in analyzing datasets.
Support
Support is the frequency of A or how frequently an item appears in the dataset. It is defined as the fraction of
the transaction T that contains the itemset X. If there are X datasets, then for transactions T, it can be written as:

Confidence
Confidence indicates how often the rule has been found to be true. Or how often the items X and Y occur
together in the dataset when the occurrence of X is already given. It is the ratio of the transaction that contains
X and Y to the number of records that contain X.

Lift
It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y are independent of each other.
It has three possible values:
o If Lift= 1: The probability of occurrence of antecedent and consequent is independent of each other.
o Lift>1: It determines the degree to which the two itemsets are dependent to each other.
o Lift<1: It tells us that one item is a substitute for other items, which means one item has a negative effect
on another.

Association Rule Mining is sometimes referred to as “Market Basket Analysis”, as it was the first application
area of association mining.

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 4
________________________________________________________________________________________________

Types of Association Rule Learning


Association rule learning can be divided into three algorithms:
Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is designed to work on the databases that
contain transactions. This algorithm uses a breadth-first search and Hash Tree to calculate the itemset efficiently.
It is mainly used for market basket analysis and helps to understand the products that can be bought together. It
can also be used in the healthcare field to find drug reactions for patients.
Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a depth-first search
technique to find frequent itemsets in a transaction database. It performs faster execution than Apriori Algorithm.
F-P Growth Algorithm
The F-P growth algorithm stands for Frequent Pattern, and it is the improved version of the Apriori Algorithm.
It represents the database in the form of a tree structure that is known as a frequent pattern or tree. The purpose
of this frequent tree is to extract the most frequent patterns.

Apriori Algorithm
Apriori algorithm is given for finding frequent item sets in a dataset for Boolean association rule. To improve
the efficiency of level-wise generation of frequent item sets, an important property is used called Apriori
property which helps by reducing the search space.
Apriori Property
All non-empty subset of frequent itemset must be frequent. The key concept of Apriori algorithm is its anti-
monotonicity of support measure.
Consider the following dataset and we will find frequent itemsets and generate association rules for them.

minimum support count is 2


minimum confidence is 60%
Step-1: K=1
(I) Create a table containing support count of each item present in dataset – Called C1(candidate set)

(II) compare candidate set item’s support count with minimum support count(here min_support=2 if
support_count of candidate set items is less than min_support then remove those items). This gives us itemset
L1.

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 5
________________________________________________________________________________________________

Step-2: K=2
• Generate candidate set C2 using L1 (this is called join step). Condition of joining L k-1 and Lk-1 is
that it should have (K-2) elements in common.
• Check all subsets of an itemset are frequent or not and if not frequent remove that itemset.(Example
subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each itemset)
• Now find support count of these itemsets by searching in dataset.

(II) compare candidate (C2) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives us
itemset L2.

Step-3:
• Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-1 is that it
should have (K-2) elements in common. So here, for L2, first element should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4,
I5}{I2, I3, I5}
• Check if all subsets of these itemsets are frequent or not and if not, then remove that
itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2,
I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly check for every itemset)
• find support count of these remaining itemset by searching in dataset.

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 6
________________________________________________________________________________________________

(II) Compare candidate (C3) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives us
itemset L3.

Step-4:
• Generate candidate set C4 using L3 (join step). Condition of joining L k-1 and Lk-1 (K=4) is that,
they should have (K-2) elements in common. So here, for L3, first 2 elements (items) should
match.
• Check all subsets of these itemsets are frequent or not (Here itemset formed by joining L3 is {I1, I2,
I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no itemset in C4
• We stop here because no frequent itemsets are found further
Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and bread also bought butter.
• Confidence(A->B)=Support_count(A∪B)/Support_count(A)
• So here, by taking an example of any frequent itemset, we will show the rule generation.
Itemset {I1, I2, I3} //from L3
So rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50%, then first 3 rules can be considered as strong association rules.
Step 7: When the threshold criterion is applied again, you'll get the significant itemset.
Steps for Apriori Algorithm
The Apriori algorithm has the following steps:
• Step 1: Determine the level of transactional database support and establish the minimal degree of
assistance and dependability.
• Step 2: Take all of the transaction's supports that are greater than the standard or chosen support
value.
• Step 3: Look for all rules with greater precision than the cutoff or baseline standard, in these
subgroups.
• Step 4: It is best to arrange the rules in ascending order of strength.

REGRESSION ANALYSIS
Regression Analysis is a statistical process for estimating the relationships between the dependent
variables or criterion variables and one or more independent variables or predictors. Regression analysis is
generally used when we deal with a dataset that has the target variable in the form of continuous data.
Regression analysis explains the changes in criteria in relation to changes in select predictors. The conditional
expectation of the criteria is based on predictors where the average value of the dependent variables is given
when the independent variables are changed. Three major uses for regression analysis are determining the
strength of predictors, forecasting an effect, and trend forecasting.
What is the purpose to use Regression Analysis?
→To analyze the effect of different independent features on the target or what we say dependent features.
→ helps us make decisions that can affect the target variable in the desired direction.
→ Regression analysis is heavily based on statistics and hence gives quite reliable results

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 7
________________________________________________________________________________________________

→Along with the development of the machine learning domain regression analysis techniques have gained
popularity as well as developed manifold from just y = mx + c.
Linear Regression
Regression: It predicts the continuous output variables based on the independent input variable. like the
prediction of house prices based on different parameters like house age, distance from the main road, location,
area, etc. Linear regression is a type of supervised machine learning algorithm that computes the linear
relationship between a dependent variable and one or more independent features. When the number of the
independent feature, is 1 then it is known as Univariate Linear regression
Linear regression is used for predictive analysis. Linear regression is a linear approach for modeling the
relationship between the criterion or the scalar response and the multiple predictors or explanatory variables.
Linear regression focuses on the conditional probability distribution of the response given the values of the
predictors. For linear regression, there is a danger of overfitting. The formula for linear regression is:
Syntax:
y = θx + b
where,
• θ – It is the model weights or parameters
• b – It is known as the bias.
This is the most basic form of regression analysis and is used to model a linear relationship between a single
dependent variable and one or more independent variables.
In regression set of records are present with X and Y values and these values are used to learn a function so
if you want to predict Y from an unknown X this learned function can be used. In regression we have to find
the value of Y, So, a function is required that predicts continuous Y in the case of regression given X as
independent features.
Here Y is called a dependent or target variable and X is called an independent variable also known as the
predictor of Y.

Linear regression performs the task to predict a dependent variable value (y) based on a given independent
variable (x)). Hence, the name is Linear Regression. In the figure above, X (input) is the work experience and
Y (output) is the salary of a person.
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
o Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent variable, then
such a Linear Regression algorithm is called Simple Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Multiple Linear Regression.
Linear Regression Line
A linear line showing the relationship between the dependent and independent variables is called a regression
line. A regression line can show two types of relationship:
o Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then
such a relationship is termed as a Positive linear relationship.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 8
________________________________________________________________________________________________

o Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and independent variable increases on the X-axis,
then such a relationship is called a negative linear relationship.

Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line of regression, and
the cost function is used to estimate the values of the coefficient for the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear regression model
is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps the input
variable to the output variable. This mapping function is also known as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of squared
error occurred between the predicted values and actual values. It can be written as:
For the above linear equation, MSE can be calculated as:

Where,
N=Total number of observation
Yi = Actual value
(a1xi+a0)= Predicted value.
Residuals: The distance between the actual value and predicted values is called residual. If the observed points
are far from the regression line, then the residual will be high, and so cost function will high. If the scatter points
are close to the regression line, then the residual will be small and hence the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
o A regression model uses gradient descent to update the coefficients of the line by reducing the cost
function.
o It is done by a random selection of values of coefficient and then iteratively update the values to reach
the minimum cost function.

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 9
________________________________________________________________________________________________

Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The process of finding
the best model out of various models is called optimization. It can be achieved by below method:
1. R-squared method:
o R-squared is a statistical method that determines the goodness of fit.
o It measures the strength of the relationship between the dependent and independent variables on a scale
of 0-100%.
o The high value of R-square determines the less difference between the predicted values and actual values
and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple determination for multiple
regression.
o It can be calculated from the below formula:

Logistic Regression
Logistic regression is a supervised machine learning algorithm mainly used for classification tasks where the
goal is to predict the probability that an instance of belonging to a given class. It is used for classification
algorithms its name is logistic regression.
It is used for predicting the categorical dependent variable using a given set of independent variables.
• Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome
must be a categorical or discrete value.
• Logistic Regression can be used to classify the observations using different types of data and can
easily determine the most effective variables used for the classification.
Logistic Function (Sigmoid Function):
• The sigmoid function is a mathematical function used to map the predicted values to probabilities.
• It maps any real value into another value within a range of 0 and 1. o The value of the logistic
regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve like the
“S” form.
• The S-form curve is called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold value tends to 1, and a value below the threshold
values tends to 0.
Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
1. Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types
of the dependent variable, such as “cat”, “dogs”, or “sheep”
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as “low”, “Medium”, or “High”.
Sr.No Linear Regresssion Logistic Regression

Linear regression is used to predict the continuous Logistic regression is used to predict the categorical
dependent variable using a given set of independent dependent variable using a given set of independent
1 variables. variables.

2 Linear regression is used for solving Regression problem. It is used for solving classification problems.

3 In this we predict the value of continuous variables In this we predict values of categorical varibles

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 10
________________________________________________________________________________________________

Sr.No Linear Regresssion Logistic Regression

4 In this we find best fit line. In this we find S-Curve .

Least square estimation method is used for estimation of Maximum likelihood estimation method is used for
5 accuracy. Estimation of accuracy.

Output is must be categorical value such as 0 or 1, Yes or


The output must be continuous value,such as price,age,etc.
6 no, etc.

It required linear relationship between dependent and


It not required linear relationship.
7 independent variables.

There may be collinearity between the independent There should not be collinearity between independent
8 variables. varible.

o The below image is showing the logistic function:

Logistic Function (Sigmoid Function):


o The sigmoid function is a mathematical function used to map the predicted values to probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so it
forms a curve like the "S" form. The S-form curve is called the Sigmoid function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the probability of either
0 or 1. Such as values above the threshold value tends to 1, and a value below the threshold values tends
to 0.
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression equation. The mathematical steps
to get Logistic Regression equations are given below:
o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation by (1-
y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will become:

The above equation is the final equation for Logistic Regression.

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 11
________________________________________________________________________________________________

CLASSIFIER
• A classifier in machine learning is an algorithm that automatically orders or categorizes
data into one or more of a set of “classes.”
• One of the most common examples is an email classifier that scans emails to filter them
by class label: Spam or Not Spam.
• A classifier is the algorithm itself – the rules used by machines to classify data.
• A classification model, on the other hand, is the end result of your classifier’s machine
learning.
• The model is trained using the classifier, so that the model, ultimately, classifies your
data.
Types of Classification Algorithms
✓ Decision Tree
✓ Naive Bayes Classifier
✓ K-Nearest Neighbors
✓ Support Vector Machines
✓ Artificial Neural Networks

Decision Tree
➢ Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems used to build models like the structure of a tree.
➢ It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
➢ It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
➢ A decision tree is a tree where each
✓ Node - a feature(attribute)
✓ Branch - a decision(rule)
✓ Leaf - an outcome(categorical or continuous)
➢ It classifies data into finer and finer categories: from “tree trunk,” to “branches,” to
“leaves.” It uses the if-then rule of mathematics to create sub-categories that fit into
broader categories and allows for precise, organic categorization.
For example, this is how a decision tree would categorize individual sports:

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 12
________________________________________________________________________________________________

➢ In a decision tree, for predicting the class of the given dataset, the algorithm starts from
the root node of the tree. This algorithm compares the values of root attribute with the
record (real dataset) attribute and, based on the comparison, follows the branch and jumps
to the next node.
➢ For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further.
➢ It continues the process until it reaches the leaf node of the tree. The complete process
can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.

Attribute Selection Measures


While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select the
best attribute for the nodes of the tree. There are two popular techniques for ASM, which are:
✓ Information Gain
✓ Gini Index
1. Information Gain:
✓ Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
✓ It calculates how much information a feature provides us about a class.
✓ According to the value of information gain, we split the node and build the decision tree.
✓ A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using
the below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies
randomness in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
S= Total number of samples
P(yes)= probability of yes
P(no)= probability of no
2. Gini Index:
✓ Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
✓ An attribute with the low Gini index should be preferred as compared to the high Gini
index.

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 13
________________________________________________________________________________________________

✓ It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2

ID3 algorithm

ID3 algorithm, stands for Iterative Dichotomiser 3, is a classification algorithm that follows
a greedy approach of building a decision tree by selecting a best attribute that
yields maximum Information Gain (IG) or minimum Entropy (H).
Entropy is a measure of the amount of uncertainty in the dataset S. Mathematical
Representation of Entropy is shown here
𝐻(𝑆) = ∑ −𝑝(𝑐) log 2 𝑝(𝑐)
𝑐∈𝐶
Where,
• S - The current dataset for which entropy is being calculated (changes every iteration of
the ID3 algorithm).
• C - Set of classes in S {example - C ={yes, no}}
• p(c) - The proportion of the number of elements in class c to the number of elements in
set S.
In ID3, entropy is calculated for each remaining attribute. The attribute with the smallest entropy
is used to split the set S on that particular iteration.
Entropy = 0 implies it is of pure class, that means all are of same category.
Information Gain IG(A) tells us how much uncertainty in S was reduced after splitting set S on
attribute A. Mathematical representation of Information gain is shown here
𝐼𝐺(𝐴, 𝑆) = 𝐻(𝑆) − ∑ 𝑝(𝑡)𝐻(𝑡)
𝑡𝜖𝑇
Where H(S) – Entropy of set S.
T- subsets created from splitting set S by attribute A such that
𝑆 = ⋃𝑡
𝑡𝜖𝑇
P(t) -proportion of the number of elements in t to the number of elements in set S.
H(t) -Entropy of subset t.
The steps in ID3 algorithm are as follows:
Step 1: Data Preprocessing:
Clean and preprocess the data. Handle missing values and convert categorical variables
into numerical representations if needed.
Step 2: Selecting the Root Node:
Calculate the entropy of the target variable (class labels) based on the dataset. The
formula for entropy is:
Entropy(S) = -Σ (p_i * log2(p_i))
where p_i is the proportion of instances belonging to class i.
Step 3: Calculating Information Gain:
For each attribute in the dataset, calculate the information gain when the dataset is split on that
attribute. The formula for information gain is:
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 14
________________________________________________________________________________________________

Information Gain(S, A) = Entropy(S) - Σ ((|S_v| / |S|) * Entropy(S_v))


where S_v is the subset of instances for each possible value of attribute A, and |S_v| is the
number of instances in that subset.
Step 4: Selecting the Best Attribute:
Choose the attribute with the highest information gain as the decision node for the tree.
Step 5: Splitting the Dataset:
Split the dataset based on the values of the selected attribute.
Step 6: Repeat the Process:
Recursively repeat steps 2 to 5 for each subset until a stopping criterion is met (e.g., the tree
depth reaches a maximum limit or all instances in a subset belong to the same class).
Solved Example:
Consider the following dataset:
Weather Temperature Humidity Windy Play Tennis?
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No

Step 1: Data Preprocessing:


The dataset does not require any preprocessing, as it is already in a suitable format.
Step 2: Calculating Entropy:
To calculate entropy, we first determine the proportion of positive and negative instances in
the dataset:
• Positive instances (Play Tennis = Yes): 9
• Negative instances (Play Tennis = No): 5
Entropy(S) = -(9/14) * log2(9/14) – (5/14) * log2(5/14) ≈ 0.940
Step 3: Calculating Information Gain:
We calculate the information gain for each attribute (Weather, Temperature, Humidity,
Windy) and choose the attribute with the highest information gain as the root node.
Information Gain(S, Weather) = Entropy(S) – [(5/14) * Entropy(Sunny) + (4/14) *
Entropy(Overcast) + (5/14) * Entropy(Rainy)] ≈ 0.246
Information Gain(S, Temperature) = Entropy(S) – [(4/14) * Entropy(Hot) + (4/14) *
Entropy(Mild) + (6/14) * Entropy(Cool)] ≈ 0.029
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 15
________________________________________________________________________________________________

Information Gain(S, Humidity) = Entropy(S) – [(7/14) * Entropy(High) + (7/14) *


Entropy(Normal)] ≈ 0.152
Information Gain(S, Windy) = Entropy(S) – [(8/14) * Entropy(False) + (6/14) *
Entropy(True)] ≈ 0.048
Step 4: Selecting the Best Attribute:
The “Weather” attribute has the highest information gain, so we select it as the root node for
our decision tree.
Step 5: Splitting the Dataset:
We split the dataset based on the values of the “Weather” attribute into three subsets (Sunny,
Overcast, Rainy).
Step 6: Repeat the Process:
Since the “Weather” attribute has n0o repeating values in any subset, we stop splitting and
label each leaf node with the majority class in that subset. The decision tree will look like
below:

Decision Tree using CART algorithm Solved Example 1


To construct and find the optimal decision tree for the given Play Tennis Data.
Also, predict the class label for the given example…?
Outlook Temp Humidity Windy Play

Sunny Hot High False No

Sunny Hot High True No

Overcast Hot High False Yes

Rainy Mild High False Yes

Rainy Cool Normal False Yes

Rainy Cool Normal True No

Overcast Cool Normal True Yes

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 16
________________________________________________________________________________________________

Sunny Mild High False No

Sunny Cool Normal False Yes

Rainy Mild Normal False Yes

Sunny Mild Normal True Yes

Overcast Mild High True Yes

Overcast Hot Normal False Yes

Rainy Mild High True No

Outlook Temp Humidity Windy Play

Sunny Hot Normal True ?


Solution:
Start with any attribute ,
• Choose outlook. It can take three values: sunny, overcast, and rainy.
• Start with the sunny value of outlook.
→There are five instances where the outlook is sunny.
→In two of the five instances, the play decision was yes, and in the other three, the decision
was no.
Thus, if the decision rule was that outlook: sunny → no, then three out of five decisions
would be correct, while two out of five such decisions would be incorrect. There are two errors
out of five.
Similarly, we will write all rules for the Outlook attribute.
Outlook
Overcast 4 Yes 4

No 0

Sunny 5 Yes 2

No 3

Rainy 5 Yes 3

No 2
Rules, individual error, and total for Outlook attribute
Attribute Rules Error Total Error

Outlook Sunny → No 2/5 4/14

Overcast → Yes 0/4


Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 17
________________________________________________________________________________________________

Rainy → Yes 2/5


Temp
Hot 4 Yes 2

No 2

Mild 6 Yes 4

No 2

Cold 4 Yes 3

No 1

Rules, individual error, and total for Temp attribute

Attribute Rules Error Total Error

Temp Hot → No 2/4 5/14

Mild → Yes 2/6

Cool → Yes 1/4


Humidity
High 7 Yes 3

No 4

Normal 7 Yes 6

No 1
Rules, individual error, and total for Humidity attribute
Attribute Rules Error Total Error

Humidity High→ No 3/7 4/14

Normal → Yes 1/7


Windy
False 8 Yes 6

No 2

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 18
________________________________________________________________________________________________

True 6 Yes 3

No 3

Rules, individual error, and total for Humidity attribute


Attribute Rules Error Total Error

Windy True → No 3/6 5/14

False → Yes 2/8


Consolidated rules, errors for individual attributes values, and total error of the attribute are
given below.
Attribute Rules Error Total Error

Outlook Sunny → No 2/5 4/14

Overcast → Yes 0/4

Rainy → Yes 2/5

Temp hot → No 2/4 5/14

Mild → Yes 2/6

Cool → Yes 1/4

Humidity high → No 3/7 4/14

Normal → Yes 1/7

Windy False → Yes 2/8 5/14

True → No 3/6
From the above table, we can notice that the attributes Outlook and Humidity have the same
minimum error that is 4/14.
Hence we consider the individual attribute value errors.
The outlook attribute has one rule which generates zero error that is the rule Overcast → Yes.
Hence we consider the Outlook as the splitting attribute.
Now we build the tree with Outlook as the root node. It has three branches for each possible
value of the outlook attribute. As the rule, Overcast → Yes generates zero error. When the
outlook attribute value is overcast we get the result as Yes. For the remaining two attribute
values we consider the subset of data and continue building the tree. Tree with Outlook as root
node is,

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 19
________________________________________________________________________________________________

Now, for the left and right subtrees, we write all possible rules and find the total error. Based
on the total error table, we will construct the tree.
Left subtree,

Consolidated rules, errors for individual attributes values, and total error of the attribute are
given below.

From the above table, we can notice that Humidity has the lowest error. Hence Humidity is
considered as the splitting attribute. Also, when Humidity is High the answer is No as it
produces zero errors. Similarly, when Humidity is Normal the answer is Yes, as it produces
zero errors.

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 20
________________________________________________________________________________________________

Right subtree,

Consolidated rules, errors for individual attributes values, and total error of the attribute are
given below.

From the above table, we can notice that Windy has the lowest error. Hence Windy is
considered as the splitting attribute. Also, when Windy is False the answer is Yes as it
produces zero errors. Similarly, when Windy is True the answer is No, as it produces zero
errors.
The final decision tree for the given Paly Tennis data set is,

Also, from the above decision tree the prediction for the new example:

is, Yes

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 21
________________________________________________________________________________________________

Advantages of the Decision Tree


➢ It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
➢ It can be very useful for solving decision-related problems.
➢ It helps to think about all the possible outcomes for a problem.
➢ There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree
➢ The decision tree contains lots of layers, which makes it complex.
➢ It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
➢ For more class labels, the computational complexity of the decision tree may increase.

Naïve Bayes algorithm


➢ Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
➢ It is mainly used in text classification that includes a high-dimensional training dataset.
➢ Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick predictions.
➢ It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.

Conditional Probability
• Conditional probability is a fundamental concept in probability theory that measures the
likelihood of an event occurring given that another event has already occurred.
• It helps us understand how the probability of one event is influenced by the presence or
knowledge of another event.
• Conditional probability is denoted as P(A | B), which reads as "the probability of event
A given event B."
Conditional Probability is defined as the probability of any event occurring when another
event has already occurred.
✓ it calculates the probability of one event happening given that a certain condition is
satisfied.
✓ P(A | B): This notation represents the conditional probability of event A occurring given
that event B has already occurred.
Conditional probability is calculated using the formula:
P(A|B) = P(A ∩ B) / P(B)
Where,
→P(A ∩ B) represents the probability of both events A and B occurring simultaneously, and
→P(B) represents the probability of event B occurring.
To calculate the conditional probability, we can use the following step-by-step method:
Step 1: Identify the Events. Let’s call them Event A and Event B.
Step 2: Determine the Probability of Event A i.e., P(A)
Step 3: Determine the Probability of Event B i.e., P(B)
Step 4: Determine the Probability of Event A and B i.e., P(A∩B).
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 22
________________________________________________________________________________________________

Step 5: Apply the Conditional Probability Formula and calculate the required probability.
Bayes' theorem can be derived using product rule and conditional probability of event X with
known event Y:

Here, both events X and Y are independent events which means probability of outcome of
both events does not depends one another.
The above equation is called as Bayes Rule or Bayes Theorem.
✓ P(X|Y) is called as posterior, which we need to calculate. It is defined as updated
probability after considering the evidence.
✓ P(Y|X) is called the likelihood. It is the probability of evidence when hypothesis is true.
✓ P(X) is called the prior probability, probability of hypothesis before considering the
evidence
✓ P(Y) is called marginal probability. It is defined as the probability of evidence under
any consideration.
Hence, Bayes Theorem can be written as:
posterior = likelihood * prior / evidence
Advantages of Naïve Bayes Classifier in Machine Learning:
✓ It is one of the simplest and effective methods for calculating the conditional probability
and text classification problems.
✓ A Naïve-Bayes classifier algorithm is better than all other models where assumption of
independent predictors holds true.
✓ It is easy to implement than other models.
✓ It requires small amount of training data to estimate the test data which minimize the
training time period.
✓ It can be used for Binary as well as Multi-class Classifications.
Disadvantages of Naïve Bayes Classifier in Machine Learning:
✓ it limits the assumption of independent predictors because it implicitly assumes that all
attributes are independent or unrelated but in real life it is not feasible to get mutually
independent attributes.
Example: Predictively Classifying Customers of a Bookstore
We have the following dataset from a bookstore:
Age Income Student Credit_Rating Buys_Book

Youth High No Fair No

Youth High No Excellent No

Middle_aged High No Fair Yes

Senior Medium No Fair Yes

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 23
________________________________________________________________________________________________

Senior Low Yes Fair Yes

Senior Low Yes Excellent No

Middle_aged Low Yes Excellent Yes

Youth Medium No Fair No

Youth Low Yes Fair Yes

Senior Medium Yes Fair Yes

Youth Medium Yes Excellent Yes

Middle_aged Medium No Excellent Yes

Middle_aged High Yes Fair Yes

Senior Medium No Excellent No


We have attributes like age, income, student, and credit rating. Our class, buys_book, has
two outcomes: Yes or No.

Our goal is to classify based on the following attributes:


X = {age = youth, student = yes, income = medium, credit_rating = fair}.
As we showed earlier, to maximize P(Ci | X), we need to maximize [ P(X | Ci) * P(Ci) ] for
i = 1 and i = 2.
Hence, P(buys_book = yes) = 9/14 = 0.643
P(buys_book = no) = 5/14 = 0.357
P(age = youth | buys_book = yes) = 2/9 = 0.222
P(age = youth | buys_book = no) =3/5 = 0.600
P(income = medium | buys_book = yes) = 4/9 = 0.444
P(income = medium | buys_book = no) = 2/5 = 0.400
P(student = yes | buys_book = yes) = 6/9 = 0.667
P(student = yes | buys_book = no) = 1/5 = 0.200
P(credit_rating = fair | buys_book = yes) = 6/9 = 0.667
P(credit_rating = fair | buys_book = no) = 2/5 = 0.400
Using the above-calculated probabilities, we have
P(X | buys_book = yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
Similarly,
P(X | buys_book = no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019
Which class does Ci provide the maximum P(X|Ci)*P(Ci)? We compute:
P(X | buys_book = yes)* P(buys_book = yes) = 0.044 x 0.643 = 0.028
P(X | buys_book = no)* P(buys_book = no) = 0.019 x 0.357 = 0.007

Comparing the above two, since 0.028 > 0.007,


Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 24
________________________________________________________________________________________________

the Naive Bayes Classifier predicts that the customer with the above-mentioned
attributes will buy a book.

K-Means Clustering Algorithm


• K-Means Clustering is an unsupervised learning algorithm that is used to solve the
clustering problems in machine learning or data science.
• It groups the unlabeled dataset into different clusters. Here K defines the number of pre-
defined clusters that need to be created in the process, as if K=2, there will be two clusters,
and for K=3, there will be three clusters, and so on.
• It is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs only one group that has similar properties.
• It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
• It is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
• The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:
✓ Determines the best value for K center points or centroids by an iterative process.
✓ Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.

The working of the K-Means algorithm is explained in the below steps:


Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid
of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
To choose the value of K number of clusters
Elbow Method

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 25
________________________________________________________________________________________________

✓ The Elbow method is one of the most popular ways to find the optimal number of clusters.
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of
Squares, which defines the total variations within a cluster. The formula to calculate the
value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
In the above formula of WCSS,
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point
and its centroid within a cluster1 and the same for the other two terms.
The steps to be followed for the implementation are given below:
✓ Data Pre-processing
✓ Finding the optimal number of clusters using the elbow method
✓ Training the K-means algorithm on the training dataset
✓ Visualizing the clusters
K-Nearest Neighbor(KNN) Algorithm for Machine Learning
• K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
• K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
• It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
• Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog.

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 26
________________________________________________________________________________________________

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
Step-6: Our model is ready.

• Firstly, we will choose the number of neighbors, so we will choose the k=5.
• Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry.
It can be calculated as:

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 27
________________________________________________________________________________________________


• By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:

Advantages of KNN Algorithm:


✓ It is simple to implement.
✓ It is robust to the noisy training data
✓ It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
✓ Always needs to determine the value of K which may be complex some time.
✓ The computation cost is high because of calculating the distance between the data points
for all the training samples.

Support Vector Machine Algorithm


• Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
• The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 28
________________________________________________________________________________________________

Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature. So as
support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of
the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence
called a Support vector.
Margin in Support Vector Machine
We all know the equation of a hyperplane is w.x+b=0 where w is a vector normal to hyperplane
and b is an offset.

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 29
________________________________________________________________________________________________

To classify a point as negative or positive we need to define a decision rule. We can define
decision rule as:

Kernels in Support Vector Machine


• The most interesting feature of SVM is that it can even work with a non-linear dataset
and for this, we use “Kernel Trick” which makes it easier to classifies the points. Suppose
we have a dataset like this:

Different Kernel Functions


Some kernel functions which you can use in SVM are given below:
1. Polynomial Kernel
Following is the formula for the polynomial kernel:

Here d is the degree of the polynomial, which we need to specify manually.


Suppose we have two features X1 and X2 and output variable as Y, so using polynomial kernel
we can write it as:

So we basically need to find X12 , X22 and X1.X2, and now we can see that 2 dimensions got
converted into 5 dimensions.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 30
________________________________________________________________________________________________

2. Sigmoid Kernel
We can use it as the proxy for neural networks. Equation is:

3. RBF Kernel
What it actually does is to create non-linear combinations of our features to lift your samples
onto a higher-dimensional feature space where we can use a linear decision boundary to separate
your classes It is the most used kernel in SVM classifications, the following formula explains it
mathematically:

where,
1. ‘σ’ is the variance and our hyperparameter
2. ||X₁ – X₂|| is the Euclidean Distance between two points X₁ and X₂

4. Bessel function kernel


It is mainly used for eliminating the cross term in mathematical functions. Following is the
formula of the Bessel function kernel:

5. Anova Kernel
It performs well on multidimensional regression problems. The formula for this kernel function
is:

Advantages of SVM
• SVM works better when the data is Linear
• It is more effective in high dimensions
• With the help of the kernel trick, we can solve any complex problem
• SVM is not sensitive to outliers
• Can help us with Image classification
Disadvantages of SVM
• Choosing a good kernel is not easy
• It doesn’t show good results on a big dataset
• The SVM hyperparameters are Cost -C and gamma. It is not that easy to fine-tune these
hyper-parameters. It is hard to visualize their impact

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 31
________________________________________________________________________________________________

Ensemble Methods
• Ensemble methods are techniques that aim at improving the accuracy of results in models
by combining multiple models instead of using a single model.
• The combined models increase the accuracy of the results significantly. This has boosted
the popularity of ensemble methods in machine learning.

Main Types of Ensemble Methods


1. Bagging
• Bagging, the short form for bootstrap aggregating, is mainly applied in classification
and regression.
• It increases the accuracy of models through decision trees, which reduces variance to a
large extent.
• Bagging is classified into two types, i.e., bootstrapping and aggregation.
✓ Bootstrapping is a sampling technique where samples are derived from the whole
population (set) using the replacement procedure. The sampling with replacement
method helps make the selection procedure randomized. The base learning
algorithm is run on the samples to complete the procedure.
✓ Aggregation in bagging is done to incorporate all possible outcomes of the
prediction and randomize the outcome. Without aggregation, predictions will not
be accurate because all outcomes are not put into consideration. Therefore, the
aggregation is based on the probability bootstrapping procedures or on the basis of
all outcomes of the predictive models.
2. Boosting
• Boosting is an ensemble technique that learns from previous predictor mistakes to make
better predictions in the future.
• The technique combines several weak base learners to form one strong learner, thus
significantly improving the predictability of models.
• Boosting works by arranging weak learners in a sequence, such that weak learners learn
from the next learner in the sequence to create better predictive models.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 32
________________________________________________________________________________________________

3. Stacking
• Stacking, another ensemble method, is often referred to as stacked generalization. This
technique works by allowing a training algorithm to ensemble several other similar
learning algorithm predictions.

Random Forest Algorithm


✓ Random Forest is a popular machine learning algorithm that belongs to the
supervised learning technique.
✓ It can be used for both Classification and Regression problems in ML.
✓ It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of
the model.
✓ Random Forest is a classifier that contains a number of decision trees on various
subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset.
✓ Instead of relying on one decision tree, the random forest takes the prediction from
each tree and based on the majority votes of predictions, and it predicts the final
output.
✓ The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:

Random Forest algorithm working


➢ Random Forest works in two-phase first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 33
________________________________________________________________________________________________

This combination of multiple models is called Ensemble. Ensemble uses two methods:
1. Bagging: Creating a different training subset from sample training data with
replacement is called Bagging. The final output is based on majority voting.
2. Boosting: Combing weak learners into strong learners by creating sequential models
such that the final model has the highest accuracy is called Boosting. Example: ADA
BOOST, XG BOOST.

Bagging: From the principle mentioned above, we can understand Random forest uses
the Bagging code. Now, let us understand this concept in detail. Bagging is also known
as Bootstrap Aggregation used by random forest. The process begins with any original
random data. After arranging, it is organised into samples known as Bootstrap Sample.
This process is known as Bootstrapping. Further, the models are trained individually,
yielding different results known as Aggregation. In the last step, all the results are
combined, and the generated output is based on majority voting. This step is known as
Bagging and is done using an Ensemble Classifier.

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 34
________________________________________________________________________________________________

Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given
to the Random forest classifier. The dataset is divided into subsets and given to each decision
tree. During the training phase, each decision tree produces a prediction result, and when a new
data point occurs, then based on the majority of results, the Random Forest classifier predicts
the final decision. Consider the below image:

Applications of Random Forest


There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be
identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
Advantages of Random Forest
o Random Forest is capable of performing both Classification and Regression tasks.
o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 35
________________________________________________________________________________________________

Disadvantages of Random Forest


o Although random forest can be used for both classification and regression tasks, it is not
more suitable for Regression tasks.

Feature Generation
• Feature generation is the process of constructing new features from existing ones. The
goal of feature generation is to derive new combinations and representations of our data
that might be useful to the machine learning model.

• A feature (or column) represents a measurable piece of data like name, age or gender.
• It is the basic building block of a dataset.
• The quality of a feature can vary significantly and has an immense effect on model
performance.
• We can improve the quality of a dataset’s features in the pre-processing stage using
processes like Feature Generation and Feature Selection.
• Feature Generation (also known as feature construction, feature extraction or feature
engineering) is the process of transforming features into new features that better relate to
the target.
• This can involve mapping a feature into a new feature using a function like log, or creating
a new feature from one or multiple features using multiplication or addition.
• Feature Generation can improve model performance when there is a feature interaction.
• The generation of new flexible features is important as it allows us to use less complex
models that are faster to run and easier to understand and maintain.

Feature selection:
• Feature selection is a process that chooses a subset of features from the original features
so that the feature space is optimally reduced according to a certain criterion.
• Its goal is to find the best possible set of features for building a machine learning model.

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 36
________________________________________________________________________________________________

The role of feature selection in machine learning is,


1. To reduce the dimensionality of feature space.
2. To speed up a learning algorithm.
3. To improve the predictive accuracy of a classification algorithm.
4. To improve the comprehensibility of the learning results.
Some popular techniques of feature selection in machine learning are:
• Filter methods
• Wrapper methods
• Embedded methods
Filter Methods
✓ These methods are generally used while doing the pre-processing step.
✓ These methods select features from the dataset irrespective of the use of any machine
learning algorithm.
✓ They are very fast and inexpensive and are very good for removing duplicated, correlated,
redundant features but these methods do not remove multicollinearity. S

Some techniques used are:


• Information Gain – It is defined as the amount of information provided by the
feature for identifying the target value and measures reduction in the entropy values.
Information gain of each attribute is calculated considering the target values for feature
selection.
• Chi-square test — Chi-square method (X2) is generally used to test the relationship
between categorical variables. It compares the observed values from different
attributes of the dataset to its expected value.

Chi-square Formula
• Fisher’s Score – Fisher’s Score selects each feature independently according to their
scores under Fisher criterion leading to a suboptimal set of features. The larger the
Fisher’s score is, the better is the selected feature.
Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 37
________________________________________________________________________________________________

• Correlation Coefficient – Pearson’s Correlation Coefficient is a measure of


quantifying the association between the two continuous variables and the direction of
the relationship with its values ranging from -1 to 1.
• Variance Threshold – It is an approach where all features are removed whose
variance doesn’t meet the specific threshold. By default, this method removes features
having zero variance. The assumption made using this method is higher variance
features are likely to contain more information.
• Mean Absolute Difference (MAD) – This method is similar to variance threshold
method but the difference is there is no square in MAD. This method calculates the
mean absolute difference from the mean value.
• Dispersion Ratio – Dispersion ratio is defined as the ratio of the Arithmetic mean
(AM) to that of Geometric mean (GM) for a given feature. Its value ranges from +1
to ∞ as AM ≥ GM for a given feature. Higher dispersion ratio implies a more relevant
feature.
• Mutual Dependence – This method measures if two variables are mutually
dependent, and thus provides the amount of information obtained for one variable on
observing the other variable. Depending on the presence/absence of a feature, it
measures the amount of information that feature contributes to making the target
prediction.
• Relief – This method measures the quality of attributes by randomly sampling an
instance from the dataset and updating each feature and distinguishing between
instances that are near to each other based on the difference between the selected
instance and two nearest instances of same and opposite classes.
Wrapper methods:
✓ Wrapper methods, also referred to as greedy algorithms train the algorithm by using a
subset of features in an iterative manner.
✓ Based on the conclusions made from training in prior to the model, addition and removal
of features takes place.
✓ Stopping criteria for selecting the best subset are usually pre-defined by the person
training the model such as when the performance of the model decreases or a specific
number of features has been achieved.
✓ The main advantage of wrapper methods over the filter methods is that they provide an
optimal set of features for training the model, thus resulting in better accuracy than the
filter methods but are computationally more expensive.

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 38
________________________________________________________________________________________________

Given a predefined classifier, a typical wrapper model will perform the following steps:
Step 1: searching a subset of features,
Step 2: evaluating the selected subset of features by the performance of the classifier,
Step 3: repeating Step 1 and Step 2 until the desired quality is reached.
Some techniques used are:
• Forward selection – This method is an iterative approach where we initially start
with an empty set of features and keep adding a feature which best improves our model
after each iteration. The stopping criterion is till the addition of a new variable does
not improve the performance of the model.
• Backward elimination – This method is also an iterative approach where we
initially start with all features and after each iteration, we remove the least significant
feature. The stopping criterion is till no improvement in the performance of the model
is observed after the feature is removed.
• Bi-directional elimination – This method uses both forward selection and backward
elimination technique simultaneously to reach one unique solution.
• Exhaustive selection – This technique is considered as the brute force approach for
the evaluation of feature subsets. It creates all possible subsets and builds a learning
algorithm for each subset and selects the subset whose model’s performance is best.
• Recursive elimination – This greedy optimization method selects features by
recursively considering the smaller and smaller set of features. The estimator is trained
on an initial set of features and their importance is obtained using
feature_importance_attribute. The least important features are then removed from the
current set of features till we are left with the required number of features.
Embedded methods:
✓ In embedded methods, the feature selection algorithm is blended as part of the learning
algorithm, thus having its own built-in feature selection methods.

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT 3 39
________________________________________________________________________________________________

✓ Embedded methods encounter the drawbacks of filter and wrapper methods and merge
their advantages. These methods are faster like those of filter methods and more accurate
than the filter methods and take into consideration a combination of features as well.

Embedded Methods Implementation


Some techniques used are:
• Regularization – This method adds a penalty to different parameters of the machine
learning model to avoid over-fitting of the model. This approach of feature selection
uses Lasso (L1 regularization) and Elastic nets (L1 and L2 regularization). The penalty
is applied over the coefficients, thus bringing down some coefficients to zero. The
features having zero coefficient can be removed from the dataset.
• Tree-based methods – These methods such as Random Forest, Gradient Boosting
provides us feature importance as a way to select features as well. Feature importance
tells us which features are more important in making an impact on the target feature.

Sem 5
Dr B BEN SUJITHA ,
PROFESSOR, CSE,NICHE
CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT IV 1
____________________________________________________________________________________________

UNIT IV CLUSTERING
Choosing distance metrics - Different clustering approaches - hierarchical agglomerative clustering, k-
means (Lloyd's algorithm), - DBSCAN - Relative merits of each method - clustering tendency and quality.

Choosing distance metrics


Distance metric uses distance function which provides a relationship metric between each elements
in the dataset. The distance metric helps algorithms to recognize similarities between the contents.
Types of Distance Metrics in Machine Learning
➢ Euclidean Distance
➢ Manhattan Distance
➢ Minkowski Distance
➢ Hamming Distance
Euclidean Distance
• Euclidean distance is the shortest distance between any two points in a metric space. Consider two
points x and y in a two-dimensional plane with coordinates (x1, x2) and (y1, y2), respectively.
• It is the square root of the sum of squares of differences between corresponding coordinates of the
two points.
• Mathematically, the Euclidean distance between the points x and y in two-dimensional plane is
given by:
formula for Euclidean Distance

Extending to n dimensions, the points x and y are of the form x = (x1, x2, …, xn) and y = (y1, y2, …, yn),
we have the following equation for Euclidean distance:

Where,
n = number of dimensions
xi, yi = data points
Computing Euclidean Distance in Python
from scipy.spatial import distance
We then initialize two points x and y like so:
x = [3,6,9]
y = [1,0,1]
We can use the euclidean convenience function to find the Euclidean distance between the points x and y:
print(distance.euclidean(x,y))
Output >> 10.198039027185569

Manhattan Distance
Manhattan Distance is the sum of absolute differences between points across all the dimensions.
The Manhattan distance between the points x and y is given by:

Dr.B BEN SUJITHA, Professor /CSE,NICHE Semester 5


CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT IV 2
____________________________________________________________________________________________

In n-dimensional space, where each point has n coordinates, the Manhattan distance is given by:

n = number of dimensions
xi, yi = data points
Computing Manhattan Distance in Python
from scipy.spatial import distance
x = [3,6,9]
y = [1,0,1]

To compute the Manhattan (or cityblock) distance, we can use the cityblock function:
print(distance.cityblock(x,y))
Output >> 16

Minkowski Distance
Minkowski Distance is the generalized form of Euclidean and Manhattan Distance.
Formula for Minkowski Distance

Here, p represents the order of the norm.

Hamming Distance
Hamming distance is a metric for comparing two binary data strings. While comparing two binary strings of
equal length, Hamming distance is the number of bit positions in which the two bits are different.
The Hamming distance between two strings, a and b is denoted as d(a,b).
In order to calculate the Hamming distance between two strings, and, we perform their XOR operation, (a⊕
b), and then count the total number of 1s in the resultant string.
Suppose there are two strings 11011001 and 10011101.
11011001 ⊕ 10011101 = 01000100. Since, this contains two 1s, the Hamming distance, d(11011001,
10011101) = 2.

clustering approaches
• Clustering is a set of techniques used to partition data into groups, or clusters.
• Clusters are defined as groups of data objects that are more similar to other objects in their cluster
than to data objects in other clusters.
• The clustering technique is commonly used for statistical data analysis.
• It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it
deals with the unlabeled dataset.
Applications of Clustering in different fields:
Marketing: It can be used to characterize & discover customer segments for marketing purposes.
Biology: It can be used for classification among different species of plants and animals.
Libraries: It is used in clustering different books on the basis of topics and information.
Insurance: It is used to acknowledge the customers, their policies and identifying the frauds.

Dr.B BEN SUJITHA, Professor /CSE,NICHE Semester 5


CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT IV 3
____________________________________________________________________________________________

City Planning: It is used to make groups of houses and to study their values based on their geographical
locations and other factors present.
Earthquake studies: By learning the earthquake-affected areas we can determine the dangerous zones.
Image Processing: Clustering can be used to group similar images together, classify images based on
content, and identify patterns in image data.
Genetics: Clustering is used to group genes that have similar expression patterns and identify gene networks
that work together in biological processes.
Finance: Clustering is used to identify market segments based on customer behavior, identify patterns in
stock market data, and analyze risk in investment portfolios.
Customer Service: Clustering is used to group customer inquiries and complaints into categories, identify
common issues, and develop targeted solutions.
Manufacturing: Clustering is used to group similar products together, optimize production processes, and
identify defects in manufacturing processes.
Medical diagnosis: Clustering is used to group patients with similar symptoms or diseases, which helps in
making accurate diagnoses and identifying effective treatments.
Fraud detection: Clustering is used to identify suspicious patterns or anomalies in financial transactions,
which can help in detecting fraud or other financial crimes.
Traffic analysis: Clustering is used to group similar patterns of traffic data, such as peak hours, routes, and
speeds, which can help in improving transportation planning and infrastructure.
Social network analysis: Clustering is used to identify communities or groups within social networks,
which can help in understanding social behavior, influence, and trends.
Cybersecurity: Clustering is used to group similar patterns of network traffic or system behavior, which
can help in detecting and preventing cyberattacks.
Climate analysis: Clustering is used to group similar patterns of climate data, such as temperature,
precipitation, and wind, which can help in understanding climate change and its impact on the environment.
Sports analysis: Clustering is used to group similar patterns of player or team performance data, which can
help in analyzing player or team strengths and weaknesses and making strategic decisions.
Crime analysis: Clustering is used to group similar patterns of crime data, such as location, time, and type,
which can help in identifying crime hotspots, predicting future crime trends, and improving crime
prevention strategies.
Benefits of clustering
✓ It helps to visualize high-dimensional data
✓ It further enables data scientists to deal with different types of data like discrete, categorical, and
binary data
✓ It gives them some structure to unstructured data sets by organizing them into a group
✓ Helps to identify obscure patterns and relationships within a data set
✓ It helps to carry out exploratory data analysis
✓ It can also be used for market segmentation, customer profiling, and more

How does Clustering differ from Classification?


→Classification Algorithms are good techniques to distinguish between groups and classify.
→Classification requires manual labeling of data which is a tiring process when a dataset is huge.
→A Clustering Algorithm does not require labels to further proceed since it is an unsupervised technique.

Clustering Methods:
• Density-Based Methods:
➢ The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected.

Dr.B BEN SUJITHA, Professor /CSE,NICHE Semester 5


CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT IV 4
____________________________________________________________________________________________

➢ This algorithm does it by identifying different clusters in the dataset and connects the areas
of high densities into clusters.
➢ The dense areas in data space are divided from each other by sparser areas.
➢ These algorithms can face difficulty in clustering the data points if the dataset has varying
densities and high dimensions.
➢ These methods have good accuracy and the ability to merge two clusters. Example DBSCAN
(Density-Based Spatial Clustering of Applications with Noise), OPTICS (Ordering Points to
Identify Clustering Structure), etc.

• Hierarchical Based Methods:


➢ There is no requirement of pre-specifying the number of clusters to be created.
➢ In this technique, the dataset is divided into clusters to create a tree-like structure, which is
also called a dendrogram.
➢ The clusters formed in this method form a tree-type structure based on the hierarchy.
➢ New clusters are formed using the previously formed one. It is divided into two category
✓ Agglomerative (bottom-up approach)
✓ Divisive (top-down approach)

• Partitioning Methods:
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is the K-Means
Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-defined
groups. The cluster center is created in such a way that the distance between the data points of one cluster
is minimum as compared to another cluster centroid.
Common Algorithms used in this method are,
• K-Means
• K-Medoids
• K-Modes

Dr.B BEN SUJITHA, Professor /CSE,NICHE Semester 5


CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT IV 5
____________________________________________________________________________________________

• Grid-based Methods:
In this method, the data space is formulated into a finite number of cells that form a grid-like
structure. All the clustering operations done on these grids are fast and independent of the number
of data objects example STING (Statistical Information Grid), wave cluster, CLIQUE (Clustering
In Quest), etc.

K-Means Clustering
• K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be created in
the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so
on.
• It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way
that each dataset belongs only one group that has similar properties.
• it allows us to cluster the data into different groups and a convenient way to discover the categories
of groups in the unlabeled dataset on its own without the need for any training.
• It is a centroid-based algorithm, where each cluster is associated with a centroid.
• The main aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters
• The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters,
and repeats the process until it does not find the best clusters. The value of k should be
predetermined in this algorithm.
• The k-means clustering algorithm mainly performs two tasks:

✓ Determines the best value for K center points or centroids by an iterative process.
✓ Assigns each data point to its closest k-center. Those data points which are near to the particular k-
center, create a cluster.
→Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:
The K-Means Clustering takes the input of dataset D and parameter k, and then divides a dataset D of n
objects into k groups.
→ Cluster similarity is measured regarding the mean value of the objects in a cluster, which can be showed
as the cluster’s mean.
→ K-Means iteratively relocates the cluster centers by computing the mean of a cluster.

Dr.B BEN SUJITHA, Professor /CSE,NICHE Semester 5


CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT IV 6
____________________________________________________________________________________________

✓ Initially, K-Means chooses k cluster centers randomly.


✓ Distance is calculated between each data point and cluster centers (Euclidean distance is
commonly used).
✓ A data point is assigned to a Cluster to which it is very close.
✓ After all the data points are assigned to a cluster the algorithm computes the mean of the cluster
data points and relocates the cluster centers to its corresponding mean of the cluster.
✓ This process is continued until the cluster centers do not change.

How does the K-Means Algorithm Work?


The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.

→The quality of the cluster assignments is determined by computing the sum of the squared error (SSE) after
the centroids converge, or match the previous iteration’s assignment.
→ The SSE is defined as the sum of the squared Euclidean distances of each point to its closest centroid.
Since this is a measure of error, the objective of k-means is to try to minimize this value.

Dr.B BEN SUJITHA, Professor /CSE,NICHE Semester 5


CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT IV 7
____________________________________________________________________________________________

choose the value of "K number of clusters" in K-means Clustering?


The performance of the K-means clustering algorithm depends upon highly efficient clusters that it forms.
✓ Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters. This method
uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the
total variations within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
In the above formula of WCSS,
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and its
centroid within a cluster1 and the same for the other two terms.

import matplotlib.cm as cm
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')
plt.style.use('fivethirtyeight')
dataset = pd.read_csv("Data......csv", sep=",")
dataset.head()
dataset.info()
plt.figure(figsize=(10, 9))
sns.scatterplot('Annual Income (k$)', 'Spending Score (1-100)', data=dataset, alpha=0.9)
data_x = dataset.iloc[:, 3:5]
data_x.head()x_array = np.array(data_x)
print(x_array)
Sum_of_squared_distances =[]
K = range(1,15)
for k in K:
km =KMeans(n_clusters =k)
km =km.fit(x_scaled)
Sum_of_squared_distances.append(km.inertia_)

plt.plot(K, Sum_of_squared_distances, 'bx-')


plt.xlabel('k')
plt.ylabel('SSE')
plt.title('Elbow Method For Optimal k')
plt.show()

→The process of transforming numerical features to use the same scale is known as feature scaling.
Advantages:
• Scalability
• perform well on huge data
• K-Means is faster than other clustering algorithms
Disadvantages:
• K-Means is its sensitivity to outliers.
• Cluster results vary according to k value and initial choice of cluster centers.
• K-Means algorithm works well only for spherical data and fails to perform well on arbitrary shapes
of data.

Dr.B BEN SUJITHA, Professor /CSE,NICHE Semester 5


CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT IV 8
____________________________________________________________________________________________

Hierarchical Clustering
➔ Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group
the unlabeled datasets into a cluster and also known as hierarchical cluster analysis or HCA.
➔ In this algorithm, the hierarchy of clusters in the form of a tree, and this tree-shaped structure is
known as the dendrogram.
➔ Dendrograms are tree diagrams frequently used to illustrate the arrangement of the clusters produced
by hierarchical clustering.
➔ A subset of similiar data is created in a tree-like structure in which the root node corresponds to entire
data and branches are created from the root node to form several clusters.
➔ The optimal number of clusters is equal to the number of vertical lines going through the horizontal
line.

➔ Hierarchical clustering involves creating clusters that have a predetermined ordering from top to
bottom. There are two types of hierarchical clustering, Agglomerative and Divisive.
➔ Divisive Clustering : the type of hierarchical clustering that uses a top-down approach to make
clusters. It uses an approach of the partitioning of 2 least similar clusters and repeats this step until
there is only one cluster. Divisive clustering is not commonly used in real life.
➔ Agglomerative Clustering : the type of hierarchical clustering which uses a bottom-up approach to
make clusters. It uses an approach of the partitioning 2 most similar clusters and repeats this step until
there is only one cluster. These steps are how the agglomerative hierarchical clustering works:

steps to agglomerative hierarchical clustering


➢ Preparing the data

Dr.B BEN SUJITHA, Professor /CSE,NICHE Semester 5


CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT IV 9
____________________________________________________________________________________________

➢ Computing (dis)similarity information between every pair of objects in the data set.
➢ Using linkage function to group objects into hierarchical cluster tree, based on the distance
information generated at step 1. Objects/clusters that are in close proximity are linked together using
the linkage function.
➢ Determining where to cut the hierarchical tree into clusters. This creates a partition of the data.

steps:
✓ Consider each alphabet as a single cluster and calculate the distance of one cluster from all the
other clusters.
✓ In the second step, comparable clusters are merged together to form a single cluster. Let’s say
cluster (B) and cluster (C) are very similar to each other therefore we merge them in the second
step similarly to cluster (D) and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]
✓ We recalculate the proximity according to the algorithm and merge the two nearest clusters([(DE),
(F)]) together to form new clusters as [(A), (BC), (DEF)]
✓ Repeating the same process; The clusters DEF and BC are comparable and merged together to
form a new cluster. We’re now left with clusters [(A), (BCDEF)].
✓ At last, the two remaining clusters are merged together to form a single cluster [(ABCDEF)].
Algorithm :
given a dataset (d1, d2, d3, ....dN) of size N
# compute the distance matrix
for i=1 to N:
# as the distance matrix is symmetric about
# the primary diagonal so we compute only lower
# part of the primary diagonal
for j=1 to i:
dis_mat[i][j] = distance[di, dj]
each data point is a singleton cluster
repeat
merge the two cluster having minimum distance
update the distance matrix
until only a single cluster remains

Measure for the distance between two clusters


• the closest distance between the two clusters is crucial for the hierarchical clustering.
• There are various ways to calculate the distance between two clusters, and these ways decide the
rule for clustering. These measures are called Linkage methods.
• Some of the popular linkage methods are given below:

Dr.B BEN SUJITHA, Professor /CSE,NICHE Semester 5


CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT IV 10
____________________________________________________________________________________________

• Single Linkage: It is the Shortest Distance between the closest points of the clusters. Consider the
below image:

Complete Linkage: It is the farthest distance between the two points of two different clusters. It is one of
the popular linkage methods as it forms tighter clusters than single-linkage.

Average Linkage: It is the linkage method in which the distance between each pair of datasets is added up
and then divided by the total number of datasets to calculate the average distance between two clusters. It is
also one of the most popular linkage methods.
Centroid Linkage: It is the linkage method in which the distance between the centroid of the clusters is
calculated. Consider the below image:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
And then importing the dataset:
df = pd.read_csv("Data......csv", sep=",")
df.head()
plt.figure(figsize=(8,5))
plt.title("Annual income distribution",fontsize=15)
plt.xlabel ("Annual income (k$)",fontsize=13)
plt.grid(True)
plt.hist(df['Annual Income (k$)'],color='blue',edgecolor='k')
plt.show()

Dr.B BEN SUJITHA, Professor /CSE,NICHE Semester 5


CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT IV 11
____________________________________________________________________________________________

plt.figure(figsize=(8,5))
plt.title("Spending Score distribution",fontsize=15)
plt.xlabel ("Spending Score (1-100)",fontsize=14)
plt.grid(True)
plt.hist(df['Spending Score (1-100)'],color='brown',edgecolor='k')
plt.show()
plt.figure(figsize=(11,8))
plt.title("Annual Income and Spending Score Correlation",fontsize=18)
plt.xlabel ("Annual Income (k$)",fontsize=14)
plt.ylabel ("Spending Score (1-100)",fontsize=14)
plt.grid(True)
plt.scatter(df['Annual Income (k$)'],df['Spending Score (1-100)'],color='green',edgecolor='k',alpha=0.6, s=100)
plt.show()
X = df.iloc[:,[3,4]].valuesimport scipy.cluster.hierarchy as sch
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster import hierarchyplt.figure(figsize=(17,10))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
#plt.grid(True)
dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.show()

Pros
• No assumption of a particular number of clusters (i.e. k-means)
• May correspond to meaningful taxonomies
Cons
• Once a decision is made to combine two clusters, it can’t be undone
• Too slow for large data sets, O(𝑛2 log(𝑛))

Dr.B BEN SUJITHA, Professor /CSE,NICHE Semester 5


CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT IV 12
____________________________________________________________________________________________

.N
o. Agglomerative Clustering Divisive Clustering

1 Bottom-up approach Top-down approach


.

each data point starts in its own


cluster, and the algorithm all data points start in a single cluster,
recursively merges the closest and the algorithm recursively splits the
2
pairs of clusters until a single cluster into smaller sub-clusters until
.
cluster containing all the data each data point is in its own cluster.
points is obtained.

Agglomerative clustering is
generally more computationally
Comparatively less expensive as
expensive, especially for large
divisive clustering only requires the
datasets as this approach requires
3 calculation of distances between sub-
the calculation of all pairwise
. clusters, which can reduce the
distances between data points,
computational burden.
which can be computationally
expensive.

divisive clustering may create sub-


Agglomerative clustering can clusters around outliers, leading to
handle outliers better than suboptimal clustering
4 divisive clustering since outliers results.
. can be absorbed into larger
clusters

Agglomerative clustering tends


divisive clustering can be more
to produce more interpretable
difficult to interpret since the
results since the dendrogram
dendrogram shows the splitting process
5 shows the merging process of the
of the clusters, and the user must
. clusters, and the user can choose
choose a stopping criterion to
the number of clusters based on
determine the number of clusters.
the desired level of granularity.

Dr.B BEN SUJITHA, Professor /CSE,NICHE Semester 5


CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT IV 13
____________________________________________________________________________________________

DBSCAN
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density based
clustering algorithm.
• The main concept of DBSCAN algorithm is to locate regions of high density that are separated from
one another by regions of low density.
• DBSCAN can identify clusters in a large spatial dataset by looking at the local density of
corresponding elements.
• The advantage of the DBSCAN algorithm over the K-Means algorithm, is that the DBSCAN can
determine which data points are noise or outliers. DBSCAN can identify points that are not part of
any cluster (very useful as outliers detector).
• it slower than agglomerative clustering and k-means, but still scales to relatively large datasets.
There are two parameters in DBSCAN: minPoints and eps :
eps: specifies how close points should be to each other to be considered a part of a cluster. It means that if
the distance between two points is lower or equal to this value (eps), these points are considered to be
neighbors.
minPoints: the minimum number of data points to form a dense region/ cluster. For example, if we set
the minPoints parameter as 5, then we need at least 5 points to form a dense region.

Based on the two parameters, the points are classified as Core point, Border point and Noise point.

Dr.B BEN SUJITHA, Professor /CSE,NICHE Semester 5


CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT IV 14
____________________________________________________________________________________________

Core Point
• A point is a core point if it has more than a specified number of minPoints within eps radius around
it or |N (p)|≥ minPoints .
• Core Point always belongs in a dense region.
• For example, let’s consider ‘p’ is set to be a core point if ‘p’ has ≥ minPoints in an eps radius around
it.
Border Point
• A point is a border point if it has fewer than minPoints within eps, but is in the neighborhood of
a core point.
• For example, p is set to be a border point if ‘p’ is not a core point. i.e ‘p’ has
< minPoints in eps radius. But ‘p’ should belong to the neighborhood ‘q’. Where ‘q’ is a core point.
• p ∈ neighborhood of q and distance(p,q) ≤ eps .
Noise Point
• A noise point is any point that is not a core point or a border point.

Density Edge:
If p and q both are core points and distance between (p,q) ≤ eps then we can connect p, q vertex in a graph
and call it “Density Edge”.
Density Connected Points:
Two points p and q are said to be density connected points if both p and q are core points and they exist a
path formed by density edges connecting point (p) to point(q).

Steps of DBSCAN Algorithm:


➢ The algorithm starts with an arbitrary point which has not been visited and its neighborhood
information is retrieved from the epsparameter.
➢ If this point contains minPoints within epsneighborhood, cluster formation starts. Otherwise the
point is labeled as noise. This point can be later found within the epsneighborhood of a different
point and, thus can be made a part of the cluster.
➢ If a point is found to be a core point then the points within the eps neighborhood is also part of the
cluster. So all the points found within eps neighborhood are added, along with their
own eps neighborhood, if they are also core points.
➢ The above process continues until the density-connected cluster is completely found.
➢ The process restarts with a new point which can be a part of a new cluster or labeled as noise.

Dr.B BEN SUJITHA, Professor /CSE,NICHE Semester 5


CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT IV 15
____________________________________________________________________________________________

import numpy as np
import pandas as pd
import osimport matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler%matplotlib inline
Next step is importing the dataset:
df = pd.read_csv('Data_Customer_Mall.csv', sep=',')
df.head()
Clus_dataSet = df[['Annual_Income','Spending_Score']]
Clus_dataSet = np.nan_to_num(Clus_dataSet)
Clus_dataSet = np.array(Clus_dataSet, dtype=np.float64)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)# Compute DBSCAN
db = DBSCAN(eps=0.4, min_samples=5).fit(Clus_dataSet)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
df['Clus_Db']=labelsrealClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels))# A sample of clusters
print(df[['Annual_Income','Spending_Score']].head())# Number of Labels
print("number of labels: ", set(labels))

k-means Clustering Hierarchical Clustering

k-means, using a pre-specified number of


clusters, the method assigns records to each Hierarchical methods can be either divisive or
cluster to find the mutually exclusive agglomerative.
cluster of spherical shape based on distance.

K Means clustering needed advance In hierarchical clustering one can stop at any number
knowledge of K i.e. no. of clusters one want of clusters, one find appropriate by interpreting the
to divide your data. dendrogram.

Agglomerative methods begin with ‘n’ clusters and


One can use median or mean as a cluster
sequentially combine similar clusters until only one
centre to represent each cluster.
cluster is obtained.

Methods used are normally less Divisive methods work in the opposite direction,
computationally intensive and are suited with beginning with one cluster that includes all the
very large datasets. records and Hierarchical methods are especially

Dr.B BEN SUJITHA, Professor /CSE,NICHE Semester 5


CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT IV 16
____________________________________________________________________________________________

useful when the target is to arrange the clusters into a


natural hierarchy.

In K Means clustering, since one start with


random choice of clusters, the results In Hierarchical Clustering, results are reproducible in
produced by running the algorithm many Hierarchical clustering
times may differ.

K- means clustering a simply a division of the


set of data objects into non-overlapping A hierarchical clustering is a set of nested clusters
subsets (clusters) such that each data object is that are arranged as a tree.
in exactly one subset).

K Means clustering is found to work well Hierarchical clustering don’t work as well as, k
when the structure of the clusters is hyper means when the shape of the clusters is
spherical (like circle in 2D, sphere in 3D). hyper spherical.

Advantages: 1. Convergence is guaranteed. 2. Advantages: 1 .Ease of handling of any forms of


Specialized to clusters of different sizes and similarity or distance. 2. Consequently, applicability
shapes. to any attributes types.

Disadvantage: 1. Hierarchical clustering requires the


Disadvantages: 1. K-Value is difficult to
computation and storage of an n×n distance matrix.
predict 2. Didn’t work well with global
For very large datasets, this can be expensive and
cluster.
slow

Lloyd’s Algorithm

We can improve the distortion in two ways: by changing the points’ cluster assignments and moving the
cluster centers. Specifically, given a fixed set of points X, we are attempting to minimize the distortion
function

by choosing the best set of clusters C


1. Initialize the centers.
2. Until the algorithm converges:
a. Assign each point to its currently closest cluster center.
b. Move each center to the mean of its currently-assigned centers.

Dr.B BEN SUJITHA, Professor /CSE,NICHE Semester 5


CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT IV 17
____________________________________________________________________________________________

Step 1 occurs only once, while steps 2(a) and 2(b) alternate until the algorithm converges. Convergence is
guaranteed due to the fact that steps 2(a) and 2(b) both reduce J.X; C /, and there is a finite number of
ways to partition the n points among k clusters

.No. K-means Clustering DBScan Clustering

Clusters formed are more or less spherical or Clusters formed are arbitrary in shape and may
1. convex in shape and must have same feature size. not have same feature size.

K-means clustering is sensitive to the number of


2. clusters specified. Number of clusters need not be specified.

K-means Clustering is more efficient for large DBSCan Clustering can not efficiently handle
3. datasets. high dimensional datasets.

K-means Clustering does not work well with DBScan clustering efficiently handles outliers
4. outliers and noisy datasets. and noisy datasets.

In the domain of anomaly detection, this


algorithm causes problems as anomalous points DBScan algorithm, on the other hand, locates
will be assigned to the same cluster as “normal” regions of high density that are separated from
5. data points. one another by regions of low density.

It requires two parameters : Radius(R) and


Minimum Points(M)
R determines a chosen radius such that if it
includes enough points within it, it is a dense
area.
M determines the minimum number of data
It requires one parameter : Number of clusters points required in a neighborhood to be
6. (K) defined as a cluster.

DBScan clustering does not work very well for


Varying densities of the data points doesn’t affect sparse datasets or for data points with varying
7. K-means clustering algorithm. density.

Dr.B BEN SUJITHA, Professor /CSE,NICHE Semester 5


CS32A3 FOUNDATIONS OF DATA SCIENCE UNIT IV 18
____________________________________________________________________________________________

How does the quality of clustering method is estimated?


It is also measured by its ability to discover some or all of the hidden patterns. To measure the quality of a
clustering, we can use the average silhouette coefficient value of all objects in the data set.

Dr.B BEN SUJITHA, Professor /CSE,NICHE Semester 5


1
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5
UNIT V DATA VISUALIZATION
Importing Matplotlib – Line plots – Scatter plots – visualizing errors – density and contour plots –
Histograms – legends – colors – subplots – text and annotation – customization – three dimensional
plotting - Geographic Data with Basemap - Visualization with Seaborn.
_______________________________________________________________

DATA VISULAIZATION
• Data visualization may be described as graphically representing data. It is the act of
translating data into a visual context, which can be done using charts, plots, animations,
infographics, etc
• Data visualization in data science refers to the process of generating graphical
representations of information. These graphical depictions, often known as plots or charts
• Data visualization in data science is pivotal for effectively communicating insights.
• The purpose of data visualization is to help drive informed decision-making
Examples of Data Visualization in Data Science
✓ Weather reports: Maps and other plot types are commonly used in weather reports.
✓ Internet websites: Social media analytics websites such as Social Blade and Google
Analytics use data visualization techniques to analyze and compare the performance of
websites.
✓ Astronomy: NASA uses advanced data visualization techniques in its reports and
presentations.
✓ Geography
✓ Gaming industry
✓ Python libraries include various features that allow users to create highly customized, classy,
and interactive plots. They are
✓ Matplotlib
✓ Seaborn
✓ Bokeh
✓ Plotly
Importing Matplotlib
• Matplotlib is a cross-platform, data visualization and graphical plotting library for Python
and its numerical extension NumPy.
• Matplotlib is a comprehensive library for creating static, animated, and interactive
visualizations in Python.
• Matplotlib is a Python visualization library for 2D array plots.
• Matplotlib is a library that contains several submodules such as pyplot.
• Matplotlib is a Python library that uses the NumPy library. Matplotlib includes a wide range
of plots, such as scatter, line, bar, histogram, and others, that can assist us in delving deeper
into trends, behavioral patterns, and correlations.

plt.plot([1,2,3],[5,7,4])
plt.show()
→importing matplotlib.pyplot and use the alias plt, which is the alias used by convention for this
submodule.
→The plt.plot will "draw" this plot in the background
➔ plt.show(): With that, the graph should pop up

Dr. BEN SUJITHA ,CSE/PROF,NICHE


2
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5

Categories of Data Visualization

Line Charts
→A Line chart is a graph that represents information as a series of data points connected by a
straight line. In line charts, each data point or marker is plotted and connected with a line or curve.
Using Matplotlib

Dr. BEN SUJITHA ,CSE/PROF,NICHE


3
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5

Using Seaborn

Setting the Axis, Ticks, Grids


• The axes define the x and y plane of the graphic. The x axis runs horizontally, and the y axis runs
vertically.
• An axis is added to a plot layer. Axis can be thought of as sets of x and y axis that lines and bars
are drawn on. An Axis contains daughter attributes like axis labels, tick labels, and line thickness.
• The following code shows how to obtain access to the axes for a plot :
fig = plt.figure()
axes = fig.add_axes([0.1, 0.1, 0.8, 0.8]) # left, bottom, width, height (range 0 to 1)
axes.plot(x, y, 'r')
axes.set_xlabel('x')
axes.set_ylabel('y')
axes.set_title('title');
Output:

Dr. BEN SUJITHA ,CSE/PROF,NICHE


4
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5

• A grid can be added to a Matplotlib plot using the plt.grid() command. By defaut, the grid is
turned off. To turn on the grid use:
plt.grid(True)
• The only valid options are plt.grid(True) and plt.grid(False).
Defining the Line Appearance and Working with Line Style
• Line styles help differentiate graphs by drawing the lines in various ways. Following line style is
used by Matplotlib.
• Matplotlib has an additional parameter to control the colour and style of the plot.
plt.plot(xa, ya 'g')

• This will make the line green. To use any colour of red, green, blue, cyan, magenta, yellow, white
or black just by using the first character of the colour name in lower case (use "k" for black, as "b"
means blue).
• To alter the linestyle, for example two dashes -- makes a dashed line. This can be used added to
the colour selector, like this:
plt.plot(xa, ya 'r--')
• to use "-" for a solid line (the default), "-." for dash-dot lines, or ":" for a dotted line. Here is an
example :
from matplotlib import pyplot as plt
import numpy as np
xa = np.linspace(0, 5, 20)
ya = xa**2
plt.plot(xa, ya, 'g')
ya = 3*xa
plt.plot(xa, ya, 'r--')
plt.show()
Output:

Dr. BEN SUJITHA ,CSE/PROF,NICHE


5
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5
• MatPlotLib Colors are as follows:

Adding Markers
• Markers add a special symbol to each data point in a line graph. Unlike line style and color,
markers tend to be a little less susceptible to accessibility and printing issues.
• Basically, the matplotlib tries to have identifiers for the markers which look similar to the marker:
1. Triangle-shaped: v, <, > Λ
2. Cross-like: *,+, 1, 2, 3, 4
3. Circle-like: 0,., h, p, H, 8
• Having differently shaped markers is a great way to distinguish between different groups of data
points. If your control group is all circles and your experimental group is all X's the difference pops
out, even to colorblind viewers.
N = x.size // 3
ax.scatter(x[:N], y[:N], marker="o")
ax.scatter(x[N: 2* N], y[N: 2* N], marker="x")
ax.scatter(x[2* N:], y[2 * N:], marker="s")
• There's no way to specify multiple marker styles in a single scatter() call, but we can separate our
data out into groups and plot each marker style separately. Here we chopped our data up into three
equal groups.

Using Labels, Annotations and Legends


• To fully document your graph, you usually have to resort to labels, annotations, and legends. Each
of these elements has a different purpose, as follows:
1. Label: Make it easy for the viewer to know the name or kind of data illustrated
2. Annotation: Help extend the viewer's knowledge of the data, rather than simply identify it.
3. Legend: Provides cues to make identification of the data group easier.
• The following example shows how to add labels to your graph:
values = [1, 5, 8, 9, 2, 0, 3, 10, 4, 7]
import matplotlib.pyplot as plt
plt.xlabel('Entries')
plt.ylabel('Values')
plt.plot(range(1,11), values)
plt.show()
• Following example shows how to add annotation to a graph:
import matplotlib.pyplot as plt
W=4

Dr. BEN SUJITHA ,CSE/PROF,NICHE


6
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5
h=3
d = 70
plt.figure(figsize=(w, h), dpi=d)
plt.axis([0, 5, 0, 5])
x = [0, 3, 5]
y = [1, 4, 3.5]
label_x = 1
label_y = 4
arrow_x = 3
arrow_y= 4
arrow_properties=dict(
facecolor="black", width=0.5,
headwidth=4, shrink=0.1)
plt.annotate("maximum", xy=(arrow_x, arrow_y),
xytext=(label_x, label_y),
arrowprops arrow_properties)
plt.plot(x, y)
plt.savefig("out.png")
Output:

Creating a legend
• There are several options available for customizing the appearance and behavior of the plot
legend. By default the legend always appears when there are multiple series and only appears on
mouseover when there is a single series. By default the legend shows point values when the mouse
is over the graph but not when the mouse leaves.
• A legend documents the individual elements of a plot. Each line is presented in a table that
contains a label for it so that people can differentiate between each line.
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(-10, 9, 20)
y = x ** 3
Z = x ** 2
figure = plt.figure()
axes = figure.add_axes([0,0,1,1])
axes.plot(x, z, label="Square Function")
axes.plot(x, y, label="Cube Function")
axes.legend()
• In the script above we define two functions: square and cube using x, y and z variables. Next, we
first plot the square function and for the label parameter, we pass the value Square Function.
• This will be the value displayed in the label for square function. Next, we plot the cube function
and pass Cube Function as value for the label parameter.
• The output looks likes this:

Dr. BEN SUJITHA ,CSE/PROF,NICHE


7
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5

Bar Graphs
When you have categorical data, you can represent it with a bar graph. A bar graph plots data with
the help of bars, which represent value on the y-axis and category on the x-axis. Bar graphs use bars
with varying heights to show the data which belongs to a specific category.
Bar Chart
A bar plot or a bar chart is just a graph that uses rectangular bars which have lengths and heights
proportional to the data values they represent to represent a category of data.
import pandas as pd
import matplotlib.pyplot as plt
# reading the csv data set
dataset = pd.read_csv("tips.csv")
# Plotting Scatter plot of total_bill vs tip
plt.bar(dataset['total_bill'], dataset['tip'])
# Giving our plot a title
plt.title("Bar Chart")
# GIving x and y labels names
plt.xlabel('Total Bill')
plt.ylabel('Tip')

plt.show()

Dr. BEN SUJITHA ,CSE/PROF,NICHE


8
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5

Seaborn
→Seaborn is a Python library for creating statistical representations based on datasets. It is built on
top of matplotlib and is used to create various visualizations. It's built on top of pandas' data
structures. The library conducts the necessary modeling and aggregation internally to create
insightful visuals.
# importing the required packages
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# reading the csv data set using pandas
dataset = pd.read_csv("tips.csv")
sns.lineplot(x='total_bill', y='tip', data=dataset)
plt.show()

Dr. BEN SUJITHA ,CSE/PROF,NICHE


9
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5

import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes()
x = np.linspace(0, 10, 1000)
ax.plot(x, np.sin(x))

Output:

The matplotlib provides the fill_between() function which is used to fill area around the lines based
on the user defined logic.
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes()
x = np.linspace(0, 10, 1000)
ax.plot(x, np.sin(x))

import matplotlib.pyplot as plt


import numpy as np
x = np.arange(0.0, 2, 0.01)
y1 = np.sin(2 * np.pi * x)
y2 = 1.2 * np.sin(4 * np.pi * x)
fig, ax = plt.subplots(1, sharex=True)
ax.plot(x, y1, x, y2, color='black')
ax.fill_between(x, y1, y2, where=y2 >= y1, facecolor='blue', interpolate=True)
ax.fill_between(x, y1, y2, where=y2 <= y1, facecolor='red', interpolate=True)
ax.set_title('fill between where')

Dr. BEN SUJITHA ,CSE/PROF,NICHE


10
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5

Output:

Pie Chart
→A pie chart is a circular graph that is broken down in the segment or slices of pie.
→It is generally used to represent the percentage or proportional data where each slice of pie
represents a particular category. Let's have a look at the below example:
from matplotlib import pyplot as plt
# Pie chart, where the slices will be ordered and plotted counter-clockwise:
Players = 'Rohit', 'Virat', 'Shikhar', 'Yuvraj'
Runs = [45, 30, 15, 10]
explode = (0.1, 0, 0, 0) # it "explode" the 1st slice
fig1, ax1 = plt.subplots()
ax1.pie(Runs, explode=explode, labels=Players, autopct='%1.1f%%',
shadow=True, startangle=90)
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
Output:

Scatter plot
→The scatter plots are mostly used for comparing variables when we need to define how much one
variable is affected by another variable.
→The data is displayed as a collection of points. Each point has the value of one variable, which
defines the position on the horizontal axes, and the value of other variable represents the position on
the vertical axis.
Example
from matplotlib import pyplot as plt
from matplotlib import style
style.use('ggplot')
x = [5,7,10]
y = [18,10,6]
x2 = [6,9,11]
y2 = [7,14,17]
plt.scatter(x, y)
plt.scatter(x2, y2, color='g')
plt.title('Epic Info')
plt.ylabel('Y axis')

Dr. BEN SUJITHA ,CSE/PROF,NICHE


11
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5
plt.xlabel('X axis')
plt.show()
Output:

import matplotlib.pyplot as plt


x = [2, 2.5, 3, 3.5, 4.5, 4.7, 5.0]
y = [7.5, 8, 8.5, 9, 9.5, 10, 10.5]
x1 = [9, 8.5, 9, 9.5, 10, 10.5, 12]
y1 = [3, 3.5, 4.7, 4, 4.5, 5, 5.2]
plt.scatter(x, y, label='high income low saving', color='g')
plt.scatter(x1, y1, label='low income high savings', color='r')
plt.xlabel('saving*100')
plt.ylabel('income*1000')
plt.title('Scatter Plot')
plt.legend()
plt.show()
Output:

Visualizing Errors
• Error bars are included in Matplotlib line plots and graphs. Error is the difference between the
calculated value and actual value.
• Without error bars, bar graphs provide the perception that a measurable or determined number is
defined to a high level of efficiency. The method matplotlib.pyplot.errorbar() draws y vs. x as
planes and/or indicators with error bars associated.
• Adding the error bar in Matplotlib, Python. It's very simple, we just have to write the value of the
error. We use the command:
plt.errorbar(x, y, yerr = 2, capsize=3)
Where:
x = The data of the X axis.
Y = The data of the Y axis.
yerr = The error value of the Y axis. Each point has its own error value.
xerr = The error value of the X axis.
capsize = The size of the lower and upper lines of the error bar
• A simple example, where we only plot one point. The error is the 10% on the Y axis.

Dr. BEN SUJITHA ,CSE/PROF,NICHE


12
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5
importmatplotlib.pyplot as plt
x=1
y = 20
y_error = 20*0.10 ## El 10% de error
plt.errorbar(x,y, yerr = y_error, capsize=3)
plt.show()
Output:

• We plot using the command "plt.errorbar (...)", giving it the desired characteristics.
importmatplotlib.pyplot as plt
importnumpy as np
x = np.arange(1,8)
y = np.array([20,10,45,32,38,21,27])
y_error = y * 0.10 ##El 10%
plt.errorbar(x, y, yerr = y_error,
linestyle="None", fmt="ob", capsize=3, ecolor="k")
plt.show()
• Parameters of the errorbar :
a) yerr is the error value in each point.
b) linestyle, here it indicate that we will not plot a line.
c) fmt, is the type of marker, in this case is a point ("o") blue ("b").
d) capsize, is the size of the lower and upper lines of the error bar.
e) ecolor, is the color of the error bar. The default color is the marker color.

Output:

• Multiple lines in MatplotlibErrorbar in Python : The ability to draw numerous lines in almost the
same plot is critical. We'll draw many errorbars in the same graph by using this scheme.
importnumpy as np
importmatplotlib.pyplot as plt
fig = plt.figure()
x = np.arange(20)
y = 4* np.sin(x / 20 * np.pi)
yerr = np.linspace (0.06, 0.3, 20)

Dr. BEN SUJITHA ,CSE/PROF,NICHE


13
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5
plt.errorbar(x, y + 8, yerr = yerr, )
plt.errorbar(x, y + 6, yerr = yerr,
uplims = True, )
plt.errorbar(x, y + 4, yerr = yerr,
uplims = True,
lolims True, )
upperlimits = [True, False] * 6
lowerlimits = [False, True]* 6
plt.errorbar(x, y, yerr = yerr,
uplims =upperlimits,
lolims = lowerlimits, )
plt.legend(loc='upper left')
plt.title('Example')
plt.show()

Output:

Density and Contour Plots


→It is useful to display three-dimensional data in two dimensions using contours or color-coded
regions.
→A contour plot is a graphical method to visualize the 3-D surface by plotting constant Z slices
called contours in a 2-D format.
The contour plot is formed by:
Vertical axis: Independent variable 2
Horizontal axis: Independent variable 1
Lines: iso-response values, can be calculated with the help (x,y).
The independent variable usually restricted to a regular grid.
The contour plot is used to depict the change in Z values as compared to X and Y values.
Types of Contour Plot:
Rectangular Contour plot: A projection of 2D-plot in 2D-rectangular canvas. It is the most
common form of the contour plot.
Polar contour plot: Polar contour plot is plotted by using the polar coordinates r and theta. The
response variable here is the collection of values generated while passing r and theta into the
given function, where r is the distance from origin and theta is the angle from the positive x axis.
Ternary contour plot: Ternary contour plot is used to represent the relationship between 3
explanatory variables and the response variable in the form of a filled triangle.
There are three Matplotlib functions:
plt.contour for contour plots,
plt.contourf for filled contour plots,
plt.imshow for showing images.
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
import numpy as np

Dr. BEN SUJITHA ,CSE/PROF,NICHE


14
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5
contours = plt.contour(X, Y, Z, 3, colors='black')
plt.clabel(contours, inline=True, fontsize=8)
plt.imshow(Z, extent=[0, 5, 0, 5], origin='lower',cmap='RdGy', alpha=0.5)
plt.colorbar();

import numpy as np
import matplotlib.pyplot as plt
# define a function
def func(x, y):
return np.sin(x) ** 2 + np.cos(y) **2
# generate 50 values b/w 0 a5
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 50)
# Generate combination of grids
X, Y = np.meshgrid(x, y)
Z = func(X, Y)
# Draw rectangular contour plot
plt.contour(X, Y, Z, cmap='gist_rainbow_r');

The ax.contour3D() function creates three-dimensional contour plot.


It requires all the input data to be in the form of two-dimensional regular grids, with the Z-data
evaluated at each point.
from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt
def f(x, y):
return np.sin(np.sqrt(x ** 2 + y ** 2))
x = np.linspace(-6, 6, 30)
y = np.linspace(-6, 6, 30)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.contour3D(X, Y, Z, 50, cmap='binary')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')
ax.set_title('3D contour')
plt.show()

Dr. BEN SUJITHA ,CSE/PROF,NICHE


15
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5

import numpy as np
import matplotlib.pyplot as plt
def f(x, y):
return np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 40)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
plt.imshow(Z, extent=[0, 10, 0, 10], origin='lower', cmap='RdGy')
plt.colorbar()

Histogram
→ A histogram is used for the distribution, whereas a bar chart is used to compare different entities.
A histogram is a type of bar plot that shows the frequency of a number of values compared to a set
of values ranges.
For example we take the data of the different age group of the people and plot a histogram with
respect to the bin. Now, bin represents the range of values that are divided into series of intervals.
Bins are generally created of the same size.
from matplotlib import pyplot as plt
from matplotlib import pyplot as plt
population_age = [21,53,60,49,25,27,30,42,40,1,2,102,95,8,15,105,70,65,55,70,75,60,52,44,43,42,4
5]
bins = [0,10,20,30,40,50,60,70,80,90,100]
plt.hist(population_age, bins, histtype='bar', rwidth=0.8)
plt.xlabel('age groups')
plt.ylabel('Number of people')
plt.title('Histogram')
plt.show()

Output:

from matplotlib import pyplot as plt


# Importing Numpy Library

Dr. BEN SUJITHA ,CSE/PROF,NICHE


16
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5
import numpy as np
plt.style.use('fivethirtyeight')
mu = 50
sigma = 7
x = np.random.normal(mu, sigma, size=200)
fig, ax = plt.subplots()
ax.hist(x, 20)
ax.set_title('Historgram')
ax.set_xlabel('bin range')
ax.set_ylabel('frequency')
fig.tight_layout()
plt.show()
Output:

Legend
→Plot legends give meaning to a visualization, assigning labels to the various plot elements.
→Legends are found in maps - describe the pictorial language or symbology of the map. Legends
are used in line graphs to explain the function or the values underlying the different lines of the
graph.
→ Matplotlib has native support for legends. Legends can be placed in various positions: A legend
can be placed inside or outside the chart and the position can be moved. The legend() method adds
the legend to the plot.
To place the legend inside, simply call legend():
import matplotlib.pyplot as plt
import numpy as np
y = [2,4,6,8,10,12,14,16,18,20]
y2 = [10,11,12,13,14,15,16,17,18,19]
x = np.arange(10)
fig = plt.figure()
ax = plt.subplot(111)
ax.plot(x, y, label='$y = numbers')
ax.plot(x, y2, label='$y2 = other numbers')
plt.title('Legend inside')
ax.legend()
plt.show()
Output:

Dr. BEN SUJITHA ,CSE/PROF,NICHE


17
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5

frompolynomialsimportPolynomial
importnumpyasnp
importmatplotlib.pyplotasplt
p=Polynomial(-0.8,2.3,0.5,1,0.2)
p_der=p.derivative()
fig, ax=plt.subplots()
X=np.linspace (-2,3,50, endpoint=True)
F=p(X)
F_derivative=p_der(X)
ax.plot(X,F,label="p")
ax.plot(X,F_derivative,label="derivation of p")
ax.legend(loc='upper left')
Output:

Matplotlib legend on bottom


importmatplotlib.pyplotasplt
importnumpyasnp
y1 = [2,4,6,8,10,12,14,16,18,20]
y2 = [10,11,12,13,14,15,16,17,18,19]
x = np.arange(10)
fig = plt.figure()
ax = plt.subplot(111)
ax.plot(x, y, label='$y = numbers')
ax.plot(x, y2, label='$y2= = other numbers')
plt.title('Legend inside')
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05), &nbsp;
shadow=True, ncol=2)
plt.show()
Output:

Dr. BEN SUJITHA ,CSE/PROF,NICHE


18
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5

Subplots
• Subplots mean groups of axes that can exist in a single matplotlib figure. subplots() function in the
matplotlib library, helps in creating multiple layouts of subplots. It provides control over all the
individual plots that are created.
• subplots() without arguments returns a Figure and a single Axes. This is actually the simplest and
recommended way of creating a single Figure and Axes.
fig, ax = plt.subplots()
ax.plot(x,y)
ax.set_title('A single plot')
Output:

• There are 3 different ways (at least) to create plots (called axes) in matplotlib. They are:plt.axes(),
figure.add_axis() and plt.subplots()
• plt.axes(): The most basic method of creating an axes is to use the plt.axes function. It takes
optional argument for figure coordinate system. These numbers represent [bottom, left, width,
height] in the figure coordinate system, which ranges from 0 at the bottom left of the figure to 1 at
the top right of the figure.
• Plot just one figure with (x,y) coordinates: plt.plot(x, y).
• By calling subplot(n,m,k), we subdidive the figure into n rows and m columns and specify that
plotting should be done on the subplot number k. Subplots are numbered row by row, from left to
right.
importmatplotlib.pyplotasplt
importnumpyasnp
frommathimportpi
plt.figure(figsize=(8,4)) # set dimensions of the figure
x=np.linspace (0,2*pi,100)
foriinrange(1,7):
plt.subplot(2,3,i)# create subplots on a grid with 2 rows and 3 columns
plt.xticks([])# set no ticks on x-axis
plt.yticks([])# set no ticks on y-axis
plt.plot(np.sin(x), np.cos(i*x))
plt.title('subplot'+'(2,3,' + str(i)+')')
plt.show()
Output:

Dr. BEN SUJITHA ,CSE/PROF,NICHE


19
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5

Text and Annotation


• When drawing large and complex plots in Matplotlib, we need a way of labelling certain portion
or points of interest on the graph. To do so, Matplotlib provides us with the "Annotation" feature
which allows us to plot arrows and text labels on the graphs to give them more meaning.
• There are four important parameters that you must always use with annotate().
a) text: This defines the text label. Takes a string as a value.
b) xy: The place where you want your arrowhead to point to. In other words, the place you want to
annotate. This is a tuple containing two values, x and y.
c) xytext: The coordinates for where you want to text to display.
d) arrowprops: A dictionary of key-value pairs which define various properties for the arrow, such
as color, size and arrowhead type.
Example :
importmatplotlib.pyplot as plt
importnumpy as np
fig, ax = plt.subplots()
x = np.arange(0.0, 5.0, 0.01)
y =np.sin(2* np.pi *x)
# Annotation
ax.annotate('Local Max',
xy = (3.3, 1),
xytext (3, 1.8),
arrowprops = dict(facecolor = 'green',
shrink =0.05))
ax.set_ylim(-2, 2)
plt.plot(x, y)
plt.show()
Output:

Example :
importplotly.graph_objectsasgo
fig=go.Figure()
fig.add_trace(go.Scatter(
x=[0,1,2,3,4,5,6,7,8],

Dr. BEN SUJITHA ,CSE/PROF,NICHE


20
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5
y=[0,1,3,2,4,3,4,6,5]
))
fig.add_trace(go.Scatter(
x=[0,1,2,3,4,5,6,7,8],
y=[0,4,5,1,2,2,3,4,2]
))
fig.add_annotation(x=2,y=5,
text="Text annotation with arrow",
showarrow=True,
arrowhead=1)
fig.add_annotation(x=4,y=4,
text="Text annotation without arrow",
showarrow=False,
yshift = 10)
fig.update_layout(showlegend=False)
fig.show()
Output:

Customization
• A tick is a short line on an axis. For category axes, ticks separate each category. For value axes,
ticks mark the major divisions and show the exact point on an axis that the axis label defines. Ticks
are always the same color and line style as the axis.
• Ticks are the markers denoting data points on axes. Matplotlib's default tick locators and
formatters are designed to be generally sufficient in many common situations. Position and labels of
ticks can be explicitly mentioned to suit specific requirements.
• Fig. 5.9.1 shows ticks.

• Ticks come in two types: major and minor.


a) Major ticks separate the axis into major units. On category axes, major ticks are the only ticks
available. On value axes, one major tick appears for every major axis division.
b) Minor ticks subdivide the major tick units. They can only appear on value axes. One minor tick
appears for every minor axis division.
• By default, major ticks appear for value axes. xticks is a method, which can be used to get or to set
the current tick locations and the labels.
• The following program creates a plot with both major and minor tick marks, customized to be
thicker and wider than the default, with the major tick marks point into and out of the plot area.
importnumpyasnp
importmatplotlib.pyplotasplt

Dr. BEN SUJITHA ,CSE/PROF,NICHE


21
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5
# A selection of functions on rnabcissa points for 0 <= x < 1
rn=100
rx=np.linspace(0,1,rn, endpoint=False)
deftophat(rx):
"""Top hat function: y = 1 for x < 0.5, y=0 for x >= 0.5"""
ry=np.ones(rn)
ry[rx>=0.5]=0
returnry
# A dictionary of functions to choose from
ry={half-sawtooth':lambdarx:rx.copy(),
'top-hat':tophat,
'sawtooth':lambdarx:2*np.abs(rx-0.5)}
# Repeat the chosen function nrep times
nrep=4
x=np.linspace (0,nrep,nrep*rn, endpoint=False)
y=np.tile(ry['top-hat'] (rx), nrep)
fig=plt.figure()
ax=fig.add_subplot(111)
ax.plot(x,y,'k',lw=2)
# Add a bit of padding around the plotted line to aid visualization
ax.set_ylim(-0.1,1.1)
ax.set_xlim(x[0]-0.5,x[-1]+0.5)
# Customize the tick marks and turn the grid on
ax.minorticks_on()
ax.tick_params (which='major',length=10, width=2,direction='inout')
ax.tick_params(which='minor',length=5,width=2, direction='in')
ax.grid(which='both')
plt.show()
Output:

Three Dimensional Plotting


• Matplotlib is the most popular choice for data visualization. While initially developed for plotting
2-D charts like histograms, bar charts, scatter plots, line plots, etc., Matplotlib has extended its
capabilities to offer 3D plotting modules as well.
• First import the library :
importmatplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
• The first one is a standard import statement for plotting using matplotlib, which you would see for
2D plotting as well. The second import of the Axes3D class is required for enabling 3D projections.
It is, otherwise, not used anywhere else.
• Create figure and axes
fig = plt.figure(figsize=(4,4))

Dr. BEN SUJITHA ,CSE/PROF,NICHE


22
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5
ax = fig.add_subplot(111, projection='3d')
Output:

Example :
fig=plt.figure(figsize=(8,8))
ax=plt.axes(projection='3d')
ax.grid()
t=np.arange(0,10*np.pi,np.pi/50)
x=np.sin(t)
y=np.cos(t)
ax.plot3D(x,y,t)
ax.set_title('3D Parametric Plot')
# Set axes label
ax.set_xlabel('x',labelpad=20)
ax.set_ylabel('y', labelpad=20)
ax.set_zlabel('t', labelpad=20)
plt.show()
Output:

3D graph plot
→Three-dimension plots can be created by importing the mplot3d toolkit, include with the main
Matplotlib installation:
from mpl_toolkits import mplot3d
When this module is imported in the program, three-dimension axes can be created by passing the
keyword projection='3d' to any of the normal axes creation routines:
Example
from mpltoolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes(projection='3d')_
Output:

Dr. BEN SUJITHA ,CSE/PROF,NICHE


23
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5

Example-2:
from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt
height = np.array([100,110,87,85,65,80,96,75,42,59,54,63,95,71,86])
weight = np.array([105,123,84,85,78,95,69,42,87,91,63,83,75,41,80])
scatter(height,weight)
fig = plt.figure()
ax = plt.axes(projection='3d')
# This is used to plot 3D scatter
ax.scatter3D(height,weight)
plt.title("3D Scatter Plot")
plt.xlabel("Height")
plt.ylabel("Weight")
plt.title("3D Scatter Plot")
plt.xlabel("Height")
plt.ylabel("Weight")
plt.show()
Output:

import matplotlib as mpl


from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import matplotlib.pyplot as plt
mpl.rcParams['legend.fontsize'] = 10
fig = plt.figure()
ax = fig.gca(projection='3d')
theta1 = np.linspace(-4 * np.pi, 4 * np.pi, 100)
z = np.linspace(-2, 2, 100)
r = z**2 + 1
x = r * np.sin(theta1)
y = r * np.cos(theta1)
ax.plot3D(x, y, z, label='parametric curve', color = 'red')
ax.legend()

Dr. BEN SUJITHA ,CSE/PROF,NICHE


24
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5
plt.show()
Output:

Important functions of Matplotlib


Functions Description

plot(x-axis value, y-axis-values) It is used to plot a simple line graph with x-axis value
against the y-axis values. show() It is used to display
the graph.

title("string") It is used to set the title of the plotted graph as


specified by the string.

xlabel("string") It is used to set the label for the x-axis as specified by


the string.

ylabel("string") It is used to set the label for y-axis as specified by the


string.

figure() It is used to control a figure level attributes.

subplots(nrows,ncol,index) It is used to add a subplot to recent figure.

subtitle("string") It adds a common title to the plotted graph specified


by the string.

subplots(nrows,ncols,figsize) It provides the simple way to create subplot, in a


single call and returns a tuple of a figure and number
of axes.

set_title("string") It is an axes level method which is used to set the title


of the subplots.

bar(categorical variables, values, It is used to create a vertical bar graph.


color)

barh(categorical variables, values, It is used to create horizontal bar graphs.


color)

legend(loc) It is used to make a legend of the graph.

xtricks(index, categorical It is used to set or get the current tick locations labels
variables) of the x-axis.

pie(value, categorical variables) It is used to create a pie chart.

Dr. BEN SUJITHA ,CSE/PROF,NICHE


25
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5

hist(value, number of bins) It is used to create a histogram.

xlim(start value, end value) It is used to set the limit of values of the x-axis.

ylim(start value, end value) It is used to set the limit of values of the y-axis.

scatter(x-axis values, y-axis It is used to plots a scatter plot with x-axis value
values) against the y-axis values.

axes() It is used to add axes to the recent figure.

set_xlabel("string") It is an axes level method which is used to set the x-


label of the plot specified as a string.

set_ylabel("string") It is used to set the y-label of the plot specified as a


string.

scatter3D(x-axis values, y-axis It is used to plot a three-dimension scatter plot with x-


values) axis value against the y-axis.

plot3D(x-axis values, y-axis It is used to plots a three-dimension line graph with x-


values) axis values against y-axis values.

Heat Maps:
→Heat maps are a type of graphical representation that displays data in a matrix format. The value
of the data point that each matrix cell represents determines its hue. Heatmaps are often used to
visualize the correlation between variables or to identify patterns in time-series data.

Tree Maps: Tree maps are used to display hierarchical data in a compact format and are useful in
showing the relationship between different levels of a hierarchy.

Box Plots: Box plots are a graphical representation of the distribution of a set of data. In a box plot,

Dr. BEN SUJITHA ,CSE/PROF,NICHE


26
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5
the median is shown by a line inside the box, while the center box depicts the range of the data. The
whiskers extend from the box to the highest and lowest values in the data, excluding outliers. Box
plots can help us to identify the spread and skewness of the data.

Geographic Data with Basemap


• Basemap is a toolkit under the Python visualization library Matplotlib. Its main function is to draw
2D maps, which are important for visualizing spatial data. Basemap itself does not do any plotting,
but provides the ability to transform coordinates into one of 25 different map projections.
• Matplotlib can also be used to plot contours, images, vectors, lines or points in transformed
coordinates. Basemap includes the GSSH coastline dataset, as well as datasets from GMT for rivers,
states and national boundaries.
• These datasets can be used to plot coastlines, rivers and political boundaries on a map at several
different resolutions. Basemap uses the Geometry Engine-Open Source (GEOS) library at the
bottom to clip coastline and boundary features to the desired map projection area. In addition,
basemap provides the ability to read shapefiles.
• Basemap cannot be installed using pip install basemap. If Anaconda is installed, you can install
basemap using canda install basemap.
• Example objects in basemap:
a) contour(): Draw contour lines.
b) contourf(): Draw filled contours.
c) imshow(): Draw an image.
d) pcolor(): Draw a pseudocolor plot.
e) pcolormesh(): Draw a pseudocolor plot (faster version for regular meshes).
f) plot(): Draw lines and/or markers.
g) scatter(): Draw points with markers.
h) quiver(): Draw vectors.(draw vector map, 3D is surface map)
i) barbs(): Draw wind barbs (draw wind plume map)
j) drawgreatcircle(): Draw a great circle (draws a great circle route)
Basemap basic usage:
import warnings
warnings.filterwarmings('ignore')
frommpl_toolkits.basemap import Basemap
importmatplotlib.pyplot as plt
map = Basemap()
map.drawcoastlines()
# plt.show()
plt.savefig('test.png')
Output:

Dr. BEN SUJITHA ,CSE/PROF,NICHE


27
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5

Visualization with Seaborn


• Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level
interface for drawing attractive and informative statistical graphics. Seaborn is an open- source
Python library.
• Seaborn helps you explore and understand your data. Its plotting functions operate on dataframes
and arrays containing whole datasets and internally perform the necessary semantic mapping and
statistical aggregation to produce informative plots.
• Its dataset-oriented, declarative API. User should focus on what the different elements of your
plots mean, rather than on the details of how to draw them.
• Keys features:
a) Seaborn is a statistical plotting library
b) It has beautiful default styles
c) It also is designed to work very well with Pandas data frame objects.
Seaborn works easily with data frames and the Pandas library. The graphs created can also be
customized easily.
• Functionality that seaborn offers:
a) A dataset-oriented API for examining relationships between multiple variables
b) Convenient views onto the overall structure of complex datasets
c) Specialized support for using categorical variables to show observations or aggregate statistics
d) Options for visualizing univariate or bivariate distributions and for comparing them between
subsets of data
e) Automatic estimation and plotting of linear regression models for different kinds of dependent
variables
f) High-level abstractions for structuring multi-plot grids that let you easily build complex
visualizations
g) Concise control over matplotlib figure styling with several built-in themes
h) Tools for choosing color palettes that faithfully reveal patterns in your data.
Plot a Scatter Plot in Seaborn :
importmatplotlib.pyplot as plt
importseaborn as sns
import pandas as pd
df = pd.read_csv('worldHappiness2016.csv').
sns.scatterplot(data= df, x = "Economy (GDP per Capita)", y =
plt.show()
Output:

Dr. BEN SUJITHA ,CSE/PROF,NICHE


28
CSA23A FOUNDATUIONS OF DATA SCEINCE UNIT 5

Difference between Matplotlib and Seaborn

Dr. BEN SUJITHA ,CSE/PROF,NICHE

You might also like