0% found this document useful (0 votes)
10 views150 pages

Data Science Chacha

The document provides an overview of data science, its methodologies, applications across various industries, and the differences between data science and business intelligence. It discusses key applications in sectors such as business analytics, healthcare, finance, and environmental science, as well as the rise of Machine Learning as a Service (MLaaS) and various tools available for machine learning. Additionally, it covers feature engineering as a crucial step in machine learning, detailing processes like feature creation, transformation, and extraction.

Uploaded by

rahulpal2142005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views150 pages

Data Science Chacha

The document provides an overview of data science, its methodologies, applications across various industries, and the differences between data science and business intelligence. It discusses key applications in sectors such as business analytics, healthcare, finance, and environmental science, as well as the rise of Machine Learning as a Service (MLaaS) and various tools available for machine learning. Additionally, it covers feature engineering as a crucial step in machine learning, detailing processes like feature creation, transformation, and extraction.

Uploaded by

rahulpal2142005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 150

Shree Mahaveerai Namah

In today’s data-driven world, where information(read data) is generated at an unprecedented rate,


data science has emerged as a vital tool for extracting valuable insights from vast amounts of
data. So, what is data science anyway?

I will take you on an exciting adventure course through the realm of data science, demystifying
its key concepts, methodologies, and applications. Whether you’re a seasoned data professional,
an aspiring data scientist, or simply someone intrigued by the power of data, this syllabus will
provide you with a comprehensive understanding of data science and its real-world implications
and show you how to become get into the field .

What is data science?


Data science combines math and statistics, specialized programming,
advanced analytics, artificial intelligence (AI) and machine learning with specific subject
matter expertise to uncover actionable insights hidden in an organization’s data. These
insights can be used to guide decision making and strategic planning.

What is Data Science Used For?

Data science is a multi-faceted discipline with many applications across diverse industries and
sectors for various purposes, leveraging the power of data to drive innovation, inform decision-
making, and solve complex problems. Let’s look at some ways industries are leveraging data
science are some key applications of data science.

❖ Business Analytics: Data science is crucial in understanding consumer behavior,


optimizing operations, and driving business growth. Businesses can make informed
decisions about product development, marketing strategies, pricing, and resource
allocation by analyzing customer data, market trends, and sales patterns.

❖ Healthcare and Biomedicine: Data science transforms the healthcare industry by


enabling personalized medicine, predictive analytics, and disease prevention. Analyzing
medical records, genomic data, and clinical trials helps identify risk factors, develop
treatment protocols, and improve patient outcomes.

❖ Finance and Banking: The financial sector heavily relies on data science for risk
assessment, fraud detection, and algorithmic trading. Data scientists can develop models
for credit scoring, portfolio management, and identifying potential risks by analyzing
market trends, economic indicators, and customer data.

❖ Social Media and Marketing: Data science is pivotal in social media analytics, helping
businesses understand user behavior, sentiment analysis, and targeted advertising. By
leveraging social media data, companies can enhance their marketing strategies, engage
with customers effectively, and drive brand awareness.

❖ Transportation and Logistics: Data science is utilized to optimize transportation


networks, improve route planning, and enhance supply chain management. Data
scientists can develop algorithms to minimize delivery times, reduce costs, and optimize
resource allocation by analyzing data from sensors, GPS devices, and historical records.

❖ Environmental Science: Data science aids in analyzing environmental data to address


climate change, natural resource management, and sustainable development. By
leveraging data from satellites, weather stations, and environmental sensors, scientists
can model and predict climate patterns, monitor ecosystem health, and develop strategies
for environmental conservation.

❖ Government and Public Policy: Governments increasingly use data science to make
data-driven policy decisions, improve public services, and enhance governance.
Analyzing socioeconomic data, census data, and public health records enables
policymakers to identify societal challenges, allocate resources effectively, and measure
the impact of policy interventions.

What’s the Difference Between Business Intelligence and Data Science?

❖ While business intelligence (BI) and data science share similarities in their data
utilization, key distinctions exist between the two disciplines. Understanding these
differences is crucial for organizations seeking to leverage data effectively. Here’s a brief
overview of how business intelligence and data science differ:

❖ Business intelligence focuses on gathering, analyzing, and visualizing data to provide


insights into past and current business performance. It primarily deals with structured
data from internal systems such as sales, finance, and customer relationship management.
BI tools and techniques enable organizations to generate reports, dashboards, and key
performance indicators (KPIs) for monitoring and reporting on operational metrics.

❖ On the other hand, data science encompasses a broader and more exploratory approach to
data analysis. It involves extracting insights and generating predictive models by
leveraging statistical techniques, machine learning algorithms, and domain expertise.
Data science incorporates structured and unstructured data from various sources,
including internal systems, external APIs, social media, and sensor data.

Key Characteristics of Business Intelligence

❖ Historical Analysis: BI predominantly focuses on historical data analysis to identify


trends, patterns, and performance metrics.
❖ Structured Data: BI relies on structured data from databases and data warehouses, often
sourced from internal systems.

❖ Reporting and Visualization: BI tools excel at generating reports, dashboards, and visual
representations of data for business users to understand and monitor key metrics.

Key Characteristics of Data Science

❖ Predictive and Prescriptive Analytics: Data science aims to uncover actionable insights,
make predictions, and drive informed decision-making using advanced analytical
techniques.

❖ Unstructured and Big Data: Data science deals with structured and unstructured data,
including text, images, and sensor data. It embraces the challenges and opportunities
presented by big data.

❖ Algorithm Development and Optimization: Data scientists develop and optimize


algorithms to solve complex problems, build predictive models, and extract insights from
data.

Machine Learning as a Service is becoming the next big thing with data becoming cheaper, data
science becoming possible and processing power getting better. The growing trend of shifting
data storage to cloud, maintaining it and deriving the best insights from it has found an ally
in MLaaS which provides solutions at a reduced cost. It basically helps developers or
organisations benefit from machine learning cognate cost, time and without human intervention
and additional programming.

Much like SaaS, IaaS and PaaS, MLaaS provides users with range of tools as a part of a cloud
computing service, which includes – facial recognition, natural language processing, data
visualisation, image recognition and deep learning. It is supported by algorithm such as deep
neural networks, convolutional neural network, Bayesian networks, probabilistic graphical
models, Restricted Boltzmann machine and pattern recognition, among others.

Here we take a lowdown at few machine learning tools which will benefit your organisation:

Amazon Machine Learning services

There is a high level of automation available with Amazon Machine Learning, which offers
visual aids and easy-to-access analytics to make machine learning accessible to developers
without having to learn complex machine learning algorithms and technology. It also offers
companies an easy, highly-scalable-on ramp for interpreting data. It helps businesses build
machine learning models without having to create the code themselves.

Amazon Machine Learning can help generate billions of predictions daily, and serve those
predictions in real-time and at high throughput, claims the company. According to Amazon,
Amazon ML is based on the same proven, highly scalable, machine learning technology used by
Amazon to perform critical functions like supply chain management, fraudulent transaction
identification, and catalog organization. The Amazon ML service is based on pay-as-you-go
pricing model. There are no minimum fees required. The data analysis and model building
charges are for $0.42 per hour, with separate fee for batch prediction ($0.10 per 1,000
predictions, rounded up to the next 1,000) and the real-time predictions ($0.0001 per prediction,
rounded up to the nearest penny). Charges for data stored in Amazon S3, Amazon RDS, or
Amazon Redshift are billed separately.

Azure Machine Learning Studio

This machine learning for beginners and experienced data scientist. It offers a range of tools that
is more flexible for out-of-the box algorithms. Azure supports a range of operating systems,
programming languages, frameworks, databases and devices. It also provide cross-device
experiences with support for all major mobile platforms. With the help of an integrated
development environment called Machine Learning Studio, the developer can also build data
models through drag-and drop gestures and simple data flow diagrams. Thus, it not only helps
save a lot of time but minimizes coding through ML Studio’s library of sample experiments.
Azure ML Studio also has a huge variety of algorithms, with around 100 methods for developers.
The most popular option in Microsoft Azure Machine Learning Studio is available for free,
which just requires a Microsoft account. This also includes free access that never expires. It also
gives 10 GB storage, R and Python support and predictive web services. However, the standard
workspace of ML Studio is for $9.90 and you will require a Azure subscription.

Google Cloud Machine Learning Engine

Google Cloud Machine Learning Engine is highly flexible, which offers users an easy alternative
to build machine learning models for data of any size and type. Google machine learning
engine is based off TensorFlow project. This platform is integrated with all
other Google services like Google Cloud Dataflow, Google Cloud Storage, Google BigQuery,
among others. But the platform is mostly aimed at deep neural network tasks. You can sign up
for a free trial to access Google Cloud Machine Learning. There is no initial charges applied and
once you sign up you get $300 to spend on GoogleCloud Platform over next 12 months.
However, once your free trial ends, you have to pay for the subscription is chargeable.

IBM’s Watson Machine Learning

Watson Machine Learning runs on IBM’s Bluemix, which is capable of both training and
scoring. With the help of training function, developers can use Watson to refine an algorithms so
that it can learn from dataset. And scoring function helps in predicting an outcome using a
trained model. Watson addresses the need of both data scientist and developers. The notebook
tool of Watson can help the researchers learn more about machine learning algorithms.
According to a report, Watson is intended to address questions of deployment,
operationalization, and even deriving business value from machine learning models.

The visual modelling tools of IBM’s Watson machine learning helps users quickly identify
patterns, gain insights and make decisions faster. The open source technologies helps users to
keep utilising their own Jupyter notebooks with Python, R and Scala.

To use the service, you will need to create an account with Bluemix for the free trial. After your
30 days free trial gets over, you need to choose between Lite, Standard and Professional. While
Lite is available for free, under 5,000 predictions an 5 compute hours. Standard and Professional
charges you flat rate per each thousand of predictions and per total number of compute hours.
Standard is available for $0.500 per 1,000 predictions and Professional is for $0.400 per 1,000
predictions.

HPE Haven OnDemand

Haven OnDemand machine learning service provides developers with services and APIs for
building applications. There more than 60 APIs available in Haven which include features like
face detection, speech recognition, image classification, media analysis, object recognition, scene
change detection, speech recognition. It also provides powerful search curation features that
enables the optimisation of search results for developers. With the help of this machine learning
services organisations can extract, analyse and index multiple data formats including emails,
audio and video archives. The Haven OnDemand pricing plans start at $10 per month.

BigML

BigML is easy to use and has a flexible deployment. It allows data imports from AWS,
Microsoft Azure, Google Storage, Google Drive, Dropbox etc. BigML has more features
available that are integrated into its web UI. It also has a large gallery of free datasets, models. It
also has a useful clustering algorithms and visualizations. It has anomaly detection feature that
helps in detecting pattern anomalies, which will help you save time and money.

According to a blog, BigML Datasets are very easy to reuse, edit, expand and export. You can
easily rename and add descriptions to each one of your fields, add new ones (through
normalization, discretization, mathematical operations, missing values replacement, etc), and
generate subsets based on sampling or custom filters. It has a flexible pricing, you can choose
between subscription plans, starting from $15 per months for students. You can perform
unlimited tasks for datasets up to 16 MB for free.

There is also a pay-as-you-go option available in BigML. For companies with stringent data
security, privacy or regulatory requirements, BigML offers private deployment that can run on
their preferred cloud provider or ISP.

MLJAR

MLJAR is a ‘human-first platform’ for machine learning and is available in beta version. It
provides a service for development, prototyping and deploying pattern recognition algorithms. It
provides features like built-in hyper-parameters search, one interface for many algorithms,
among others. The get started, users need to upload dataset, select input and target attributes and
the machine learning service provider will find the matching ML algorithm. MLJAR is also
based on pay-as-you-go pricingmodel. Once your 30 days free trial gets over, there is a different
subscription plan for professional developers, startups, businesses and organisations. When you
start the subscription, you get 10 free credits for a start.

Arimo
Arimo uses machine learning algorithms and a large computing platform to crunch massive
amounts of data in seconds. It describes itself as ‘behavioural AI for IoT’, which learn from past
behaviour, predicts futures action and drives superior business outcomes. The service provider is
based on deep learning architecture, that works with time series data to discover patterns of
behaviour.

Domino

Domino is a platform that support modern data analysis workflow.This platform supports
language agnostic like Python, R, MATLAB, Perl, Julia, shell script, among others. Domino is a
platform for data scientist, data science managers, IT leaders and executives. This machine
learning service can be implemented on-site or in the cloud. Developers can develop, deploy and
collaborate using the existing tools and language, claims the company.It also streamlines
knowledge management with all projects stored, searchable and forkable. It has a rich
functionality in an integrated end-to end platform for version control and collaboration along
with one-click infrastructure scalability and deployment and publishing.

Dataiku Data Science Studio

It is the collaborative data science platform for data scientists, data analysts and engineers to
explore, prototype, build and deliver their own data products more efficiently. The platform
supports R, Phyton, Scala , Hive, Pig, Spark etc. It uses customisable drag and drop visual
interface at any step of dataflow prototyping process. The platform provides machine learning
technologies like Scikit-Learn, MLlib, Xgboost, H2O, among others in a visual user interface.
Types of Data

Data sources can be categorized as follows:

File Data Sources

• Excel: Power BI can connect to Excel workbooks, which may include data in tables or
data models created using Power Query or Power Pivot.

• Text/CSV: Comma-separated values files can be imported directly into Power BI.

• XML: Extensible Markup Language files.

• JSON: JavaScript Object Notation files, useful for web data and APIs.

• Folder: Aggregate data from multiple files in a folder.

• PDF: Extract data from tables within PDF documents.


• Parquet: Columnar storage format files.

• SharePoint Folder: Data stored in SharePoint folders.

Database Data Sources

• SQL Server Database: Connect to Microsoft SQL Server databases.

• Access Database: Microsoft Access database files.

• Oracle Database: Oracle's RDBMS databases.

• IBM Db2 Database: IBM's database systems.

• MySQL Database: Widely used open-source relational database.

• PostgreSQL Database: Advanced open-source relational database.

• SAP HANA Database: In-memory, column-oriented, relational database management


system.

• Amazon Redshift: Data warehouse product which forms part of the larger cloud-
computing platform Amazon Web Services.

• Google BigQuery: Serverless, highly scalable, and cost-effective multi-cloud data


warehouse designed for business agility.

• Snowflake: Cloud-based data warehousing service.

Feature Engineering in machine learning

Feature engineering is a crucial preprocessing step in machine learning that involves


transforming raw data into a format that can be effectively used by machine learning models.
This process enhances the predictive accuracy and decision-making capability of models by
creating, selecting, and transforming relevant features.

Key Processes in Feature Engineering

Feature Creation
Feature creation involves generating new features based on domain knowledge or by observing
patterns in the data. This can significantly improve the performance of a machine learning
model. Types of feature creation include:

• Domain-Specific: Creating features based on business rules or industry standards.


• Data-Driven: Creating features by observing patterns in the data.
• Synthetic: Generating new features by combining existing ones2.

Feature Transformation

Feature transformation converts features into a more suitable representation for the machine
learning model. This includes:

• Normalization: Rescaling features to have a similar range.


• Scaling: Transforming numerical variables to have a similar scale.
• Encoding: Transforming categorical features into numerical representations, such as one-
hot encoding.

Feature Extraction

Feature extraction creates new features from existing ones to provide more relevant information
to the model. Techniques include:

• Dimensionality Reduction: Reducing the number of features while retaining important


information (e.g., PCA, LDA).
• Feature Combination: Combining existing features to create new ones.
• Feature Aggregation: Aggregating features to create new ones2.

Feature Selection

Feature selection involves selecting a subset of relevant features from the dataset to be used in
the model. This can reduce overfitting, improve model performance, and decrease computational
costs. Methods include:

• Filter Method: Based on the statistical relationship between the feature and the target
variable.
• Wrapper Method: Based on the evaluation of the feature subset using a specific
machine learning algorithm.
• Embedded Method: Feature selection as part of the training process of the algorithm2.

Feature Scaling

Feature scaling transforms features so that they have a similar scale, which is important for many
machine learning algorithms. Techniques include:

• Min-Max Scaling: Rescaling features to a specific range, such as between 0 and 1.


• Standard Scaling: Rescaling features to have a mean of 0 and a standard deviation of 1.
• Robust Scaling: Rescaling features to be robust to outliers.

Importance of Feature Engineering

Feature engineering is essential because the quality of features used to train machine learning
models heavily influences their performance. By providing more meaningful and relevant
information, feature engineering helps models learn more effectively from the data.

Dummification

When working with categorical variables in machine learning models, it is essential to convert
them into numerical form. This conversion allows the model to understand and utilize the
information contained within these variables. In Python 3 programming, two common
approaches to handle categorical variables in XGBoost models are dummification and encoding.

Dummification, also known as one-hot encoding, is a technique used to convert categorical


variables into binary columns. Each unique value in the categorical variable is transformed into
a separate binary column, where a value of 1 indicates the presence of that particular value and 0
otherwise. This approach allows the XGBoost model to treat each category as an independent
feature.

# Program showing dummification


import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Create a sample dataframe
info = {'color': ['red', 'blue', 'green', 'red', 'green']}
df = pd.DataFrame(info)
# Apply dummification
encoder = OneHotEncoder()
dummified = encoder.fit_transform(df[['color']])
Output (file:dummificaation1.py):-
"""
print(info)
{'color': ['red', 'blue', 'green', 'red', 'green']}
print(dummified)
(0, 2) 1.0
(1, 0) 1.0
(2, 1) 1.0
(3, 2) 1.0
(4, 1) 1.0
"""
I create a sample dataframe with a categorical(qualitative) variable ‘color’. I then use the
OneHotEncoder from the scikit-learn library to dummify the ‘color’ column. The resulting
dummified dataframe will have three binary columns: ‘color_red’, ‘color_blue’, and
‘color_green’.
Encoding
Encoding is an alternative approach to handle categorical variables in XGBoost. Instead of
creating separate binary columns, encoding assigns a numerical value to each category. This
approach preserves the ordinal relationship between categories, which can be beneficial in
certain scenarios.
#Program to do encoding.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Create a sample dataframe
info = {'color': ['red', 'blue', 'green', 'red', 'green']}
df = pd.DataFrame(info)
# Apply encoding
encoder = LabelEncoder()
encoded = encoder.fit_transform(df['color'])
Output(encoding_dummy1.py):-
"""
print(info)
{'color': ['red', 'blue', 'green', 'red', 'green']}
print(encoded)
[2 0 1 2 1]
"""
I use the LabelEncoder from the scikit-learn library to encode the ‘color’ column. The resulting
encoded array will contain numerical values corresponding to each category: 2 for ‘red’, 0 for
‘blue’, and 1 for ‘green’.
§XGBoost stands for eXtreme Gradient Boosting. It is an optimized distributed gradient
boosting library designed for efficient and scalable training of machine learning models.
XGBoost is an implementation of gradient boosted decision trees, which is an ensemble learning
method that combines the predictions of multiple weak models to produce a stronger prediction.

Tools for Feature Engineering

Several tools can help streamline and automate the feature engineering process, including:

• Featuretools: A Python library for automated feature engineering.


• TPOT: An automated machine learning tool that includes feature engineering.
• DataRobot: A machine learning automation platform with feature engineering
capabilities.
• Alteryx: A data preparation and automation tool with feature engineering features.
• H2O.ai: An open-source machine learning platform with automated and manual feature
engineering options2.

In conclusion, feature engineering is a vital step in the machine learning pipeline that involves
creating, transforming, extracting, and selecting features to improve model performance. It
requires substantial data analysis and domain knowledge to effectively encode features for
different models and datasets

Python Machine Learning: Scikit-Learn Tutorial


An easy-to-follow scikit-learn tutorial that will help you to get started with the Python machine
learning.

Machine Learning with Python

Machine learning is a branch in computer science that studies the design of algorithms that can
learn.

Data Preprocessing Steps In Machine Learning: Major Tasks Involved

Data cleaning, Data transformation, Data reduction, and Data integration are the major steps in
data preprocessing.

Data Cleaning

Data cleaning, one of the major preprocessing steps in machine learning, locates and fixes errors
or discrepancies in the data. From duplicates and outliers to missing numbers, it fixes them all.
Methods like transformation, removal, and imputation help ML professionals perform data
cleaning seamlessly.

Data Integration

Data integration is among the major responsibilities of data preprocessing in machine learning.
This process integrates (merges) information extracted from multiple sources to outline and
create a single dataset. The fact that you need to handle data in multiple forms, formats, and
semantics makes data integration a challenging task for many ML developers.

Data Transformation
ML programmers must pay close attention to data transformation when it comes to data
preprocessing steps. This process entails putting the data in a format that will allow for analysis.
Normalization, standardization, and discretisation are common data transformation procedures.
While standardization transforms data to have a zero mean and unit variance, normalization
scales data to a common range. Continuous data is discretized into discrete categories using this
technique.

Data Reduction

Data reduction is the process of lowering the dataset’s size while maintaining crucial
information. Through the use of feature selection and feature extraction algorithms, data
reduction can be accomplished. While feature extraction entails translating the data into a lower-
dimensional space while keeping the crucial information, feature selection requires choosing a
subset of pertinent characteristics from the dataset. For example, If employees data consist of
attributes like BOD and date of joining company, we can replace both attributes and replace with
years of service with company = DOB – date of joining the company.

Data Preprocessing Steps In Machine Learning: Major Tasks


Involved

Data cleaning, Data transformation, Data reduction, and Data integration are the
major steps in data preprocessing.

Data Cleaning

Data cleaning, one of the major preprocessing steps in machine learning,


locates and fixes errors or discrepancies in the data. From duplicates and outliers to
missing numbers, it fixes them all. Methods like transformation, removal, and
imputation help ML professionals perform data cleaning seamlessly.

Data Integration

Data integration is among the major responsibilities of data preprocessing in


machine learning. This process integrates (merges) information extracted from
multiple sources to outline and create a single dataset. The fact that you need to
handle data in multiple forms, formats, and semantics makes data integration a
challenging task for many ML developers.

Data Transformation
ML programmers must pay close attention to data transformation when it comes
to data preprocessing steps. This process entails putting the data in a format that
will allow for analysis. Normalization, standardization, and discretisation are
common data transformation procedures. While standardization transforms data to
have a zero mean and unit variance, normalization scales data to a common range.
Continuous data is discretized into discrete categories using this technique.

Data Reduction

Data reduction is the process of lowering the dataset’s size while maintaining
crucial information. Through the use of feature selection and feature extraction
algorithms, data reduction can be accomplished. While feature extraction entails
translating the data into a lower-dimensional space while keeping the crucial
information, feature selection requires choosing a subset of pertinent characteristics
from the dataset. For example, If employees data consist of attributes like BOD and
date of joining company, we can replace both attributes and replace with years of
service with company = DOB – date of joining the company.

Definition:

data curation work that creates and oversees ready-to-use data sets for BI and analytics.
Data curation involves tasks such as indexing, cataloging and maintaining data sets and
their associated metadata to help users find and access the data.

Data curation is the process of creating, organizing and maintaining data sets so they
can be accessed and used by people looking for information. It involves collecting,
structuring, indexing and cataloging data for users in an organization, group or the
general public. Data can be curated to support business decision-making, academic
needs, scientific research and other purposes.

Why Data Preprocessing in Machine Learning?

When it comes to creating a Machine Learning model, data preprocessing is the


first step marking the initiation of the process. Typically, real-world data is
incomplete, inconsistent, inaccurate (contains errors or outliers), and often lacks
specific attribute values/trends. This is where data preprocessing enters the scenario
– it helps to clean, format, and organize the raw data, thereby making it ready -to-go
for Machine Learning models. Let’s explore various steps of data preprocessing in
machine learning.

What are Data Preprocessing Pipelines?

A data preprocessing pipeline is a series of sequential data transformation steps that are
applied to the raw input data to prepare it for model training and evaluation. These pipelines
help maintain consistency, ensure reproducibility, and enhance the efficiency of the
preprocessing process.

Key Steps in a Data Preprocessing Pipeline

1. Data Cleaning: Handling missing values, outliers, and noisy data.


2. Feature Transformation: Applying transformations like log transformations
or polynomial features.
3. Encoding Categorical Data: Converting categorical variables into
numerical representations.
4. Feature Scaling: Standardizing or normalizing features to bring them to a
similar scale.
5. Feature Selection: Selecting the most relevant features for modeling.
Shree Mahaveerai Namah

Example

Import numpy as np
Import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the initial data


print("Initial data:")
print(X_test[:5])

# Create a pipeline
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('pca', PCA(n_components=2)),
('model', LogisticRegression())
])

# Fit and predict using the pipeline


pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

# Transform and print results of preprocessing steps


X_imputed = pipeline.named_steps['imputer'].transform(X_test)
X_scaled = pipeline.named_steps['scaler'].transform(X_imputed)
X_pca = pipeline.named_steps['pca'].transform(X_scaled)

print("Imputed data:")
print(X_imputed[:5])
print("Scaled data:")
print(X_scaled[:5])
print("PCA-transformed data:")
print(X_pca[:5])

# Visualize the first two principal components


plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_pred, cmap='viridis')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Visualization with Logistic Regression')
plt.show()

Output :
Scatter matrix plot for iris data

#Program to create scatter-matrix for iris data set

#Program to check prediction by splitting database

import pandas as pd

import numpy as np

from sklearn.datasets import load_iris

import matplotlib.pyplot as plt

iris_data = load_iris()
from sklearn.model_selection import train_test_split

print(iris_data["feature_names"])

print(iris_data["target_names"])

X_train, X_test, y_train, y_test = train_test_split(\

iris_data['data'], iris_data['target'],test_size=0.2, random_state=0)

print("X_train shape: {}".format(X_train.shape))

print("y_train shape: {}".format(y_train.shape))

print("X_test shape: {}".format(X_test.shape))

print("y_test shape: {}".format(y_test.shape))

# create dataframe from data in X_train

# label the columns using the strings in iris_dataset.feature_names

iris_dataframe = pd.DataFrame(X_train, columns=iris_data.feature_names)

# create a scatter matrix from the dataframe, color by y_train

pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='g',\

hist_kwds={'bins': 20}, alpha=.8)


plt.show()

Output of file scatter_matrix1.py

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

['setosa' 'versicolor' 'virginica']

X_train shape: (120, 4)

y_train shape: (120,)

X_test shape: (30, 4)

y_test shape: (30,)


Typical tasks are concept learning, function learning or “predictive modeling”, clustering and
finding predictive patterns. These tasks are learned through available data that were observed
through experiences or instructions, for example.

The hope that comes with this discipline is that including the experience into its tasks will
eventually improve the learning. But this improvement needs to happen in such a way that the
learning itself becomes automatic so that humans like ourselves don’t need to interfere anymore
is the ultimate goal.

Today’s scikit-learn tutorial will introduce you to the basics of Python machine learning:

• You'll learn how to use Python and its libraries to explore your data with the help
of matplotlib and Principal Component Analysis (PCA),

• And you'll preprocess your data with normalization and you'll split your data into training
and test sets.

• Next, you'll work with the well-known KMeans algorithm to construct an unsupervised
model, fit this model to your data, predict values, and validate the model that you have
built.

• As an extra, you'll also see how you can also use Support Vector Machines (SVM) to
construct another model to classify your data.

If you’re more interested in an R tutorial, take a look at our Machine Learning with R for
Beginners tutorial.

Alternatively, check out DataCamp's Supervised Learning with scikit-learn and Unsupervised
Learning in Python courses!

Loading Your Data Set

The first step to about anything in data science is loading in your data. This is also the starting
point of this scikit-learn tutorial.

This discipline typically works with observed data. This data might be collected by yourself or
you can browse through other sources to find data sets. But if you’re not a researcher or
otherwise involved in experiments, you’ll probably do the latter.

If you’re new to this and you want to start problems on your own, finding these data sets might
prove to be a challenge. However, you can typically find good data sets at the UCI Machine
Learning Repository or on the Kaggle website. Also, check out this KD Nuggets list with
resources.
For now, you should warm up, not worry about finding any data by yourself and just load in
the digits data set that comes with a Python library, called scikit-learn.

Fun fact: did you know the name originates from the fact that this library is a scientific toolbox
built around SciPy? By the way, there is more than just one scikit out there. This scikit contains
modules specifically for machine learning and data mining, which explains the second
component of the library name. :)

To load in the data, you import the module datasets from sklearn. Then, you can use
the load_digits() method from datasets to load in the data:

# Import `datasets` from `sklearn`

from sklearn import datasets

# Load in the `digits` data

digits = datasets.load_digits()

# Print the `digits` data

print(digits)

Note that the datasets module contains other methods to load and fetch popular reference
datasets, and you can also count on this module in case you need artificial data generators. In
addition, this data set is also available through the UCI Repository that was mentioned above:
you can find the data here.

If you would have decided to pull the data from the latter page, your data import would’ve
looked like this:

# Import the `pandas` library as `pd`

import pandas as pd

# Load in the data with `read_csv()`

digits = pd.read_csv("http://archive.ics.uci.edu/ml/machine

-learning-databases/optdigits/optdigits.tra", header=None)

# Print out `digits`

print(digits)
Note that if you download the data like this, the data is already split up in a training and a test
set, indicated by the extensions .tra and .tes. You’ll need to load in both files to elaborate your
project. With the command above, you only load in the training set.

Tip: if you want to know more about importing data with the Python data manipulation library
Pandas, consider Importing Data in Python .

Shree Mahaveerai Namah

Explore Your Data

When first starting out with a data set, it’s always a good idea to go through the data description
and see what you can already learn. When it comes to scikit-learn, you don’t immediately have
this information readily available, but in the case where you import data from another source,
there's usually a data description present, which will already be a sufficient amount of
information to gather some insights into your data.

However, these insights are not merely deep enough for the analysis that you are going to
perform. You really need to have a good working knowledge about the data set.

Performing an exploratory data analysis (EDA) on a data set like the one that this tutorial now
has might seem difficult.

Where do you start exploring these handwritten digits?

Gathering Basic Information on Your Data


Let’s say that you haven’t checked any data description folder (or maybe you want to double-
check the information that has been given to you).

Then you should start with gathering the basic information.

When you printed out the digits data after having loaded it with the help of the scikit-
learn datasets module, you will have noticed that there is already a lot of information available.
You already have knowledge of things such as the target values and the description of your data.
You can access the digits data through the attribute data. Similarly, you can also access the target
values or labels through the target attribute and the description through the DESCR attribute.

To see which keys you have available to already get to know your data, you can just
run digits.keys().

Get the keys of the `digits` data

print(digits.keys())

# Print out the data

print(digits.data)

# Print out the target values

print(digits.target)

# Print out the description of the `digits` data

print(digits.DESCR)

The next thing that you can (double)check is the type of your data.

If you used read_csv() to import the data, you would have had a data frame that contains just the
data. There wouldn’t be any description component, but you would be able to resort to, for
example, head() or tail() to inspect your data. In these cases, it’s always wise to read up on the
data description folder!

However, this tutorial assumes that you make use of the library's data and the type of
the digits variable is not that straightforward if you’re not familiar with the library. Look at the
print out in the first code chunk. You’ll see that digits actually contains numpy arrays!

This is already quite some important information. But how do you access these arays?

It’s very easy, actually: you use attributes to access the relevant arrays.
Remember that you have already seen which attributes are available when you
printed digits.keys(). For instance, you have the dataattribute to isolate the data, target to see the
target values and the DESCR for the description, …

But what then?

The first thing that you should know of an array is its shape. That is, the number of dimensions
and items that is contained within an array. The array’s shape is a tuple of integers that specify
the sizes of each dimension. In other words, if you have a 3d array like this y = np.zeros((2, 3,
4)), the shape of your array will be (2,3,4).

Now let’s try to see what the shape is of these three arrays that you have distinguished
(the data, target and DESCR arrays).

Use first the data attribute to isolate the numpy array from the digits data and then use
the shape attribute to find out more. You can do the same for the target and DESCR. There’s also
the images attribute, which is basically the data in images. You’re also going to test this out.

Check up on this statement by using the shape attribute on the array:

# Inspect the shape

print(digits_target.shape)

# Print the number of unique labels

number_digits = len(np.unique(digits.target))

# Isolate the `images`

digits_images = digits.images

# Inspect the shape

print(digits_images.shap

To recap: by inspecting digits.data, you see that there are 1797 samples and that there are 64
features. Because you have 1797 samples, you also have 1797 target values.

But all those target values contain 10 unique values, namely, from 0 to 9. In other words, all
1797 target values are made up of numbers that lie between 0 and 9. This means that the digits
that your model will need to recognize are numbers from 0 to 9.
Lastly, you see that the images data contains three dimensions: there are 1797 instances that are
8 by 8 pixels big. You can visually check that the images and the data are related by reshaping
the images array to two dimensions: digits.images.reshape((1797, 64)).

But if you want to be completely sure, better to check with

print(np.all(digits.images.reshape((1797,64)) == digits.data))

With the numpy method all(), you test whether all array elements along a given axis evaluate
to True. In this case, you evaluate if it’s true that the reshaped images array equals digits.data.
You’ll see that the result will be True in this case.

Visualize Your Data Images With matplotlib

Then, you can take your exploration up a notch by visualizing the images that you’ll be working
with. You can use one of Python’s data visualization libraries, such as matplotlib, for this
purpose:

# Import matplotlib

import matplotlib.pyplot as plt

# Figure size (width, height) in inches

fig = plt.figure(figsize=(6, 6))

# Adjust the subplots

fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# For each of the 64 images

for i in range(64):

# Initialize the subplots: add a subplot in the grid of 8 by 8, at the i+1-th position

ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])

# Display an image at the i-th position

ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')


# label the image with the target value

ax.text(0, 7, str(digits.target[i]))

# Show the plot

plt.show()

The code chunk seems quite lengthy at first sight and this might be overwhelming. But, what
happens in the code chunk above is actually pretty easy once you break it down into parts:

• You import matplotlib.pyplot.

• Next, you set up a figure with a figure size of 6 inches wide and 6 inches long. This is
your blank canvas where all the subplots with the images will appear.

• Then you go to the level of the subplots to adjust some parameters: you set the left side of
the suplots of the figure to 0, the right side of the suplots of the figure to 1, the bottom
to 0 and the top to 1. The height of the blank space between the suplots is set at 0.005and
the width is set at 0.05. These are merely layout adjustments.

• After that, you start filling up the figure that you have made with the help of a for loop.

• You initialize the suplots one by one, adding one at each position in the grid that
is 8 by 8 images big.

• You display each time one of the images at each position in the grid. As a color map, you
take binary colors, which in this case will result in black, gray values and white colors.
The interpolation method that you use is 'nearest', which means that your data is
interpolated in such a way that it isn’t smooth. You can see the effect of the different
interpolation methods here.

• The cherry on the pie is the addition of text to your subplots. The target labels are printed
at coordinates (0,7) of each subplot, which in practice means that they will appear in the
bottom-left of each of the subplots.

• Don’t forget to show the plot with plt.show()!

In the end, you’ll get to see the following:


Shree Mahaveerai Namah

On my laptop output of file data_set4.py

On a more simple note, you can also visualize the target labels with an image, just like this:

# Import matplotlib

import matplotlib.pyplot as plt


# Join the images and target labels in a list

images_and_labels = list(zip(digits.images, digits.target))

# for every element in the list

for index, (image, label) in enumerate(images_and_labels[:8]):

# initialize a subplot of 2X4 at the i+1-th position

plt.subplot(2, 4, index + 1)

# Don't plot any axes

plt.axis('off')

# Display images in all subplots

plt.imshow(image, cmap=plt.cm.gray_r,interpolation='nearest')

# Add a title to each subplot

plt.title('Training: ' + str(label))

# Show the plot

plt.show()

Which will render the following visualization:


Output on my laptop of file data_set5.py
Note that in this case, after you have imported matplotlib.pyplot, you zip the two numpy arrays
together and save it into a variable called images_and_labels. You’ll see now that this list
contains suples of each time an instance of digits.images and a corresponding digits.target value.

Then, you say that for the first eight elements of images_and_labels -note that the index starts at
0!-, you initialize subplots in a grid of 2 by 4 at each position. You turn of the plotting of the
axes and you display images in all the subplots with a color map plt.cm.gray_r(which returns all
grey colors) and the interpolation method used is nearest. You give a title to each subplot, and
you show it.

Not too hard, huh?

And now you know a very good idea of the data that you’ll be working with!

Visualizing Your Data: Principal Component Analysis (PCA)

But is there no other way to visualize the data?

As the digits data set contains 64 features, this might prove to be a challenging task. You can
imagine that it’s very hard to understand the structure and keep the overview of the digits data. In
such cases, it is said that you’re working with a high dimensional data set.

High dimensionality of data is a direct result of trying to describe the objects via a collection of
features. Other examples of high dimensional data are, for example, financial data, climate data,
neuroimaging, …

But, as you might have gathered already, this is not always easy. In some cases, high
dimensionality can be problematic, as your algorithms will need to take into account too many
features. In such cases, you speak of the curse of dimensionality. Because having a lot of
dimensions can also mean that your data points are far away from virtually every other point,
which makes the distances between the data points uninformative.

Dont’ worry, though, because the curse of dimensionality is not simply a matter of counting the
number of features. There are also cases in which the effective dimensionality might be much
smaller than the number of the features, such as in data sets where some features are irrelevant.

In addition, you can also understand that data with only two or three dimensions is easier to
grasp and can also be visualized easily.

That all explains why you’re going to visualize the data with the help of one of the
Dimensionality Reduction techniques, namely Principal Component Analysis (PCA). The idea in
PCA is to find a linear combination of the two variables that contains most of the information.
This new variable or “principal component” can replace the two original variables.
In short, it’s a linear transformation method that yields the directions (principal components) that
maximize the variance of the data. Remember that the variance indicates how far a set of data
points lie apart.

You can easily apply PCA do your data with the help of scikit-learn:

# Import `datasets` from `sklearn`

from sklearn import datasets

from sklearn.decomposition import PCA

import pandas as pd

import matplotlib.pyplot as plt

# Load in the `digits` data

digits = datasets.load_digits()

# Print the `digits` data

#print(digits)

"""# Create a Randomized PCA model that takes two components

randomized_pca = RandomizedPCA(n_components=2)

# Fit and transform the data to the model

reduced_data_rpca = randomized_pca.fit_transform(digits.data)"""

# Create a regular PCA model

pca = PCA(n_components=2)

# Fit and transform the data to the model

reduced_data_pca = pca.fit_transform(digits.data)

# Inspect the shape

print(reduced_data_pca.shape)

output (1797,2)
Tip: you have used the RandomizedPCA() here because it performs better when there’s a high
number of dimensions. Try replacing the randomized PCA model or estimator object with a
regular PCA model and see what the difference is.

Note how you explicitly tell the model to only keep two components. This is to make sure that
you have two-dimensional data to plot. Also, note that you don’t pass the target class with the
labels to the PCA transformation because you want to investigate if the PCA reveals the
distribution of the different labels and if you can clearly separate the instances from each other.

You can now build a scatterplot to visualize the data:

colors = ['black', 'blue', 'purple', 'yellow', 'white', 'red', 'lime', 'cyan', 'orange', 'gray']

for i in range(len(colors)):

x = reduced_data_rpca[:, 0][digits.target == i]

y = reduced_data_rpca[:, 1][digits.target == i]

plt.scatter(x, y, c=colors[i])

plt.legend(digits.target_names, bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

plt.xlabel('First Principal Component')

plt.ylabel('Second Principal Component')

plt.title("PCA Scatter Plot")

plt.show()

Which looks like this:


Entire program file data_sets7.py

# Import `datasets` from `sklearn`

from sklearn import datasets

from sklearn.decomposition import PCA

import pandas as pd

import matplotlib.pyplot as plt

# Load in the `digits` data

digits = datasets.load_digits()

# Print the `digits` data

#print(digits)

# Create a regular PCA model

pca = PCA(n_components=2)
# Fit and transform the data to the model

reduced_data_pca = pca.fit_transform(digits.data)

# Inspect the shape

print(reduced_data_pca.shape)

colors = ['black', 'blue', 'purple', 'yellow', 'white', 'red', 'lime', 'cyan', 'orange', 'gray']

for i in range(len(colors)):

x = reduced_data_pca[:, 0][digits.target == i]

y = reduced_data_pca[:, 1][digits.target == i]

plt.scatter(x, y, c=colors[i])

plt.legend(digits.target_names, bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

plt.xlabel('First Principal Component')

plt.ylabel('Second Principal Component')

plt.title("PCA Scatter Plot")

plt.show()

Output on my laptop of file data_sets7.py


Again you use matplotlib to visualize the data. It’s good for a quick visualization of what you’re
working with, but you might have to consider something a little bit more fancy if you’re working
on making this part of your data science portfolio.

Also note that the last call to show the plot (plt.show()) is not necessary if you’re working in
Jupyter Notebook, as you’ll want to put the images inline.

What happens in the code chunk above is the following:

1. You put your colors together in a list. Note that you list ten colors, which is equal to the
number of labels that you have. This way, you make sure that your data points can be
colored in according to the labels. Then, you set up a range that goes from 0 to 9. Mind
you that this range is not inclusive! Remember that this is the same for indices of a list,
for example.

2. You set up your x and y coordinates. You take the first or the second column
of reduced_data_pca, and you select only those data points for which the label equals the
index that you’re considering. That means that in the first run, you’ll consider the data
points with label 0, then label 1, … and so on.

3. You construct the scatter plot. Fill in the x and y coordinates and assign a color to the
batch that you’re processing. The first run, you’ll give the color black to all data points,
the next run blue, … and so on.

4. You add a legend to your scatter plot. Use the target_names key to get the right labels for
your data points.

5. Add labels to your x and y axes that are meaningful.

6. Reveal the resulting plot.

Definition

Principal component analysis (PCA) is a linear dimensionality reduction technique with


applications in exploratory data analysis, visualization and data preprocessing.

The data is linearly transformed onto a new coordinate system such that the directions
(principal components) capturing the largest variation in the data can be easily identified.

Implementing PCA in Python with scikit-learn

WHY PCA?

➢ When there are many input attributes, it is difficult to visualize the data. There is a very
famous term ‘Curse of dimensionality in the machine learning domain.
➢ Basically, it refers to the fact that a higher number of attributes in a dataset adversely
affects the accuracy and training time of the machine learning model.
➢ Principal Component Analysis (PCA) is a way to address this issue and is used for better
data visualization and improving accuracy.

How does PCA work?

➢ PCA is an unsupervised pre-processing task that is carried out before applying any ML
algorithm. PCA is based on “orthogonal linear transformation” which is a mathematical
technique to project the attributes of a data set onto a new coordinate system. The
attribute which describes the most variance is called the first principal component and is
placed at the first coordinate.
➢ Similarly, the attribute which stands second in describing variance is called a second
principal component and so on. In short, the complete dataset can be expressed in terms
of principal components. Usually, more than 90% of the variance is explained by
two/three principal components.
➢ Principal component analysis, or PCA, thus converts data from high dimensional space to
low dimensional space by selecting the most important attributes that capture maximum
information about the dataset.

Python Implementation:

• To implement PCA in Scikit learn, it is essential to standardize/normalize the data before


applying PCA.
• PCA is imported from sklearn.decomposition. We need to select the required number of
principal components.
• Usually, n_components is chosen to be 2 for better visualization but it matters and
depends on data.
• By the fit and transform method, the attributes are passed.
• The values of principal components can be checked using components_ while the
variance explained by each principal component can be calculated using
explained_variance_ratio.

Program to explore data sets, breast cancer datasets.

#Another example of PCA


# import all libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#%matplotlib inline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

#import the breast _cancer dataset


from sklearn.datasets import load_breast_cancer
data=load_breast_cancer()
data.keys()
#Exploratory data
# Check the output classes
print(data['target_names'])

# Check the input attributes


print(data['feature_names'])

Output of python program breast_cancer1.py

['malignant' 'benign']
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
'mean smoothness' 'mean compactness' 'mean concavity'
'mean concave points' 'mean symmetry' 'mean fractal dimension'
'radius error' 'texture error' 'perimeter error' 'area error'
'smoothness error' 'compactness error' 'concavity error'
'concave points error' 'symmetry error' 'fractal dimension error'
'worst radius' 'worst texture' 'worst perimeter' 'worst area'
'worst smoothness' 'worst compactness' 'worst concavity'
'worst concave points' 'worst symmetry' 'worst fractal dimension']

Apply PCA

➢ Standardize the dataset prior to PCA.


➢ Import PCA from sklearn.decomposition.
➢ Choose the number of principal components.

#Another example of PCA


# import all libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
#import the breast _cancer dataset
from sklearn.datasets import load_breast_cancer
data=load_breast_cancer()
data.keys()
#Exploratory data
# Check the output classes
#print(data['target_names'])

# Check the input attributes


#print(data['feature_names'])
# construct a dataframe using pandas
df1=pd.DataFrame(data['data'],columns=data['feature_names'])

# Scale data before applying PCA


scaling=StandardScaler()

# Use fit and transform method


scaling.fit(df1)
Scaled_data=scaling.transform(df1)

# Set the n_components=3 Choosing 3 components.


principal=PCA(n_components=3)
principal.fit(Scaled_data)
x=principal.transform(Scaled_data)
# Check the dimensions of data after PCA
print(x.shape)

output of file breast_cancer2.py (569,3)


we understand that the dimensions of x are (569,3) while the dimension of actual data is
(569,30). Thus, it is clear that with PCA, the number of dimensions has reduced to 3 from 30.
If we choose n_components=2, the dimensions would be reduced to 2.

# Check the values of eigen vectors

# prodeced by principal components

principal.components_

Output :

(569, 3)

principal.components_

array([[ 0.21890244, 0.10372458, 0.22753729, 0.22099499, 0.14258969,


0.23928535, 0.25840048, 0.26085376, 0.13816696, 0.06436335,
0.20597878, 0.01742803, 0.21132592, 0.20286964, 0.01453145,
0.17039345, 0.15358979, 0.1834174 , 0.04249842, 0.10256832,
0.22799663, 0.10446933, 0.23663968, 0.22487053, 0.12795256,
0.21009588, 0.22876753, 0.25088597, 0.12290456, 0.13178394],
[-0.23385713, -0.05970609, -0.21518136, -0.23107671, 0.18611304,
0.15189161, 0.06016535, -0.03476751, 0.19034879, 0.36657546,
-0.10555215, 0.08997968, -0.08945724, -0.15229263, 0.20443045,
0.23271592, 0.19720729, 0.13032155, 0.18384799, 0.28009203,
-0.21986638, -0.0454673 , -0.19987843, -0.21935186, 0.17230435,
0.14359318, 0.09796411, -0.00825725, 0.14188335, 0.27533946],
[-0.00853123, 0.06454988, -0.00931421, 0.02869953, -0.10429186,
-0.07409152, 0.00273378, -0.02556361, -0.04023984, -0.02257411,
0.26848137, 0.37463367, 0.26664535, 0.21600653, 0.30883897,
0.15477987, 0.17646378, 0.22465751, 0.2885842 , 0.21150378,
-0.04750696, -0.0422978 , -0.04854648, -0.01190227, -0.25979759,
-0.23607559, -0.1730574 , -0.17034417, -0.27131263, -0.2327914 ]])

The principal.components_ provide an array in which the number of rows tells the number of
principal components while the number of columns is equal to the number of features in actual
data. We can easily see that there are three rows as n_components was chosen to be 3. However,
each row has 30 columns as in actual data.

Adding following code to above code will able to plot data . For 2-D plot add following code
plt.figure(figsize=(10,10))
plt.scatter(x[:,0],x[:,1],c=data['target'],cmap='plasma')

plt.xlabel('pc1')

plt.ylabel('pc2')

plt.show()

Output of file breast_cancer3.py

The colors show the 2 output classes of the original dataset-benign and malignant tumor. It is
clear that principal components show clear separation between two output classes.

For three principal components, we need to plot a 3d graph. x[:,0] signifies the first principal
component. Similarly, x[:,1] and x[:,2] represent the second and the third principal component.

# choose projection 3d for creating a 3d graph


axis = fig.add_subplot(111, projection='3d')

# x[:,0]is pc1,x[:,1] is pc2 while x[:,2] is pc3


axis.scatter(x[:,0],x[:,1],x[:,2], c=data['target'],cmap='plasma')
axis.set_xlabel("PC1", fontsize=10)
axis.set_ylabel("PC2", fontsize=10)
axis.set_zlabel("PC3", fontsize=10)
plt.show()

Output from file breast_cancer4.py


Calculate variance ratio

Explained_variance_ratio provides an idea of how much variation is explained by principal


components.

principal.explained_variance_ratio_

array([0.44272026, 0.18971182, 0.09393163])

Principal Axis Method: PCA basically searches a linear combination of variables so that we can
extract maximum variance from the variables. Once this process completes it removes it and
searches for another linear combination that gives an explanation about the maximum proportion
of remaining variance which basically leads to orthogonal factors. In this method, we analyze
total variance.

Eigenvector: It is a non-zero vector that stays parallel after matrix multiplication. Let’s suppose
x is an eigenvector of dimension r of matrix M with dimension r*r if Mx and x are parallel. Then
we need to solve Mx=Ax where both x and A are unknown to get eigenvector and eigenvalues.
Under Eigen-Vectors, we can say that Principal components show both common and unique
variance of the variable. Basically, it is variance focused approach seeking to reproduce total
variance and correlation with all components. The principal components are basically the linear
combinations of the original variables weighted by their contribution to explain the variance in a
particular orthogonal dimension.

Eigen Values: It is basically known as characteristic roots. It basically measures the variance in
all variables which is accounted for by that factor. The ratio of eigenvalues is the ratio of
explanatory importance of the factors with respect to the variables. If the factor is low then it is
contributing less to the explanation of variables. In simple words, it measures the amount of
variance in the total given database accounted by the factor. We can calculate the factor’s
eigenvalue as the sum of its squared factor loading for all the variables.

File name scatter_matrix2.py

#Program to create scatter-matrix for breast cancer data set

#Program to check prediction by splitting database using scatter matrix plot

import pandas as pd

import numpy as np

from sklearn.datasets import load_breast_cancer

data=load_breast_cancer()

data.keys()

import matplotlib.pyplot as plt

#Exploratory data

# Check the output classes

print(data['target_names'])

# Check the input attributes

print(data['feature_names'])

# construct a dataframe using pandas

df1=pd.DataFrame(data['data'],columns=data['feature_names'])
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(\

data['data'], data['target'],test_size=0.2, random_state=0)

print("X_train shape: {}".format(X_train.shape))

print("y_train shape: {}".format(y_train.shape))

print("X_test shape: {}".format(X_test.shape))

print("y_test shape: {}".format(y_test.shape))

# create dataframe from data in X_train

# label the columns using the strings in iris_dataset.feature_names

df1 = pd.DataFrame(X_train, columns=data.feature_names)

# create a scatter matrix from the dataframe, color by y_train

pd.plotting.scatter_matrix(df1, c=y_train, figsize=(15, 15), marker='g',\

hist_kwds={'bins': 20}, alpha=.8)

plt.show()

Output from file scatter_matrix2.py

['malignant' 'benign']

['mean radius' 'mean texture' 'mean perimeter' 'mean area'

'mean smoothness' 'mean compactness' 'mean concavity'

'mean concave points' 'mean symmetry' 'mean fractal dimension'

'radius error' 'texture error' 'perimeter error' 'area error'

'smoothness error' 'compactness error' 'concavity error'


'concave points error' 'symmetry error' 'fractal dimension error'

'worst radius' 'worst texture' 'worst perimeter' 'worst area'

'worst smoothness' 'worst compactness' 'worst concavity'

'worst concave points' 'worst symmetry' 'worst fractal dimension']

X_train shape: (455, 30)

y_train shape: (455,)

X_test shape: (114, 30)

y_test shape: (114,)

Scatter matrix plot


Shree Mahaveerai Namah
Linear regression Model

#Understanding concept of linear regression.

import numpy as np

import matplotlib.pyplot as plt

X = 2 * np.random.rand(100, 1)

y = 4 + 3 * X + np.random.randn(100, 1)

plt.scatter(X,y)

plt.title('Randomly Generated Linear data set')

plt.xlabel('X1')

plt.ylabel('y')

plt.show()

output of file linear_regression1.py


y = θ0 + θ1 × X

This model has two model parameters, θ0 and θ1 . By tweaking these parameters, you can make
your model represent any linear function.

Before you can use your model, you need to define the parameter values θ0 and θ1. How can
you know which values will make your model perform best? To answer this question, you need
to specify a performance measure. You can either define a utility function (or fitness function)
that measures how good your model is, or you can define a cost function that measures how bad
it is. For linear regression problems, people typically use a cost function that measures the
distance between the linear model’s predictions and the training examples; the objective is to
minimize this distance.

Now let’s compute θ-hat using the Normal Equation. We will use the inv() function from
NumPy’s Linear Algebra module (np.linalg) to compute the inverse of a matrix, and the dot()
method for matrix multiplication:
X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

The actual function that we used to generate the data is y = 4 + 3x1 + Gaussian noise.

Output from file linear_regression1.py

= RESTART: C:/Users/Dr. Ashwin I


Mehta/AppData/Local/Programs/Python/Python312/linear_regression1.py

theta_best = [[3.93600944]

[3.04902507]]

We would have hoped for θ0 = 4 and θ1 = 3 instead of θ0 = 3.936 and θ1 = 3.049, Close
enough, but the noise made it impossible to recover the exact parameters of the original function.

Code for file linear_regression1.py

#Understanding concept of linear regression.

import numpy as np

import matplotlib.pyplot as plt

X = 2 * np.random.rand(100, 1)

y = 4 + 3 * X + np.random.randn(100, 1)

X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance

theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

print('theta_best =', theta_best)


plt.scatter(X,y)

plt.title('Randomly Generated Linear data set')

plt.xlabel('X1')

plt.ylabel('y')

plt.show()

Now you can make predictions using θ :

>>> X_new = np.array([[0], [2]])

>>> X_new_b = np.c_[np.ones((2, 1)), X_new] # add x0 = 1 to each instance

>>> y_predict = X_new_b.dot(theta_best)

>>> y_predict array([[4.21509616], [9.75532293]])

Let’s plot this model’s predictions.plot(X_new, y_predict, "r-")

plt.plot(X, y, "b.")

plt.axis([0, 2, 0, 15])

plt.show()

Code for file least_square_fit1.py

#Understanding concept of linear regression.

#Understanding concept of linear regression.

import numpy as np
import matplotlib.pyplot as plt

X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance


theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
print('theta_best =', theta_best)
X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new] # add x0 = 1 to each instance
y_predict = X_new_b.dot(theta_best)
print('y_predict =',y_predict)
#array([[4.21509616], [9.75532293]])
#Let’s plot this model’s predictions.
plt.plot(X_new, y_predict, "r-")
plt.legend('Predictions')
plt.plot(X, y, "b.")
plt.axis([0, 2, 0, 15])
plt.show()

Output

theta_best = [[3.9519002 ]

[2.95492963]]

y_predict = [[3.9519002 ]

[9.86175946]]
Box plot

Boxplots

A boxplot is a standardized way of displaying the distribution of data based on a five number
summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can
tell you about your outliers and what their values are. It can also tell you if your data is
symmetrical, how tightly your data is grouped, and if and how your data is skewed.

The image below is a boxplot. A boxplot is a standardized way of displaying the distribution of
data based on a five number summary (“minimum”, first quantile (Q1), median, third quantile
(Q3), and “maximum”). It can tell you about your outliers and what their values are. It can also
tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is
skewed.
For some distributions/datasets, you will find that you need more information than the measures
of central tendency (median, mean, and mode).

A box plot is a method for graphically depicting groups of numerical data through their quartiles
(also called quantiles). The box extends from the Q1 to Q3 quartile values of the data, with a
line at the median (Q2). The whiskers extend from the edges of box to show the range of the
data. The position of the whiskers is set by default to 1.5 * IQR (IQR = Q3 - Q1, Inter-quartile
range) from the edges of the box. Outlier points are those past the end of the whiskers. Make a
box-and-whisker plot from DataFrame columns, optionally grouped by some other columns.
Code for box plot

importnumpy as np

import pandas as pd

import random

importmatplotlib.pyplot as plt

import math

importseaborn as sns

np.random.seed(1234)

df=pd.DataFrame(np.random.randn(10,4),columns=['col1','col2','col3','col4'])

df #DataFrame object with random numbers.

Out[2]:

col1 col2 col3 col4

0 0.471435 -1.190976 1.432707 -0.312652

1 -0.720589 0.887163 0.859588 -0.636524

2 0.015696 -2.242685 1.150036 0.991946

3 0.953324 -2.021255 -0.334077 0.002118

4 0.405453 0.289092 1.321158 -1.546906

5 -0.202646 -0.655969 0.193421 0.553439

6 1.318152 -0.469305 0.675554 -1.817027

7 -0.183109 1.058969 -0.397840 0.337438

8 1.047579 1.045938 0.863717 -0.122092

9 0.124713 -0.322795 0.841675 2.390961

Now the code for boxplot:

boxplot=df.boxplot(column=[‘col1’, 'col2','col3'])
Figure 4.1.3.5 Box plot for different columns with grid visible

boxplot= df.boxplot(column=[‘col1’,’col2’],grid=False)

box Whisker

Figure 4.1.3.6 Box plot for less number of columns compared to previous figure with grid
invisible
importnumpy as np

import pandas as pd

import random

importmatplotlib.pyplot as plt

import math

importseaborn as sns

np.random.seed(1234)

df=pd.DataFrame(np.random.randn(6,2),columns=['col1','col2'])

df['X']=pd.Series(['A','A','A','B','B','B'])

df

Out[2]:

col1 col2 X

0 0.471435 -1.190976 A

1 1.432707 -0.312652 A

2 -0.720589 0.887163 A

3 0.859588 -0.636524 B

4 0.015696 -2.242685 B

5 1.150036 0.991946 B

Now box plot using by of pandas.

Explanation of group by usage for pandas. As above DataFrame shows three columns, namely
col1, col2 and X. Column ‘X’ has two categories ‘A’ and ‘B’. For ’A’ part, col1 and col2 have
numeric data. A boxplot corresponding to this part for col1 is shown on the left side of the
diagram and also labeled. In the same diagram, for ‘B’ is shown on the right side. This is a
unique feature in pandas and very helpful in doing analytics. All statistical interpretation is
applied box and whisker on the plot.

How to Interpret a Boxplot

Here is how to read a boxplot. The median is indicated by the vertical line that runs down the
center of the box. In the boxplot below for col1 by ‘A’, the median is between -0.9 and 0.9,
around 0.5. Additionally, boxplots display two common measures of the variability or spread in a
data set.

➢ Range. If you are interested in the spread of all the data, it is represented on a boxplot by
the horizontal distance between the smallest value and the largest value, including any
outliers. In the boxplot below, data values range from about -2.0 (the smallest non-
outlier) to about 1.5 (the largest non-outlier), so the range is 3.5. If you ignore outliers
(there are none in this diagram), the range is illustrated by the distance between the
opposite ends of the whiskers - about 3.5 in the boxplot above.

➢ Interquartile range (IQR=Q3 –Q1). The middle half of a data set falls within the
interquartile range. In a boxplot, the interquartile range is represented by the width of the
box (Q3 minus Q1). In the box plot below for col2 and ‘B’, the interquartile range is
equal to about 0.0 minus -1.5 or about 1.5. And finally, boxplots often provide
information about the shape of a data set. The examples below show some common
patterns.

Shree Mahaveerai Namah


Underfitting and Overfitting

In machine learning, we aim to build predictive models that forecast the outcome for a given
input data. To achieve this, we take additional steps to tune the trained model. So, we evaluate
the performance of several candidate models to choose the best-performing one.
However, deciding on the best-performing model is not a straightforward task because selecting
the model with the highest accuracy doesn’t guarantee it’ll generate error-free results in the
future. Hence, we apply train-test splits and cross-validation to estimate the model’s
performance on unseen data.

What Are Underfitting and Overfitting

Overfitting happens when we train a machine learning model too much tuned to the training set.
As a result, the model learns the training data too well, but it can’t generate good predictions for
unseen data. An overfitted model produces low accuracy results for data points unseen in
training, hence, leads to non-optimal decisions.

A model unable to produce sensible results on new data is also called “not able to generalize.” In
this case, the model is too complex, and the patterns existing in the dataset are not well
represented. Such a model with high variance overfits.

Overfitting models produce good predictions for data points in the training set but perform
poorly on new samples.

Underfitting occurs when the machine learning model is not well-tuned to the training set. The
resulting model is not capturing the relationship between input and output well enough.
Therefore, it doesn’t produce accurate predictions, even for the training dataset. Resultingly, an
underfitted model generates poor results that lead to high-error decisions, like an
overfitted model.

An underfitted model is not complex enough to recognize the patterns in the dataset. Usually, it
has a high bias towards one output value. This is because it considers the variations of the input
data as noise and generates similar outputs regardless of the given input.

When training a model, we want it to fit well to the training data. Still, we want it to
generalize and generate accurate predictions for unseen data, as well. As a result, we don’t
want the resulting model to be on any extreme.

Let’s consider we have a dataset residing on an S-shaped curve such as a logarithmic curve.
Fitting a high-order parabola passing through the known points with zero error is always
possible. On the other hand, we can fit a straight line with a high error rate.

The first solution generates an overly complex model and models the implicit noise as well as the
dataset. As a result, we can expect a high error for a new data point on the original S-shaped
curve.

Conversely, the second model is far too simple to capture the relationship between the input and
output. Hence, it will perform poorly on new data, too, as shown in figure below.
High Variance Low Bias Low Variance High Bias

Bias

This part of the generalization error is due to wrong assumptions, such as assuming that the data
is linear when it is actually quadratic. A high-bias model is most likely to underfit the training
data.

Variance

Variance This part is due to the model’s excessive sensitivity to small variations in the training
data. A model with many degrees of freedom (such as a high-degree polynomial model) is likely
to have high variance, and thus to overfit the training data.

Irreducible error

This part is due to the noisiness of the data itself. The only way to reduce this part of the error is
to clean up the data (e.g., fix the data sources, such as broken sensors, or detect and remove
outliers)

Reducing the error from overfitting or underfitting is referred to as the bias-variance tradeoff.
We aim to find a good fitting model in between, i.e. low bias, low variance.

Summary:

Overfitting Underfitting
Model is too complex Model is not complex enough
Accurate for training set Not accurate for training set
Not accurate for validation set Not accurate for validation set
Need to reduce complexity Need to increase complexity
Reduce number of features Increase number of features
Apply regularization Reduce regularization
Reduce training Increase training
Add training examples Add training examples
Shree Mahaveerai Namah

Structured vs. unstructured data:

Data is the lifeblood of business, and it comes in a huge variety of formats — everything from
strictly formed relational databases to your last post on Facebook. All of that data, in all different
formats, can be sorted into one of two categories: structured or unstructured data.

Structured vs. unstructured data can be understood by considering the who, what, when, where,
and the how of the data:

1. Who will be using the data?


2. What type of data are you collecting?
3. When does the data need to be prepared, before storage or when used?
4. Where will the data be stored?
5. How will the data be stored?

These five questions highlight the fundamentals of both structured and unstructured data, and
allow general users to understand how the two differ. They will also help users understand
nuances like semi-structured data, and guide us as we navigate the future of data in the cloud.

What is structured data?

Structured data is data that has been predefined and formatted to a set structure before
being placed in data storage, which is often referred to as schema-on-write. The best example
of structured data is the relational database: the data has been formatted into precisely defined
fields, such as credit card numbers or address, in order to be easily queried with SQL.

Pros of structured data

There are three key benefits of structured data:

1. Easy use by machine learning algorithms: The largest benefit of structured data
is how easily it can be used by machine learning. The specific and organized nature
of structured data allows for easy manipulation and querying of that data.
2. Easy use by business users: Another benefit of structured data is that it can be
used by an average business user with an understanding of the topic to which the
data relates. There is no need to have an in-depth understanding of various different
types of data or the relationships of that data. It opens up self-service data access to
the business user.
3. Increased access to more tools: Structured data also has the benefit of having
been in use for far longer; historically, it was the only option. Data managers have
more product choices when using structured data because there are more tools that
have been tried and tested for using and analyzing structured data.

Cons of structured data

The cons of structured data are rooted in a lack of data flexibility. Here are some potential
drawbacks to the use of structured data:

1. A predefined purpose limits use. While on-write-schema data definition is a large


benefit to structured data, it is also true that data with a predefined structure can
only be used for its intended purpose. This limits its flexibility and use cases.
2. There are limited storage options. Structured data is generally stored in data
warehouses. Data warehouses are data storage systems with rigid schemas. Any
change in requirements means updating all of that structured data to meet the new
needs.This results in massive expenditure of resources and time. Some of the cost
can be mitigated by using a cloud-based data warehouse, as this allows for greater
scalability and eliminates the maintenance expenses generated by having
equipment on-premises.

Examples of structured data

Structured data is everywhere. It’s the basis for inventory control systems and ATMs. It can be
human- or machine-generated.

Common examples of machine-generated structured data are weblog statistics and point of sale
data, such as barcodes and quantity. And don’t forget spreadsheets — a classic example of
human-generated structured data.

What is unstructured data?

Unstructured data is data stored in its native format and not processed until used, which is known
as schema-on-read. It comes in a myriad of file formats, including email(semi-structured ), social
media posts, presentations, chats, IoT sensor data, and satellite imagery.

Pros of unstructured data

As with the pros and cons of structured data, unstructured data also has strengths and weaknesses
for specific business needs. Some of its benefits include:

1. Freedom of the native format: Because unstructured data is stored in its native
format, the data is not defined until it is needed. This leads to a larger pool of use
cases, because the purpose of the data is adaptable. It allows for preparation and
analysis of only the data needed. The native format also allows for a wider variety
of file formats in the database, because the data that can be stored is not restricted
to a specific format. That means the company has more data to draw from.
2. Faster accumulation rates: Another benefit of unstructured data is in data
accumulation rates. There is no need to predefine the data, which means it can be
collected quickly and easily.
3. Better pricing and scalability: Unstructured data is often stored in cloud data
lakes, which allow for massive storage. Cloud data lakes also allow for pay-as-
you-use storage pricing, which helps cut costs and allows for easy scalability.

Cons of unstructured data

There are also cons to using unstructured data. The biggest challenge is that it requires both
specific expertise and specialized tools in order to be used to its fullest potential.

1. Data science expertise: The largest drawback to unstructured data is that data
science expertise is required to prepare and analyze the data. A standard business
user cannot use unstructured data as-is due to its undefined/non-formatted nature.
Using unstructured data requires understanding the topic or area of the data, but
also of how the data can be related to make it useful.
2. Specialized tools: In addition to the required professional expertise, unstructured
data requires specialized tools to manipulate. Standardized tools are intended for
use with structured data, which leaves a data manager with limited choices in
products — some of which are still in their infancy — for utilizing unstructured
data.

Examples of unstructured data

Unstructured data is qualitative rather than quantitative, which means that it is more
characteristic and categorical in nature.

It lends itself well to use cases such as determining how effective a marketing campaign is, or to
uncovering potential buying trends through social media and review websites. Because it can be
used to detect patterns in chats or suspicious email trends, it’s also very useful to organizations in
assisting with monitoring for policy compliance.

Structured data vs. unstructured data

The difference between structured data and unstructured data comes down to the types of data
that can be used for each, the level of data expertise required to make use of that data, and on-
write data vs. on-read schema.
Structured Data Unstructured data

Who Self-service access Requires data science expertise

What Only select data types Many varied types conglomerated

When Schema-on-write Schema-on-read

Where Commonly stored in data Commonly stored in data lakes


warehouses

How Predefined format Native format

Structured data is highly specific and is stored in a predefined format, where unstructured data is
a compilation of many varied types of data that are stored in their native formats. This means that
structured data takes advantage of schema-on-write and unstructured data employs schema-on-
read.

Structured data is commonly stored in data warehouses and unstructured data is stored in data
lakes. Both have cloud-use potential, but structured data allows for less storage space and
unstructured data requires more.

The last difference may hold the most impact. Structured data can be used by the average
business user, but unstructured data requires data science expertise in order to gain
accurate business intelligence.

What is semi-structured data?

Semi-structured data refers to what would normally be considered unstructured data, but that
includes metadata that identifies certain characteristics. The metadata contains enough
information to enable the data to be more efficiently cataloged, searched, and analyzed than
strictly unstructured data. Think of semi-structured data as in between structured and
unstructured data.

A good example of semi-structured data vs. structured data would be a tab delimited file
containing customer data versus a database containing CRM tables. On the other hand, semi-
structured data has more hierarchy than unstructured data; the tab delimited file is more specific
than a list of comments from a customer’s Instagram.

How is structured data different from unstructured data?

Structured data is: Unstructured data is:

➢ In the form of numbers and text, in ➢ Comes in a variety of shapes and sizes that does not
standardized, readable formats. conform to a predefined data model and remains in
➢ Typically XML and CSV. its native format.
➢ Follows a predefined relational data ➢ Typically DOC, WMV, MPW, MP3, WAV.
model. ➢ Does not have a data model, though may have
➢ Stored in a relational database in tablets, hidden structure.
rows, and columns, with specific labels. ➢ Stored in unstructured raw formats or in a NoSQL
Relational databases use SQL for database. Many companies use data lakes to store
processing. large volumes of unstructured data that they can
➢ Easy to search and use with ample then access when needed.
analytics tools available. ➢ Requires complex search, processing, and analysis
➢ Quantitative (has countable elements), before it can be placed in a relational database.
easy to group based on attributes or ➢ Qualitative with subjective information that must be
characteristics. split, stacked, grouped, minded and patterned to
analyze it.
Shree Mahaveerai Namah
Web Scraping

Web scraping is a technique used to extract data from websites. Python provides
several libraries to perform web scraping, including Requests and BeautifulSoup . Here
are the basic steps to perform web scraping using Python:

1. Find the URL of the website you want to scrape.


2. Inspect the page to find the data you want to extract.
3. Write the code to extract the data.
4. Run the code and extract the data.
5. Store the data in the required format.

Here is an example of how to perform web scraping using


the Requests and BeautifulSoup libraries:

import requests

from bs4 import BeautifulSoup

url = https://www.example.com # You can put here Ismail College URL.

response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

data = soup.find_all('div', class_='example-class')

In this example, we first import the requests and BeautifulSoup libraries. We then define the
URL of the website we want to scrape and use the requests library to send a GET request to the
URL. We then use the BeautifulSoup library to parse the HTML content of the response and
extract the data we want. Finally, we store the extracted data in the data variable.
Shree Mahaveerai Namah
Where To Go Now?

Now that you have even more information about your data and you have a visualization ready, it
does seem a bit like the data points sort of group together, but you also see there is quite some
overlap.

This might be interesting to investigate further.

Do you think that, in a case where you knew that there are 10 possible digits labels to assign to
the data points, but you have no access to the labels, the observations would group or “cluster”
together by some criterion in such a way that you could infer the labels?

Now this is a research question!

In general, when you have acquired a good understanding of your data, you have to decide on the
use cases that would be relevant to your data set. In other words, you think about what your data
set might teach you or what you think you can learn from your data.

From there on, you can think about what kind of algorithms you would be able to apply to your
data set in order to get the results that you think you can obtain.

Tip: the more familiar you are with your data, the easier it will be to assess the use cases for
your specific data set. The same also holds for finding the appropriate machine algorithm.

However, when you’re first getting started with scikit-learn, you’ll see that the amount of
algorithms that the library contains is pretty vast and that you might still want additional help
when you’re doing the assessment for your data set. That’s why this scikit-learnmachine learning
map will come in handy.
Note that this map does require you to have some knowledge about the algorithms that are
included in the scikit-learn library. This, by the way, also holds some truth for taking this next
step in your project: if you have no idea what is possible, it will be very hard to decide on what
your use case will be for the data.

As your use case was one for clustering, you can follow the path on the map towards “KMeans”.
You’ll see the use case that you have just thought about requires you to have more than 50
samples (“check!”), to have labeled data (“check!”), to know the number of categories that you
want to predict (“check!”) and to have less than 10K samples (“check!”).

But what exactly is the K-Means algorithm?

It is one of the simplest and widely used unsupervised learning algorithms to solve clustering
problems. The procedure follows a simple and easy way to classify a given data set through a
certain number of clusters that you have set before you run the algorithm. This number of
clusters is called k and you select this number at random.

Definition

Then, the k-means algorithm will find the nearest cluster center for each data point and
assign the data point closest to that cluster.

Once all data points have been assigned to clusters, the cluster centers will be recomputed. In
other words, new cluster centers will emerge from the average of the values of the cluster data
points. This process is repeated until most data points stick to the same cluster. The cluster
membership should stabilize.

You can already see that, because the k-means algorithm works the way it does, the initial set of
cluster centers that you give up can have a big effect on the clusters that are eventually found.
You can, of course, deal with this effect, as you will see further on.

However, before you can go into making a model for your data, you should definitely take a look
into preparing your data for this purpose.

Preprocessing Your Data

As you have read in the previous section, before modeling your data, you’ll do well by preparing
it first. This preparation step is called “preprocessing”.

Data Normalization

The first thing that we’re going to do is preprocessing the data. You can standardize the
# Import

from sklearn.preprocessing import scale

# Apply `scale()` to the `digits` data

data = scale(digits.data)

code for file data_set8.py

# Import `datasets` from `sklearn`

from sklearn import datasets

import pandas as pd

from sklearn.preprocessing import scale

# Load in the `digits` data

digits = datasets.load_digits()

data=scale(digits.data)

#Print the `digits` data

print(data)

By scaling the data, you shift the distribution of each attribute to have a mean of zero and a
standard deviation of one (unit variance).

Splitting Your Data Into Training And Test Sets

In order to assess your model’s performance later, you will also need to divide the data set into
two parts: a training set and a test set. The first is used to train the system, while the second is
used to evaluate the learned or trained system.

Splitting the dataset


Splitting the dataset is the next step in data preprocessing in machine learning. Every dataset
for Machine Learning model must be split into two separate sets – training set and test set.

Training set denotes the subset of a dataset that is used for training the machine learning
model. Here, you are already aware of the output. A test set, on the other hand, is the subset of
the dataset that is used for testing the machine learning model. The ML model uses the test set
to predict outcomes.

Usually, the dataset is split into 70:30 ratio or 80:20 ratio. This means that you either take
70% or 80% of the data for training the model while leaving out the rest 30% or 20%. The
splitting process varies according to the shape and size of the dataset in question.

In practice, the division of your data set into a test and a training sets is disjoint: the most
common splitting choice is to take 2/3 of your original data set as the training set, while the 1/3
that remains will compose the test set.

You will try to do this also here. You see in the code chunk below that this ‘traditional’ splitting
choice is respected: in the arguments of the train_test_split() method, you clearly see that
the test_size is set to 0.25.

You’ll also note that the argument random_state has the value 42 assigned to it. With this
argument, you can guarantee that your split will always be the same. That is particularly handy if
you want reproducible results.

# Import `train_test_split`

from sklearn.cross_validation import train_test_split


# Split the `digits` data into training and test sets

X_train, X_test, y_train, y_test, images_train, images_test =

train_test_split(data, digits.target, digits.images, test_size=0

.25, random_state=42)

Code fro file data_set9.py

# Import `datasets` from `sklearn`

from sklearn import datasets

import pandas as pd

from sklearn.preprocessing import scale

# Load in the `digits` data

digits = datasets.load_digits()

data=scale(digits.data)

# Import `train_test_split`

from sklearn.model_selection import train_test_split

# Split the `digits` data into training and test sets

X_train, X_test, y_train, y_test, images_train, images_test = \

train_test_split(data, digits.target, digits.images, test_size=0.25,\

random_state=42)

#Print the `digits` data

print(images_train)

After you have split up your data set into train and test sets, you can quickly inspect the numbers
before you go and model the data:
# Number of training features

n_samples, n_features = X_train.shape

# Print out `n_samples`

print(n_samples)

# Print out `n_features`

print(n_features)

# Number of Training labels

n_digits = len(np.unique(y_train))

# Inspect `y_train`

print(len(y_train))

code for file data_set10.py

# Import `datasets` from `sklearn`

from sklearn import datasets

import pandas as pd

import numpy as np

from sklearn.preprocessing import scale

# Load in the `digits` data

digits = datasets.load_digits()

data=scale(digits.data)

# Import `train_test_split`

from sklearn.model_selection import train_test_split

# Split the `digits` data into training and test sets

X_train, X_test, y_train, y_test, images_train, images_test = \

train_test_split(data, digits.target, digits.images, test_size=0.25,\


random_state=42)

# Number of training features

n_samples, n_features = X_train.shape

# Print out `n_samples`

print(n_samples)

# Print out `n_features`

print(n_features)

# Number of Training labels

n_digits = len(np.unique(y_train))

# Inspect `y_train`

print(len(y_train))

#Print the `digits` data

#print(images_train)

You’ll see that the training set X_train now contains 1347 samples, which is exactly 2/3d of the
samples that the original data set contained, and 64 features, which hasn’t changed.
The y_train training set also contains 2/3d of the labels of the original data set. This means that
the test sets X_test and y_test contain 450 samples.

Clustering The digits Data

After all these preparation steps, you have made sure that all your known (training) data is
stored. No actual model or learning was performed up until this moment.

Now, it’s finally time to find those clusters of your training set. Use KMeans() from
the cluster module to set up your model. You’ll see that there are three arguments that are passed
to this method: init, n_clusters and the random_state.

You might still remember this last argument from before when you split the data into training
and test sets. This argument basically guaranteed that you got reproducible results.
# Import the `cluster` module

from sklearn import cluster

# Create the KMeans model

clf = cluster.KMeans(init='k-means++', n_clusters=10, random_state=42)

# Fit the training data `X_train`to the model

clf.fit(X_train)

The init indicates the method for initialization and even though it defaults to ‘k-means++’, you
see it explicitly coming back in the code. That means that you can leave it out if you want. Try it
out in the DataCamp Light chunk above!

Next, you also see that the n_clusters argument is set to 10. This number not only indicates the
number of clusters or groups you want your data to form, but also the number of centroids to
generate. Remember that a cluster centroid is the middle of a cluster.

Do you also still remember how the previous section described this as one of the possible
disadvantages of the K-Means algorithm?

That is, that the initial set of cluster centers that you give up can have a big effect on the clusters
that are eventually found?

Usually, you try to deal with this effect by trying several initial sets in multiple runs and by
selecting the set of clusters with the minimum sum of the squared errors (SSE). In other words,
you want to minimize the distance of each point in the cluster to the mean or centroid of that
cluster.

By adding the n-init argument to KMeans(), you can determine how many different centroid
configurations the algorithm will try.

Note again that you don’t want to insert the test labels when you fit the model to your data: these
will be used to see if your model is good at predicting the actual classes of your instances!

You can also visualize the images that make up the cluster centers as follows:

# Import matplotlib

import matplotlib.pyplot as plt


# Figure size in inches

fig = plt.figure(figsize=(8, 3))

# Add title

fig.suptitle('Cluster Center Images', fontsize=14, fontweight='bold')

# For all labels (0-9)

for i in range(10):

# Initialize subplots in a grid of 2X5, at i+1th position

ax = fig.add_subplot(2, 5, 1 + i)

# Display images

ax.imshow(clf.cluster_centers_[i].reshape((8, 8)), cmap=plt.cm.binary)

# Don't show the axes

plt.axis('off')

# Show the plot

plt.show()
Code for file data_set_cluster4.py

# Program for observing centoids of cluster

from sklearn.preprocessing import scale

from sklearn import datasets

import pandas as pd

# Import the `cluster` module

from sklearn import cluster

# Load in the `digits` data

digits = datasets.load_digits()

data=scale(digits.data)

# Import `train_test_split`

from sklearn.model_selection import train_test_split


# Split the `digits` data into training and test sets

X_train, X_test, y_train, y_test, images_train, images_test = \

train_test_split(data, digits.target, digits.images, test_size=0.25,\

random_state=42)

# Create the KMeans model

clf = cluster.KMeans(init='k-means++', n_clusters=10,

random_state=42)

# Fit the training data `X_train`to the model

clf.fit(X_train)

import matplotlib.pyplot as plt

# Figure size in inches

fig = plt.figure(figsize=(8, 3))

# Add title

fig.suptitle('Cluster Center Images', fontsize=14, fontweight='bold')

# For all labels (0-9)

for i in range(10):

# Initialize subplots in a grid of 2X5, at i+1th position

ax = fig.add_subplot(2, 5, 1 + i)

# Display images

ax.imshow(clf.cluster_centers_[i].reshape((8, 8)), cmap=plt.cm.binary)

# Don't show the axes

plt.axis('off')
# Show the plot

plt.show()

output of the above code for file data_set_cluster4.py

If you want to see another example that visualizes the data clusters and their centers, go here.

The next step is to predict the labels of the test set:

# Predict the labels for `X_test`

y_pred=clf.predict(X_test)

# Print out the first 100 instances of `y_pred`

print(y_pred[:100])

# Print out the first 100 instances of `y_test`

print(y_test[:100])

# Study the shape of the cluster centers

print(clf.cluster_centers_._shape)____

By adding above code we get following output(data_set_cluster5.py) in addition to centroids


centre images.

[5 0 0 4 9 8 2 2 2 9 6 3 7 1 7 6 0 4 2 9 7 9 0 4 2 5 0 9 5 0 7 0 8 7 7 5 0

7455060530152260251011317297041420027
4 1 3 2 0 0 0 1 9 2 2 1 5 7 7 0 3 8 3 0 0 8 0 1 7 7]

[6 9 3 7 2 1 5 2 5 2 1 9 4 0 4 2 3 7 8 8 4 3 9 7 5 6 3 5 6 3 4 9 1 4 4 6 9

4766913613065519560900104524570759554

7 0 4 5 5 9 9 0 2 3 8 0 6 4 4 9 1 2 8 3 5 2 9 0 4 4]

(10, 64)

In the code chunk above, you predict the values for the test set, which contains 450 samples. You
store the result in y_pred. You also print out the first 100 instances of y_pred and y_test and you
immediately see some results.

In addition, you can study the shape of the cluster centers: you immediately see that there are 10
clusters with each 64 features.

But this doesn’t tell you much because we set the number of clusters to 10 and you already knew
that there were 64 features.

Maybe a visualization would be more helpful.

Let’s visualize the predicted labels:

# Import `Isomap()`

from sklearn.manifold import Isomap

# Create an isomap and fit the `digits` data to it

X_iso = Isomap(n_neighbors=10).fit_transform(X_train)

# Compute cluster centers and predict cluster index for each sample

clusters = clf.fit_predict(X_train)

# Create a plot with subplots in a grid of 1X2

fig, ax = plt.subplots(1, 2, figsize=(8, 4))

# Adjust layout
fig.suptitle('Predicted Versus Training Labels', fontsize=14, fontweight='bold')

fig.subplots_adjust(top=0.85)

# Add scatterplots to the subplots

ax[0].scatter(X_iso[:, 0], X_iso[:, 1], c=clusters)

ax[0].set_title('Predicted Training Labels')

ax[1].scatter(X_iso[:, 0], X_iso[:, 1], c=y_train)

ax[1].set_title('Actual Training Labels')

# Show the plots

plt.show()

You use Isomap() as a way to reduce the dimensions of your high-dimensional data set digits.
The difference with the PCA method is that the Isomap is a non-linear reduction method.

Output from my laptop of file data_set_cluster6.py


Tip: run the code from above again, but use the PCA reduction method instead of the Isomap to
study the effect of reduction methods yourself.

You will find the solution here:

# Import `PCA()`

from sklearn.decomposition import PCA

# Model and fit the `digits` data to the PCA model

X_pca = PCA(n_components=2).fit_transform(X_train)

# Compute cluster centers and predict cluster index for each sample

clusters = clf.fit_predict(X_train)

# Create a plot with subplots in a grid of 1X2

fig, ax = plt.subplots(1, 2, figsize=(8, 4))


# Adjust layout

fig.suptitle('Predicted Versus Training Labels', fontsize=14, fontweight='bold')

fig.subplots_adjust(top=0.85)

# Add scatterplots to the subplots

ax[0].scatter(X_pca[:, 0], X_pca[:, 1], c=clusters)

ax[0].set_title('Predicted Training Labels')

ax[1].scatter(X_pca[:, 0], X_pca[:, 1], c=y_train)

ax[1].set_title('Actual Training Labels')

# Show the plots

plt.show()

output of file data_set_cluster8.py


Shree Mahaveerai Namah
from sklearn.datasets import fetch_openml

mnist=fetch_openml('mnist_784', version=1)

mnist.keys()

dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details',


'url'])

X,y=mnist['data'],mnist['target']

X.shape

(70000, 784)

y.shape

(70000,)

#Program for mnist datasets

import matplotlib as mpl

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

from sklearn.datasets import fetch_openml

mnist=fetch_openml('mnist_784', version=1)
#mnist.keys()

"""dict_keys(['data', 'target', 'frame', 'categories', 'feature_names',

'target_names', 'DESCR', 'details', 'url'])"""

X,y=mnist['data'],mnist['target']

some_digit = X.loc[4]

some_digit_array=some_digit.to_numpy()

some_digit_image = some_digit_array.reshape(28,28)

plt.imshow(some_digit_image)

plt.axis("off")

plt.show()

Output of file minst1.py


Now we are to apply sgdclassifier to this data.

# Program for classification

# Program for sgd classifier on mnist dataset

import matplotlib as mpl

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

from sklearn.datasets import fetch_openml

from sklearn.linear_model import SGDClassifier


from sklearn.metrics import confusion_matrix

from sklearn.model_selection import cross_val_predict

from sklearn.metrics import precision_score, recall_score, f1_score

mnist=fetch_openml('mnist_784', version=1)

X,y=mnist['data'],mnist['target']

y = y.astype(np.uint8) # Important Y is string convert it to integer.

X_train, X_test, y_train, y_test = X[:60000], X[60001:], y[:60000], y[60001:]

y_train_9 = (y_train == 9)

y_test_9 = (y_test == 9)

sgd_clf = SGDClassifier(loss='log_loss', alpha=0.01,\

max_iter=1000, random_state=42)

sgd_clf.fit(X_train, y_train_9)

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_9, cv=3)

Output of file mnist_sgdclassifier1.py

confusion_matrix(y_train_9, y_train_pred)

array([[52482, 1569],

[ 1770, 4179]], dtype=int64)

precision_score(y_train_9, y_train_pred)# 4179/(4179+1569)

0.7270354906054279

recall_score(y_train_9, y_train_pred)# 4179/(4179+1770)

0.702471003530005

f1_score(y_train_9, y_train_pred)

0.7145421903052065
The F1 score favors classifiers that have similar precision and recall(accuracy). This is not
always what you want: in some contexts you mostly care about precision, and in other contexts
you really care about recall(accuracy). For example, if you trained a classifier to detect videos
that are safe for kids, you would probably prefer a classifier that rejects many good videos (low
recall(accuracy)) but keeps only safe ones (high precision), rather than a classifier that has a
much higher recall but lets a few really bad videos show up in your product. On the other hand,
suppose you train a classifier to detect shoplifters on CCTV images: it is probably fine if your
classifier has only 30% precision as long as it has 99% recall (sure, the security guards will get a
few false alerts, but almost all shoplifters will get caught).

Program for ROC curve

File stored as mnist_sgdclassifier2.py

# Program for classification

# Program for sgd classifier on mnist dataset

import matplotlib as mpl

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

from sklearn.datasets import fetch_openml

from sklearn.linear_model import SGDClassifier


from sklearn.metrics import confusion_matrix

from sklearn.model_selection import cross_val_predict

from sklearn.metrics import precision_score, recall_score, f1_score, roc_curve

mnist=fetch_openml('mnist_784', version=1)

X,y=mnist['data'],mnist['target']

y = y.astype(np.uint8) # Important Y is string convert it to integer.

X_train, X_test, y_train, y_test = X[:60000], X[60001:], y[:60000], y[60001:]

y_train_9 = (y_train == 9)

y_test_9 = (y_test == 9)

sgd_clf = SGDClassifier(loss='log_loss', alpha=0.01,\

max_iter=1000, random_state=42)

sgd_clf.fit(X_train, y_train_9)

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_9, cv=3)

y_scores = cross_val_predict(sgd_clf, X_train, y_train_9, cv=3, method="decision_function")

fpr, tpr, thresholds = roc_curve(y_train_9, y_scores)

plt.plot(fpr, tpr, linewidth=2, label=None)

plt.plot([0, 1], [0, 1], 'k--') # dashed diagonal [

[...] # Add axis labels and grid

plt.show()

"""

These are commands to be used in shell (commands)

confusion_matrix(y_train_9, y_train_pred)

array([[52482, 1569],

[ 1770, 4179]], dtype=int64)


precision_score(y_train_9, y_train_pred)# 4179/(4179+1569)

0.7270354906054279

recall_score(y_train_9, y_train_pred)# 4179/(4179+1770)

0.702471003530005

f1_score(y_train_9, y_train_pred)

0.7145421903052065"""

Output of file mnist_sgdclassifier2.py


Shree Mahaveerai Namah

A flexible classification technique that shares close ties with the SGD
Regressor is the SGD Classifier. It works by progressively changing model
parameters in the direction of a loss function's sharpest gradient. Its capacity
to update these parameters with a randomly chosen subset of the training
data for every iteration is what distinguishes it as "stochastic". The SGD
Classifier is a useful tool because of its versatility, especially in situations
where real-time learning is required and big datasets are involved. We will
examine the fundamental ideas of the SGD Classifier in this post, dissecting
its key variables and hyperparameters. We will also discuss any potential
drawbacks and examine its benefits, such as scalability and efficiency. You
will have a thorough grasp of the SGD Classifier and its crucial role in the
field of data-driven decision-making by the time this journey is over.
Stochastic Gradient Descent
One popular optimization method in deep learning and machine learning
is stochastic gradient descent (SGD). Large datasets and complicated
models benefit greatly from its training. To minimize a loss function, SGD
updates model parameters iteratively. It differentiates itself as "stochastic" by
employing mini-batches, or random subsets, of the training data in each
iteration, which introduces a degree of randomness while maximizing
computational efficiency. By accelerating convergence, this randomness can
aid in escaping local minima. Modern machine learning algorithms rely
heavily on SGD because, despite its simplicity, it may be quite effective
when combined with regularization strategies and suitable learning rate
schedules.
How Stochastic Gradient Descent Works?
Here's how the SGD process typically works:
• Initialize the model parameters randomly or with some default values.
• Randomly shuffle the training data.
• For each training example: Compute the gradient of the cost function with
respect to the current model parameters using the current example.
• Update the model parameters in the direction of the negative gradient by
a small step size known as the learning rate.
• Repeat this process for a specified number of iterations (epochs).
Stochastic Gradient Descent Algorithm
For machine learning model training, initializing model parameters (θ) and
selecting a low learning rate (α) are the first steps in performing stochastic
gradient descent (SGD). Next, to add unpredictability, the training data is
jumbled at random. Every time around, the algorithm analyzes a single
training sample and determines the cost function's gradient (J) in relation to
the model's parameters. The size and direction of the steepest slope are
represented by this gradient. The model is adjusted to minimize the cost
function and provide predictions that are more accurate by updating θ in the
gradient's opposite direction. The model can efficiently learn from and adjust
to new information by going through these iterative processes for every data
point.
The cost function,J(\theta) , is typically a function of the difference between
the predicted value h_{\theta}(x) and the actual target y . In regression
problems, it's often the mean squared error; in classification problems, it can
be cross-entropy loss, for example.
For Regression (Mean Squared Error):
Cost Function:
J(θ) =\frac{1}{2m}* \sum_{i=1}^{m}(h_{θ}(x^i) - y^i)^2
Gradient (Partial Derivatives):
∇J(θ) = \frac{1}{m}*\sum_{i=1}^m(h_{\theta}(x^i) - y^i)x_{j}^i for\;\;\; j = 0 \to n
Update Parameters
Update the model parameters (θ) based on the gradient and the learning
rate:
\theta = \theta -\alpha * \nabla J(\theta)
where,
• θ: Updated model parameters.
• α: Learning rate.
• ∇J(θ): Gradient vector computed.
What is the SGD Classifier?
The SGD Classifier is a linear classification algorithm that aims to find the
optimal decision boundary (a hyperplane) to separate data points belonging
to different classes in a feature space. It operates by iteratively adjusting the
model's parameters to minimize a cost function, often the cross-entropy loss,
using the stochastic gradient descent optimization technique.
How it Differs from Other Classifiers:
The SGD Classifier differs from other classifiers in several ways:
• Stochastic Gradient Descent: Unlike some classifiers that use closed-
form solutions or batch gradient descent (which processes the entire
training dataset in each iteration), the SGD Classifier uses stochastic
gradient descent. It updates the model's parameters incrementally,
processing one training example at a time or in small mini-batches. This
makes it computationally efficient and well-suited for large datasets.
• Linearity: The SGD Classifier is a linear classifier, meaning it constructs
a linear decision boundary to separate classes. This makes it suitable for
problems where the relationship between features and the target variable
is approximately linear. In contrast, algorithms like decision trees or
support vector machines can capture more complex decision boundaries.
• Regularization: The SGD Classifier allows for the incorporation of L1 or
L2 regularization to prevent overfitting. Regularization terms are added to
the cost function, encouraging the model to have smaller parameter
values. This is particularly useful when dealing with high-dimensional
data.
Common Use Cases in Machine Learning
The SGD Classifier is commonly used in various machine learning tasks and
scenarios:
1. Text Classification: It's often used for tasks like sentiment analysis,
spam detection, and text categorization. Text data is typically high-
dimensional, and the SGD Classifier can efficiently handle large feature
spaces.
2. Large Datasets: When working with extensive datasets, the SGD
Classifier's stochastic nature is advantageous. It allows you to train on
large datasets without the need to load the entire dataset into memory,
making it memory-efficient.
3. Online Learning: In scenarios where data streams in real-time, such as
clickstream analysis or fraud detection, the SGD Classifier is well-suited
for online learning. It can continuously adapt to changing data patterns.
4. Multi-class Classification: The SGD Classifier can be used for multi-
class classification tasks by extending the binary classification approach
to handle multiple classes, often using the one-vs-all (OvA) strategy.
5. Parameter Tuning: The SGD Classifier is a versatile algorithm that can
be fine-tuned with various hyperparameters, including the learning rate,
regularization strength, and the type of loss function. This flexibility allows
it to adapt to different problem domains.
Parameters of Stochastic Gradient Descent Classifier
Stochastic Gradient Descent (SGD) Classifier is a versatile algorithm with
various parameters and concepts that can significantly impact its
performance. Here's a detailed explanation of some of the key parameters
and concepts relevant to the SGD Classifier:
1. Learning Rate (α):
• The learning rate (α) is a crucial hyperparameter that determines the size
of the steps taken during parameter updates in each iteration.
• It controls the trade-off between convergence speed and stability.
• A larger learning rate can lead to faster convergence but may result in
overshooting the optimal solution.
• In contrast, a smaller learning rate may lead to slower convergence but
with more stable updates.
• It's important to choose an appropriate learning rate for your specific
problem.
2. Batch Size:
The batch size defines the number of training examples used in each
iteration or mini-batch when updating the model parameters. There are three
common choices for batch size:
• Stochastic Gradient Descent (batch size = 1): In this case, the model
parameters are updated after processing each training example. This
introduces significant randomness and can help escape local minima but
may result in noisy updates.
• Mini-Batch Gradient Descent (1 < batch size < number of training
examples): Mini-batch SGD strikes a balance between the efficiency of
batch gradient descent and the noise of stochastic gradient descent. It's
the most commonly used variant.
• Batch Gradient Descent (batch size = number of training
examples): In this case, the model parameters are updated using the
entire training dataset in each iteration. While this can lead to more stable
updates, it is computationally expensive, especially for large datasets.
3. Convergence Criteria:
Convergence criteria are used to determine when the optimization process
should stop. Common convergence criteria include:
• Fixed Number of Epochs: You can set a predefined number of epochs,
and the algorithm stops after completing that many iterations through the
dataset.
• Tolerance on the Change in the Cost Function: Stop when the change
in the cost function between consecutive iterations becomes smaller than
a specified threshold.
• Validation Set Performance: You can monitor the performance of the
model on a separate validation set and stop training when it reaches a
satisfactory level of performance.
4. Regularization (L1 and L2):
• Regularization is a technique used to prevent overfitting.
• The SGD Classifier allows you to incorporate L1 (Lasso) and L2 (Ridge)
regularization terms into the cost function.
• These terms add a penalty based on the magnitude of the model
parameters, encouraging them to be small.
• The regularization strength hyperparameter controls the impact of
regularization on the optimization process.
5. Loss Function:
• The choice of the loss function determines how the classifier measures
the error between predicted and actual class labels.
• For binary classification, the cross-entropy loss is commonly used, while
for multi-class problems, the categorical cross-entropy or softmax loss is
typical.
• The choice of the loss function should align with the problem and the
activation function used.
6. Momentum and Adaptive Learning Rates:
To enhance convergence and avoid oscillations, you can use momentum
techniques or adaptive learning rates. Momentum introduces an additional
parameter that smoothers the updates and helps the algorithm escape local
minima. Adaptive learning rate methods automatically adjust the learning
rate during training based on the observed progress.
7. Early Stopping:
Early stopping is a technique used to prevent overfitting. It involves
monitoring the model's performance on a validation set during training and
stopping the optimization process when the performance starts to degrade,
indicating overfitting.
Python Code using SGD to classify the famous Iris Dataset

# Program for SGD classifier

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import load_iris


from sklearn.model_selection import train_test_split

from sklearn.linear_model import SGDClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

import seaborn as sns

# Load the Iris dataset

data = load_iris()

X, y = data.data, data.target

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split( \

X, y, test_size=0.2, random_state=42)

“””This code loads the Iris dataset, which is made up of target labels in y and
features in X. The data is then split 80–20 for training and testing purposes,
with a reproducible random seed of 42. This yields training and testing sets
for both features and labels.”””

# Create an SGD Classifier

clf = SGDClassifier(loss='log_loss', alpha=0.01,

max_iter=1000, random_state=42)

# Train the classifier

clf.fit(X_train, y_train)

“””An SGD Classifier (clf) is instantiated for classification tasks in this code.
Because the classifier is configured to use the log loss (logistic loss)
function, it can be used for both binary and multiclass classification.
Furthermore, to help avoid overfitting, L2 regularization is used with an alpha
parameter of 0.01. To guarantee consistency of results, a random seed of 42
is chosen, and the classifier is programmed to run up to 1000 iterations
during training.”””

# Make predictions

#y_pred = clf.predict(X_test)

Pass

Output of file sgd1.py

y_pred = clf.predict(X_test)

print(y_pred)

[1 0 2 1 1 0 1 2 1 1 1 0 0 0 0 1 2 1 1 2 0 1 0 2 2 2 1 2 0 0] Test size 20%

Using the training data (X_train and y_train), these lines of code train the
SGD Classifier (clf). Following training, the model is applied to generate
predictions on the test data (X_test), which are then saved in the y_pred
variable for a future analysis.

#Evaluate the model

accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy of the model on test data:{accuracy}')

Accuracy of the model on test data:0.9

“””These lines of code compare the predicted labels (y_pred) with the actual
labels of the test data (y_test) to determine the classification accuracy.”””
This gives 90% accuracy

If we increase test data size to 30% accuracy comes to 96% as shown here:
Test data size 30%

y_pred = clf.predict(X_test)

print(y_pred)

[1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 2 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1 0 0 0 2 2 1 0 0] Test size 30%

#Evaluate the model

accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy of the model:{accuracy}')

Accuracy of the model: 0.9555555555555556

This gives accuracy of 96%.

Shree Mahaveerai Namah


At first sight, the visualization doesn’t seem to indicate that the model works well.

But this needs some further investigation.

What is Confusion Matrix?

A confusion matrix is a table that is often used to describe the performance of a classification
model on a set of data for which the true values are known. It is a 2x2 matrix for binary
classification (though it can be expanded for multi-class problems). The four outcomes are:

• True Positives (TP): The cases in which the model predicted yes (or the positive class),
and the truth is also yes.

• True Negatives (TN): The cases in which the model predicted no (or the negative class),
and the truth is also no.

• False Positives (FP), Type I error: The cases in which the model predicted yes, but the
truth is no.

• False Negatives (FN), Type II error: The cases in which the model predicted no, but the
truth is yes.

The confusion matrix looks like this:

Different Metrics derived from Confusion Matrix

Precision: Precision is the ratio of correctly predicted positive observations to the total predicted
positive observations.
Formula: Precision = TP / (TP + FP)

Recall (Sensitivity): Recall is the ratio of correctly predicted positive observations to all positive

observations in actual class.

Formula: Recall = TP / (TP + FN)

F1-Score: The F1 Score is the weighted average of Precision and Recall. It tries to find the
balance between precision and recall.

Formula: F1 Score = 2*(Recall * Precision) / (Recall + Precision)

Accuracy: Accuracy is the most intuitive performance measure. It is simply a ratio of correctly

predicted observation to the total observations.

Formula: Accuracy = (TP+TN) / (TP+FP+FN+TN)

Type I error (False Positive rate): This is the situation where you reject the null hypothesis

when it is actually true. In terms of the confusion matrix, it’s when you wrongly predict the

positive class.

Formula: Type I error = FP / (FP + TN)

Type II error (False Negative rate): This is the situation where you fail to reject the null

hypothesis when it is actually false. In terms of the confusion matrix, it’s when you wrongly
predict the negative class.
Formula: Type II error = FN / (FN + TP)

ROC-AUC Curve:

Receiver Operating Characteristic (ROC) is a probability curve that plots the true positive rate

(sensitivity or recall) against the false positive rate (1 — specificity) at various threshold settings.

Area Under the Curve (AUC) is the area under the ROC curve. If the AUC is high (close to 1), the
model is better at distinguishing between positive and negative classes. An AUC of 0.5 represents

a model that is no better than random.

https://shivang-ahd.medium.com/all-about-confusion-matrix-preparing-for-interview-questions-
fddea115a7ee
Evaluation of Your Clustering Model

And this need for further investigation brings you to the next essential step, which is the
evaluation of your model’s performance. In other words, you want to analyze the degree of
correctness of the model’s predictions.

Let’s print out a confusion matrix:

# Import `metrics` from `sklearn`

from sklearn import _______

# Print out the confusion matrix with `confusion_matrix()`

print(metrics.confusion_matrix(y_test, y_pred))

Code for confusion matrix data_set_cluster10.py

# Program for observing centoids of cluster

#To predict the labels of the test set: Using Isomap which is non-linear

from sklearn.preprocessing import scale

from sklearn import datasets

import pandas as pd

# Import the `cluster` module

from sklearn import cluster

# Load in the `digits` data

digits = datasets.load_digits()

data=scale(digits.data)

# Import `train_test_split`

from sklearn.model_selection import train_test_split


# Split the `digits` data into training and test sets

X_train, X_test, y_train, y_test, images_train, images_test = \

train_test_split(data, digits.target, digits.images, test_size=0.25,\

random_state=42)

#print(y_test)

# Create the KMeans model

clf = cluster.KMeans(init='k-means++', n_clusters=10,

random_state=42)

# Fit the training data `X_train`to the model

clf.fit(X_train)

import matplotlib.pyplot as plt

y_pred=clf.predict(X_test)

#Let’s print out a confusion matrix:

# Import `metrics` from `sklearn`

from sklearn import metrics

# Print out the confusion matrix with `confusion_matrix()`

print(metrics.confusion_matrix(y_test, y_pred))

# Show the plots

plt.show()
Output of file data_set_cluster10.py

[[ 0 43 0 0 0 0 0 0 0 0]

[ 0 0 0 11 0 0 19 0 7 0]

[ 0 0 2 1 0 1 5 0 9 20]

[34 0 6 0 3 0 0 0 1 2]

[ 0 0 0 2 0 0 1 52 0 0]

[12 0 40 0 0 1 0 0 0 6]

[ 0 1 0 0 0 44 0 0 0 0]

[ 0 0 1 0 35 0 3 0 0 2]

[ 2 0 23 1 0 0 9 0 0 3]

[38 1 4 3 2 0 0 0 0 0]]

At first sight, the results seem to confirm our first thoughts that you gathered from the
visualizations. Only the digit 5 was classified correctly in 41 cases. Also, the digit 8 was
classified correctly in 11 instances. But this is not really a success.

You might need to know a bit more about the results than just the confusion matrix.

Let’s try to figure out something more about the quality of the clusters by applying different
cluster quality metrics. That way, you can judge the goodness of fit of the cluster labels to the
correct labels.

from sklearn.metrics import homogeneity_score,

completeness_score, v_measure_score, adjusted_rand_score,

adjusted_mutual_info_score, silhouette_score

print('% 9s' % 'inertia homo compl v-meas ARI AMI

silhouette')

print('%i %.3f %.3f %.3f %.3f %.3f %.3f'

%(clf.inertia_,

homogeneity_score(y_test, y_pred),

completeness_score(y_test, y_pred),
v_measure_score(y_test, y_pred),

adjusted_rand_score(y_test, y_pred),

adjusted_mutual_info_score(y_test, y_pred),

silhouette_score(X_test, y_pred, metric='euclidean')))

You’ll see that there are quite some metrics to consider:

• The homogeneity score tells you to what extent all of the clusters contain only data points
which are members of a single class.

• The completeness score measures the extent to which all of the data points that are
members of a given class are also elements of the same cluster.

• The V-measure score is the harmonic mean between homogeneity and completeness.

• The adjusted Rand score measures the similarity between two clusterings and considers
all pairs of samples and counting pairs that are assigned in the same or different clusters
in the predicted and true clusterings.

• The Adjusted Mutual Info (AMI) score is used to compare clusters. It measures the
similarity between the data points that are in the clusterings, accounting for chance
groupings and takes a maximum value of 1 when clusterings are equivalent.

• The silhouette score measures how similar an object is to its own cluster compared to
other clusters. The silhouette scores ranges from -1 to 1, where a higher value indicates
that the object is better matched to its own cluster and worse mached to neighboring
clusters. If many points have a high value, the clusteirng configuration is good.

You clearly see that these scores aren’t fantastic: for example, you see that the value for the
silhouette score is close to 0, which indicates that the sample is on or very close to the decision
boundary between two neighboring clusters. This could indicate that the samples could have
been assigned to the wrong cluster.

Also the ARI measure seems to indicate that not all data points in a given cluster are similar and
the completeness score tells you that there are definitely data points that weren’t put in the right
cluster.

Clearly, you should consider another estimator to predict the labels for the digits data.
Shree Mahaveerai Namah
Trying Out Another Model: Support Vector Machines

When you recapped all of the information that you gathered out of the data exploration, you saw
that you could build a model to predict which group a digit belongs to without you knowing the
labels. And indeed, you just used the training data and not the target values to build your
KMeans model.

Let’s assume that you depart from the case where you use both the digits training data and the
corresponding target values to build your model.

If you follow the algorithm map, you’ll see that the first model that you meet is the linear SVC(
Support Vector Classifier ). Let’s apply this now to the digits data:

# Import `train_test_split`

from sklearn.cross_validation import train_test_split

# Split the data into training and test sets

X_train, X_test, y_train, y_test, images_train, images_test =

train_test_split(digits.data, digits.target, digits.images,

test_size=0.25, random_state=42)

# Import the `svm` model

from sklearn import svm

# Create the SVC model

svc_model = svm.SVC(gamma=0.001, C=100., kernel='linear')

# Fit the data to the SVC model

svc_model.fit(X_train,y_train)

You see here that you make use of X_train and y_train to fit the data to the SVC model. This is
clearly different from clustering. Note also that in this example, you set the value
of gamma manually. It is possible to automatically find good values for the parameters by using
tools such as grid search and cross validation.
Even though this is not the focus of this topic, you will see how you could have gone about this
if you would have made use of grid search to adjust your parameters. You would have done

Something like following.

# Split the `digits` data into two equal sets

X_train, X_test, y_train, y_test = train_test_split(digits

.data, digits.target, test_size=0.5, random_state=0)

# Import GridSearchCV

from sklearn.grid_search import GridSearchCV

# Set the parameter candidates

parameter_candidates = [

{'C': [1, 10, 100, 1000], 'kernel': ['linear']},

{'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel'

: ['rbf']},

# Create a classifier with the parameter candidates

Next, you use the classifier with the classifier and parameter candidates that you have just
created to apply it to the second part of your data set. Next, you also train a new classifier using
the best parameters found by the grid search. You score the result to see if the best parameters
that were found in the grid search are actually working.

# Apply the classifier to the test data, and view the accuracy

score

clf.score(X_test, y_test)

# Train and score a new classifier with the grid search

parameters

svm.SVC(C=10, kernel='rbf', gamma=0.001).fit(X_train, y_train


).score(X_test, y_test)

The parameters indeed work well!

Now what does this new knowledge tell you about the SVC classifier that you had modeled
before you had done the grid search?

Let’s back up to the model that you had made before.

You see that in the SVM classifier, the penalty parameter C of the error term is specified at 100..
Lastly, you see that the kernel has been explicitly specified as a linear one. The kernel argument
specifies the kernel type that you’re going to use in the algorithm and by default, this is rbf. In
other cases, you can specify others such as linear, poly, …

But what is a kernel exactly?

A kernel is a similarity function, which is used to compute similarity between the training data
points. When you provide a kernel to an algorithm, together with the training data and the labels,
you will get a classifier, as is the case here. You will have trained a model that assigns new
unseen objects into a particular category. For the SVM, you will typicall try to linearly divide
your data points.

However, the grid search tells you that an rbf kernel would’ve worked better. The penalty
parameter and the gamma were specified correctly.

Tip: try out the classifier with an rbf kernel.

For now, let’s just say you just continue with a linear kernel and predict the values for the test
set:

# Predict the label of `X_test`

print(svc_model.predict(______))

# Print `y_test` to check the results

print(______)

You can also visualize the images and their predicted labels:
# Import matplotlib

import matplotlib.pyplot as plt

# Assign the predicted values to `predicted`

predicted = svc_model.predict(X_test)

# Zip together the `images_test` and `predicted` values in `images_and_predictions`

images_and_predictions = list(zip(images_test, predicted))

# For the first 4 elements in `images_and_predictions`

for index, (image, prediction) in enumerate(images_and_predictions[:4]):

# Initialize subplots in a grid of 1 by 4 at positions i+1

plt.subplot(1, 4, index + 1)

# Don't show axes

plt.axis('off')

# Display images in all subplots in the grid

plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')

# Add a title to the plot

plt.title('Predicted: ' + str(prediction))

# Show the plot

plt.show()

This plot is very similar to the plot that you made when you were exploring the data:
Above discussion is summarized in the following program for support vector machine.

Code for Support Vector Classification (SVC) file data_set_svm1.py

# Program for observing Support Vector Machine (SVM)with svc

#To predict the labels of the test set:

from sklearn.preprocessing import scale

from sklearn import datasets

import pandas as pd

import matplotlib.pyplot as plt

# Load in the `digits` data

digits = datasets.load_digits()

data=scale(digits.data)

# Import `train_test_split`

from sklearn.model_selection import train_test_split

# Split the `digits` data into training and test sets

X_train, X_test, y_train, y_test, images_train, images_test = \

train_test_split(data, digits.target, digits.images, test_size=0.25,\

random_state=42)
# Import the `svm` model

from sklearn import svm

# Create the SVC model

svc_model = svm.SVC(gamma=0.001, C=100., kernel='linear')

# Fit the data to the SVC model

svc_model.fit(X_train,y_train)

# Print `y_test` to check the results

print(y_test)

# Assign the predicted values to `predicted`

predicted = svc_model.predict(X_test)

print(predicted)

# Zip together the `images_test` and `predicted` values in `images_and_predictions`

images_and_predictions = list(zip(images_test, predicted))

# For the first 4 elements in `images_and_predictions`

for index, (image, prediction) in enumerate(images_and_predictions[:4]):

# Initialize subplots in a grid of 1 by 4 at positions i+1

plt.subplot(1, 4, index + 1)

# Don't show axes

plt.axis('off')
# Display images in all subplots in the grid

plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')

# Add a title to the plot

plt.title('Predicted: ' + str(prediction))

# Show the plot

plt.show()
[6 9 3 7 2 1 5 2 5 2 1 9 4 0 4 2 3 7 8 8 4 3 9 7 5 6 3 5 6 3 4 9 1 4 4 6 9

4766913613065519560900104524570759554

7045599023806449128352904443531359427

7441927872694072758757706642809469969

0356606439397290453659984213772239803

2256994154236485957894815449618604527

4645603236715147688551628899762223488

3609770104515360410036597355998533205

8340246434505213141170152128706488518

4587985062079895277187438356003050041

2845963188423898850633716412116474834

0519457637059759742190753363969501558

3 3 6 2 6 5]

[6 9 3 7 2 1 5 2 5 2 1 9 4 0 4 2 3 7 8 8 4 3 9 7 5 6 3 5 6 3 4 9 1 4 4 6 9

4766913613065513560900104524570759354

7045599023806449128352904443531359427

7441927872694072758757906642809469969

0556606439377290453659984213772239803

2256994154236485957894815449618604527

4645603236715147651551028899762223488

3609770104515360410036597355998533205

8340246434505213141170152128706488518

4587986062079895277187438356003050041

2845963188423898850633716412116474834
0519457637059759742190752363969501558

3 3 6 2 6 5]

If you add following code in the above file creating data_set_svm2.py

#import metric

from sklearn import metrics

# Print the classification report of `y_test` and `predicted`

print(metrics.classification_report(y_test, predicted))

# Print the confusion matrix of `y_test` and `predicted`

print(metrics.confusion_matrix(y_test, predicted))

You get following output of file data_set_svm2.py

Python 3.12.0 (tags/v3.12.0:0fb18b0, Oct 2 2023, 13:03:39) [MSC v.1935 64 bit (AMD64)] on
win32

Type "help", "copyright", "credits" or "license()" for more information.

= RESTART: C:/Users/Dr. Ashwin I


Mehta/AppData/Local/Programs/Python/Python312/data_set_svm2.py

precision recall f1-score support

0 0.98 1.00 0.99 43


1 0.97 1.00 0.99 37
2 0.97 1.00 0.99 38
3 0.96 0.96 0.96 46
4 1.00 1.00 1.00 55
5 0.97 0.97 0.97 59
6 0.98 0.98 0.98 45
7 0.98 0.98 0.98 41
8 1.00 0.95 0.97 38
9 0.98 0.96 0.97 48

accuracy 0.98 450


macro avg 0.98 0.98 0.98 450
weighted avg 0.98 0.98 0.98 450
[[43 0 0 0 0 0 0 0 0 0]
[ 0 37 0 0 0 0 0 0 0 0]
[ 0 0 38 0 0 0 0 0 0 0]
[ 0 0 1 44 0 1 0 0 0 0]
[ 0 0 0 0 55 0 0 0 0 0]
[ 0 0 0 1 0 57 1 0 0 0]
[ 1 0 0 0 0 0 44 0 0 0]
[ 0 0 0 0 0 0 0 40 0 1]
[ 0 1 0 0 0 1 0 0 36 0]
[ 0 0 0 1 0 0 0 1 0 46]]

Only this time, you zip together the images and the predicted values and you only take the first 4
elements of images_and_predictions.

But now the biggest question: how does this model perform?

# Import `metrics`

from sklearn import metrics

# Print the classification report of `y_test` and `predicted`

print(metrics.classification_report(______, _________))

# Print the confusion matrix of `y_test` and `predicted`

print(metrics.confusion_matrix(______, _________))

You clearly see that this model performs a whole lot better than the clustering model that you
used earlier.

You can also see it when you visualize the predicted and the actual labels with the help
of Isomap():

# Import `Isomap()`

from sklearn.manifold import Isomap


# Create an isomap and fit the `digits` data to it

X_iso = Isomap(n_neighbors=10).fit_transform(X_train)

# Compute cluster centers and predict cluster index for each sample

predicted = svc_model.predict(X_train)

# Create a plot with subplots in a grid of 1X2

fig, ax = plt.subplots(1, 2, figsize=(8, 4))

# Adjust the layout

fig.subplots_adjust(top=0.85)

# Add scatterplots to the subplots

ax[0].scatter(X_iso[:, 0], X_iso[:, 1], c=predicted)

ax[0].set_title('Predicted labels')

ax[1].scatter(X_iso[:, 0], X_iso[:, 1], c=y_train)

ax[1].set_title('Actual Labels')

# Add title

fig.suptitle('Predicted versus actual labels', fontsize=14, fontweight='bold')

# Show the plot

plt.show()
This will give you the following scatterplots:

You’ll see that this visualization confirms your classification report, which is very good news. :)

What's Next?

Digit Recognition in Natural Images

Congratulations, you have reached the end of this scikit-learn tutorial, which was meant to
introduce you to Python machine learning! Now it's your turn.

Firstly, make sure you get a hold of DataCamp's scikit-learn cheat sheet.

Next, start your own digit recognition project with different data. One dataset that you can
already use is the MNIST data, which you can download here.

The steps that you can take are very similar to the ones that you have gone through with this
tutorial, but if you still feel that you can use some help, you should check out this page, which
works with the MNIST data and applies the KMeans algorithm.
Working with the digits dataset was the first step in classifying characters with scikit-learn. If
you’re done with this, you might consider trying out an even more challenging problem, namely,
classifying alphanumeric characters in natural images.

A well-known dataset that you can use for this problem is the Chars74K dataset, which contains
more than 74,000 images of digits from 0 to 9 and the both lowercase and higher case letters of
the English alphabet. You can download the dataset here.

Data Visualization and pandas

Whether you're going to start with the projects that have been mentioned above or not, this is
definitely not the end of your journey of data science with Python. If you choose not to widen
your view just yet, consider deepening your data visualization and data manipulation knowledge.

Don't miss out on our Interactive Data Visualization with Bokeh course to make sure you can
impress your peers with a stunning data science portfolio or our pandas Foundation course, to
learn more about working with data frames in Python.

Scikit-Learn Tutorial: Baseball Analytics Pt 1

Detecting Fake News with Scikit-Learn

Keras Tutorial: Deep Learning in Python


Shree Mahaveerai Namah

Decision Tree

A decision tree is one of the most powerful tools of supervised learning algorithms used for
both classification and regression tasks. It builds a flowchart-like tree structure where each
internal node denotes a test on an attribute, each branch represents an outcome of the test, and
each leaf node (terminal node) holds a class label (Result of algorithm). It is constructed by
recursively splitting the training data into subsets based on the values of the attributes until a
stopping criterion is met, such as the maximum depth of the tree or the minimum number of
samples required to split a node.
During training, the Decision Tree algorithm selects the best attribute to split the data based on
a metric such as entropy or Gini impurity, which measures the level of impurity or randomness
in the subsets. The goal is to find the attribute that maximizes the information gain or the
reduction in impurity after the split.
What is a Decision Tree?
A decision tree is a flowchart-like tree structure where each internal node denotes the feature,
branches denote the rules and the leaf nodes denote the result of the algorithm. It is a
versatile supervised machine-learning algorithm, which is used for both classification and
regression problems. It is one of the very powerful algorithms. And it is also used in Random
Forest to train on different subsets of training data, which makes random forest one of the most
powerful algorithms in machine learning.

➢ Root Node: It is the topmost node in the tree, which represents the complete dataset. It
is the starting point of the decision-making process.

➢ Decision/Internal Node: A node that symbolizes a choice regarding an input feature.


Branching off of internal nodes connects them to leaf nodes or other internal nodes.

➢ Leaf/Terminal Node: A node without any child nodes that indicates a class label or a
numerical value.

➢ Splitting: The process of splitting a node into two or more sub-nodes using a split
criterion and a selected feature.

➢ Branch/Sub-Tree: A subsection of the decision tree starts at an internal node and ends
at the leaf nodes.

➢ Parent Node: The node that divides into one or more child nodes.

➢ Child Node: The nodes that emerge when a parent node is split.

➢ Impurity: A measurement of the target variable’s homogeneity in a subset of data. It


refers to the degree of randomness or uncertainty in a set of examples. The Gini
index and entropy are two commonly used impurity measurements in decision trees for
classifications task

➢ Variance: Variance measures how much the predicted and the target variables vary in
different samples of a dataset. It is used for regression problems in decision
trees. Mean squared error, Mean Absolute Error, friedman_mse, or Half Poisson
deviance are used to measure the variance for the regression tasks in the decision tree.

➢ Information Gain: Information gain is a measure of the reduction in impurity achieved


by splitting a dataset on a particular feature in a decision tree. The splitting criterion is
determined by the feature that offers the greatest information gain, It is used to
determine the most informative feature to split on at each node of the tree, with the goal
of creating pure subsets

➢ Pruning: The process of removing branches from the tree that do not provide any
additional information or lead to overfitting.
Important points related to Entropy:

1. The entropy is 0 when the dataset is completely homogeneous, meaning that each
instance belongs to the same class. It is the lowest entropy indicating no uncertainty in
the dataset sample.

2. when the dataset is equally divided between multiple classes, the entropy is at its
maximum value. Therefore, entropy is highest when the distribution of class labels is
even, indicating maximum uncertainty in the dataset sample.

3. Entropy is used to evaluate the quality of a split. The goal of entropy is to select the
attribute that minimizes the entropy of the resulting subsets, by splitting the dataset into
more homogeneous subsets with respect to the class labels.

4. The highest information gain attribute is chosen as the splitting criterion (i.e., the
reduction in entropy after splitting on that attribute), and the process is repeated
recursively to build the decision tree.

Decision Tree

File decision_tree.2py

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn import tree

from sklearn.metrics import accuracy_score, classification_report

from sklearn.datasets import load_iris

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split


import warnings

warnings.filterwarnings('ignore')

iris = load_iris()

iris = sns.load_dataset('iris')

iris.head()

iris_setosa = iris.loc[iris["species"] == "Iris-setosa"]

iris_virginica = iris.loc[iris["species"] == "Iris-virginica"]

iris_versicolor = iris.loc[iris["species"] == "Iris-versicolor"]

sns.FacetGrid(iris, \

hue="species").map(sns.distplot,"petal_length").add_legend()

sns.FacetGrid(iris,hue="species").map(sns.distplot,"petal_width")\

.add_legend()

sns.FacetGrid(iris,hue="species").map(sns.distplot, \

"sepal_length").add_legend()

plt.show()
File name decision_tree3.py

#Program for plotting decsion tree

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn import tree

import pandas as pd

from sklearn.metrics import accuracy_score, classification_report

from sklearn.datasets import load_iris

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

iris = load_iris()

iris = sns.load_dataset('iris')

iris.head()

iris_setosa = iris.loc[iris["species"] == "Iris-setosa"]

iris_virginica = iris.loc[iris["species"] == "Iris-virginica"]

iris_versicolor = iris.loc[iris["species"] == "Iris-versicolor"]

X = iris.iloc[:, :-2]

y = iris.species

print(y)

X_train, X_test,y_train, y_test = train_test_split(X, y,test_size=0.33,random_state=42)

treemodel = DecisionTreeClassifier(criterion='entropy',max_depth=2)

treemodel.fit(X_train, y_train)

plt.figure(figsize=(15, 10))

tree.plot_tree(treemodel, filled=True)

ypred = treemodel.predict(X_test)

score = accuracy_score(ypred, y_test)

print(score)

print(classification_report(ypred, y_test))

plt.show()

Output of file decision_tree3.py

decision_tree3.py

0.98
precision recall f1-score support

setosa 1.00 1.00 1.00 19

versicolor 0.93 1.00 0.97 14

virginica 1.00 0.94 0.97 17

accuracy 0.98 50

macro avg 0.98 0.98 0.98 50

weighted avg 0.98 0.98 0.98 50

SOME DEFINITIONS
A scatter diagram, also known as a scatter plot or scatter graph, is a graphical representation of
the relationship between two continuous variables. It displays individual data points on a two-
dimensional graph, with one variable plotted on the x-axis and the other on the y-axis. Each point
represents the simultaneous values of the two variables for a specific observation.

The primary purpose of a scatter diagram is to visually assess the nature and strength of the
relationship between the variables. The pattern of points on the graph can provide insights into
potential correlations, trends, clusters, or outliers in the data.

There are different patterns that may emerge in a scatter diagram:

Positive correlation: Points on the graph tend to form an upward-sloping pattern, indicating that
as one variable increases, the other also tends to increase.

Negative correlation: Points on the graph tend to form a downward-sloping pattern, suggesting
that as one variable increases, the other tends to decrease.

No correlation: Points are scattered randomly, indicating a lack of a discernible pattern or


relationship between the variables.

Scatter diagrams are widely used in various fields, including statistics, data analysis, and
scientific research, to visually explore and understand the relationships within a dataset. Nm

A probability distribution is a statistical function that describes all the possible values and
likelihoods that a random variable can take within a given range from a random experiment.
This range will be bounded between the minimum and maximum possible values, but precisely
where the possible values is likely to be plotted on the probability distribution depends on a
number of factors. These factors include the distribution's mean (average), standard deviation,
skewness and kurtosis. Some of all probabilities or area under curve should be 1.

Linear regression is a statistical method used to model the relationship between a dependent
variable and one or more independent variables by fitting a linear equation to the observed data.
the goal is to find the best fitting line that minimizes the difference between the observed and
predicted values, allowing for the prediction of the dependent variable based on the values of the
independent variables.

Linear regression is a statistical method used to model the relationship between a dependent
variable and one ore more independent variables by fitting a linear equation to observed data.
The goal of linear regression is to find the best-fitting straight line (linear regression line) that
minimizes the sum of the squared differences between the observed values and the values
predicted by the models. This line can be used for predicting the dependent variable for new and
unseen data. Simple equation y = mx + b.

Multi collinearity is a statistical phenomenon in regression analysis where two or more


independent variables in a multiple regression model are highly correlated. It can lead to issues
in estimating the individual contributions of each variable of dependent variable. High multi
collinearity makes it challenging to isolate the effect of each independent variable on the
dependent variable, and it can effect the stability and reliability of the regression coefficients.

Signs of Multi collinearity:

1. High correlation coefficients: Multi collinearity is often indicated by high correlation between
independent variables. A correlation coefficient close to +1 or -1 suggests a strong linear
relationship.
2. Variance Inflation factor (VIF): VIF is a measure of how much the variance of the estimated
regression coefficients increases when your predictors are correlated. High VIF values (typically
greater than 10) are indicative of multi collinearity.

Advantages of Multi Collinearity:

1. Variable Reduction: Multi collinearity can sometimes help identify redundant variables in a
model, if two variables are highly correlated, it may be possible to combine or eliminate one of
them, simplifying the model.

2. Improved predictive Accuracy: In some cases, multi collinearity may not significantly impact
the predictive accuracy of a model. if the primary goal is prediction rather than the understanding
individual variable effects, multi collinearity may be of lesser concern.

Disadvantages of Multi Collinearity:

1. Difficulty in interpretation: Multi collinearity makes it difficult to interpret the individual


impact of each variable on the dependent variable because their effects are confounded.

2. Unreliable coefficients: The estimated coefficients of the affected variable become less stable,
and small changes in the data can lead to large changes in the coefficients.

3. Model Instability: Multi collinearity can make the model sensitive to the small changes in the
data, leading to model instability and potential inaccurate predictions.

Addressing multi collinearity is crucial for building reliable regression models, and it requires a
combination of statistical techniques and careful considerations of the variables included in the
model.
Hypothesis testing is a statistical method used to make inferences about a population based on a
sample of data. It involves testing a hypothesis or a claim about a population parameter using
sample data. The process typically follows a structured set of steps. Let's break down the key
components and steps involved in hypothesis testing:

Components of Hypothesis Testing:

1.Null Hypothesis (H0):

❖ The null hypothesis is a statement of no effect, no difference, or no change in the


population parameter.
❖ It is denoted by H0.
❖ It is assumed to be true initially.

2. Alternative Hypothesis (Ha or H1):

❖ The alternative hypothesis is the statement that contradicts the null hypothesis.
❖ It is denoted by Ha or H1
❖ It represents what the researcher is trying to establish.

3. Significance Level

❖ The significance level is the probability of rejecting the null hypothesis when it is
actually true.
❖ Commonly used values for α are 0.05, 0.01, and 0.10.

4. Test Statistic:

❖ The test statistic is a numerical value calculated from the sample data to
determine how far the sample result deviates from what is expected under the null
hypothesis.

5. Critical Region or Rejection Region:

• The critical region is the set of all values of the test statistic that leads to the rejection of
the null hypothesis.

6. P-value:

• The p-value is the probability of obtaining a test statistic as extreme as, or more extreme
than, the one observed in the sample, assuming the null hypothesis is true.
• A small p-value indicates evidence against the null hypothesis.

Steps in Hypothesis Testing:

1. Formulate Hypotheses:

• State the null hypothesis (0H0) and the alternative hypothesis (Ha).

2. Choose Significance Level (α):

• Decide on the level of significance (α), which represents the probability of making a
Type I error (rejecting a true null hypothesis).

3. Collect and Analyze Data:

• Collect a sample of data and calculate the test statistic.

4. Calculate P-value:

• Calculate the probability of obtaining the observed test statistic or more extreme values
under the null hypothesis.

5. Make a Decision:

• If the p-value is less than or equal to α, reject the null hypothesis.

• If the p-value is greater than α, fail to reject the null hypothesis.

6. Draw a Conclusion:

• State the conclusion in terms of the problem context and interpret the results.

Types of Errors:

• Type I Error (False Positive):

• Rejecting the null hypothesis when it is actually true.

• Type II Error (False Negative):

• Failing to reject the null hypothesis when it is actually false.

Hypothesis testing is a crucial tool in research and decision-making, helping to draw valid
conclusions about populations based on limited sample data.
A Null hypothesis is a type of statistical hypothesis that propose that no statistical significant
difference exists or effect or relationship in a set of given observations(parameters) of a
population or between different samples of a population . Hypothesis testing is used to assess the
credibility of a hypothesis by using sample data . Sometime referred to simply as the "null: it is
represented as H0.

The null hypothesis , also known as the conjecture , is used in quantitative analysis to test
theories about markets investing strategies or economics to decide if an idea is true or false.

A test statistic is a crucial element in the framework of statistical hypothesis testing serving as a
standardized value that enables the determination of the probability of observing the data under
the null hypothesis. It is the bridge between the sample data and the deision-making process
regarding the hypotheses. The contruction and interpretation of a test statistic follow a systematic
approach designed to assess the evidence against the null hypothesis.

Definiton and purpose

A test statistic is calculated from sample data to test a hypothesis about a population parameter.
The choice of a test statistic depend s on the hypothesis being tested the type of data collected
and the assumed distribution of the data deviates from what is expected under the null
hypothesis. This deviation is then assessed to determine if it is significant enough to warrant
rejection of Ho.

Calculation and interpretation :

The calculation of a test stastic involves specific formulas that incorporate the sample data the
hypothesized value under Ho and often the sample size. Common types of test statistics inclde :

Z-statistic : Used in tests concerning the mean of a normally distributed population with known
variance.

T-statistic : Used in test concerning the mean of a normally distributed population with unknown
variance.

CHI-Square statistic : Used in tests of independence in contingecy tables and for goodness-of-fit
tests.

F-statistic : Used in analysis of variance for comparing the variances of three or more samples.

Conclusion :

The test statistic is a fundamental concept in statistics that quantifies the evidence against the
null hypothesis. Its caeful calculation and interpretation enable researchers to make informed
descisona about the validity of thier hypothesis thereby advancing knowledging in thier
respective fields through rigiours scientific methods
Shree Mahaveerai Namah

What is gradient descent?


Gradient descent is an optimization algorithm which is commonly-used to train machine
learning models and neural networks(Definition). Training data helps these models learn
over time, and the cost function within gradient descent specifically acts as a barometer, gauging
its accuracy with each iteration of parameter updates. Until the function is close to or equal to
zero, the model will continue to adjust its parameters to yield the smallest possible error.
Once machine learning models are optimized for accuracy, they can be powerful tools for
artificial intelligence (AI) and computer vision applications.

How does gradient descent work?

Before we dive into gradient descent, it may help to review some concepts from linear
regression. You may recall the following formula for the slope of a line, which is y = mx + b,
where m represents the slope and b is the intercept on the y-axis.

You may also recall plotting a scatterplot in statistics and finding the line of best fit, which
required calculating the error between the actual output and the predicted output (y-hat) using the
mean squared error formula. The gradient descent algorithm behaves similarly, but it is based on
a convex function.

The starting point is just an arbitrary point for us to evaluate the performance. From that starting
point, we will find the derivative (or slope), and from there, we can use a tangent line to observe
the steepness of the slope. The slope will inform the updates to the parameters—i.e. the weights
and bias. The slope at the starting point will be steeper, but as new parameters are generated, the
steepness should gradually reduce until it reaches the lowest point on the curve, known as the
point of convergence.

Similar to finding the line of best fit in linear regression, the goal of gradient descent is to
minimize the cost function, or the error between predicted and actual y. In order to do this, it
requires two data points—a direction and a learning rate. These factors determine the partial
derivative calculations of future iterations, allowing it to gradually arrive at the local or global
minimum (i.e. point of convergence).

• Learning rate (also referred to as step size or the alpha) is the size of the steps that are
taken to reach the minimum. This is typically a small value, and it is evaluated and
updated based on the behavior of the cost function. High learning rates result in larger
steps but risks overshooting the minimum. Conversely, a low learning rate has small step
sizes. While it has the advantage of more precision, the number of iterations
compromises overall efficiency as this takes more time and computations to reach the
minimum.
• The cost (or loss) function measures the difference, or error, between actual y and
predicted y at its current position. This improves the machine learning model's efficacy
by providing feedback to the model so that it can adjust the parameters to minimize the
error and find the local or global minimum. It continuously iterates, moving along the
direction of steepest descent (or the negative gradient) until the cost function is close to
or at zero. At this point, the model will stop learning. Additionally, while the terms, cost
function and loss function, are considered synonymous, there is a slight difference
between them. It’s worth noting that a loss function refers to the error of one training
example, while a cost function calculates the average error across an entire training set.

Types of gradient descent

There are three types of gradient descent learning algorithms: batch gradient descent, stochastic
gradient descent and mini-batch gradient descent.

Batch gradient descent

Batch gradient descent sums the error for each point in a training set, updating the model only
after all training examples have been evaluated. This process referred to as a training epoch.

While this batching provides computation efficiency, it can still have a long processing time for
large training datasets as it still needs to store all of the data into memory. Batch gradient descent
also usually produces a stable error gradient and convergence, but sometimes that convergence
point isn’t the most ideal, finding the local minimum versus the global one.
Stochastic gradient descent

Stochastic gradient descent (SGD) runs a training epoch for each example within the dataset and
it updates each training example's parameters one at a time. Since you only need to hold one
training example, they are easier to store in memory. While these frequent updates can offer
more detail and speed, it can result in losses in computational efficiency when compared to
batch gradient descent. Its frequent updates can result in noisy gradients, but this can also be
helpful in escaping the local minimum and finding the global one.

Mini-batch gradient descent

Mini-batch gradient descent combines concepts from both batch gradient descent and stochastic
gradient descent. It splits the training dataset into small batch sizes and performs updates on each
of those batches. This approach strikes a balance between the computational efficiency
of batch gradient descent and the speed of stochastic gradient descent.
Challenges with gradient descent

While gradient descent is the most common approach for optimization problems, it does come
with its own set of challenges. Some of them include:

Local minima and saddle points

For convex problems, gradient descent can find the global minimum with ease, but as nonconvex
problems emerge, gradient descent can struggle to find the global minimum, where the model
achieves the best results.

Recall that when the slope of the cost function is at or close to zero, the model stops learning. A
few scenarios beyond the global minimum can also yield this slope, which are local minima and
saddle points. Local minima mimic the shape of a global minimum, where the slope of the cost
function increases on either side of the current point. However, with saddle points, the negative
gradient only exists on one side of the point, reaching a local maximum on one side and a local
minimum on the other. Its name inspired by that of a horse’s saddle.

Noisy gradients can help the gradient escape local minimums and saddle points.

Vanishing and Exploding Gradients

In deeper neural networks, particular recurrent neural networks, we can also encounter two other
problems when the model is trained with gradient descent and backpropagation.

• Vanishing gradients: This occurs when the gradient is too small. As we move
backwards during backpropagation, the gradient continues to become smaller, causing the
earlier layers in the network to learn more slowly than later layers. When this happens,
the weight parameters update until they become insignificant—i.e. 0—resulting in an
algorithm that is no longer learning.
• Exploding gradients: This happens when the gradient is too large, creating an unstable
model. In this case, the model weights will grow too large, and they will eventually be
represented as NaN. One solution to this issue is to leverage a dimensionality reduction
technique, which can help to minimize complexity within the model.

. Introduction by

Robert Kwiatkowski

Gradient descent (GD) is an iterative first-order optimisation algorithm, used to find a local
minimum/maximum of a given function. This method is commonly used in machine
learning (ML) and deep learning (DL) to minimise a cost/loss function (e.g. in a linear
regression or least sqare fitting). Due to its importance and ease of implementation, this algorithm
is usually taught at the beginning of almost all machine learning courses.

However, its use is not limited to ML/DL only, it’s widely used also in areas like:

❖ control engineering (robotics, chemical, etc.)


❖ computer games

❖ mechanical engineering

We will do a deep dive into the math, implementation and behaviour of first-order gradient
descent algorithm. We will navigate the custom (cost) function directly to find its minimum. That
means there will be no underlying data like in typical ML tutorials — we will be more flexible
regarding a function’s shape.

This method was proposed long before the era of modern computers by Augustin-Louis Cauchy
in 1847. Since that time, there was a significant development in computer science and numerical
methods. That led to numerous improved versions of Gradient Descent. However, we’re going to
use a basic/vanilla version implemented in Python.

2. Function requirements

Gradient descent algorithm does not work for all functions. There are two specific requirements.
A function has to be:

❖ differentiable

❖ convex

First, what does it mean it has to be differentiable? If a function is differentiable it has a


derivative for each point in its domain — not all functions meet these criteria. First, let’s see some
examples of functions meeting this criterion:

Examples of differentiable functions; Image by author

Typical non-differentiable functions have a step or a cusp or a discontinuity:


Examples of non-differentiable functions;

Next requirement — function has to be convex. For a univariate function, this means that the
line segment connecting two function’s points lays on or above its curve (it does not cross it). If it
does, it means that it has a local minimum which is not a global one.

Mathematically, for two points x₁, x₂ laying on the function’s curve this condition is expressed as:

where λ denotes a point’s location on a section line and its value has to be between 0 (left point)
and 1 (right point), e.g. λ=0.5 means a location in the middle.

Below there are two functions with exemplary section lines.

Exemplary convex and non-convex functions;


Another way to check mathematically if a univariate function is convex is to calculate the second
derivative and check if its value is always bigger than 0.

Let’s do a simple example

Let’s investigate a simple quadratic function given by:

Its first and second derivative are:

Because the second derivative is always bigger than 0, our function is strictly convex.

It is also possible to use quasi-convex functions with a gradient descent algorithm. However,
often they have so-called saddle points (called also minimax points) where the algorithm can get
stuck (it will demonstrate it later). An example of a quasi-convex function is:

Let’s stop here for a moment. We see that the first derivative equal zero at x=0 and x=1.5. This places are
candidates for function’s extrema (minimum or maximum )— the slope is zero there. But first we have to
check the second derivative first.

The value of this expression is zero for x=0 and x=1. These locations are called an inflexion point
— a place where the curvature changes sign — meaning it changes from convex to concave or
vice-versa. By analysing this equation we conclude that :

• for x<0: function is convex


• for 0<x<1: function is concave (the 2nd derivative < 0)

• for x>1: function is convex again

Now we see that point x=0 has both first and second derivative equal to zero meaning this is a
saddle point and point x=1.5 is a global minimum.

Let’s look at the graph of this function. As calculated before a saddle point is at x=0 and
minimum at x=1.5.

Semi-convex function with a saddle point;

For multivariate functions the most appropriate check if a point is a saddle point is to calculate a
Hessian matrix which involves a bit more complex calculations and is beyond the scope of your
syllabi.

Example of a saddle point in a bivariate function is show below.


3. Gradient

Before jumping into code one more thing has to be explained — what is a gradient. Intuitively it
is a slope of a curve at a given point in a specified direction.

In the case of a univariate function, it is simply the first derivative at a selected point. In the
case of a multivariate function, it is a vector of derivatives in each main direction (along
variable axes). Because we are interested only in a slope along one axis and we don’t care about
others these derivatives are called partial derivatives.

A gradient for an n-dimensional function f(x) at a given point p is defined as follows:

Nabla or Del

The upside-down triangle is a so-called nabla symbol and you read it “del”. To better understand
how to calculate it let’s do a hand calculation for an exemplary 2-dimensional function below.
3D plot;

Let’s assume we are interested in a gradient at point p(10,10):

so consequently:

By looking at these values we conclude that the slope is twice steeper along the y axis.

4. Gradient Descent Algorithm

Gradient Descent Algorithm iteratively calculates the next point using gradient at the current
position, scales it (by a learning rate) and subtracts obtained value from the current position
(makes a step). It subtracts the value because we want to minimise the function (to maximise it
would be adding). This process can be written as:

There’s an important parameter η which scales the gradient and thus controls the step size. In
machine learning, it is called learning rate and have a strong influence on performance.

• The smaller learning rate the longer GD converges, or may reach maximum iteration before
reaching the optimum point

• If learning rate is too big the algorithm may not converge to the optimal point (jump around) or
even to diverge completely.

In summary, Gradient Descent method’s steps are:

1. choose a starting point (initialisation)

2. calculate gradient at this point

3. make a scaled step in the opposite direction to the gradient (objective: minimise)

4. repeat points 2 and 3 until one of the criteria is met:

• maximum number of iterations reached

• step size is smaller than the tolerance (due to scaling or a small gradient).

Below, there’s an exemplary implementation of the Gradient Descent algorithm (with steps
tracking):

import
numpy
as np
from typing import Callable
def gradient_descent(start: float, gradient: Callable[[float], float],
learn_rate: float, max_iter: int, tol: float = 0.01):
x = start
steps = [start] # history tracking
for _ in range(max_iter):
diff = learn_rate*gradient(x)
if np.abs(diff) < tol:
break
x = x - diff
steps.append(x) # history tracing

return steps, x

This function takes 5 parameters:

1. starting point [float] - in our case, we define it manually but in practice, it is often a random
initialisation

2. gradient function [object] - function calculating gradient which has to be specified before-
hand and passed to the GD function

3. learning rate [float] - scaling factor for step sizes

4. maximum number of iterations [int]

5. tolerance [float] to conditionally stop the algorithm (in this case a default value is 0.01)

5. Example 1 — a quadratic function

Let’s take a simple quadratic function defined as:

Because it is an univariate function a gradient function is:

Let’s write these functions in Python:

For this function, by taking a learning rate of 0.1 and starting point at x=9 we can easily calculate
each step by hand. Let’s do it for the first 3 steps:
def func1(x:float):

return x**2-4x+1

def gradient_func1(x:float):

return 2*x -4

#Program to calculate gradient Decent


import numpy as np
from typing import Callable
def gradient_descent(start:float,gradient:Callable[[float],float],\
learn_rate:float,max_iter:int,tol:float=0,):
x=start
steps=[start] #history tracking
for _ in range(max_iter):
diff=learn_rate*gradient(x)
if np.abs(diff)<tol:
break
x=x-diff
steps.append(x) # history tracking
return steps,x

def func1(x:float):
return x**2-4*x+1
def gradient_func1(x:float):
return 2*x -4

Output of the above program


The python function is called by:

The animation below shows steps taken by the GD algorithm for learning rates of 0.1 and 0.8. As
you see, for the smaller learning rate, as the algorithm approaches the minimum the steps are
getting gradually smaller. For a bigger learning rate, it is jumping from one side to another before
converging.

First 10 steps taken by GD for small and big learning rate;


Image by Robert Kwiatkowski

Trajectories, number of iterations and the final converged result (within tolerance) for various
learning rates are shown below:

Results for various learning rates; Image by author

6. Example 2 — a function with a saddle point

Now let’s see how the algorithm will cope with a semi-convex function we investigated
mathematically before.

Below results for two learning rates and two different staring points.

GD trying to escape from a saddle point; Image by author

Below an animation for a learning rate of 0.4 and a starting point x=-0.5.
Animation of GD trying to escape from a saddle point; Image by author

Now, you see that an existence of a saddle point imposes a real challenge for the first-order
gradient descent algorithms like GD, and obtaining a global minimum is not guaranteed. Second-
order algorithms deal with these situations better (e.g. Newton-Raphson method).

Investigation of saddle points and how to escape from them is a subject of ongoing studies and
various solutions were proposed. For example, Chi Jin and M. Jordan proposed a Perturbing
Gradient Descent algorithm.

7. Summary

We have learned, how Gradient Decent algorithm works, when can it be used and, what are some
common challenges when using it. I hope this will be a good starting point for you for exploring
more advanced gradient-based optimisation methods like Momentum or Nesterov (Accelerated)
Gradient Descent, RMSprop, ADAM or second-order ones like the Newton-Ralphson algorithm.

You might also like