Data Science Chacha
Data Science Chacha
I will take you on an exciting adventure course through the realm of data science, demystifying
its key concepts, methodologies, and applications. Whether you’re a seasoned data professional,
an aspiring data scientist, or simply someone intrigued by the power of data, this syllabus will
provide you with a comprehensive understanding of data science and its real-world implications
and show you how to become get into the field .
Data science is a multi-faceted discipline with many applications across diverse industries and
sectors for various purposes, leveraging the power of data to drive innovation, inform decision-
making, and solve complex problems. Let’s look at some ways industries are leveraging data
science are some key applications of data science.
❖ Finance and Banking: The financial sector heavily relies on data science for risk
assessment, fraud detection, and algorithmic trading. Data scientists can develop models
for credit scoring, portfolio management, and identifying potential risks by analyzing
market trends, economic indicators, and customer data.
❖ Social Media and Marketing: Data science is pivotal in social media analytics, helping
businesses understand user behavior, sentiment analysis, and targeted advertising. By
leveraging social media data, companies can enhance their marketing strategies, engage
with customers effectively, and drive brand awareness.
❖ Government and Public Policy: Governments increasingly use data science to make
data-driven policy decisions, improve public services, and enhance governance.
Analyzing socioeconomic data, census data, and public health records enables
policymakers to identify societal challenges, allocate resources effectively, and measure
the impact of policy interventions.
❖ While business intelligence (BI) and data science share similarities in their data
utilization, key distinctions exist between the two disciplines. Understanding these
differences is crucial for organizations seeking to leverage data effectively. Here’s a brief
overview of how business intelligence and data science differ:
❖ On the other hand, data science encompasses a broader and more exploratory approach to
data analysis. It involves extracting insights and generating predictive models by
leveraging statistical techniques, machine learning algorithms, and domain expertise.
Data science incorporates structured and unstructured data from various sources,
including internal systems, external APIs, social media, and sensor data.
❖ Reporting and Visualization: BI tools excel at generating reports, dashboards, and visual
representations of data for business users to understand and monitor key metrics.
❖ Predictive and Prescriptive Analytics: Data science aims to uncover actionable insights,
make predictions, and drive informed decision-making using advanced analytical
techniques.
❖ Unstructured and Big Data: Data science deals with structured and unstructured data,
including text, images, and sensor data. It embraces the challenges and opportunities
presented by big data.
Machine Learning as a Service is becoming the next big thing with data becoming cheaper, data
science becoming possible and processing power getting better. The growing trend of shifting
data storage to cloud, maintaining it and deriving the best insights from it has found an ally
in MLaaS which provides solutions at a reduced cost. It basically helps developers or
organisations benefit from machine learning cognate cost, time and without human intervention
and additional programming.
Much like SaaS, IaaS and PaaS, MLaaS provides users with range of tools as a part of a cloud
computing service, which includes – facial recognition, natural language processing, data
visualisation, image recognition and deep learning. It is supported by algorithm such as deep
neural networks, convolutional neural network, Bayesian networks, probabilistic graphical
models, Restricted Boltzmann machine and pattern recognition, among others.
Here we take a lowdown at few machine learning tools which will benefit your organisation:
There is a high level of automation available with Amazon Machine Learning, which offers
visual aids and easy-to-access analytics to make machine learning accessible to developers
without having to learn complex machine learning algorithms and technology. It also offers
companies an easy, highly-scalable-on ramp for interpreting data. It helps businesses build
machine learning models without having to create the code themselves.
Amazon Machine Learning can help generate billions of predictions daily, and serve those
predictions in real-time and at high throughput, claims the company. According to Amazon,
Amazon ML is based on the same proven, highly scalable, machine learning technology used by
Amazon to perform critical functions like supply chain management, fraudulent transaction
identification, and catalog organization. The Amazon ML service is based on pay-as-you-go
pricing model. There are no minimum fees required. The data analysis and model building
charges are for $0.42 per hour, with separate fee for batch prediction ($0.10 per 1,000
predictions, rounded up to the next 1,000) and the real-time predictions ($0.0001 per prediction,
rounded up to the nearest penny). Charges for data stored in Amazon S3, Amazon RDS, or
Amazon Redshift are billed separately.
This machine learning for beginners and experienced data scientist. It offers a range of tools that
is more flexible for out-of-the box algorithms. Azure supports a range of operating systems,
programming languages, frameworks, databases and devices. It also provide cross-device
experiences with support for all major mobile platforms. With the help of an integrated
development environment called Machine Learning Studio, the developer can also build data
models through drag-and drop gestures and simple data flow diagrams. Thus, it not only helps
save a lot of time but minimizes coding through ML Studio’s library of sample experiments.
Azure ML Studio also has a huge variety of algorithms, with around 100 methods for developers.
The most popular option in Microsoft Azure Machine Learning Studio is available for free,
which just requires a Microsoft account. This also includes free access that never expires. It also
gives 10 GB storage, R and Python support and predictive web services. However, the standard
workspace of ML Studio is for $9.90 and you will require a Azure subscription.
Google Cloud Machine Learning Engine is highly flexible, which offers users an easy alternative
to build machine learning models for data of any size and type. Google machine learning
engine is based off TensorFlow project. This platform is integrated with all
other Google services like Google Cloud Dataflow, Google Cloud Storage, Google BigQuery,
among others. But the platform is mostly aimed at deep neural network tasks. You can sign up
for a free trial to access Google Cloud Machine Learning. There is no initial charges applied and
once you sign up you get $300 to spend on GoogleCloud Platform over next 12 months.
However, once your free trial ends, you have to pay for the subscription is chargeable.
Watson Machine Learning runs on IBM’s Bluemix, which is capable of both training and
scoring. With the help of training function, developers can use Watson to refine an algorithms so
that it can learn from dataset. And scoring function helps in predicting an outcome using a
trained model. Watson addresses the need of both data scientist and developers. The notebook
tool of Watson can help the researchers learn more about machine learning algorithms.
According to a report, Watson is intended to address questions of deployment,
operationalization, and even deriving business value from machine learning models.
The visual modelling tools of IBM’s Watson machine learning helps users quickly identify
patterns, gain insights and make decisions faster. The open source technologies helps users to
keep utilising their own Jupyter notebooks with Python, R and Scala.
To use the service, you will need to create an account with Bluemix for the free trial. After your
30 days free trial gets over, you need to choose between Lite, Standard and Professional. While
Lite is available for free, under 5,000 predictions an 5 compute hours. Standard and Professional
charges you flat rate per each thousand of predictions and per total number of compute hours.
Standard is available for $0.500 per 1,000 predictions and Professional is for $0.400 per 1,000
predictions.
Haven OnDemand machine learning service provides developers with services and APIs for
building applications. There more than 60 APIs available in Haven which include features like
face detection, speech recognition, image classification, media analysis, object recognition, scene
change detection, speech recognition. It also provides powerful search curation features that
enables the optimisation of search results for developers. With the help of this machine learning
services organisations can extract, analyse and index multiple data formats including emails,
audio and video archives. The Haven OnDemand pricing plans start at $10 per month.
BigML
BigML is easy to use and has a flexible deployment. It allows data imports from AWS,
Microsoft Azure, Google Storage, Google Drive, Dropbox etc. BigML has more features
available that are integrated into its web UI. It also has a large gallery of free datasets, models. It
also has a useful clustering algorithms and visualizations. It has anomaly detection feature that
helps in detecting pattern anomalies, which will help you save time and money.
According to a blog, BigML Datasets are very easy to reuse, edit, expand and export. You can
easily rename and add descriptions to each one of your fields, add new ones (through
normalization, discretization, mathematical operations, missing values replacement, etc), and
generate subsets based on sampling or custom filters. It has a flexible pricing, you can choose
between subscription plans, starting from $15 per months for students. You can perform
unlimited tasks for datasets up to 16 MB for free.
There is also a pay-as-you-go option available in BigML. For companies with stringent data
security, privacy or regulatory requirements, BigML offers private deployment that can run on
their preferred cloud provider or ISP.
MLJAR
MLJAR is a ‘human-first platform’ for machine learning and is available in beta version. It
provides a service for development, prototyping and deploying pattern recognition algorithms. It
provides features like built-in hyper-parameters search, one interface for many algorithms,
among others. The get started, users need to upload dataset, select input and target attributes and
the machine learning service provider will find the matching ML algorithm. MLJAR is also
based on pay-as-you-go pricingmodel. Once your 30 days free trial gets over, there is a different
subscription plan for professional developers, startups, businesses and organisations. When you
start the subscription, you get 10 free credits for a start.
Arimo
Arimo uses machine learning algorithms and a large computing platform to crunch massive
amounts of data in seconds. It describes itself as ‘behavioural AI for IoT’, which learn from past
behaviour, predicts futures action and drives superior business outcomes. The service provider is
based on deep learning architecture, that works with time series data to discover patterns of
behaviour.
Domino
Domino is a platform that support modern data analysis workflow.This platform supports
language agnostic like Python, R, MATLAB, Perl, Julia, shell script, among others. Domino is a
platform for data scientist, data science managers, IT leaders and executives. This machine
learning service can be implemented on-site or in the cloud. Developers can develop, deploy and
collaborate using the existing tools and language, claims the company.It also streamlines
knowledge management with all projects stored, searchable and forkable. It has a rich
functionality in an integrated end-to end platform for version control and collaboration along
with one-click infrastructure scalability and deployment and publishing.
It is the collaborative data science platform for data scientists, data analysts and engineers to
explore, prototype, build and deliver their own data products more efficiently. The platform
supports R, Phyton, Scala , Hive, Pig, Spark etc. It uses customisable drag and drop visual
interface at any step of dataflow prototyping process. The platform provides machine learning
technologies like Scikit-Learn, MLlib, Xgboost, H2O, among others in a visual user interface.
Types of Data
• Excel: Power BI can connect to Excel workbooks, which may include data in tables or
data models created using Power Query or Power Pivot.
• Text/CSV: Comma-separated values files can be imported directly into Power BI.
• JSON: JavaScript Object Notation files, useful for web data and APIs.
• Amazon Redshift: Data warehouse product which forms part of the larger cloud-
computing platform Amazon Web Services.
Feature Creation
Feature creation involves generating new features based on domain knowledge or by observing
patterns in the data. This can significantly improve the performance of a machine learning
model. Types of feature creation include:
Feature Transformation
Feature transformation converts features into a more suitable representation for the machine
learning model. This includes:
Feature Extraction
Feature extraction creates new features from existing ones to provide more relevant information
to the model. Techniques include:
Feature Selection
Feature selection involves selecting a subset of relevant features from the dataset to be used in
the model. This can reduce overfitting, improve model performance, and decrease computational
costs. Methods include:
• Filter Method: Based on the statistical relationship between the feature and the target
variable.
• Wrapper Method: Based on the evaluation of the feature subset using a specific
machine learning algorithm.
• Embedded Method: Feature selection as part of the training process of the algorithm2.
Feature Scaling
Feature scaling transforms features so that they have a similar scale, which is important for many
machine learning algorithms. Techniques include:
Feature engineering is essential because the quality of features used to train machine learning
models heavily influences their performance. By providing more meaningful and relevant
information, feature engineering helps models learn more effectively from the data.
Dummification
When working with categorical variables in machine learning models, it is essential to convert
them into numerical form. This conversion allows the model to understand and utilize the
information contained within these variables. In Python 3 programming, two common
approaches to handle categorical variables in XGBoost models are dummification and encoding.
Several tools can help streamline and automate the feature engineering process, including:
In conclusion, feature engineering is a vital step in the machine learning pipeline that involves
creating, transforming, extracting, and selecting features to improve model performance. It
requires substantial data analysis and domain knowledge to effectively encode features for
different models and datasets
Machine learning is a branch in computer science that studies the design of algorithms that can
learn.
Data cleaning, Data transformation, Data reduction, and Data integration are the major steps in
data preprocessing.
Data Cleaning
Data cleaning, one of the major preprocessing steps in machine learning, locates and fixes errors
or discrepancies in the data. From duplicates and outliers to missing numbers, it fixes them all.
Methods like transformation, removal, and imputation help ML professionals perform data
cleaning seamlessly.
Data Integration
Data integration is among the major responsibilities of data preprocessing in machine learning.
This process integrates (merges) information extracted from multiple sources to outline and
create a single dataset. The fact that you need to handle data in multiple forms, formats, and
semantics makes data integration a challenging task for many ML developers.
Data Transformation
ML programmers must pay close attention to data transformation when it comes to data
preprocessing steps. This process entails putting the data in a format that will allow for analysis.
Normalization, standardization, and discretisation are common data transformation procedures.
While standardization transforms data to have a zero mean and unit variance, normalization
scales data to a common range. Continuous data is discretized into discrete categories using this
technique.
Data Reduction
Data reduction is the process of lowering the dataset’s size while maintaining crucial
information. Through the use of feature selection and feature extraction algorithms, data
reduction can be accomplished. While feature extraction entails translating the data into a lower-
dimensional space while keeping the crucial information, feature selection requires choosing a
subset of pertinent characteristics from the dataset. For example, If employees data consist of
attributes like BOD and date of joining company, we can replace both attributes and replace with
years of service with company = DOB – date of joining the company.
Data cleaning, Data transformation, Data reduction, and Data integration are the
major steps in data preprocessing.
Data Cleaning
Data Integration
Data Transformation
ML programmers must pay close attention to data transformation when it comes
to data preprocessing steps. This process entails putting the data in a format that
will allow for analysis. Normalization, standardization, and discretisation are
common data transformation procedures. While standardization transforms data to
have a zero mean and unit variance, normalization scales data to a common range.
Continuous data is discretized into discrete categories using this technique.
Data Reduction
Data reduction is the process of lowering the dataset’s size while maintaining
crucial information. Through the use of feature selection and feature extraction
algorithms, data reduction can be accomplished. While feature extraction entails
translating the data into a lower-dimensional space while keeping the crucial
information, feature selection requires choosing a subset of pertinent characteristics
from the dataset. For example, If employees data consist of attributes like BOD and
date of joining company, we can replace both attributes and replace with years of
service with company = DOB – date of joining the company.
Definition:
data curation work that creates and oversees ready-to-use data sets for BI and analytics.
Data curation involves tasks such as indexing, cataloging and maintaining data sets and
their associated metadata to help users find and access the data.
Data curation is the process of creating, organizing and maintaining data sets so they
can be accessed and used by people looking for information. It involves collecting,
structuring, indexing and cataloging data for users in an organization, group or the
general public. Data can be curated to support business decision-making, academic
needs, scientific research and other purposes.
A data preprocessing pipeline is a series of sequential data transformation steps that are
applied to the raw input data to prepare it for model training and evaluation. These pipelines
help maintain consistency, ensure reproducibility, and enhance the efficiency of the
preprocessing process.
Example
Import numpy as np
Import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
# Create a pipeline
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('pca', PCA(n_components=2)),
('model', LogisticRegression())
])
print("Imputed data:")
print(X_imputed[:5])
print("Scaled data:")
print(X_scaled[:5])
print("PCA-transformed data:")
print(X_pca[:5])
Output :
Scatter matrix plot for iris data
import pandas as pd
import numpy as np
iris_data = load_iris()
from sklearn.model_selection import train_test_split
print(iris_data["feature_names"])
print(iris_data["target_names"])
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
The hope that comes with this discipline is that including the experience into its tasks will
eventually improve the learning. But this improvement needs to happen in such a way that the
learning itself becomes automatic so that humans like ourselves don’t need to interfere anymore
is the ultimate goal.
Today’s scikit-learn tutorial will introduce you to the basics of Python machine learning:
• You'll learn how to use Python and its libraries to explore your data with the help
of matplotlib and Principal Component Analysis (PCA),
• And you'll preprocess your data with normalization and you'll split your data into training
and test sets.
• Next, you'll work with the well-known KMeans algorithm to construct an unsupervised
model, fit this model to your data, predict values, and validate the model that you have
built.
• As an extra, you'll also see how you can also use Support Vector Machines (SVM) to
construct another model to classify your data.
If you’re more interested in an R tutorial, take a look at our Machine Learning with R for
Beginners tutorial.
Alternatively, check out DataCamp's Supervised Learning with scikit-learn and Unsupervised
Learning in Python courses!
The first step to about anything in data science is loading in your data. This is also the starting
point of this scikit-learn tutorial.
This discipline typically works with observed data. This data might be collected by yourself or
you can browse through other sources to find data sets. But if you’re not a researcher or
otherwise involved in experiments, you’ll probably do the latter.
If you’re new to this and you want to start problems on your own, finding these data sets might
prove to be a challenge. However, you can typically find good data sets at the UCI Machine
Learning Repository or on the Kaggle website. Also, check out this KD Nuggets list with
resources.
For now, you should warm up, not worry about finding any data by yourself and just load in
the digits data set that comes with a Python library, called scikit-learn.
Fun fact: did you know the name originates from the fact that this library is a scientific toolbox
built around SciPy? By the way, there is more than just one scikit out there. This scikit contains
modules specifically for machine learning and data mining, which explains the second
component of the library name. :)
To load in the data, you import the module datasets from sklearn. Then, you can use
the load_digits() method from datasets to load in the data:
digits = datasets.load_digits()
print(digits)
Note that the datasets module contains other methods to load and fetch popular reference
datasets, and you can also count on this module in case you need artificial data generators. In
addition, this data set is also available through the UCI Repository that was mentioned above:
you can find the data here.
If you would have decided to pull the data from the latter page, your data import would’ve
looked like this:
import pandas as pd
digits = pd.read_csv("http://archive.ics.uci.edu/ml/machine
-learning-databases/optdigits/optdigits.tra", header=None)
print(digits)
Note that if you download the data like this, the data is already split up in a training and a test
set, indicated by the extensions .tra and .tes. You’ll need to load in both files to elaborate your
project. With the command above, you only load in the training set.
Tip: if you want to know more about importing data with the Python data manipulation library
Pandas, consider Importing Data in Python .
When first starting out with a data set, it’s always a good idea to go through the data description
and see what you can already learn. When it comes to scikit-learn, you don’t immediately have
this information readily available, but in the case where you import data from another source,
there's usually a data description present, which will already be a sufficient amount of
information to gather some insights into your data.
However, these insights are not merely deep enough for the analysis that you are going to
perform. You really need to have a good working knowledge about the data set.
Performing an exploratory data analysis (EDA) on a data set like the one that this tutorial now
has might seem difficult.
When you printed out the digits data after having loaded it with the help of the scikit-
learn datasets module, you will have noticed that there is already a lot of information available.
You already have knowledge of things such as the target values and the description of your data.
You can access the digits data through the attribute data. Similarly, you can also access the target
values or labels through the target attribute and the description through the DESCR attribute.
To see which keys you have available to already get to know your data, you can just
run digits.keys().
print(digits.keys())
print(digits.data)
print(digits.target)
print(digits.DESCR)
The next thing that you can (double)check is the type of your data.
If you used read_csv() to import the data, you would have had a data frame that contains just the
data. There wouldn’t be any description component, but you would be able to resort to, for
example, head() or tail() to inspect your data. In these cases, it’s always wise to read up on the
data description folder!
However, this tutorial assumes that you make use of the library's data and the type of
the digits variable is not that straightforward if you’re not familiar with the library. Look at the
print out in the first code chunk. You’ll see that digits actually contains numpy arrays!
This is already quite some important information. But how do you access these arays?
It’s very easy, actually: you use attributes to access the relevant arrays.
Remember that you have already seen which attributes are available when you
printed digits.keys(). For instance, you have the dataattribute to isolate the data, target to see the
target values and the DESCR for the description, …
The first thing that you should know of an array is its shape. That is, the number of dimensions
and items that is contained within an array. The array’s shape is a tuple of integers that specify
the sizes of each dimension. In other words, if you have a 3d array like this y = np.zeros((2, 3,
4)), the shape of your array will be (2,3,4).
Now let’s try to see what the shape is of these three arrays that you have distinguished
(the data, target and DESCR arrays).
Use first the data attribute to isolate the numpy array from the digits data and then use
the shape attribute to find out more. You can do the same for the target and DESCR. There’s also
the images attribute, which is basically the data in images. You’re also going to test this out.
print(digits_target.shape)
number_digits = len(np.unique(digits.target))
digits_images = digits.images
print(digits_images.shap
To recap: by inspecting digits.data, you see that there are 1797 samples and that there are 64
features. Because you have 1797 samples, you also have 1797 target values.
But all those target values contain 10 unique values, namely, from 0 to 9. In other words, all
1797 target values are made up of numbers that lie between 0 and 9. This means that the digits
that your model will need to recognize are numbers from 0 to 9.
Lastly, you see that the images data contains three dimensions: there are 1797 instances that are
8 by 8 pixels big. You can visually check that the images and the data are related by reshaping
the images array to two dimensions: digits.images.reshape((1797, 64)).
print(np.all(digits.images.reshape((1797,64)) == digits.data))
With the numpy method all(), you test whether all array elements along a given axis evaluate
to True. In this case, you evaluate if it’s true that the reshaped images array equals digits.data.
You’ll see that the result will be True in this case.
Then, you can take your exploration up a notch by visualizing the images that you’ll be working
with. You can use one of Python’s data visualization libraries, such as matplotlib, for this
purpose:
# Import matplotlib
for i in range(64):
# Initialize the subplots: add a subplot in the grid of 8 by 8, at the i+1-th position
ax.text(0, 7, str(digits.target[i]))
plt.show()
The code chunk seems quite lengthy at first sight and this might be overwhelming. But, what
happens in the code chunk above is actually pretty easy once you break it down into parts:
• Next, you set up a figure with a figure size of 6 inches wide and 6 inches long. This is
your blank canvas where all the subplots with the images will appear.
• Then you go to the level of the subplots to adjust some parameters: you set the left side of
the suplots of the figure to 0, the right side of the suplots of the figure to 1, the bottom
to 0 and the top to 1. The height of the blank space between the suplots is set at 0.005and
the width is set at 0.05. These are merely layout adjustments.
• After that, you start filling up the figure that you have made with the help of a for loop.
• You initialize the suplots one by one, adding one at each position in the grid that
is 8 by 8 images big.
• You display each time one of the images at each position in the grid. As a color map, you
take binary colors, which in this case will result in black, gray values and white colors.
The interpolation method that you use is 'nearest', which means that your data is
interpolated in such a way that it isn’t smooth. You can see the effect of the different
interpolation methods here.
• The cherry on the pie is the addition of text to your subplots. The target labels are printed
at coordinates (0,7) of each subplot, which in practice means that they will appear in the
bottom-left of each of the subplots.
On a more simple note, you can also visualize the target labels with an image, just like this:
# Import matplotlib
plt.subplot(2, 4, index + 1)
plt.axis('off')
plt.imshow(image, cmap=plt.cm.gray_r,interpolation='nearest')
plt.show()
Then, you say that for the first eight elements of images_and_labels -note that the index starts at
0!-, you initialize subplots in a grid of 2 by 4 at each position. You turn of the plotting of the
axes and you display images in all the subplots with a color map plt.cm.gray_r(which returns all
grey colors) and the interpolation method used is nearest. You give a title to each subplot, and
you show it.
And now you know a very good idea of the data that you’ll be working with!
As the digits data set contains 64 features, this might prove to be a challenging task. You can
imagine that it’s very hard to understand the structure and keep the overview of the digits data. In
such cases, it is said that you’re working with a high dimensional data set.
High dimensionality of data is a direct result of trying to describe the objects via a collection of
features. Other examples of high dimensional data are, for example, financial data, climate data,
neuroimaging, …
But, as you might have gathered already, this is not always easy. In some cases, high
dimensionality can be problematic, as your algorithms will need to take into account too many
features. In such cases, you speak of the curse of dimensionality. Because having a lot of
dimensions can also mean that your data points are far away from virtually every other point,
which makes the distances between the data points uninformative.
Dont’ worry, though, because the curse of dimensionality is not simply a matter of counting the
number of features. There are also cases in which the effective dimensionality might be much
smaller than the number of the features, such as in data sets where some features are irrelevant.
In addition, you can also understand that data with only two or three dimensions is easier to
grasp and can also be visualized easily.
That all explains why you’re going to visualize the data with the help of one of the
Dimensionality Reduction techniques, namely Principal Component Analysis (PCA). The idea in
PCA is to find a linear combination of the two variables that contains most of the information.
This new variable or “principal component” can replace the two original variables.
In short, it’s a linear transformation method that yields the directions (principal components) that
maximize the variance of the data. Remember that the variance indicates how far a set of data
points lie apart.
You can easily apply PCA do your data with the help of scikit-learn:
import pandas as pd
digits = datasets.load_digits()
#print(digits)
randomized_pca = RandomizedPCA(n_components=2)
reduced_data_rpca = randomized_pca.fit_transform(digits.data)"""
pca = PCA(n_components=2)
reduced_data_pca = pca.fit_transform(digits.data)
print(reduced_data_pca.shape)
output (1797,2)
Tip: you have used the RandomizedPCA() here because it performs better when there’s a high
number of dimensions. Try replacing the randomized PCA model or estimator object with a
regular PCA model and see what the difference is.
Note how you explicitly tell the model to only keep two components. This is to make sure that
you have two-dimensional data to plot. Also, note that you don’t pass the target class with the
labels to the PCA transformation because you want to investigate if the PCA reveals the
distribution of the different labels and if you can clearly separate the instances from each other.
colors = ['black', 'blue', 'purple', 'yellow', 'white', 'red', 'lime', 'cyan', 'orange', 'gray']
for i in range(len(colors)):
x = reduced_data_rpca[:, 0][digits.target == i]
y = reduced_data_rpca[:, 1][digits.target == i]
plt.scatter(x, y, c=colors[i])
plt.show()
import pandas as pd
digits = datasets.load_digits()
#print(digits)
pca = PCA(n_components=2)
# Fit and transform the data to the model
reduced_data_pca = pca.fit_transform(digits.data)
print(reduced_data_pca.shape)
colors = ['black', 'blue', 'purple', 'yellow', 'white', 'red', 'lime', 'cyan', 'orange', 'gray']
for i in range(len(colors)):
x = reduced_data_pca[:, 0][digits.target == i]
y = reduced_data_pca[:, 1][digits.target == i]
plt.scatter(x, y, c=colors[i])
plt.show()
Also note that the last call to show the plot (plt.show()) is not necessary if you’re working in
Jupyter Notebook, as you’ll want to put the images inline.
1. You put your colors together in a list. Note that you list ten colors, which is equal to the
number of labels that you have. This way, you make sure that your data points can be
colored in according to the labels. Then, you set up a range that goes from 0 to 9. Mind
you that this range is not inclusive! Remember that this is the same for indices of a list,
for example.
2. You set up your x and y coordinates. You take the first or the second column
of reduced_data_pca, and you select only those data points for which the label equals the
index that you’re considering. That means that in the first run, you’ll consider the data
points with label 0, then label 1, … and so on.
3. You construct the scatter plot. Fill in the x and y coordinates and assign a color to the
batch that you’re processing. The first run, you’ll give the color black to all data points,
the next run blue, … and so on.
4. You add a legend to your scatter plot. Use the target_names key to get the right labels for
your data points.
Definition
The data is linearly transformed onto a new coordinate system such that the directions
(principal components) capturing the largest variation in the data can be easily identified.
WHY PCA?
➢ When there are many input attributes, it is difficult to visualize the data. There is a very
famous term ‘Curse of dimensionality in the machine learning domain.
➢ Basically, it refers to the fact that a higher number of attributes in a dataset adversely
affects the accuracy and training time of the machine learning model.
➢ Principal Component Analysis (PCA) is a way to address this issue and is used for better
data visualization and improving accuracy.
➢ PCA is an unsupervised pre-processing task that is carried out before applying any ML
algorithm. PCA is based on “orthogonal linear transformation” which is a mathematical
technique to project the attributes of a data set onto a new coordinate system. The
attribute which describes the most variance is called the first principal component and is
placed at the first coordinate.
➢ Similarly, the attribute which stands second in describing variance is called a second
principal component and so on. In short, the complete dataset can be expressed in terms
of principal components. Usually, more than 90% of the variance is explained by
two/three principal components.
➢ Principal component analysis, or PCA, thus converts data from high dimensional space to
low dimensional space by selecting the most important attributes that capture maximum
information about the dataset.
Python Implementation:
['malignant' 'benign']
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
'mean smoothness' 'mean compactness' 'mean concavity'
'mean concave points' 'mean symmetry' 'mean fractal dimension'
'radius error' 'texture error' 'perimeter error' 'area error'
'smoothness error' 'compactness error' 'concavity error'
'concave points error' 'symmetry error' 'fractal dimension error'
'worst radius' 'worst texture' 'worst perimeter' 'worst area'
'worst smoothness' 'worst compactness' 'worst concavity'
'worst concave points' 'worst symmetry' 'worst fractal dimension']
Apply PCA
principal.components_
Output :
(569, 3)
principal.components_
The principal.components_ provide an array in which the number of rows tells the number of
principal components while the number of columns is equal to the number of features in actual
data. We can easily see that there are three rows as n_components was chosen to be 3. However,
each row has 30 columns as in actual data.
Adding following code to above code will able to plot data . For 2-D plot add following code
plt.figure(figsize=(10,10))
plt.scatter(x[:,0],x[:,1],c=data['target'],cmap='plasma')
plt.xlabel('pc1')
plt.ylabel('pc2')
plt.show()
The colors show the 2 output classes of the original dataset-benign and malignant tumor. It is
clear that principal components show clear separation between two output classes.
For three principal components, we need to plot a 3d graph. x[:,0] signifies the first principal
component. Similarly, x[:,1] and x[:,2] represent the second and the third principal component.
principal.explained_variance_ratio_
Principal Axis Method: PCA basically searches a linear combination of variables so that we can
extract maximum variance from the variables. Once this process completes it removes it and
searches for another linear combination that gives an explanation about the maximum proportion
of remaining variance which basically leads to orthogonal factors. In this method, we analyze
total variance.
Eigenvector: It is a non-zero vector that stays parallel after matrix multiplication. Let’s suppose
x is an eigenvector of dimension r of matrix M with dimension r*r if Mx and x are parallel. Then
we need to solve Mx=Ax where both x and A are unknown to get eigenvector and eigenvalues.
Under Eigen-Vectors, we can say that Principal components show both common and unique
variance of the variable. Basically, it is variance focused approach seeking to reproduce total
variance and correlation with all components. The principal components are basically the linear
combinations of the original variables weighted by their contribution to explain the variance in a
particular orthogonal dimension.
Eigen Values: It is basically known as characteristic roots. It basically measures the variance in
all variables which is accounted for by that factor. The ratio of eigenvalues is the ratio of
explanatory importance of the factors with respect to the variables. If the factor is low then it is
contributing less to the explanation of variables. In simple words, it measures the amount of
variance in the total given database accounted by the factor. We can calculate the factor’s
eigenvalue as the sum of its squared factor loading for all the variables.
import pandas as pd
import numpy as np
data=load_breast_cancer()
data.keys()
#Exploratory data
print(data['target_names'])
print(data['feature_names'])
df1=pd.DataFrame(data['data'],columns=data['feature_names'])
from sklearn.model_selection import train_test_split
plt.show()
['malignant' 'benign']
import numpy as np
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
plt.scatter(X,y)
plt.xlabel('X1')
plt.ylabel('y')
plt.show()
This model has two model parameters, θ0 and θ1 . By tweaking these parameters, you can make
your model represent any linear function.
Before you can use your model, you need to define the parameter values θ0 and θ1. How can
you know which values will make your model perform best? To answer this question, you need
to specify a performance measure. You can either define a utility function (or fitness function)
that measures how good your model is, or you can define a cost function that measures how bad
it is. For linear regression problems, people typically use a cost function that measures the
distance between the linear model’s predictions and the training examples; the objective is to
minimize this distance.
Now let’s compute θ-hat using the Normal Equation. We will use the inv() function from
NumPy’s Linear Algebra module (np.linalg) to compute the inverse of a matrix, and the dot()
method for matrix multiplication:
X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
The actual function that we used to generate the data is y = 4 + 3x1 + Gaussian noise.
theta_best = [[3.93600944]
[3.04902507]]
We would have hoped for θ0 = 4 and θ1 = 3 instead of θ0 = 3.936 and θ1 = 3.049, Close
enough, but the noise made it impossible to recover the exact parameters of the original function.
import numpy as np
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
plt.xlabel('X1')
plt.ylabel('y')
plt.show()
plt.plot(X, y, "b.")
plt.axis([0, 2, 0, 15])
plt.show()
import numpy as np
import matplotlib.pyplot as plt
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
Output
theta_best = [[3.9519002 ]
[2.95492963]]
y_predict = [[3.9519002 ]
[9.86175946]]
Box plot
Boxplots
A boxplot is a standardized way of displaying the distribution of data based on a five number
summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can
tell you about your outliers and what their values are. It can also tell you if your data is
symmetrical, how tightly your data is grouped, and if and how your data is skewed.
The image below is a boxplot. A boxplot is a standardized way of displaying the distribution of
data based on a five number summary (“minimum”, first quantile (Q1), median, third quantile
(Q3), and “maximum”). It can tell you about your outliers and what their values are. It can also
tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is
skewed.
For some distributions/datasets, you will find that you need more information than the measures
of central tendency (median, mean, and mode).
A box plot is a method for graphically depicting groups of numerical data through their quartiles
(also called quantiles). The box extends from the Q1 to Q3 quartile values of the data, with a
line at the median (Q2). The whiskers extend from the edges of box to show the range of the
data. The position of the whiskers is set by default to 1.5 * IQR (IQR = Q3 - Q1, Inter-quartile
range) from the edges of the box. Outlier points are those past the end of the whiskers. Make a
box-and-whisker plot from DataFrame columns, optionally grouped by some other columns.
Code for box plot
importnumpy as np
import pandas as pd
import random
importmatplotlib.pyplot as plt
import math
importseaborn as sns
np.random.seed(1234)
df=pd.DataFrame(np.random.randn(10,4),columns=['col1','col2','col3','col4'])
Out[2]:
boxplot=df.boxplot(column=[‘col1’, 'col2','col3'])
Figure 4.1.3.5 Box plot for different columns with grid visible
boxplot= df.boxplot(column=[‘col1’,’col2’],grid=False)
box Whisker
Figure 4.1.3.6 Box plot for less number of columns compared to previous figure with grid
invisible
importnumpy as np
import pandas as pd
import random
importmatplotlib.pyplot as plt
import math
importseaborn as sns
np.random.seed(1234)
df=pd.DataFrame(np.random.randn(6,2),columns=['col1','col2'])
df['X']=pd.Series(['A','A','A','B','B','B'])
df
Out[2]:
col1 col2 X
0 0.471435 -1.190976 A
1 1.432707 -0.312652 A
2 -0.720589 0.887163 A
3 0.859588 -0.636524 B
4 0.015696 -2.242685 B
5 1.150036 0.991946 B
Explanation of group by usage for pandas. As above DataFrame shows three columns, namely
col1, col2 and X. Column ‘X’ has two categories ‘A’ and ‘B’. For ’A’ part, col1 and col2 have
numeric data. A boxplot corresponding to this part for col1 is shown on the left side of the
diagram and also labeled. In the same diagram, for ‘B’ is shown on the right side. This is a
unique feature in pandas and very helpful in doing analytics. All statistical interpretation is
applied box and whisker on the plot.
Here is how to read a boxplot. The median is indicated by the vertical line that runs down the
center of the box. In the boxplot below for col1 by ‘A’, the median is between -0.9 and 0.9,
around 0.5. Additionally, boxplots display two common measures of the variability or spread in a
data set.
➢ Range. If you are interested in the spread of all the data, it is represented on a boxplot by
the horizontal distance between the smallest value and the largest value, including any
outliers. In the boxplot below, data values range from about -2.0 (the smallest non-
outlier) to about 1.5 (the largest non-outlier), so the range is 3.5. If you ignore outliers
(there are none in this diagram), the range is illustrated by the distance between the
opposite ends of the whiskers - about 3.5 in the boxplot above.
➢ Interquartile range (IQR=Q3 –Q1). The middle half of a data set falls within the
interquartile range. In a boxplot, the interquartile range is represented by the width of the
box (Q3 minus Q1). In the box plot below for col2 and ‘B’, the interquartile range is
equal to about 0.0 minus -1.5 or about 1.5. And finally, boxplots often provide
information about the shape of a data set. The examples below show some common
patterns.
In machine learning, we aim to build predictive models that forecast the outcome for a given
input data. To achieve this, we take additional steps to tune the trained model. So, we evaluate
the performance of several candidate models to choose the best-performing one.
However, deciding on the best-performing model is not a straightforward task because selecting
the model with the highest accuracy doesn’t guarantee it’ll generate error-free results in the
future. Hence, we apply train-test splits and cross-validation to estimate the model’s
performance on unseen data.
Overfitting happens when we train a machine learning model too much tuned to the training set.
As a result, the model learns the training data too well, but it can’t generate good predictions for
unseen data. An overfitted model produces low accuracy results for data points unseen in
training, hence, leads to non-optimal decisions.
A model unable to produce sensible results on new data is also called “not able to generalize.” In
this case, the model is too complex, and the patterns existing in the dataset are not well
represented. Such a model with high variance overfits.
Overfitting models produce good predictions for data points in the training set but perform
poorly on new samples.
Underfitting occurs when the machine learning model is not well-tuned to the training set. The
resulting model is not capturing the relationship between input and output well enough.
Therefore, it doesn’t produce accurate predictions, even for the training dataset. Resultingly, an
underfitted model generates poor results that lead to high-error decisions, like an
overfitted model.
An underfitted model is not complex enough to recognize the patterns in the dataset. Usually, it
has a high bias towards one output value. This is because it considers the variations of the input
data as noise and generates similar outputs regardless of the given input.
When training a model, we want it to fit well to the training data. Still, we want it to
generalize and generate accurate predictions for unseen data, as well. As a result, we don’t
want the resulting model to be on any extreme.
Let’s consider we have a dataset residing on an S-shaped curve such as a logarithmic curve.
Fitting a high-order parabola passing through the known points with zero error is always
possible. On the other hand, we can fit a straight line with a high error rate.
The first solution generates an overly complex model and models the implicit noise as well as the
dataset. As a result, we can expect a high error for a new data point on the original S-shaped
curve.
Conversely, the second model is far too simple to capture the relationship between the input and
output. Hence, it will perform poorly on new data, too, as shown in figure below.
High Variance Low Bias Low Variance High Bias
Bias
This part of the generalization error is due to wrong assumptions, such as assuming that the data
is linear when it is actually quadratic. A high-bias model is most likely to underfit the training
data.
Variance
Variance This part is due to the model’s excessive sensitivity to small variations in the training
data. A model with many degrees of freedom (such as a high-degree polynomial model) is likely
to have high variance, and thus to overfit the training data.
Irreducible error
This part is due to the noisiness of the data itself. The only way to reduce this part of the error is
to clean up the data (e.g., fix the data sources, such as broken sensors, or detect and remove
outliers)
Reducing the error from overfitting or underfitting is referred to as the bias-variance tradeoff.
We aim to find a good fitting model in between, i.e. low bias, low variance.
Summary:
Overfitting Underfitting
Model is too complex Model is not complex enough
Accurate for training set Not accurate for training set
Not accurate for validation set Not accurate for validation set
Need to reduce complexity Need to increase complexity
Reduce number of features Increase number of features
Apply regularization Reduce regularization
Reduce training Increase training
Add training examples Add training examples
Shree Mahaveerai Namah
Data is the lifeblood of business, and it comes in a huge variety of formats — everything from
strictly formed relational databases to your last post on Facebook. All of that data, in all different
formats, can be sorted into one of two categories: structured or unstructured data.
Structured vs. unstructured data can be understood by considering the who, what, when, where,
and the how of the data:
These five questions highlight the fundamentals of both structured and unstructured data, and
allow general users to understand how the two differ. They will also help users understand
nuances like semi-structured data, and guide us as we navigate the future of data in the cloud.
Structured data is data that has been predefined and formatted to a set structure before
being placed in data storage, which is often referred to as schema-on-write. The best example
of structured data is the relational database: the data has been formatted into precisely defined
fields, such as credit card numbers or address, in order to be easily queried with SQL.
1. Easy use by machine learning algorithms: The largest benefit of structured data
is how easily it can be used by machine learning. The specific and organized nature
of structured data allows for easy manipulation and querying of that data.
2. Easy use by business users: Another benefit of structured data is that it can be
used by an average business user with an understanding of the topic to which the
data relates. There is no need to have an in-depth understanding of various different
types of data or the relationships of that data. It opens up self-service data access to
the business user.
3. Increased access to more tools: Structured data also has the benefit of having
been in use for far longer; historically, it was the only option. Data managers have
more product choices when using structured data because there are more tools that
have been tried and tested for using and analyzing structured data.
The cons of structured data are rooted in a lack of data flexibility. Here are some potential
drawbacks to the use of structured data:
Structured data is everywhere. It’s the basis for inventory control systems and ATMs. It can be
human- or machine-generated.
Common examples of machine-generated structured data are weblog statistics and point of sale
data, such as barcodes and quantity. And don’t forget spreadsheets — a classic example of
human-generated structured data.
Unstructured data is data stored in its native format and not processed until used, which is known
as schema-on-read. It comes in a myriad of file formats, including email(semi-structured ), social
media posts, presentations, chats, IoT sensor data, and satellite imagery.
As with the pros and cons of structured data, unstructured data also has strengths and weaknesses
for specific business needs. Some of its benefits include:
1. Freedom of the native format: Because unstructured data is stored in its native
format, the data is not defined until it is needed. This leads to a larger pool of use
cases, because the purpose of the data is adaptable. It allows for preparation and
analysis of only the data needed. The native format also allows for a wider variety
of file formats in the database, because the data that can be stored is not restricted
to a specific format. That means the company has more data to draw from.
2. Faster accumulation rates: Another benefit of unstructured data is in data
accumulation rates. There is no need to predefine the data, which means it can be
collected quickly and easily.
3. Better pricing and scalability: Unstructured data is often stored in cloud data
lakes, which allow for massive storage. Cloud data lakes also allow for pay-as-
you-use storage pricing, which helps cut costs and allows for easy scalability.
There are also cons to using unstructured data. The biggest challenge is that it requires both
specific expertise and specialized tools in order to be used to its fullest potential.
1. Data science expertise: The largest drawback to unstructured data is that data
science expertise is required to prepare and analyze the data. A standard business
user cannot use unstructured data as-is due to its undefined/non-formatted nature.
Using unstructured data requires understanding the topic or area of the data, but
also of how the data can be related to make it useful.
2. Specialized tools: In addition to the required professional expertise, unstructured
data requires specialized tools to manipulate. Standardized tools are intended for
use with structured data, which leaves a data manager with limited choices in
products — some of which are still in their infancy — for utilizing unstructured
data.
Unstructured data is qualitative rather than quantitative, which means that it is more
characteristic and categorical in nature.
It lends itself well to use cases such as determining how effective a marketing campaign is, or to
uncovering potential buying trends through social media and review websites. Because it can be
used to detect patterns in chats or suspicious email trends, it’s also very useful to organizations in
assisting with monitoring for policy compliance.
The difference between structured data and unstructured data comes down to the types of data
that can be used for each, the level of data expertise required to make use of that data, and on-
write data vs. on-read schema.
Structured Data Unstructured data
Structured data is highly specific and is stored in a predefined format, where unstructured data is
a compilation of many varied types of data that are stored in their native formats. This means that
structured data takes advantage of schema-on-write and unstructured data employs schema-on-
read.
Structured data is commonly stored in data warehouses and unstructured data is stored in data
lakes. Both have cloud-use potential, but structured data allows for less storage space and
unstructured data requires more.
The last difference may hold the most impact. Structured data can be used by the average
business user, but unstructured data requires data science expertise in order to gain
accurate business intelligence.
Semi-structured data refers to what would normally be considered unstructured data, but that
includes metadata that identifies certain characteristics. The metadata contains enough
information to enable the data to be more efficiently cataloged, searched, and analyzed than
strictly unstructured data. Think of semi-structured data as in between structured and
unstructured data.
A good example of semi-structured data vs. structured data would be a tab delimited file
containing customer data versus a database containing CRM tables. On the other hand, semi-
structured data has more hierarchy than unstructured data; the tab delimited file is more specific
than a list of comments from a customer’s Instagram.
➢ In the form of numbers and text, in ➢ Comes in a variety of shapes and sizes that does not
standardized, readable formats. conform to a predefined data model and remains in
➢ Typically XML and CSV. its native format.
➢ Follows a predefined relational data ➢ Typically DOC, WMV, MPW, MP3, WAV.
model. ➢ Does not have a data model, though may have
➢ Stored in a relational database in tablets, hidden structure.
rows, and columns, with specific labels. ➢ Stored in unstructured raw formats or in a NoSQL
Relational databases use SQL for database. Many companies use data lakes to store
processing. large volumes of unstructured data that they can
➢ Easy to search and use with ample then access when needed.
analytics tools available. ➢ Requires complex search, processing, and analysis
➢ Quantitative (has countable elements), before it can be placed in a relational database.
easy to group based on attributes or ➢ Qualitative with subjective information that must be
characteristics. split, stacked, grouped, minded and patterned to
analyze it.
Shree Mahaveerai Namah
Web Scraping
Web scraping is a technique used to extract data from websites. Python provides
several libraries to perform web scraping, including Requests and BeautifulSoup . Here
are the basic steps to perform web scraping using Python:
import requests
response = requests.get(url)
In this example, we first import the requests and BeautifulSoup libraries. We then define the
URL of the website we want to scrape and use the requests library to send a GET request to the
URL. We then use the BeautifulSoup library to parse the HTML content of the response and
extract the data we want. Finally, we store the extracted data in the data variable.
Shree Mahaveerai Namah
Where To Go Now?
Now that you have even more information about your data and you have a visualization ready, it
does seem a bit like the data points sort of group together, but you also see there is quite some
overlap.
Do you think that, in a case where you knew that there are 10 possible digits labels to assign to
the data points, but you have no access to the labels, the observations would group or “cluster”
together by some criterion in such a way that you could infer the labels?
In general, when you have acquired a good understanding of your data, you have to decide on the
use cases that would be relevant to your data set. In other words, you think about what your data
set might teach you or what you think you can learn from your data.
From there on, you can think about what kind of algorithms you would be able to apply to your
data set in order to get the results that you think you can obtain.
Tip: the more familiar you are with your data, the easier it will be to assess the use cases for
your specific data set. The same also holds for finding the appropriate machine algorithm.
However, when you’re first getting started with scikit-learn, you’ll see that the amount of
algorithms that the library contains is pretty vast and that you might still want additional help
when you’re doing the assessment for your data set. That’s why this scikit-learnmachine learning
map will come in handy.
Note that this map does require you to have some knowledge about the algorithms that are
included in the scikit-learn library. This, by the way, also holds some truth for taking this next
step in your project: if you have no idea what is possible, it will be very hard to decide on what
your use case will be for the data.
As your use case was one for clustering, you can follow the path on the map towards “KMeans”.
You’ll see the use case that you have just thought about requires you to have more than 50
samples (“check!”), to have labeled data (“check!”), to know the number of categories that you
want to predict (“check!”) and to have less than 10K samples (“check!”).
It is one of the simplest and widely used unsupervised learning algorithms to solve clustering
problems. The procedure follows a simple and easy way to classify a given data set through a
certain number of clusters that you have set before you run the algorithm. This number of
clusters is called k and you select this number at random.
Definition
Then, the k-means algorithm will find the nearest cluster center for each data point and
assign the data point closest to that cluster.
Once all data points have been assigned to clusters, the cluster centers will be recomputed. In
other words, new cluster centers will emerge from the average of the values of the cluster data
points. This process is repeated until most data points stick to the same cluster. The cluster
membership should stabilize.
You can already see that, because the k-means algorithm works the way it does, the initial set of
cluster centers that you give up can have a big effect on the clusters that are eventually found.
You can, of course, deal with this effect, as you will see further on.
However, before you can go into making a model for your data, you should definitely take a look
into preparing your data for this purpose.
As you have read in the previous section, before modeling your data, you’ll do well by preparing
it first. This preparation step is called “preprocessing”.
Data Normalization
The first thing that we’re going to do is preprocessing the data. You can standardize the
# Import
data = scale(digits.data)
import pandas as pd
digits = datasets.load_digits()
data=scale(digits.data)
print(data)
By scaling the data, you shift the distribution of each attribute to have a mean of zero and a
standard deviation of one (unit variance).
In order to assess your model’s performance later, you will also need to divide the data set into
two parts: a training set and a test set. The first is used to train the system, while the second is
used to evaluate the learned or trained system.
Training set denotes the subset of a dataset that is used for training the machine learning
model. Here, you are already aware of the output. A test set, on the other hand, is the subset of
the dataset that is used for testing the machine learning model. The ML model uses the test set
to predict outcomes.
Usually, the dataset is split into 70:30 ratio or 80:20 ratio. This means that you either take
70% or 80% of the data for training the model while leaving out the rest 30% or 20%. The
splitting process varies according to the shape and size of the dataset in question.
In practice, the division of your data set into a test and a training sets is disjoint: the most
common splitting choice is to take 2/3 of your original data set as the training set, while the 1/3
that remains will compose the test set.
You will try to do this also here. You see in the code chunk below that this ‘traditional’ splitting
choice is respected: in the arguments of the train_test_split() method, you clearly see that
the test_size is set to 0.25.
You’ll also note that the argument random_state has the value 42 assigned to it. With this
argument, you can guarantee that your split will always be the same. That is particularly handy if
you want reproducible results.
# Import `train_test_split`
.25, random_state=42)
import pandas as pd
digits = datasets.load_digits()
data=scale(digits.data)
# Import `train_test_split`
random_state=42)
print(images_train)
After you have split up your data set into train and test sets, you can quickly inspect the numbers
before you go and model the data:
# Number of training features
print(n_samples)
print(n_features)
n_digits = len(np.unique(y_train))
# Inspect `y_train`
print(len(y_train))
import pandas as pd
import numpy as np
digits = datasets.load_digits()
data=scale(digits.data)
# Import `train_test_split`
print(n_samples)
print(n_features)
n_digits = len(np.unique(y_train))
# Inspect `y_train`
print(len(y_train))
#print(images_train)
You’ll see that the training set X_train now contains 1347 samples, which is exactly 2/3d of the
samples that the original data set contained, and 64 features, which hasn’t changed.
The y_train training set also contains 2/3d of the labels of the original data set. This means that
the test sets X_test and y_test contain 450 samples.
After all these preparation steps, you have made sure that all your known (training) data is
stored. No actual model or learning was performed up until this moment.
Now, it’s finally time to find those clusters of your training set. Use KMeans() from
the cluster module to set up your model. You’ll see that there are three arguments that are passed
to this method: init, n_clusters and the random_state.
You might still remember this last argument from before when you split the data into training
and test sets. This argument basically guaranteed that you got reproducible results.
# Import the `cluster` module
clf.fit(X_train)
The init indicates the method for initialization and even though it defaults to ‘k-means++’, you
see it explicitly coming back in the code. That means that you can leave it out if you want. Try it
out in the DataCamp Light chunk above!
Next, you also see that the n_clusters argument is set to 10. This number not only indicates the
number of clusters or groups you want your data to form, but also the number of centroids to
generate. Remember that a cluster centroid is the middle of a cluster.
Do you also still remember how the previous section described this as one of the possible
disadvantages of the K-Means algorithm?
That is, that the initial set of cluster centers that you give up can have a big effect on the clusters
that are eventually found?
Usually, you try to deal with this effect by trying several initial sets in multiple runs and by
selecting the set of clusters with the minimum sum of the squared errors (SSE). In other words,
you want to minimize the distance of each point in the cluster to the mean or centroid of that
cluster.
By adding the n-init argument to KMeans(), you can determine how many different centroid
configurations the algorithm will try.
Note again that you don’t want to insert the test labels when you fit the model to your data: these
will be used to see if your model is good at predicting the actual classes of your instances!
You can also visualize the images that make up the cluster centers as follows:
# Import matplotlib
# Add title
for i in range(10):
ax = fig.add_subplot(2, 5, 1 + i)
# Display images
plt.axis('off')
plt.show()
Code for file data_set_cluster4.py
import pandas as pd
digits = datasets.load_digits()
data=scale(digits.data)
# Import `train_test_split`
random_state=42)
random_state=42)
clf.fit(X_train)
# Add title
for i in range(10):
ax = fig.add_subplot(2, 5, 1 + i)
# Display images
plt.axis('off')
# Show the plot
plt.show()
If you want to see another example that visualizes the data clusters and their centers, go here.
y_pred=clf.predict(X_test)
print(y_pred[:100])
print(y_test[:100])
print(clf.cluster_centers_._shape)____
[5 0 0 4 9 8 2 2 2 9 6 3 7 1 7 6 0 4 2 9 7 9 0 4 2 5 0 9 5 0 7 0 8 7 7 5 0
7455060530152260251011317297041420027
4 1 3 2 0 0 0 1 9 2 2 1 5 7 7 0 3 8 3 0 0 8 0 1 7 7]
[6 9 3 7 2 1 5 2 5 2 1 9 4 0 4 2 3 7 8 8 4 3 9 7 5 6 3 5 6 3 4 9 1 4 4 6 9
4766913613065519560900104524570759554
7 0 4 5 5 9 9 0 2 3 8 0 6 4 4 9 1 2 8 3 5 2 9 0 4 4]
(10, 64)
In the code chunk above, you predict the values for the test set, which contains 450 samples. You
store the result in y_pred. You also print out the first 100 instances of y_pred and y_test and you
immediately see some results.
In addition, you can study the shape of the cluster centers: you immediately see that there are 10
clusters with each 64 features.
But this doesn’t tell you much because we set the number of clusters to 10 and you already knew
that there were 64 features.
# Import `Isomap()`
X_iso = Isomap(n_neighbors=10).fit_transform(X_train)
# Compute cluster centers and predict cluster index for each sample
clusters = clf.fit_predict(X_train)
# Adjust layout
fig.suptitle('Predicted Versus Training Labels', fontsize=14, fontweight='bold')
fig.subplots_adjust(top=0.85)
plt.show()
You use Isomap() as a way to reduce the dimensions of your high-dimensional data set digits.
The difference with the PCA method is that the Isomap is a non-linear reduction method.
# Import `PCA()`
X_pca = PCA(n_components=2).fit_transform(X_train)
# Compute cluster centers and predict cluster index for each sample
clusters = clf.fit_predict(X_train)
fig.subplots_adjust(top=0.85)
plt.show()
mnist=fetch_openml('mnist_784', version=1)
mnist.keys()
X,y=mnist['data'],mnist['target']
X.shape
(70000, 784)
y.shape
(70000,)
import pandas as pd
import numpy as np
mnist=fetch_openml('mnist_784', version=1)
#mnist.keys()
X,y=mnist['data'],mnist['target']
some_digit = X.loc[4]
some_digit_array=some_digit.to_numpy()
some_digit_image = some_digit_array.reshape(28,28)
plt.imshow(some_digit_image)
plt.axis("off")
plt.show()
import pandas as pd
import numpy as np
mnist=fetch_openml('mnist_784', version=1)
X,y=mnist['data'],mnist['target']
y_train_9 = (y_train == 9)
y_test_9 = (y_test == 9)
max_iter=1000, random_state=42)
sgd_clf.fit(X_train, y_train_9)
confusion_matrix(y_train_9, y_train_pred)
array([[52482, 1569],
0.7270354906054279
0.702471003530005
f1_score(y_train_9, y_train_pred)
0.7145421903052065
The F1 score favors classifiers that have similar precision and recall(accuracy). This is not
always what you want: in some contexts you mostly care about precision, and in other contexts
you really care about recall(accuracy). For example, if you trained a classifier to detect videos
that are safe for kids, you would probably prefer a classifier that rejects many good videos (low
recall(accuracy)) but keeps only safe ones (high precision), rather than a classifier that has a
much higher recall but lets a few really bad videos show up in your product. On the other hand,
suppose you train a classifier to detect shoplifters on CCTV images: it is probably fine if your
classifier has only 30% precision as long as it has 99% recall (sure, the security guards will get a
few false alerts, but almost all shoplifters will get caught).
import pandas as pd
import numpy as np
mnist=fetch_openml('mnist_784', version=1)
X,y=mnist['data'],mnist['target']
y_train_9 = (y_train == 9)
y_test_9 = (y_test == 9)
max_iter=1000, random_state=42)
sgd_clf.fit(X_train, y_train_9)
plt.show()
"""
confusion_matrix(y_train_9, y_train_pred)
array([[52482, 1569],
0.7270354906054279
0.702471003530005
f1_score(y_train_9, y_train_pred)
0.7145421903052065"""
A flexible classification technique that shares close ties with the SGD
Regressor is the SGD Classifier. It works by progressively changing model
parameters in the direction of a loss function's sharpest gradient. Its capacity
to update these parameters with a randomly chosen subset of the training
data for every iteration is what distinguishes it as "stochastic". The SGD
Classifier is a useful tool because of its versatility, especially in situations
where real-time learning is required and big datasets are involved. We will
examine the fundamental ideas of the SGD Classifier in this post, dissecting
its key variables and hyperparameters. We will also discuss any potential
drawbacks and examine its benefits, such as scalability and efficiency. You
will have a thorough grasp of the SGD Classifier and its crucial role in the
field of data-driven decision-making by the time this journey is over.
Stochastic Gradient Descent
One popular optimization method in deep learning and machine learning
is stochastic gradient descent (SGD). Large datasets and complicated
models benefit greatly from its training. To minimize a loss function, SGD
updates model parameters iteratively. It differentiates itself as "stochastic" by
employing mini-batches, or random subsets, of the training data in each
iteration, which introduces a degree of randomness while maximizing
computational efficiency. By accelerating convergence, this randomness can
aid in escaping local minima. Modern machine learning algorithms rely
heavily on SGD because, despite its simplicity, it may be quite effective
when combined with regularization strategies and suitable learning rate
schedules.
How Stochastic Gradient Descent Works?
Here's how the SGD process typically works:
• Initialize the model parameters randomly or with some default values.
• Randomly shuffle the training data.
• For each training example: Compute the gradient of the cost function with
respect to the current model parameters using the current example.
• Update the model parameters in the direction of the negative gradient by
a small step size known as the learning rate.
• Repeat this process for a specified number of iterations (epochs).
Stochastic Gradient Descent Algorithm
For machine learning model training, initializing model parameters (θ) and
selecting a low learning rate (α) are the first steps in performing stochastic
gradient descent (SGD). Next, to add unpredictability, the training data is
jumbled at random. Every time around, the algorithm analyzes a single
training sample and determines the cost function's gradient (J) in relation to
the model's parameters. The size and direction of the steepest slope are
represented by this gradient. The model is adjusted to minimize the cost
function and provide predictions that are more accurate by updating θ in the
gradient's opposite direction. The model can efficiently learn from and adjust
to new information by going through these iterative processes for every data
point.
The cost function,J(\theta) , is typically a function of the difference between
the predicted value h_{\theta}(x) and the actual target y . In regression
problems, it's often the mean squared error; in classification problems, it can
be cross-entropy loss, for example.
For Regression (Mean Squared Error):
Cost Function:
J(θ) =\frac{1}{2m}* \sum_{i=1}^{m}(h_{θ}(x^i) - y^i)^2
Gradient (Partial Derivatives):
∇J(θ) = \frac{1}{m}*\sum_{i=1}^m(h_{\theta}(x^i) - y^i)x_{j}^i for\;\;\; j = 0 \to n
Update Parameters
Update the model parameters (θ) based on the gradient and the learning
rate:
\theta = \theta -\alpha * \nabla J(\theta)
where,
• θ: Updated model parameters.
• α: Learning rate.
• ∇J(θ): Gradient vector computed.
What is the SGD Classifier?
The SGD Classifier is a linear classification algorithm that aims to find the
optimal decision boundary (a hyperplane) to separate data points belonging
to different classes in a feature space. It operates by iteratively adjusting the
model's parameters to minimize a cost function, often the cross-entropy loss,
using the stochastic gradient descent optimization technique.
How it Differs from Other Classifiers:
The SGD Classifier differs from other classifiers in several ways:
• Stochastic Gradient Descent: Unlike some classifiers that use closed-
form solutions or batch gradient descent (which processes the entire
training dataset in each iteration), the SGD Classifier uses stochastic
gradient descent. It updates the model's parameters incrementally,
processing one training example at a time or in small mini-batches. This
makes it computationally efficient and well-suited for large datasets.
• Linearity: The SGD Classifier is a linear classifier, meaning it constructs
a linear decision boundary to separate classes. This makes it suitable for
problems where the relationship between features and the target variable
is approximately linear. In contrast, algorithms like decision trees or
support vector machines can capture more complex decision boundaries.
• Regularization: The SGD Classifier allows for the incorporation of L1 or
L2 regularization to prevent overfitting. Regularization terms are added to
the cost function, encouraging the model to have smaller parameter
values. This is particularly useful when dealing with high-dimensional
data.
Common Use Cases in Machine Learning
The SGD Classifier is commonly used in various machine learning tasks and
scenarios:
1. Text Classification: It's often used for tasks like sentiment analysis,
spam detection, and text categorization. Text data is typically high-
dimensional, and the SGD Classifier can efficiently handle large feature
spaces.
2. Large Datasets: When working with extensive datasets, the SGD
Classifier's stochastic nature is advantageous. It allows you to train on
large datasets without the need to load the entire dataset into memory,
making it memory-efficient.
3. Online Learning: In scenarios where data streams in real-time, such as
clickstream analysis or fraud detection, the SGD Classifier is well-suited
for online learning. It can continuously adapt to changing data patterns.
4. Multi-class Classification: The SGD Classifier can be used for multi-
class classification tasks by extending the binary classification approach
to handle multiple classes, often using the one-vs-all (OvA) strategy.
5. Parameter Tuning: The SGD Classifier is a versatile algorithm that can
be fine-tuned with various hyperparameters, including the learning rate,
regularization strength, and the type of loss function. This flexibility allows
it to adapt to different problem domains.
Parameters of Stochastic Gradient Descent Classifier
Stochastic Gradient Descent (SGD) Classifier is a versatile algorithm with
various parameters and concepts that can significantly impact its
performance. Here's a detailed explanation of some of the key parameters
and concepts relevant to the SGD Classifier:
1. Learning Rate (α):
• The learning rate (α) is a crucial hyperparameter that determines the size
of the steps taken during parameter updates in each iteration.
• It controls the trade-off between convergence speed and stability.
• A larger learning rate can lead to faster convergence but may result in
overshooting the optimal solution.
• In contrast, a smaller learning rate may lead to slower convergence but
with more stable updates.
• It's important to choose an appropriate learning rate for your specific
problem.
2. Batch Size:
The batch size defines the number of training examples used in each
iteration or mini-batch when updating the model parameters. There are three
common choices for batch size:
• Stochastic Gradient Descent (batch size = 1): In this case, the model
parameters are updated after processing each training example. This
introduces significant randomness and can help escape local minima but
may result in noisy updates.
• Mini-Batch Gradient Descent (1 < batch size < number of training
examples): Mini-batch SGD strikes a balance between the efficiency of
batch gradient descent and the noise of stochastic gradient descent. It's
the most commonly used variant.
• Batch Gradient Descent (batch size = number of training
examples): In this case, the model parameters are updated using the
entire training dataset in each iteration. While this can lead to more stable
updates, it is computationally expensive, especially for large datasets.
3. Convergence Criteria:
Convergence criteria are used to determine when the optimization process
should stop. Common convergence criteria include:
• Fixed Number of Epochs: You can set a predefined number of epochs,
and the algorithm stops after completing that many iterations through the
dataset.
• Tolerance on the Change in the Cost Function: Stop when the change
in the cost function between consecutive iterations becomes smaller than
a specified threshold.
• Validation Set Performance: You can monitor the performance of the
model on a separate validation set and stop training when it reaches a
satisfactory level of performance.
4. Regularization (L1 and L2):
• Regularization is a technique used to prevent overfitting.
• The SGD Classifier allows you to incorporate L1 (Lasso) and L2 (Ridge)
regularization terms into the cost function.
• These terms add a penalty based on the magnitude of the model
parameters, encouraging them to be small.
• The regularization strength hyperparameter controls the impact of
regularization on the optimization process.
5. Loss Function:
• The choice of the loss function determines how the classifier measures
the error between predicted and actual class labels.
• For binary classification, the cross-entropy loss is commonly used, while
for multi-class problems, the categorical cross-entropy or softmax loss is
typical.
• The choice of the loss function should align with the problem and the
activation function used.
6. Momentum and Adaptive Learning Rates:
To enhance convergence and avoid oscillations, you can use momentum
techniques or adaptive learning rates. Momentum introduces an additional
parameter that smoothers the updates and helps the algorithm escape local
minima. Adaptive learning rate methods automatically adjust the learning
rate during training based on the observed progress.
7. Early Stopping:
Early stopping is a technique used to prevent overfitting. It involves
monitoring the model's performance on a validation set during training and
stopping the optimization process when the performance starts to degrade,
indicating overfitting.
Python Code using SGD to classify the famous Iris Dataset
import numpy as np
data = load_iris()
X, y = data.data, data.target
X, y, test_size=0.2, random_state=42)
“””This code loads the Iris dataset, which is made up of target labels in y and
features in X. The data is then split 80–20 for training and testing purposes,
with a reproducible random seed of 42. This yields training and testing sets
for both features and labels.”””
max_iter=1000, random_state=42)
clf.fit(X_train, y_train)
“””An SGD Classifier (clf) is instantiated for classification tasks in this code.
Because the classifier is configured to use the log loss (logistic loss)
function, it can be used for both binary and multiclass classification.
Furthermore, to help avoid overfitting, L2 regularization is used with an alpha
parameter of 0.01. To guarantee consistency of results, a random seed of 42
is chosen, and the classifier is programmed to run up to 1000 iterations
during training.”””
# Make predictions
#y_pred = clf.predict(X_test)
Pass
y_pred = clf.predict(X_test)
print(y_pred)
Using the training data (X_train and y_train), these lines of code train the
SGD Classifier (clf). Following training, the model is applied to generate
predictions on the test data (X_test), which are then saved in the y_pred
variable for a future analysis.
“””These lines of code compare the predicted labels (y_pred) with the actual
labels of the test data (y_test) to determine the classification accuracy.”””
This gives 90% accuracy
If we increase test data size to 30% accuracy comes to 96% as shown here:
Test data size 30%
y_pred = clf.predict(X_test)
print(y_pred)
A confusion matrix is a table that is often used to describe the performance of a classification
model on a set of data for which the true values are known. It is a 2x2 matrix for binary
classification (though it can be expanded for multi-class problems). The four outcomes are:
• True Positives (TP): The cases in which the model predicted yes (or the positive class),
and the truth is also yes.
• True Negatives (TN): The cases in which the model predicted no (or the negative class),
and the truth is also no.
• False Positives (FP), Type I error: The cases in which the model predicted yes, but the
truth is no.
• False Negatives (FN), Type II error: The cases in which the model predicted no, but the
truth is yes.
Precision: Precision is the ratio of correctly predicted positive observations to the total predicted
positive observations.
Formula: Precision = TP / (TP + FP)
Recall (Sensitivity): Recall is the ratio of correctly predicted positive observations to all positive
F1-Score: The F1 Score is the weighted average of Precision and Recall. It tries to find the
balance between precision and recall.
Accuracy: Accuracy is the most intuitive performance measure. It is simply a ratio of correctly
Type I error (False Positive rate): This is the situation where you reject the null hypothesis
when it is actually true. In terms of the confusion matrix, it’s when you wrongly predict the
positive class.
Type II error (False Negative rate): This is the situation where you fail to reject the null
hypothesis when it is actually false. In terms of the confusion matrix, it’s when you wrongly
predict the negative class.
Formula: Type II error = FN / (FN + TP)
ROC-AUC Curve:
Receiver Operating Characteristic (ROC) is a probability curve that plots the true positive rate
(sensitivity or recall) against the false positive rate (1 — specificity) at various threshold settings.
Area Under the Curve (AUC) is the area under the ROC curve. If the AUC is high (close to 1), the
model is better at distinguishing between positive and negative classes. An AUC of 0.5 represents
https://shivang-ahd.medium.com/all-about-confusion-matrix-preparing-for-interview-questions-
fddea115a7ee
Evaluation of Your Clustering Model
And this need for further investigation brings you to the next essential step, which is the
evaluation of your model’s performance. In other words, you want to analyze the degree of
correctness of the model’s predictions.
print(metrics.confusion_matrix(y_test, y_pred))
#To predict the labels of the test set: Using Isomap which is non-linear
import pandas as pd
digits = datasets.load_digits()
data=scale(digits.data)
# Import `train_test_split`
random_state=42)
#print(y_test)
random_state=42)
clf.fit(X_train)
y_pred=clf.predict(X_test)
print(metrics.confusion_matrix(y_test, y_pred))
plt.show()
Output of file data_set_cluster10.py
[[ 0 43 0 0 0 0 0 0 0 0]
[ 0 0 0 11 0 0 19 0 7 0]
[ 0 0 2 1 0 1 5 0 9 20]
[34 0 6 0 3 0 0 0 1 2]
[ 0 0 0 2 0 0 1 52 0 0]
[12 0 40 0 0 1 0 0 0 6]
[ 0 1 0 0 0 44 0 0 0 0]
[ 0 0 1 0 35 0 3 0 0 2]
[ 2 0 23 1 0 0 9 0 0 3]
[38 1 4 3 2 0 0 0 0 0]]
At first sight, the results seem to confirm our first thoughts that you gathered from the
visualizations. Only the digit 5 was classified correctly in 41 cases. Also, the digit 8 was
classified correctly in 11 instances. But this is not really a success.
You might need to know a bit more about the results than just the confusion matrix.
Let’s try to figure out something more about the quality of the clusters by applying different
cluster quality metrics. That way, you can judge the goodness of fit of the cluster labels to the
correct labels.
adjusted_mutual_info_score, silhouette_score
silhouette')
%(clf.inertia_,
homogeneity_score(y_test, y_pred),
completeness_score(y_test, y_pred),
v_measure_score(y_test, y_pred),
adjusted_rand_score(y_test, y_pred),
adjusted_mutual_info_score(y_test, y_pred),
• The homogeneity score tells you to what extent all of the clusters contain only data points
which are members of a single class.
• The completeness score measures the extent to which all of the data points that are
members of a given class are also elements of the same cluster.
• The V-measure score is the harmonic mean between homogeneity and completeness.
• The adjusted Rand score measures the similarity between two clusterings and considers
all pairs of samples and counting pairs that are assigned in the same or different clusters
in the predicted and true clusterings.
• The Adjusted Mutual Info (AMI) score is used to compare clusters. It measures the
similarity between the data points that are in the clusterings, accounting for chance
groupings and takes a maximum value of 1 when clusterings are equivalent.
• The silhouette score measures how similar an object is to its own cluster compared to
other clusters. The silhouette scores ranges from -1 to 1, where a higher value indicates
that the object is better matched to its own cluster and worse mached to neighboring
clusters. If many points have a high value, the clusteirng configuration is good.
You clearly see that these scores aren’t fantastic: for example, you see that the value for the
silhouette score is close to 0, which indicates that the sample is on or very close to the decision
boundary between two neighboring clusters. This could indicate that the samples could have
been assigned to the wrong cluster.
Also the ARI measure seems to indicate that not all data points in a given cluster are similar and
the completeness score tells you that there are definitely data points that weren’t put in the right
cluster.
Clearly, you should consider another estimator to predict the labels for the digits data.
Shree Mahaveerai Namah
Trying Out Another Model: Support Vector Machines
When you recapped all of the information that you gathered out of the data exploration, you saw
that you could build a model to predict which group a digit belongs to without you knowing the
labels. And indeed, you just used the training data and not the target values to build your
KMeans model.
Let’s assume that you depart from the case where you use both the digits training data and the
corresponding target values to build your model.
If you follow the algorithm map, you’ll see that the first model that you meet is the linear SVC(
Support Vector Classifier ). Let’s apply this now to the digits data:
# Import `train_test_split`
test_size=0.25, random_state=42)
svc_model.fit(X_train,y_train)
You see here that you make use of X_train and y_train to fit the data to the SVC model. This is
clearly different from clustering. Note also that in this example, you set the value
of gamma manually. It is possible to automatically find good values for the parameters by using
tools such as grid search and cross validation.
Even though this is not the focus of this topic, you will see how you could have gone about this
if you would have made use of grid search to adjust your parameters. You would have done
# Import GridSearchCV
parameter_candidates = [
: ['rbf']},
Next, you use the classifier with the classifier and parameter candidates that you have just
created to apply it to the second part of your data set. Next, you also train a new classifier using
the best parameters found by the grid search. You score the result to see if the best parameters
that were found in the grid search are actually working.
# Apply the classifier to the test data, and view the accuracy
score
clf.score(X_test, y_test)
parameters
Now what does this new knowledge tell you about the SVC classifier that you had modeled
before you had done the grid search?
You see that in the SVM classifier, the penalty parameter C of the error term is specified at 100..
Lastly, you see that the kernel has been explicitly specified as a linear one. The kernel argument
specifies the kernel type that you’re going to use in the algorithm and by default, this is rbf. In
other cases, you can specify others such as linear, poly, …
A kernel is a similarity function, which is used to compute similarity between the training data
points. When you provide a kernel to an algorithm, together with the training data and the labels,
you will get a classifier, as is the case here. You will have trained a model that assigns new
unseen objects into a particular category. For the SVM, you will typicall try to linearly divide
your data points.
However, the grid search tells you that an rbf kernel would’ve worked better. The penalty
parameter and the gamma were specified correctly.
For now, let’s just say you just continue with a linear kernel and predict the values for the test
set:
print(svc_model.predict(______))
print(______)
You can also visualize the images and their predicted labels:
# Import matplotlib
predicted = svc_model.predict(X_test)
plt.subplot(1, 4, index + 1)
plt.axis('off')
plt.show()
This plot is very similar to the plot that you made when you were exploring the data:
Above discussion is summarized in the following program for support vector machine.
import pandas as pd
digits = datasets.load_digits()
data=scale(digits.data)
# Import `train_test_split`
random_state=42)
# Import the `svm` model
svc_model.fit(X_train,y_train)
print(y_test)
predicted = svc_model.predict(X_test)
print(predicted)
plt.subplot(1, 4, index + 1)
plt.axis('off')
# Display images in all subplots in the grid
plt.show()
[6 9 3 7 2 1 5 2 5 2 1 9 4 0 4 2 3 7 8 8 4 3 9 7 5 6 3 5 6 3 4 9 1 4 4 6 9
4766913613065519560900104524570759554
7045599023806449128352904443531359427
7441927872694072758757706642809469969
0356606439397290453659984213772239803
2256994154236485957894815449618604527
4645603236715147688551628899762223488
3609770104515360410036597355998533205
8340246434505213141170152128706488518
4587985062079895277187438356003050041
2845963188423898850633716412116474834
0519457637059759742190753363969501558
3 3 6 2 6 5]
[6 9 3 7 2 1 5 2 5 2 1 9 4 0 4 2 3 7 8 8 4 3 9 7 5 6 3 5 6 3 4 9 1 4 4 6 9
4766913613065513560900104524570759354
7045599023806449128352904443531359427
7441927872694072758757906642809469969
0556606439377290453659984213772239803
2256994154236485957894815449618604527
4645603236715147651551028899762223488
3609770104515360410036597355998533205
8340246434505213141170152128706488518
4587986062079895277187438356003050041
2845963188423898850633716412116474834
0519457637059759742190752363969501558
3 3 6 2 6 5]
#import metric
print(metrics.classification_report(y_test, predicted))
print(metrics.confusion_matrix(y_test, predicted))
Python 3.12.0 (tags/v3.12.0:0fb18b0, Oct 2 2023, 13:03:39) [MSC v.1935 64 bit (AMD64)] on
win32
Only this time, you zip together the images and the predicted values and you only take the first 4
elements of images_and_predictions.
But now the biggest question: how does this model perform?
# Import `metrics`
print(metrics.classification_report(______, _________))
print(metrics.confusion_matrix(______, _________))
You clearly see that this model performs a whole lot better than the clustering model that you
used earlier.
You can also see it when you visualize the predicted and the actual labels with the help
of Isomap():
# Import `Isomap()`
X_iso = Isomap(n_neighbors=10).fit_transform(X_train)
# Compute cluster centers and predict cluster index for each sample
predicted = svc_model.predict(X_train)
fig.subplots_adjust(top=0.85)
ax[0].set_title('Predicted labels')
ax[1].set_title('Actual Labels')
# Add title
plt.show()
This will give you the following scatterplots:
You’ll see that this visualization confirms your classification report, which is very good news. :)
What's Next?
Congratulations, you have reached the end of this scikit-learn tutorial, which was meant to
introduce you to Python machine learning! Now it's your turn.
Firstly, make sure you get a hold of DataCamp's scikit-learn cheat sheet.
Next, start your own digit recognition project with different data. One dataset that you can
already use is the MNIST data, which you can download here.
The steps that you can take are very similar to the ones that you have gone through with this
tutorial, but if you still feel that you can use some help, you should check out this page, which
works with the MNIST data and applies the KMeans algorithm.
Working with the digits dataset was the first step in classifying characters with scikit-learn. If
you’re done with this, you might consider trying out an even more challenging problem, namely,
classifying alphanumeric characters in natural images.
A well-known dataset that you can use for this problem is the Chars74K dataset, which contains
more than 74,000 images of digits from 0 to 9 and the both lowercase and higher case letters of
the English alphabet. You can download the dataset here.
Whether you're going to start with the projects that have been mentioned above or not, this is
definitely not the end of your journey of data science with Python. If you choose not to widen
your view just yet, consider deepening your data visualization and data manipulation knowledge.
Don't miss out on our Interactive Data Visualization with Bokeh course to make sure you can
impress your peers with a stunning data science portfolio or our pandas Foundation course, to
learn more about working with data frames in Python.
Decision Tree
A decision tree is one of the most powerful tools of supervised learning algorithms used for
both classification and regression tasks. It builds a flowchart-like tree structure where each
internal node denotes a test on an attribute, each branch represents an outcome of the test, and
each leaf node (terminal node) holds a class label (Result of algorithm). It is constructed by
recursively splitting the training data into subsets based on the values of the attributes until a
stopping criterion is met, such as the maximum depth of the tree or the minimum number of
samples required to split a node.
During training, the Decision Tree algorithm selects the best attribute to split the data based on
a metric such as entropy or Gini impurity, which measures the level of impurity or randomness
in the subsets. The goal is to find the attribute that maximizes the information gain or the
reduction in impurity after the split.
What is a Decision Tree?
A decision tree is a flowchart-like tree structure where each internal node denotes the feature,
branches denote the rules and the leaf nodes denote the result of the algorithm. It is a
versatile supervised machine-learning algorithm, which is used for both classification and
regression problems. It is one of the very powerful algorithms. And it is also used in Random
Forest to train on different subsets of training data, which makes random forest one of the most
powerful algorithms in machine learning.
➢ Root Node: It is the topmost node in the tree, which represents the complete dataset. It
is the starting point of the decision-making process.
➢ Leaf/Terminal Node: A node without any child nodes that indicates a class label or a
numerical value.
➢ Splitting: The process of splitting a node into two or more sub-nodes using a split
criterion and a selected feature.
➢ Branch/Sub-Tree: A subsection of the decision tree starts at an internal node and ends
at the leaf nodes.
➢ Parent Node: The node that divides into one or more child nodes.
➢ Child Node: The nodes that emerge when a parent node is split.
➢ Variance: Variance measures how much the predicted and the target variables vary in
different samples of a dataset. It is used for regression problems in decision
trees. Mean squared error, Mean Absolute Error, friedman_mse, or Half Poisson
deviance are used to measure the variance for the regression tasks in the decision tree.
➢ Pruning: The process of removing branches from the tree that do not provide any
additional information or lead to overfitting.
Important points related to Entropy:
1. The entropy is 0 when the dataset is completely homogeneous, meaning that each
instance belongs to the same class. It is the lowest entropy indicating no uncertainty in
the dataset sample.
2. when the dataset is equally divided between multiple classes, the entropy is at its
maximum value. Therefore, entropy is highest when the distribution of class labels is
even, indicating maximum uncertainty in the dataset sample.
3. Entropy is used to evaluate the quality of a split. The goal of entropy is to select the
attribute that minimizes the entropy of the resulting subsets, by splitting the dataset into
more homogeneous subsets with respect to the class labels.
4. The highest information gain attribute is chosen as the splitting criterion (i.e., the
reduction in entropy after splitting on that attribute), and the process is repeated
recursively to build the decision tree.
Decision Tree
File decision_tree.2py
import pandas as pd
import numpy as np
warnings.filterwarnings('ignore')
iris = load_iris()
iris = sns.load_dataset('iris')
iris.head()
sns.FacetGrid(iris, \
hue="species").map(sns.distplot,"petal_length").add_legend()
sns.FacetGrid(iris,hue="species").map(sns.distplot,"petal_width")\
.add_legend()
sns.FacetGrid(iris,hue="species").map(sns.distplot, \
"sepal_length").add_legend()
plt.show()
File name decision_tree3.py
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
iris = load_iris()
iris = sns.load_dataset('iris')
iris.head()
X = iris.iloc[:, :-2]
y = iris.species
print(y)
treemodel = DecisionTreeClassifier(criterion='entropy',max_depth=2)
treemodel.fit(X_train, y_train)
plt.figure(figsize=(15, 10))
tree.plot_tree(treemodel, filled=True)
ypred = treemodel.predict(X_test)
print(score)
print(classification_report(ypred, y_test))
plt.show()
decision_tree3.py
0.98
precision recall f1-score support
accuracy 0.98 50
SOME DEFINITIONS
A scatter diagram, also known as a scatter plot or scatter graph, is a graphical representation of
the relationship between two continuous variables. It displays individual data points on a two-
dimensional graph, with one variable plotted on the x-axis and the other on the y-axis. Each point
represents the simultaneous values of the two variables for a specific observation.
The primary purpose of a scatter diagram is to visually assess the nature and strength of the
relationship between the variables. The pattern of points on the graph can provide insights into
potential correlations, trends, clusters, or outliers in the data.
Positive correlation: Points on the graph tend to form an upward-sloping pattern, indicating that
as one variable increases, the other also tends to increase.
Negative correlation: Points on the graph tend to form a downward-sloping pattern, suggesting
that as one variable increases, the other tends to decrease.
Scatter diagrams are widely used in various fields, including statistics, data analysis, and
scientific research, to visually explore and understand the relationships within a dataset. Nm
A probability distribution is a statistical function that describes all the possible values and
likelihoods that a random variable can take within a given range from a random experiment.
This range will be bounded between the minimum and maximum possible values, but precisely
where the possible values is likely to be plotted on the probability distribution depends on a
number of factors. These factors include the distribution's mean (average), standard deviation,
skewness and kurtosis. Some of all probabilities or area under curve should be 1.
Linear regression is a statistical method used to model the relationship between a dependent
variable and one or more independent variables by fitting a linear equation to the observed data.
the goal is to find the best fitting line that minimizes the difference between the observed and
predicted values, allowing for the prediction of the dependent variable based on the values of the
independent variables.
Linear regression is a statistical method used to model the relationship between a dependent
variable and one ore more independent variables by fitting a linear equation to observed data.
The goal of linear regression is to find the best-fitting straight line (linear regression line) that
minimizes the sum of the squared differences between the observed values and the values
predicted by the models. This line can be used for predicting the dependent variable for new and
unseen data. Simple equation y = mx + b.
1. High correlation coefficients: Multi collinearity is often indicated by high correlation between
independent variables. A correlation coefficient close to +1 or -1 suggests a strong linear
relationship.
2. Variance Inflation factor (VIF): VIF is a measure of how much the variance of the estimated
regression coefficients increases when your predictors are correlated. High VIF values (typically
greater than 10) are indicative of multi collinearity.
1. Variable Reduction: Multi collinearity can sometimes help identify redundant variables in a
model, if two variables are highly correlated, it may be possible to combine or eliminate one of
them, simplifying the model.
2. Improved predictive Accuracy: In some cases, multi collinearity may not significantly impact
the predictive accuracy of a model. if the primary goal is prediction rather than the understanding
individual variable effects, multi collinearity may be of lesser concern.
2. Unreliable coefficients: The estimated coefficients of the affected variable become less stable,
and small changes in the data can lead to large changes in the coefficients.
3. Model Instability: Multi collinearity can make the model sensitive to the small changes in the
data, leading to model instability and potential inaccurate predictions.
Addressing multi collinearity is crucial for building reliable regression models, and it requires a
combination of statistical techniques and careful considerations of the variables included in the
model.
Hypothesis testing is a statistical method used to make inferences about a population based on a
sample of data. It involves testing a hypothesis or a claim about a population parameter using
sample data. The process typically follows a structured set of steps. Let's break down the key
components and steps involved in hypothesis testing:
❖ The alternative hypothesis is the statement that contradicts the null hypothesis.
❖ It is denoted by Ha or H1
❖ It represents what the researcher is trying to establish.
3. Significance Level
❖ The significance level is the probability of rejecting the null hypothesis when it is
actually true.
❖ Commonly used values for α are 0.05, 0.01, and 0.10.
4. Test Statistic:
❖ The test statistic is a numerical value calculated from the sample data to
determine how far the sample result deviates from what is expected under the null
hypothesis.
• The critical region is the set of all values of the test statistic that leads to the rejection of
the null hypothesis.
6. P-value:
• The p-value is the probability of obtaining a test statistic as extreme as, or more extreme
than, the one observed in the sample, assuming the null hypothesis is true.
• A small p-value indicates evidence against the null hypothesis.
1. Formulate Hypotheses:
• State the null hypothesis (0H0) and the alternative hypothesis (Ha).
• Decide on the level of significance (α), which represents the probability of making a
Type I error (rejecting a true null hypothesis).
4. Calculate P-value:
• Calculate the probability of obtaining the observed test statistic or more extreme values
under the null hypothesis.
5. Make a Decision:
6. Draw a Conclusion:
• State the conclusion in terms of the problem context and interpret the results.
Types of Errors:
Hypothesis testing is a crucial tool in research and decision-making, helping to draw valid
conclusions about populations based on limited sample data.
A Null hypothesis is a type of statistical hypothesis that propose that no statistical significant
difference exists or effect or relationship in a set of given observations(parameters) of a
population or between different samples of a population . Hypothesis testing is used to assess the
credibility of a hypothesis by using sample data . Sometime referred to simply as the "null: it is
represented as H0.
The null hypothesis , also known as the conjecture , is used in quantitative analysis to test
theories about markets investing strategies or economics to decide if an idea is true or false.
A test statistic is a crucial element in the framework of statistical hypothesis testing serving as a
standardized value that enables the determination of the probability of observing the data under
the null hypothesis. It is the bridge between the sample data and the deision-making process
regarding the hypotheses. The contruction and interpretation of a test statistic follow a systematic
approach designed to assess the evidence against the null hypothesis.
A test statistic is calculated from sample data to test a hypothesis about a population parameter.
The choice of a test statistic depend s on the hypothesis being tested the type of data collected
and the assumed distribution of the data deviates from what is expected under the null
hypothesis. This deviation is then assessed to determine if it is significant enough to warrant
rejection of Ho.
The calculation of a test stastic involves specific formulas that incorporate the sample data the
hypothesized value under Ho and often the sample size. Common types of test statistics inclde :
Z-statistic : Used in tests concerning the mean of a normally distributed population with known
variance.
T-statistic : Used in test concerning the mean of a normally distributed population with unknown
variance.
CHI-Square statistic : Used in tests of independence in contingecy tables and for goodness-of-fit
tests.
F-statistic : Used in analysis of variance for comparing the variances of three or more samples.
Conclusion :
The test statistic is a fundamental concept in statistics that quantifies the evidence against the
null hypothesis. Its caeful calculation and interpretation enable researchers to make informed
descisona about the validity of thier hypothesis thereby advancing knowledging in thier
respective fields through rigiours scientific methods
Shree Mahaveerai Namah
Before we dive into gradient descent, it may help to review some concepts from linear
regression. You may recall the following formula for the slope of a line, which is y = mx + b,
where m represents the slope and b is the intercept on the y-axis.
You may also recall plotting a scatterplot in statistics and finding the line of best fit, which
required calculating the error between the actual output and the predicted output (y-hat) using the
mean squared error formula. The gradient descent algorithm behaves similarly, but it is based on
a convex function.
The starting point is just an arbitrary point for us to evaluate the performance. From that starting
point, we will find the derivative (or slope), and from there, we can use a tangent line to observe
the steepness of the slope. The slope will inform the updates to the parameters—i.e. the weights
and bias. The slope at the starting point will be steeper, but as new parameters are generated, the
steepness should gradually reduce until it reaches the lowest point on the curve, known as the
point of convergence.
Similar to finding the line of best fit in linear regression, the goal of gradient descent is to
minimize the cost function, or the error between predicted and actual y. In order to do this, it
requires two data points—a direction and a learning rate. These factors determine the partial
derivative calculations of future iterations, allowing it to gradually arrive at the local or global
minimum (i.e. point of convergence).
• Learning rate (also referred to as step size or the alpha) is the size of the steps that are
taken to reach the minimum. This is typically a small value, and it is evaluated and
updated based on the behavior of the cost function. High learning rates result in larger
steps but risks overshooting the minimum. Conversely, a low learning rate has small step
sizes. While it has the advantage of more precision, the number of iterations
compromises overall efficiency as this takes more time and computations to reach the
minimum.
• The cost (or loss) function measures the difference, or error, between actual y and
predicted y at its current position. This improves the machine learning model's efficacy
by providing feedback to the model so that it can adjust the parameters to minimize the
error and find the local or global minimum. It continuously iterates, moving along the
direction of steepest descent (or the negative gradient) until the cost function is close to
or at zero. At this point, the model will stop learning. Additionally, while the terms, cost
function and loss function, are considered synonymous, there is a slight difference
between them. It’s worth noting that a loss function refers to the error of one training
example, while a cost function calculates the average error across an entire training set.
There are three types of gradient descent learning algorithms: batch gradient descent, stochastic
gradient descent and mini-batch gradient descent.
Batch gradient descent sums the error for each point in a training set, updating the model only
after all training examples have been evaluated. This process referred to as a training epoch.
While this batching provides computation efficiency, it can still have a long processing time for
large training datasets as it still needs to store all of the data into memory. Batch gradient descent
also usually produces a stable error gradient and convergence, but sometimes that convergence
point isn’t the most ideal, finding the local minimum versus the global one.
Stochastic gradient descent
Stochastic gradient descent (SGD) runs a training epoch for each example within the dataset and
it updates each training example's parameters one at a time. Since you only need to hold one
training example, they are easier to store in memory. While these frequent updates can offer
more detail and speed, it can result in losses in computational efficiency when compared to
batch gradient descent. Its frequent updates can result in noisy gradients, but this can also be
helpful in escaping the local minimum and finding the global one.
Mini-batch gradient descent combines concepts from both batch gradient descent and stochastic
gradient descent. It splits the training dataset into small batch sizes and performs updates on each
of those batches. This approach strikes a balance between the computational efficiency
of batch gradient descent and the speed of stochastic gradient descent.
Challenges with gradient descent
While gradient descent is the most common approach for optimization problems, it does come
with its own set of challenges. Some of them include:
For convex problems, gradient descent can find the global minimum with ease, but as nonconvex
problems emerge, gradient descent can struggle to find the global minimum, where the model
achieves the best results.
Recall that when the slope of the cost function is at or close to zero, the model stops learning. A
few scenarios beyond the global minimum can also yield this slope, which are local minima and
saddle points. Local minima mimic the shape of a global minimum, where the slope of the cost
function increases on either side of the current point. However, with saddle points, the negative
gradient only exists on one side of the point, reaching a local maximum on one side and a local
minimum on the other. Its name inspired by that of a horse’s saddle.
Noisy gradients can help the gradient escape local minimums and saddle points.
In deeper neural networks, particular recurrent neural networks, we can also encounter two other
problems when the model is trained with gradient descent and backpropagation.
• Vanishing gradients: This occurs when the gradient is too small. As we move
backwards during backpropagation, the gradient continues to become smaller, causing the
earlier layers in the network to learn more slowly than later layers. When this happens,
the weight parameters update until they become insignificant—i.e. 0—resulting in an
algorithm that is no longer learning.
• Exploding gradients: This happens when the gradient is too large, creating an unstable
model. In this case, the model weights will grow too large, and they will eventually be
represented as NaN. One solution to this issue is to leverage a dimensionality reduction
technique, which can help to minimize complexity within the model.
. Introduction by
Robert Kwiatkowski
Gradient descent (GD) is an iterative first-order optimisation algorithm, used to find a local
minimum/maximum of a given function. This method is commonly used in machine
learning (ML) and deep learning (DL) to minimise a cost/loss function (e.g. in a linear
regression or least sqare fitting). Due to its importance and ease of implementation, this algorithm
is usually taught at the beginning of almost all machine learning courses.
However, its use is not limited to ML/DL only, it’s widely used also in areas like:
❖ mechanical engineering
We will do a deep dive into the math, implementation and behaviour of first-order gradient
descent algorithm. We will navigate the custom (cost) function directly to find its minimum. That
means there will be no underlying data like in typical ML tutorials — we will be more flexible
regarding a function’s shape.
This method was proposed long before the era of modern computers by Augustin-Louis Cauchy
in 1847. Since that time, there was a significant development in computer science and numerical
methods. That led to numerous improved versions of Gradient Descent. However, we’re going to
use a basic/vanilla version implemented in Python.
2. Function requirements
Gradient descent algorithm does not work for all functions. There are two specific requirements.
A function has to be:
❖ differentiable
❖ convex
Next requirement — function has to be convex. For a univariate function, this means that the
line segment connecting two function’s points lays on or above its curve (it does not cross it). If it
does, it means that it has a local minimum which is not a global one.
Mathematically, for two points x₁, x₂ laying on the function’s curve this condition is expressed as:
where λ denotes a point’s location on a section line and its value has to be between 0 (left point)
and 1 (right point), e.g. λ=0.5 means a location in the middle.
Because the second derivative is always bigger than 0, our function is strictly convex.
It is also possible to use quasi-convex functions with a gradient descent algorithm. However,
often they have so-called saddle points (called also minimax points) where the algorithm can get
stuck (it will demonstrate it later). An example of a quasi-convex function is:
Let’s stop here for a moment. We see that the first derivative equal zero at x=0 and x=1.5. This places are
candidates for function’s extrema (minimum or maximum )— the slope is zero there. But first we have to
check the second derivative first.
The value of this expression is zero for x=0 and x=1. These locations are called an inflexion point
— a place where the curvature changes sign — meaning it changes from convex to concave or
vice-versa. By analysing this equation we conclude that :
Now we see that point x=0 has both first and second derivative equal to zero meaning this is a
saddle point and point x=1.5 is a global minimum.
Let’s look at the graph of this function. As calculated before a saddle point is at x=0 and
minimum at x=1.5.
For multivariate functions the most appropriate check if a point is a saddle point is to calculate a
Hessian matrix which involves a bit more complex calculations and is beyond the scope of your
syllabi.
Before jumping into code one more thing has to be explained — what is a gradient. Intuitively it
is a slope of a curve at a given point in a specified direction.
In the case of a univariate function, it is simply the first derivative at a selected point. In the
case of a multivariate function, it is a vector of derivatives in each main direction (along
variable axes). Because we are interested only in a slope along one axis and we don’t care about
others these derivatives are called partial derivatives.
Nabla or Del
The upside-down triangle is a so-called nabla symbol and you read it “del”. To better understand
how to calculate it let’s do a hand calculation for an exemplary 2-dimensional function below.
3D plot;
so consequently:
By looking at these values we conclude that the slope is twice steeper along the y axis.
Gradient Descent Algorithm iteratively calculates the next point using gradient at the current
position, scales it (by a learning rate) and subtracts obtained value from the current position
(makes a step). It subtracts the value because we want to minimise the function (to maximise it
would be adding). This process can be written as:
There’s an important parameter η which scales the gradient and thus controls the step size. In
machine learning, it is called learning rate and have a strong influence on performance.
• The smaller learning rate the longer GD converges, or may reach maximum iteration before
reaching the optimum point
• If learning rate is too big the algorithm may not converge to the optimal point (jump around) or
even to diverge completely.
3. make a scaled step in the opposite direction to the gradient (objective: minimise)
• step size is smaller than the tolerance (due to scaling or a small gradient).
Below, there’s an exemplary implementation of the Gradient Descent algorithm (with steps
tracking):
import
numpy
as np
from typing import Callable
def gradient_descent(start: float, gradient: Callable[[float], float],
learn_rate: float, max_iter: int, tol: float = 0.01):
x = start
steps = [start] # history tracking
for _ in range(max_iter):
diff = learn_rate*gradient(x)
if np.abs(diff) < tol:
break
x = x - diff
steps.append(x) # history tracing
return steps, x
1. starting point [float] - in our case, we define it manually but in practice, it is often a random
initialisation
2. gradient function [object] - function calculating gradient which has to be specified before-
hand and passed to the GD function
5. tolerance [float] to conditionally stop the algorithm (in this case a default value is 0.01)
For this function, by taking a learning rate of 0.1 and starting point at x=9 we can easily calculate
each step by hand. Let’s do it for the first 3 steps:
def func1(x:float):
return x**2-4x+1
def gradient_func1(x:float):
return 2*x -4
def func1(x:float):
return x**2-4*x+1
def gradient_func1(x:float):
return 2*x -4
The animation below shows steps taken by the GD algorithm for learning rates of 0.1 and 0.8. As
you see, for the smaller learning rate, as the algorithm approaches the minimum the steps are
getting gradually smaller. For a bigger learning rate, it is jumping from one side to another before
converging.
Trajectories, number of iterations and the final converged result (within tolerance) for various
learning rates are shown below:
Now let’s see how the algorithm will cope with a semi-convex function we investigated
mathematically before.
Below results for two learning rates and two different staring points.
Below an animation for a learning rate of 0.4 and a starting point x=-0.5.
Animation of GD trying to escape from a saddle point; Image by author
Now, you see that an existence of a saddle point imposes a real challenge for the first-order
gradient descent algorithms like GD, and obtaining a global minimum is not guaranteed. Second-
order algorithms deal with these situations better (e.g. Newton-Raphson method).
Investigation of saddle points and how to escape from them is a subject of ongoing studies and
various solutions were proposed. For example, Chi Jin and M. Jordan proposed a Perturbing
Gradient Descent algorithm.
7. Summary
We have learned, how Gradient Decent algorithm works, when can it be used and, what are some
common challenges when using it. I hope this will be a good starting point for you for exploring
more advanced gradient-based optimisation methods like Momentum or Nesterov (Accelerated)
Gradient Descent, RMSprop, ADAM or second-order ones like the Newton-Ralphson algorithm.