ML New New 1
ML New New 1
driven approach. Scientists began creating programs for computers to analyze large amounts
of data and draw conclusions — or “learn” — from the results.
What is Machine Learning?
And in 1997, IBM’s Deep Blue shocked the world by beating the world champion at chess.
Machine Learning is a branch of artificial intelligence that develops algorithms by learning
the hidden patterns of the datasets used it to make predictions on new similar type data, The term “deep learning” was coined in 2006 by Geoffrey Hinton to explain new algorithms
without being explicitly programmed for each task. that let computers “see” and distinguish objects and text in images and videos.
Traditional Machine Learning combines data with statistical tools to predict an output that Four years later, in 2010 Microsoft revealed their Kinect technology could track 20 human
can be used to make actionable insights. features at a rate of 30 times per second, allowing people to interact with the computer via
movements and gestures. The follow year IBM’s Watson beat its human competitors at
Machine learning is used in many different applications, from image and speech recognition Jeopardy.
to natural language processing, recommendation systems, fraud detection, portfolio
optimization, automated task, and so on. Machine learning models are also used to power Google Brain was developed in 2011 and its deep neural network could learn to discover and
autonomous vehicles, drones, and robots, making them more intelligent and adaptable to categorize objects much the way a cat does. The following year, the tech giant’s X Lab
changing environments. developed a machine learning algorithm that is able to autonomously browse YouTube
videos to identify the videos that contain cats.
History of the machine learning -
In 2014, Facebook developed DeepFace, a software algorithm that is able to recognize or
The early days - verify individuals on photos to the same level as humans can.
Machine learning history starts in 1943 with the first mathematical model of neural networks 2015 - Present day
presented in the scientific paper "A logical calculus of the ideas immanent in nervous
activity" by Walter Pitts and Warren McCulloch. Amazon launched its own machine learning platform in 2015. Microsoft also created the
Distributed Machine Learning Toolkit, which enabled the efficient distribution of machine
Then, in 1949, the book The Organization of Behavior by Donald Hebb is published. The learning problems across multiple computers.
book had theories on how behavior relates to neural networks and brain activity and would
go on to become one of the monumental pillars of machine learning development. Then more 3,000 AI and Robotics researchers, endorsed by Stephen Hawking, Elon
Musk and Steve Wozniak (among many others), signed an open letter warning of the danger
In 1950 Alan Turing created the Turing Test to determine if a computer has real intelligence. of autonomous weapons which select and engage targets without human intervention.
To pass the test, a computer must be able to fool a human into believing it is also human.
2. Automation: This is one of the significant applications of machine learning that helps to laptops having the same categories and criteria. Similarly, when we use Netflix, we find
make the system automated. It helps machines to perform repetitive tasks without human some recommendations for entertainment series, movies, etc. Hence, this is also possible by
intervention. As a machine learning engineer and data scientist, you have the responsibilities machine learning algorithms.
to solve any given task multiple times with no errors. However, this is not practically
8. Virtual Personal Assistance: This feature helps us in many ways, such as searching
possible for humans. Hence machine learning has developed various models to automate the
content using voice instruction, calling a number using voice, searching contact in your
process, having the capability of performing iterative tasks in lesser time.
mobile, playing music, opening an email, Scheduling an appointment, etc. Now a day, you
3. Banking and Finance: Machine Learning is a subset of AI that uses statistical models to all have seen advertising like "Alexa! Play the Music" this is also done with the help of
make accurate predictions. In the banking and finance sector, machine learning helped in machine learning. Google Assistant, Alexa, Cortana, Siri, etc., are a few common
many ways, such as fraud detection, portfolio management, risk management, chatbots, applications of machine learning. These virtual personal assistants record our voice
document analysis, high-frequency trading, mortgage underwriting, AML detection, anomaly instructions, send them over to the server on a cloud, decode it using ML algorithms and act
detection, risk credit score detection, KYC processing, etc. Hence, machine learning is accordingly.
widely applied in the banking and finance sector to reduce error as well as time.
9. Email Spam and Malware detection & Filtering: Machine learning also helps us for
4. Transportation and Traffic Prediction: This is one of the most common applications of filtering emails in different categories such as spam, important, general, etc. In this way,
Machine Learning that is widely used by all individuals in their daily routine. It helps to users can easily identify whether the email is useful or spam. This is also possible by
ensure highly secured routes, generate accurate ETAs, predict vehicle breakdown, Driving machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve
Prescriptive Analytics, etc. Although machine learning has solved transportation problems, it Bayes classifier. Content filter, header filter, rules-based filter, permission filter, general
still requires more improvement. Statistical machine learning algorithms helps to build a blacklist filter, etc., are some important spam filters used by Google.
smart transportation system. Further, deep Learning explored the complex interactions of
10. Self-driving cars: This is one of the most exciting applications of machine learning.
roads, highways, traffic, environmental elements, crashes, etc. Hence, machine learning
Machine learning plays a vital role in the manufacturing of self-driving cars. It uses an
technology has improved daily traffic management as well as a collection of traffic data to
unsupervised learning method to train car models to detect people and objects while driving.
predict insights of routes and traffic.
Tata and Tesla are the most popular car manufacturing companies working on self-driving
5. Image Recognition: It is one of the most common applications of machine learning which cars. Hence, it is a big revolution in a technological era which is also done with the help of
is used to detect the image over the internet. Further, various social media sites such as machine learning.
Facebook uses image recognition for tagging the images to your Facebook friends with its
11. Credit card fraud detection: Credit card frauds have become very easy targets for online
feature named auto friend tagging suggestion. Further, now a day's, almost all mobile
hackers. As the culture of online/digital payments is increasing, the risk of credit/debit cards
devices come with exciting face detection features. Using this feature, you can secure your
is parallel increasing. Machine Learning also helps developers to detect and analyze frauds in
mobile data with face unlocking, so if anyone tries to access your mobile device, they cannot
online transactions. It develops a novel fraud detection method for Streaming Transaction
open without face recognition.
Data, with an objective to analyze the past transaction details of the customers and extract
6. Speech Recognition: Speech recognition is one of the biggest achievements of machine the behavioral patterns. Further, cardholders are clustered into various categories with their
learning applications. It enables users to search content without writing text or, in other transaction amount so that the behavioral pattern of the groups can be extracted respectively.
words, 'search by voice'. It can search content/products on YouTube, Google, Amazon, etc. Hence, credit card fraud detection is a novel approach using Aggregation Strategy and
platforms by your voice. This technology is referred to as speech recognition. It is a Feedback Mechanism of machine learning.
process of converting voice instructions into the text; hence it is also known as 'Speech to
12. Stock Marketing and Trading: Machine learning also helps in the stock marketing and
text' or 'Computer speech recognition. Some important examples of speech recognitions
trading sector, where it uses historical trends or past experience for predicting the market
are Google assistant, Siri, Cortana, Alexa, etc.
risk. As share marketing is another name of marketing risk, machine learning reduces it to
7. Product Recommendation: It is one of the biggest achievements made by machine learning some extent and predicts data against marketing risk. Machine learning's long short-term
which helps various e-commerce and entertainment companies like Flipkart, Amazon, neural memory network is used for the prediction of stock market trends.
Netflix, etc., to digitally advertise their products over the internet. When anyone searches for
13. Language Translation: The use of Machine learning can be seen in language translation.
any product, they start getting an advertisement for the same product while internet surfing
It uses the sequence-to-sequence learning algorithms for translating one language into other.
on the same browser. This is possible by machine learning algorithms that work on users'
Further, it also uses images recognition techniques to identify the text from one language to
interests or past experience and accordingly recommend them for products. For e.g., when
other. Similarly, Google's GNMT (Google Neural Machine Translation) provides this
we search for a laptop on the Amazon platform, then it also gets started with so many other
feature, which is a Neural Machine Learning that translates the text into our familiar
language, and it is called automatic translation.
> Diagrammatic representation of machine-learning methods, datasets and methods of Unsupervised Learning?
validation used in prediction of PPIs. (A) Classification of different machine-learning
As the name suggests, unsupervised learning is a machine learning technique in which
methods into supervised and unsupervised approaches. (B) Training, testing and blind
models are not supervised using training dataset. Instead, models itself find the hidden
datasets for k-fold cross-validation. (C) Training and testing datasets for bootstrap validation.
patterns and insights from the given data. It can be compared to learning which takes place in
the human brain while learning new things. It can be defined as:
Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.
Unsupervised learning cannot be directly applied to a regression or classification problem
because unlike supervised learning, we have the input data but no corresponding output data.
The goal of unsupervised learning is to find the underlying structure of dataset, group that
data according to similarities, and represent that dataset in a compressed format.
Example: Suppose the unsupervised learning algorithm is given an input dataset containing
images of different types of cats and dogs. The algorithm is never trained upon the given
dataset, which means it does not have any idea about the features of the dataset. The task of
the unsupervised learning algorithm is to identify the image features on their own.
Unsupervised learning algorithm will perform this task by clustering the image dataset into
the groups according to similarities between images.
1) Problem Definition Speech recognition is a process of converting voice instructions into text, and it is also
known as "Speech to text", or "Computer speech recognition." At present, machine learning
2) Data Collection algorithms are widely used by various applications of speech recognition. Google assistant,
3) Data Cleaning and Pre processing Siri, Cortana, and Alexa are using speech recognition technology to follow the voice
instructions.
4) Exploratory Data Analysis (EDA)
5) Feature Engineering and Selection
3) Traffic prediction
6) Model Selection
If we want to visit a new place, we take help of Google Maps, which shows us the correct
7) Model Training
path with the shortest route and predicts the traffic conditions.
8) Model Evaluation and Tuning It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
9) Model Deployment congested with the help of two ways:
10) Model Monitoring and Maintenance * Real Time location of the vehicle form Google Map app and sensors
* Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes information
from the user and sends back to its database to improve the performance.
4) Product recommendations
Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for
some product on Amazon, then we started getting an advertisement for the same product These assistant record our voice instructions, send it over the server on a cloud, and decode it
while internet surfing on the same browser and this is because of machine learning. using ML algorithms and act accordingly
Google understands the user interest using various machine learning algorithms and suggests 8) Online Fraud Detection
the product as per customer interest.
Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various ways that a
fraudulent transaction can take place such as fake accounts, fake ids, and steal money in the
As similar, when we use Netflix, we find some recommendations for entertainment series,
middle of a transaction. So to detect this, Feed Forward Neural network helps us by checking
movies, etc., and this is also done with the help of machine learning.
whether it is a genuine transaction or a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these values
5) Self-driving cars become the input for the next round. For each genuine transaction, there is a specific pattern
One of the most exciting applications of machine learning is self-driving cars. Machine which gets change for the fraud transaction hence, it detects it and makes our online
learning plays a significant role in self-driving cars. Tesla, the most popular car transactions more secure
manufacturing company is working on self-driving car. It is using unsupervised learning
method to train the car models to detect people and objects while driving.
9) Stock Market trading
Machine learning is widely used in stock market trading. In the stock market, there is always
6) Email Spam and Malware Filtering a risk of up and downs in shares, so for this machine learning's long short term memory
Whenever we receive a new email, it is filtered automatically as important, normal, and neural network is used for the prediction of stock market trends.
spam. We always receive an important mail in our inbox with the important symbol and 9) Stock Market trading
spam emails in our spam box, and the technology behind this is Machine learning. Below are
Machine learning is widely used in stock market trading. In the stock market, there is always
some spam filters used by Gmail:
a risk of up and downs in shares, so for this machine learning's long short term memory
neural network is used for the prediction of stock market trends.
*Content Filter
*Header filter 10) Medical Diagnosis
*General blacklists filter In medical science, machine learning is used for diseases diagnoses. With this, medical
*Rules-based filters technology is growing very fast and able to build 3D models that can predict the exact
position of lesions in the brain.
*Permission filters
It helps in finding brain tumors and other brain-related diseases easily.
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve
Bayes classifier are used for email spam filtering and malware detection.
11) Automatic Language Translation
7) Virtual Personal Assistant Nowadays, if we visit a new place and we are not aware of the language then it is not a
problem at all, as for this also machine learning helps us by converting the text into our
We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As known languages. Google's GNMT (Google Neural Machine Translation) provide this
the name suggests, they help us in finding the information using our voice instruction. These feature, which is a Neural Machine Learning that translates the text into our familiar
assistants can help us in various ways just by our voice instructions such as Play music, call language, and it called as automatic translation.
someone, Open an email, Scheduling an appointment, etc.
The technology behind the automatic translation is a sequence to sequence learning
These virtual assistants use machine learning algorithms as an important part. algorithm, which is used with image recognition and translates the text from one language to
another language
Unit – 2 features onto a lower-dimensional space while preserving as much of the variance as
possible.
Dimensionality Reduction
>Dimensionality reduction is the process of reducing the number of features (or dimensions)
in a dataset while retaining as much information as possible. This can be done for a variety
of reasons, such as to reduce the complexity of a model, to improve the performance of a
learning algorithm, or to make it easier to visualize the data. There are several techniques for
Row Vector
dimensionality reduction, including principal component analysis (PCA), singular value
decomposition (SVD), and linear discriminant analysis (LDA). Each technique uses a A matrix having only one row is called a row vector.
different method to project the data onto a lower-dimensional space while preserving * Column Vector
important information.
A matrix having only one column is called a column vector
> Dimensionality reduction is a technique used to reduce the number of features in a dataset
while retaining as much of the important information as possible. In other words, it is a
process of transforming high-dimensional data into a lower-dimensional space that still
preserves the essence of the original data.
In machine learning, high-dimensional data refers to data with a large number of features or
variables. The curse of dimensionality is a common problem in machine learning, where the
performance of the model deteriorates as the number of features increases. This is because
the complexity of the model increases with the number of features, and it becomes more how to represent a dataset
difficult to find a good solution. In addition, high-dimensional data can also lead to A dataset is a collection of data in which data is arranged in some order. A dataset can
overfitting, where the model fits the training data too closely and does not generalize well to contain any data from a series of an array to a database table. Below table shows an example
new data. of the dataset:
Dimensionality reduction can help to mitigate these problems by reducing the complexity of Country Age Salary Purchased
the model and improving its generalization performance. There are two main approaches to
India 38 48000 No
dimensionality reduction: feature selection and feature extraction
France 43 45000 Yes
Feature selection involves selecting a subset of the original features that are most relevant to France 48 65000 No
the problem at hand. The goal is to reduce the dimensionality of the dataset while retaining Germany 40 Yes
the most important features. There are several methods for feature selection, including filter
India 35 58000 Yes
methods, wrapper methods, and embedded methods. Filter methods rank the features based
on their relevance to the target variable, wrapper methods use the model performance as the
criteria for selecting features, and embedded methods combine feature selection with the
A tabular dataset can be understood as a database table or matrix, where each column
model training process.
corresponds to a particular variable, and each row corresponds to the fields of the dataset.
2) Feature Extraction: The most supported file type for a tabular dataset is "Comma Separated File," or CSV. But to
store a "tree-like data," we can use the JSON file more efficiently.
Feature extraction involves creating new features by combining or transforming the original
features. The goal is to create a set of features that captures the essence of the original data in Types of data in datasets
a lower-dimensional space. There are several methods for feature extraction, including
principal component analysis (PCA), linear discriminant analysis (LDA), and t-distributed 1) Numerical data:Such as house price, temperature, etc.
stochastic neighbor embedding (t-SNE). PCA is a popular technique that projects the original 2) Categorical data:Such as Yes/No, True/False, Blue/green, etc.
3) Ordinal data:These data are similar to categorical data but can be measured on the basis of *Securities exchange information
comparison.
*Climate information
Note: A real-world dataset is of huge size, which is difficult to manage and process at the
*Sensor readings.
initial level. Therefore, to practice machine learning algorithms, we can use any dummy
dataset. 4) Tabular Datasets:
Tabular datasets are organized information coordinated in tables or calculation sheets. They
contain lines addressing examples or tests and segments addressing highlights or qualities.
Tabular datasets are utilized for undertakings like relapse and arrangement. The dataset
given before in the article is an illustration of a tabular dataset.
Types of datasets
Machine learning incorporates different domains, each requiring explicit sorts of datasets. A
few normal sorts of datasets utilized in machine learning include: Need of Dataset
1) Image Datasets: *Completely ready and pre-handled datasets are significant for machine learning projects.
Image datasets contain an assortment of images and are normally utilized in computer vision *They give the establishment to prepare exact and solid models. Notwithstanding, working
tasks such as image classification, object detection, and image segmentation. with enormous datasets can introduce difficulties regarding the board and handling.
Examples : *To address these difficulties, productive information the executive's strategies and are
expected to handle calculations.
*ImageNet
Data Preprocessing in Machine learning
*CIFAR-10
Data preprocessing is a process of preparing the raw data and making it suitable for a
*MNIST machine learning model. It is the first and crucial step while creating a machine learning
model.
2) Text Datasets: When creating a machine learning project, it is not always a case that we come across the
clean and formatted data. And while doing any operation with data, it is mandatory to clean
Text datasets comprise textual information, like articles, books, or virtual entertainment
it and put in a formatted way. So for this, we use data preprocessing task.
posts. These datasets are utilized in NLP techniques like sentiment analysis, text
classification, and machine translation. It involves below steps -
Examples : *Getting the dataset
*Importing libraries
*Gutenberg Task dataset *Importing datasets
*IMDb film reviews dataset *Finding Missing Data
3) Time Series Datasets: *Encoding Categorical Data
Time series datasets include information focuses gathered after some time. They are *Splitting dataset into training and test set
generally utilized in determining, abnormality location, and pattern examination. *Feature scaling
Examples : 1) Get the Dataset
To create a machine learning model, the first thing we required is a dataset as a machine *Save your Python file in the directory which contains dataset.
learning model completely works on data. The collected data for a particular problem in a
*Go to File explorer option in Spyder IDE, and select the required directory.
proper format is known as the dataset.
*Click on F5 button or run option to execute the file.
Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the dataset Here, in the below image, we can see the Python file along with required dataset. Now, the
required for a liver patient. So each dataset is different from another dataset. To use the current folder is set as a working directory
dataset in our code, we usually put it into a CSV file. However, sometimes, we may also
need to use an HTML or xlsx file.
Now to import the dataset, we will use read_csv() function of pandas library, which is used
2) Importing Libraries to read a csv file and performs various operations on it. Using this function, we can read a
In order to perform data pre processing using Python, we need to import some predefined csv file locally as well as through an URL.
Python libraries. These libraries are used to perform some specific jobs. There are three
specific libraries that we will use for data pre processing, which are:
We can use read_csv function as below:
Numpy: Numpy Python library is used for including any type of mathematical operation in
the code. It is the fundamental package for scientific calculation in Python. It also supports to [2:15 PM, 5/12/2024] Himanshu Skitm: 4) Handling Missing data:
add large, multidimensional arrays and matrices. So, in Python, we can import it as: The next step of data preprocessing is to handle missing data in the datasets. If our dataset
import numpy as nm contains some missing data, then it may create a huge problem for our machine learning
model. Hence it is necessary to handle missing values present in the dataset.
Here we have used nm, which is a short name for Numpy, and it will be used in the whole
program.
Ways to handle missing data:
Mat plotlib: The second library is mat plotlib, which is a Python 2D plotting library, and
with this library, we need to import a sub-library pyplot. This library is used to plot any type There are mainly two ways to handle missing data, which are:
of charts in Python for the code. It will be imported as below:
By deleting the particular row: The first way is used to commonly deal with null values. In
this way, we just delete the specific row or column which consists of null values. But this
import mat plotlib.pyplot as mpt way is not so efficient and removing data may lead to loss of information which will not give
the accurate output.
Here we have used mpt as a short name for this library.
Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data By calculating the mean: In this way, we will calculate the mean of that column or row
manipulation and analysis library. It will be imported as below: which contains any missing value and will put it on the place of missing value. This strategy
is useful for the features which have numeric data such as age, salary, year, etc. Here, we
will use this approach.
Here, we have used pd as a short name for this library. Consider the below image:
To handle missing values, we will use Scikit-learn library in our code, which contains
3) Importing the Datasets various libraries for building machine learning models. Here we will use Imputer cl…
Now we need to import the datasets which we have collected for our machine learning [2:15 PM, 5/12/2024] Himanshu Skitm: array([['India', 38.0, 68000.0],
project. But before importing a dataset, we need to set the current directory as a working ['France', 43.0, 45000.0],
directory. To set a working directory in Spyder IDE, we need to follow the below steps:
['Germany', 30.0, 54000.0],
['France', 48.0, 65000.0], Standardization or Z-Score Normalization is the transformation of features by subtracting
from mean and dividing by standard deviation. This is often called as Z-score.
['Germany', 40.0, 65222.22222222222],
['India', 35.0, 58000.0],
X_new = (X - mean)/Std
['Germany', 41.111111111111114, 53000.0],
['France', 49.0, 79000.0],
Standardization can be helpful in cases where the data follows a Gaussian distribution.
['India', 50.0, 88000.0],
However, this does not have to be necessarily true. Geometrically speaking, it translates the
['France', 37.0, 77000.0]], dtype=object data to the mean vector of original data to the origin and squishes or expands the points if std
As we can see in the above output, the missing values have been replaced with the means of is 1 respectively. We can see that we are just changing mean and standard deviation to a
rest column values. standard normal distribution which is still normal thus the shape of the distribution is not
affected.
Standardization does not get affected by outliers because there is no predefined range of
data_set= pd.read_csv('Dataset.csv') transformed features.
Here, data_set is a name of the variable to store our dataset, and inside the function, we have
passed the name of our dataset. Once we execute the above line of code, it will successfully
import the dataset in our code. We can also check the imported dataset by clicking on the * Covariance of a data matrix
section variable explorer, and then double click on data_set. Consider the below image: Covariance matrices represent the covariance values of each pair of variables in multivariate
data. These values show the distribution magnitude and direction of multivariate data in a
multidimensional space and can allow you to gather information about how data spreads
5) Encoding Categorical data: among two dimensions.
Categorical data is data which has some categories such as, in our dataset; there are two Covariance Matrix is a type of matrix used to describe the covariance values between two
categorical variable, Country, and Purchased. items in a random vector. It is also known as the variance-covariance matrix because the
variance of each element is represented along the matrix’s major diagonal and the covariance
is represented among the non-diagonal elements. A covariance matrix is usually a square
Since machine learning model completely works on mathematics and numbers, but if our
matrix. It is also positive semi-definite and symmetric. This matrix comes in handy when it
dataset would have a categorical variable, then it may create trouble while building the
comes to stochastic modeling and Principal component analysis.
model. So it is necessary to encode these categorical variables into numbers.
Firstly, we will convert the country variables into categorical data. So to do this, we will use
LabelEncoder() class from preprocessing library.
Column standardization
Geometrically, column standardization means squishing the data points such that the mean
vector comes at origin and the variance(by either squishing or expanding) on any axes would
be 1 in the transformed space. column standardization is often called mean centering and
variance scaling(squishing/expanding). *
Principal Component Analysis(PCA) *The first principal component captures the most variation in the data, but the second
principal component captures the maximum variance that is orthogonal to the first principal
As the number of features or dimensions in a dataset increases, the amount of data required component, and so on.
to obtain a statistically significant result increases exponentially. This can lead to issues such
as overfitting, increased computation time, and reduced accuracy of machine learning *Principal Component Analysis can be used for a variety of purposes, including data
models this is known as the curse of dimensionality problems that arise while working with visualization, feature selection, and data compression. In data visualization, PCA can be used
high-dimensional data. to plot high-dimensional data in two or three dimensions, making it easier to interpret. In
feature selection, PCA can be used to identify the most important variables in a dataset. In
As the number of dimensions increases, the number of possible combinations of features data compression, PCA can be used to reduce the size of a dataset without losing important
increases exponentially, which makes it computationally difficult to obtain a representative information.
sample of the data and it becomes expensive to perform tasks such as clustering or
classification because it becomes. Additionally, some machine learning algorithms can be *In Principal Component Analysis, it is assumed that the information is carried in the
sensitive to the number of dimensions, requiring more data to achieve the same level of variance of the features, that is, the higher the variation in a feature, the more information
accuracy as lower-dimensional data. that features carries.
To address the curse of dimensionality, Feature engineering techniques are used which
include feature selection and feature extraction. Dimensionality reduction is a type of feature
extraction technique that aims to reduce the number of input features while retaining as much
of the original information as possible.
Principal Component Analysis(PCA) technique was introduced by the mathematician Karl
Pearson in 1901. It works on the condition that while the data in a higher dimensional space Unit 3
is mapped to
Supervised Learning
data in a lower dimension space, the variance of the data in the lower dimensional space
should be maximum. Supervised Machine Learning
*Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal Supervised learning is the types of machine learning in which machines are trained using
transformation that converts a set of correlated variables to a set of uncorrelated well "labelled" training data, and on basis of that data, machines predict the output. The
variables.PCA is the most widely used tool in exploratory data analysis and in machine labelled data means some input data is already tagged with the correct output.
learning for predictive models. Moreover,
In supervised learning, the training data provided to the machines work as the supervisor that
*Principal Component Analysis (PCA) is an unsupervised learning algorithm technique used teaches the machines to predict the output correctly. It applies the same concept as a student
to examine the interrelations among a set of variables. It is also known as a general factor learns in the supervision of the teacher.
analysis where regression determines a line of best fit.
Supervised learning is a process of providing input data as well as correct output data to the
*The main goal of Principal Component Analysis (PCA) is to reduce the dimensionality of a machine learning model. The aim of a supervised learning algorithm is to find a mapping
dataset while preserving the most important patterns or relationships between the variables function to map the input variable(x) with the output variable(y).
without any prior knowledge of the target variables.
In the real-world, supervised learning can be used for Risk Assessment, Image classification,
Principal Component Analysis (PCA) is used to reduce the dimensionality of a data set by Fraud Detection, spam filtering, etc.
finding a new set of variables, smaller than the original set of variables, retaining most of the
sample’s information, and useful for the regression and classification of data.
Principal Component Analysis (PCA) is a technique for dimensionality reduction that
identifies a set of orthogonal axes, called principal components, that capture the maximum
variance in the data. The principal components are linear combinations of the original
variables in the dataset and are ordered in decreasing order of importance. The total variance
captured by all the principal components is equal to the total variance in the original dataset.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc. Below are some popular Regression
Regression algorithms which come
under supervised learning:
Suppose we have a dataset of different types of shapes which includes square, rectangle, o Linear Regression
triangle, and Polygon. Now the first step is that we need to train the model for each shape. o Regression Trees
o Non-Linear Regression
o If the given shape has four sides, and all the sides are equal, then it will be labelled as o Bayesian Linear Regression
a Square. o Polynomial Regression
o If the given shape has three sides, then it will be labelled as a triangle.
triangle 2. Classification
o If the given shape has six equal sides then it will be labelled as hexagon.
hexagon Classification algorithms are used when the output variable is categorical, which means there
are two classes such as Yes-No,
No, Male
Male-Female, True-false, etc.
Now, after training, we test our model using the test set, and the task of the model is to
identify the shape. Spam Filtering,
The machine is already trained on all types of shapes
shapes,, and when it finds a new shape, it o Random Forest
classifies the shape on the bases of a number of sides, and predicts the output. o Decision Trees
o Logistic Regression
o Support vector Machines
K-Nearest Neighbor(KNN) Algorithm for Machine Learning
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into a
well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, o Firstly, we will choose the number of neighbors, so we will choose the k=5.
then it classifies that data into a category that is much similar to the new data. o Next, we will calculate the Euclidean distance between the data points. The
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, Euclidean distance is the distance between two points, which we have already studied
but we want to know either it is a cat or dog. So for this identification, we can use the in geometry. It can be calculated as:
KNN algorithm, as it works on a similarity measure. Our KNN model will find the
similar features of the new data set to the cats and dogs images and based on the most
similar features it will put it in either cat or dog category.
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
theorem and used for solving classification problems. o It can be used for Binary as well as Multi-class Classifications.
o It is mainly used in text classification that includes a high-dimensional
dimensional training o It performs well in Multi-class predictions as compared to the other Algorithms.
dataset. o It is the most popular choice for text classification problems.
o Naïve Bayes Classifier is one of the simple and most effective Classification
Disadvantages of Naïve Bayes Classifier:
algorithms which helps in building the fast machine learning models that ca
can make
quick predictions. o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
o It is a probabilistic classifier, which means it predicts on the basis of the the relationship between features.
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental Applications of Naïve Bayes Classifier:
analysis, and classifying articles
articles.
o It is used for Credit Scoring.
Why is it called Naïve Bayes? o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be learner.
described as: o It is used in Text classification such as Spam filtering and Sentiment analysis.
o Naïve:: It is called Naïve because it assumes that the occurrence of a certain feature is Types of Naïve Bayes Model:
independent of the occurrence of other features. SuSuch
ch as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an There are three types of Naive Bayes Model, which are given below:
apple. Hence each feature individually contributes to identify that it is an apple
without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem
Theorem.
o Gaussian: The Gaussian model assumes that features follow a normal distribution. Decision Criterion:: The rule or condition used to determine how the data should be
This means if predictors take continuous values instead of discrete, then the model split at a decision node. It involves comparing feature values against a threshold.
assumes that these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is Pruning:: The process of removing branches or nodes from a decision tree to improve
multinomial distributed. It is primarily used for document classification problems, it its generalisation and prevent overfitting.
means a particular document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular
word is present or not in a document. This model is also famous for document
classification tasks.
There are specialized terms associated with decision trees that denote various components
and facets of the tree structure and decision-making procedure. :
Root Node: A decision tree’s root node, which represents the original choice or
feature from which the tree branches, is the highest node.
Internal Nodes (Decision Nodes): Nodes in the tree whose choices are determined by Linear Regression in Machine Learning
the values of particular attributes. There are branches on these nodes that go to other
nodes. Linear regression
gression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
Leaf Nodes (Terminal Nodes): The branches’ termini, when choices or forecasts are continuous/real or numeric variables such as sales, salary, age, product price,
rice, etc.
decided upon. There are no more branches on leaf nodes.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
Branches (Edges): Links between nodes that show how decisions are made in more independent (y) variables, hence called as linear regression. Since linear regression
response to particular circumstances. shows the linear relationship, which means it finds how the val
value
ue of the dependent variable
is changing according to the value of the independent variable.
Splitting: The process of dividing a node into two or more sub-nodes based on a
decision criterion. It involves selecting a feature and a threshold to create subsets of The linear regression model provides a sloped straight line representing the relationship
data. between the variables. Consider the below image:
Parent Node: A node that is split into child nodes. The original node from which a
split originates.
A linear line showing the relationship between the dependent and independent variables is
called a regression line. A regression line can show two types of relationship:
y= a0+a1x+ ε
The values for x and y variables are training datasets for Linear Regression model
representation.
Linear regression can be further divided into two types of the algorithm:
ADVERTISEMENT
ADVERTISEMENT
o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:
Machine. Consider the below diagram in which there are two different categories that are SVM algorithm can be used for Face detection, image classific
classification,
ation, text
classified using a decision boundary or hyperplane: categorization, etc.
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such data
is termed as linearly
arly separable data, and classifier is used called as Linear SVM
classifier.
o Non-linear SVM: Non-Linear Linear SVM is used for nonnon-linearly
linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is termed
as non-linear
ear data and classifier used is called as Non-linear
Non linear SVM classifier.
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and x2.
We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue.
Consider the below image:
Example: SVM can be understood with tthe he example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we want a
model that can accurately identify whether it is a cat or dog, so such a model can be created
by using the SVM algorithm. We will first train our model with lots of images of cats and
dogs so that it can learn about different features of cats and dogs, and then we test it with this
strange creature. So as support vector creates a decision boundary between these two data
(cat and dog) and choose extreme cases (support vectors), it will see the extreme case of cat
and dog. On the basis of the support vectors, it will classify it as a cat. Consider the below
diagram:
So as it is 2-d
d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple liness that can separate these classes. Consider the below image:
If data is linearly arranged, then we can separate it by using a straight line, but for non
non-linear
data, we cannot draw a single straight line. Consider the below image:
ADVERTISEMENT
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane
hyperplane.. SVM algorithm finds the closest point of the
lines from both the
he classes. These points are called support vectors. The distance between the
vectors and the hyperplane is called as margin.. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal hyperplane.
hyperplane
So to separate these data points, we need to add one more dimensi
dimension.
on. For linear data, we
have used two dimensions x and y, so for non
non-linear
linear data, we will add a third dimension z. It
can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
Non-Linear SVM:
ADVERTISEMENT
Since we are in 3-dd Space, hence it is looking like a plane parallel to the xx-axis.
axis. If we convert
it in 2d space with z=1, then it will become as:
UNIT IV more effective. Such as people who buy X item (suppose a bread) are also tend to
purchase Y (Butter/Jam) item. A typical example of Association rule is Market Basket
What is Unsupervised Learning? Analysis.
As the name suggests, unsupervised learning is a machine learning technique in which
What is K-means Clustering?
models are not supervised using training dataset. Instead, models its
itself
elf find the hidden
patterns and insights from the given data. It can be compared to learning which takes place in Unsupervised Machine Learning is the process of teaching a computer to use unlabeled,
the human brain while learning new things. It can be defined as: unclassified data and enabling the algorithm to operate on that data without supervision.
Without any previous data training, the machine’s job in this case is to organize unsorted
Unsupervised learning is a type of machine learning in which models are trained
trai using data according to parallels, patterns, and variations.
unlabeled dataset and are allowed to act on that data without any supervision.
K means clustering, assigns data points to one of the K clusters depending on their distance
Unsupervised learning cannot be directly applied to a regression or classification problem from the center of the clusters. It starts by randomly assigning the clusters centroid in the
because unlike supervised learning, we have the input data but no corresponding
correspond output data. space. Then each data point assign to one of the cluster based on its distance from centroid of
The goal of unsupervised learning is to find the underlying structure of dataset, group that the cluster. After assigning each point to one of the cluster, new cluster centroids are
data according to similarities, and represent that dataset in a compressed format.
format assigned. This process runs iteratively until it finds good cluster. In the analysis we assume
Example: Suppose the unsupervised learning algorithm is given aann input dataset containing that number of cluster is given in advanced and we have to put points in one of the group.
images of different types of cats and dogs. The algorithm is never trained upon the given In some cases, K is not clearly defined, and we have to think about the optimal number of K.
dataset, which means it does not have any idea about the features of the dataset. The task of K Means clustering performs best data is well separated. When data points overlapped this
the unsupervised learning algorithm is to ide
identify
ntify the image features on their own. clustering is not suitable. K Means is faster as compare to other clustering technique. It
Types of Unsupervised Learning Algorithm: provides strong coupling between the data points. K Means cluster do not provide clear
information regarding the quality of clusters. Different initial assignment of cluster centroid
The unsupervised learning algorithm can be further categorized into two types of problems:
may lead to different clusters. Also, K Means algorithm is sensitive to noise. It maymhave
stuck in local minima.
What is the objective of k-means clustering?
The goal of clustering is to divide the population or set of data points into a number of
groups so that the data points within each group are more comparable to one another and
different from the data points within the other groups. It is essentially a grouping of things
based on how similar and different they are to one another.
How k-means clustering works?
We are given a data set of items, with certain features, and values for these features (like a
vector). The task is to categorize those items into groups. To achieve this, we will use the K-
means algorithm, an unsupervised learning algorithm. ‘K’ in the name of the algorithm
represents the number of groups/clusters we want to classify our items into.
o Clustering:: Clustering is a method of grouping the objects into clusters such that (It will help if you think of items as points in an n-dimensional space). The algorithm will
objects with most similarities remains into a group and has less or no similarities with categorize the items into k groups or clusters of similarity. To calculate that similarity, we
the objects of another group. Cluster analysis finds the commonalities between the will use the Euclidean distance as a measurement.
data objects and categorizes them as per the presence and absence of those The algorithm works as follows:
commonalities.
1. First, we randomly initialize k points, called means or cluster centroids.
o Association:: An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database. It determines the set 2. We categorize each item to its closest mean, and we update the mean’s coordinates,
of items that occurs together in the dataset. Association rule makes marketing
marke strategy which are the averages of the items categorized in that cluster so far.
3. We repeat the process for a given number of iterations and at the end, we have our
clusters.
What is Boosting
The “points” mentioned above are called means because they are the mean values of the
items categorized in them. To initialize these means, we have a lot of options. An intuitive Boosting is an ensemble modeling technique that attempts to build a strong classifier from
method is to initialize the means at random items in the data set. Another method is to the number of weak classifiers. It is done by building a model by using weak models in
initialize the means at random values between the boundaries of the data set (if for a series. Firstly, a model is built from the training data. Then the second model is built which
feature x, the items have values in [0,3], we will initialize the means with values for x at tries to correct the errors present in the first model. This procedure is continued and models
[0,3]). are added until either the complete training data set is predicted correctly or the maximum
number of models are added.
The above algorithm in pseudocode is as follows:
Initialize k means with random values
--> For a given number of iterations:
Boosting Bagging
The main aim of boosting is to decrease bias, not The main aim of bagging is to decrease
variance variance not bias
New Models are influenced by the accuracy of All the models are independent of each
previous Models other
Note: To better understand the Random Forest Algorithm, you should have knowledge of the
Decision Tree Algorithm.
Assumptions for Random Forest
Since the random forest combines multiple trees to predict the class of the dataset, it is
possible that some decision trees may predict the correct output, while others may not. But
together, all the trees predict the correct output. Therefore, below are two assumptions for a
better Random forest classifier:
o There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
Why use Random Forest?
Below are some points that explain why we should use the Random Forest algorithm:
Applications of Random Forest
<="" li="">
There are mainly four sectors where Random forest mostly used:
o It takes less training time as compared to other algorithms.
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
o It predicts output with high accuracy, even for the large dataset it runs efficiently.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can Total number of point
be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
Image by author
4. Marketing: Marketing trends can be identified using this algorithm.
Accuracy value lies between 0 and 1. If the value is closer to 0 it's considered as bad
Advantages of Random Forest performance, whereas if the value is closer to 1 then its considered good performance. It is
one of the simplest and easy to understand metric.
o Random Forest is capable of performing both Classification and Regression tasks.
o It is capable of handling large datasets with high dimensionality.
Let’s understand this metric using an example:
o It enhances the accuracy of the model and prevents the overfitting issue.
Assume we have already trained our model using training data. Now, we want to use the test
Disadvantages of Random Forest
data and check how accurate the predictions are. Let's say we have a classification problem
o Although random forest can be used for both classification and regression tasks, it is with 100 data points in our test data set. The objective is to classify whether the point is
not more suitable for Regression tasks. positive or negative. Assume out of the 100 points, we have 60 positive points and 40
ADVERTISEMENT negative points (Note that this is the original/actual class label). Now, when we feed this test
data to our model and suppose we get the below output:
Python Implementation of Random Forest Algorithm
Now we will implement the Random Forest Algorithm tree using Python. For this, we will
use the same dataset "user_data.csv", which we have used in previous classification models.
By using the same dataset, we can compare the Random Forest classifier with other
classification models such as Decision tree Classifier, KNN, SVM, Logistic Regression, etc.
ADVERTISEMENT
Implementation Steps are given below:
o Data Pre-processing step
o Fitting the Random forest algorithm to the Training set
So basis the above example, our model has misclassified 7 points as negative and 5 points as
o Predicting the test result positive. So overall misclassified points = 7+5 = 12.
o Test accuracy of the result (Creation of Confusion matrix) So the accuracy of the model can be calculated as:
o Visualizing the test set result.
Accuracy
This is probably the simplest performance metrics. It is defined as:
Accuracy =
Number of correctly classified points
———————————————————-
What is a Confusion Matrix? Recall measures the effectiveness of a classification model in identifying all relevant
instances from a dataset. It is the ratio of the number of true positive (TP) instances to the
A confusion matrix is a matrix that summarizes the performance of a machine learning sum of true positive and false negative (FN) instances.
model on a set of test data. It is a means of displaying the number of accurate and inaccurate
instances based on the model's predictions. It is often used to measure the performance of [Tex]\text{Recall} = \frac{TP}{TP+FN} [/Tex]
classification models, which aim to predict a categorical label for each input instance.
The matrix displays the number of instances produced by the model on the test data. Note: We use precision when we want to minimize false positives, crucial in scenarios like
* True positives (TP): occur when the model accurately predicts a positive data point. spam email detection where misclassifying a non-spam message as spam is costly. And we
use recall when minimizing false negatives is essential, as in medical diagnoses, where
* True negatives (TN): occur when the model accurately predicts a negative data point. identifying all actual positive cases is critical, even if it results in some false positives.
* False positives (FP): occur when the model predicts a positive data point incorrectly. 4. F1-Score
* False negatives (FN): occur when the model mispredicts a negative data point. F1-score is used to evaluate the overall performance of a classification model. It is the
harmonic mean of precision and recall,
Why do we need a Confusion Matrix? [Tex]\text{F1-Score} = \frac {2 \cdot Precision \cdot Recall}{Precision + Recall} [/Tex]
3. Recall
An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the AUC (Area under the ROC Curve).
classification threshold classifies more items as positive, thus increasing both False Positives
and True Positives. The following figure shows a typical ROC curve. AUC provides an aggregate measure of performance across all possible classification
thresholds. One way of interpreting AUC is as the probability that the model ranks a random
positive example more highly than a random negative example. For example, given the
following examples, which are arranged from left to right in ascending order of logistic
regression predictions.
Distribution Errors - means any error or omission by the Company or any of its
Subsidiaries with respect toany distribution of client funds prior to the Closing, whether in
connection with a class action lawsuit, a bankruptcy case or otherwise.