ML Unit-I Notes
ML Unit-I Notes
Machine Learning
Hottest trend in today’s market
40% of Application development – Based on ML core Developer
Biggest Confusion: AI vs ML vs Deep Learnig
Machine Learning
ML term was coined by Arthur Samuel in 1959.(American Pioneer)
Use of data and algorithms to imitate (behave) in a similar way those humans learn, gradually
improving its accuracy.
Automatically learn and improve from past experience without being explicitly programmed.
Choice / Selection of algorithm depends on what type of data do we have and what kind of task we
are trying to automate.
Traditional Programming: We feed in DATA (Input) + PROGRAM (logic), run it on machine and get
output.
Machine Learning: We feed in DATA (Input) + Output, run it on machine during training and the machine
creates its own program (logic), which can be evaluated while testing.
How does Machine Learning Works?
How does Machine Learning Works?
A Machine Learning system learns from historical data, builds the prediction models, and whenever it
receives new data, predicts the output for it.
The accuracy of predicted output depends upon the amount of data, as the huge amount of data
helps to build a better model which predicts the output more accurately.
So instead of writing a code for it, we just need to feed the data to generic algorithms, and with the help of
these algorithms, machine builds the logic as per the data and predict the output.
Machine learning has changed our way of thinking about the problem. The below block diagram explains
the working of Machine Learning algorithm:
Machine Learning Life Cycle
Cyclic Process
To train the Machine, we need data.That’s why MLLC starts with Gathering Data
Machine Learning Life Cycle
Step-1: Gathering Data:
first step of the machine learning life cycle.
The quantity and quality of the collected data will determine the efficiency of the output.
The more will be the data, the more accurate will be the prediction.
This step includes the below tasks:
Goal-Identify various Data Sources like Files, DB, Internet or Mobile devices
Collect Data
We get a coherent set of data, also called as a dataset. It will be used in further steps.
Machine Learning Life Cycle
Step-2: Data Preparation:
Data exploration:
Building models
It is based on the Facebook project named "Deep Face" which is responsible for face recognition and
person identification in the picture.
You all might have use IMDB ratings, Google Photos where it recognizes faces,
Google Lens where the ML image-text recognition model can extract text from the images you feed in
Applications / Uses of ML
2. Speech Recognition:
Speech recognition is a process of converting voice instructions into text.
While using Google, we get an option of "Search by voice," it comes under speech recognition, and it's a
popular application of machine learning.
Google assistant, Apple Siri, Cortana, and Amazon Alexa are using speech recognition technology to
follow the voice instructions.
Applications / Uses of ML
3.Traffic Detection:
If we want to visit a new place, we take help of Google Maps, which shows us the correct path with the
shortest route and predicts the traffic conditions – Traffic is cleared, slow-moving or heavily congested
Real Time location of the vehicle from Google Map app and sensors
Everyone who is using Google Map is helping this app to make it better. It takes information from
the user and sends back to its database to improve the performance.
Whenever we search for some product on Amazon, then we started getting an advertisement for the
same product while internet surfing on the same browser and this is because of machine learning.
The most popular car manufacturing company like Tesla, Google, and Uber are working on self-driving
car.
Uses unsupervised learning method to train the car models to detect people and objects while driving.
Applications / Uses of ML
6.Email Spam and Malware Filtering:
Whenever we receive a new email, it is filtered automatically as important (in Inbox with
important symbol) , normal, and spam (Spam box)
Gmail which categorizes E-mail as social, promotion, updates or forum using text classification,
which is a part of ML.
Some ML algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve Bayes
classifier are used for email spam filtering and malware detection.
Below are some spam filters used by Gmail:
Content Filter
Header filter
General blacklists filter
Rules-based filters
Permission filters
Applications / Uses of ML
7. Virtual Personal Assistant:
We have various virtual personal assistants such as Google assistant, Amazon Alexa, Cortana, Apple Siri.
As the name suggests, they help us in finding the information using our voice instruction.
These assistants can help us in various ways just by our voice instructions such as Play music, call
someone, Open an email, Scheduling an appointment, etc.
Applications / Uses of ML
8.Online Fraud Detection:
Makes online transaction safe and secure by detecting fraud transaction.
Whenever we perform some online transaction, there may be various ways that a fraudulent transaction
can take place such as fake accounts, fake ids, and steal money in the middle of a transaction.
So to detect this, Feed Forward Neural network helps us by checking whether it is a genuine transaction
or a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these values
become the input for the next round.
For each genuine transaction, there is a specific pattern which gets change for finding the fraud
transaction and makes our online transactions more secure.
Applications / Uses of ML
9.Stock Market Trading:
In the stock market, there is always a risk of up and downs in shares, so for this machine learning's long
short term memory neural network is used for the prediction of stock market trends.
Applications / Uses of ML
10.Medical Diagnosis:
With this, medical technology is growing very fast and able to build 3D models that can predict the
exact position of lesions in the brain.
Researchers and scientists have prepared models to train machines for detecting cancer just by
looking at slide – cell images. For humans to perform this task it would have taken a lot of time. But now,
no more delay, machines predict the chances of having or not having cancer with some accuracy and
doctors just have to give an assurance call, that’s it.
The answer to – how is this possible is very simple -all that is required, is,
High computation machine,
A large amount of good quality image data,
ML model with good algorithms to achieve state-of-the-art results.
Applications / Uses of ML
11.Automatic Language Translation:
Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at all, as
for this also machine learning helps us by converting the text into our known languages.
Google's GNMT (Google Neural Machine Translation) provide this feature, which is a Neural Machine
Learning that translates the text into our familiar language, and it called as automatic translation.
The technology behind the automatic translation is a sequence to sequence learning algorithm, which is
used with image recognition and translates the text from one language to another language.
Applications / Uses of ML
12. Targeted Advertisement (Online Ads Systems):
Traditionally, the advertisement was only done using newspapers, magazines and radio
But now technology has made us smart enough to do Targeted advertisement (online ad system) which
is a way more efficient method to target most receptive audience.
Applications / Uses of ML
13. Training the student during Exams:
While preparing for the exams students don’t actually cram (learn quickly) the subject but try to learn it
with complete understanding.
Before the examination, they feed their machine (brain) with a good amount of high-quality data
(questions and answers from different books or teachers notes or online video lectures).
Actually, they are training their brain with input as well as output. Gradually, the performance keeps on
increasing, gaining more confidence with the adopted approach.
That’s how actually models are built, train machine with data (both inputs and outputs are given to model)
and when the time comes test on data (with input only) and achieves our model scores by comparing its
answer with the actual output which has not been fed while training.
Types of Machine Learning
Supervised Learning
Sample labelled data to the machine learning system in order to train it, and on that basis, it predicts
the output.
The labelled data means some input data is already tagged with the correct output.
The goal of supervised learning is to map input data with the output data.
Creates a model using labelled data Test the model by providing a sample data to check whether
it is predicting the exact output or not.
Training data act as Supervisor that teaches the machines to predict the output correctly.
Supervised Learning
Supervised Learning
How Supervised Learning Works?
How Supervised Learning Works?
Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle, and
Polygon. Now the first step is that we need to train the model for each shape.
If the given shape has four sides, and all the sides are equal, then it will be labelled as a Square.
If the given shape has three sides, then it will be labelled as a triangle.
If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify the
shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies the
shape on the bases of a number of sides, and predicts the output.
Steps Involved in Supervised Learning:
First Determine the type of training dataset
Split the training dataset into training dataset, test dataset, and validation dataset.
Determine the input features of the training dataset, which should have enough knowledge so that the
model can accurately predict the output.
Determine the suitable algorithm for the model, such as support vector machine, decision tree, etc.
Execute the algorithm on the training dataset. Sometimes we need validation datasets as the control
parameters, which are the subset of training datasets.
Evaluate the accuracy of the model by providing the test set. If the model predicts the correct output,
which means our model is accurate.
Types of Supervised Learning
A] Classification algorithms:
Output variable is categorical, which means there are two classes such as Yes-No, Male-
Female, True-false, etc. For
Random Forest
Decision Trees
Logistic Regression
It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc.
Linear Regression
Regression Trees
Non-Linear Regression
Polynomial Regression
Advantages of Supervised Learning
With the help of supervised learning, the model can predict the output on the basis of
prior experiences.
In supervised learning, we can have an exact idea about the classes of objects.
Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.
Disadvantages of Supervised Learning
Supervised learning models are not suitable for handling the complex tasks.
Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
In supervised learning, we need enough knowledge about the classes of object.
Fraud Detection
Spam Detection
Speech Recognition
Biometric Attendance
“Unsupervised learning is a type of machine learning in which models are trained using unlabelled
dataset and are allowed to act on that data without any supervision”
Goal - find the underlying structure of dataset, group / cluster that data according to similarities,
and represent that dataset in a compressed format.
Do not have any idea about the features of the dataset. The task of the unsupervised learning
algorithm is to identify the image features on their own.
Unsupervised Learning
Unsupervised Learning
Unsupervised Learning
Unsupervised Learning
Why use unsupervised Learning ?
Helpful for finding useful insights from the data.
Much similar as a human learns to think by their own experiences, which makes it closer to the real
AI.
Works on unlabelled and uncategorized data which make unsupervised learning more important.
In real-world, we do not always have input data with the corresponding output so to solve such cases,
we need unsupervised learning.
Types of unsupervised Learning
Clustering:
Clustering is a method of grouping the objects into clusters such that objects with most
similarities remains into a group and has less or no similarities with the objects of another
group.
Association rule makes marketing strategy more effective. Such as people who buy X
item (suppose a bread) are also tend to purchase Y (Butter/Jam) item.
A typical example of Association rule is Market Basket Analysis, Web Usage Mining,
Continuous Production etc
The main aim - To find the dependency of one data item on another data item and
map those variables accordingly so that it can generate maximum profit.
Types of unsupervised Learning
Unsupervised Learning Algorithm
K-means clustering
KNN (k-nearest neighbours)
Hierarchal clustering
Anomaly detection
Neural Networks
Principle Component Analysis
Independent Component Analysis
A-priori algorithm
Singular Value decomposition
Advantages of Unsupervised Learning
Advantages of Unsupervised Learning
Used for more complex tasks because, in unsupervised learning, we don't have labelled input data.
The result of the unsupervised learning algorithm might be less accurate as input data is not
labelled, and algorithms do not know the exact output in advance.
Applications of Unsupervised Learning
Network Analysis: For identifying plagiarism and copyright in document network analysis of text
data for scholarly articles.
Recommendation Systems: For building recommendation applications for different web applications and
e-commerce websites.
Anomaly Detection: Identify unusual data points within the dataset. It is used to discover fraudulent
transactions.
Unsupervised learning in Health care – Categorize MRI Data by Normal or Abnormal images
Supervised learning model takes direct feedback to Unsupervised learning model does not take any
check if it is predicting correct output or not. feedback.
Supervised learning model predicts the output. Unsupervised learning model finds the hidden
patterns in data.
In supervised learning, input data is provided to In unsupervised learning, only input data is
the model along with the output. provided to the model.
The goal - To train the model so that it can predict The goal - To find the hidden patterns and useful
the output when it is given new data. insights from the unknown dataset.
Supervised learning needs supervision to train the Unsupervised learning does not need any
model. supervision to train the model.
Supervised vs Unsupervised Learning
Supervised Learning Unsupervised Learning
Categorized in Classification and Regression problems. Classified in Clustering and Associations problems.
Supervised learning can be used for those cases where Unsupervised learning can be used for those cases
we know the input as well as corresponding outputs. where we have only input data and no corresponding
output data.
Supervised learning model produces an accurate result. Unsupervised learning model may give less accurate
result as compared to supervised learning.
Supervised learning is not close to true Artificial Unsupervised learning is more close to the true
intelligence as in this, we first train the model for each Artificial Intelligence as it learns similarly as a child
data, and then only it can predict the correct output. learns daily routine things by his experiences.
It includes various algorithms such as Linear Regression, It includes various algorithms such as Clustering, KNN,
Logistic Regression, Support Vector Machine, Multi- and Apriori algorithm.
class Classification, Decision tree, Bayesian Logic, etc.
Semi-supervised Learning
Represents the intermediate ground between Supervised and Unsupervised learning algorithms.
Combination of labelled(Small amount) and unlabelled datasets(Huge amount) during the training period.
Supervised learning requires hand-labelling by ML specialists or data scientists, also requires a high cost to
process.
Example:
Supervised learning is where a student is under the supervision of an instructor at home and college.
Further, if that student is self-analyzing the same concept without any help from the instructor, it comes
under unsupervised learning.
Under semi-supervised learning, the student has to revise itself after analyzing the same concept under
the guidance of an instructor at college.
Working of Semi-supervised Learning
Firstly, it trains the model with fewer amounts of training data similar to the supervised learning
models. The training continues until the model gives accurate results.
The algorithms use the unlabelled dataset with pseudo labels in the next step, and now the result may
not be accurate.
Now, the labels from labelled training data and pseudo labels data are linked together.
The input data in labelled training data and unlabelled training data are also linked.
In the end, again train the model with the new combined input as did in the first step. It will reduce
errors and improve the accuracy of the model.
The process can combine various neural network models and training ways.
Semi-supervised Learning
Advantages:
It is highly efficient.
Disadvantages:
Accuracy is low.
Applications Semi-supervised Learning
Speech Analysis-
Since, labelling the audio data is the most impassable task that requires many human resources;
This problem can be naturally overcome with the help of applying SSL in a Semi-supervised learning
model.
Protein sequence classification- DNA strands are larger, they require active human intervention. So, the
rise of the Semi-supervised model has been proximate in this field.
Text document classifier- As we know, it would be very unfeasible / inconvenient to find a large amount
of labelled text data, so semi-supervised learning is an ideal model to overcome this.
Reinforcement Learning
Feedback-based Machine learning technique
In this an agent learns to behave automatically in an environment by performing the actions and seeing
the results of actions. [Feedbacks system without any labelled data] Process of hit and trial, and
based on the experience
Since there is no labelled data, so the agent is bound to learn by its experience only.
For each good action, the agent gets positive feedback, and for each bad action, the agent gets
negative feedback or penalty.
The agent interacts with the environment and explores it by itself. The primary goal of an agent in
reinforcement learning is to improve the performance by getting the maximum positive rewards.
Reinforcement Learning
Reinforcement Learning
Example: Suppose there is an AI agent present within a maze environment, and his goal is to find the
diamond.
The agent interacts with the environment by performing some actions
Based on those actions, the state of the agent gets changed
Receives a reward or penalty as feedback.
The agent continues doing these three things (take action, change state/remain in the same state, and
get feedback), and by doing these actions, he learns and explores the environment.
The agent learns that what actions lead to positive feedback or rewards and what actions lead to negative
feedback penalty. As a positive reward, the agent gets a positive point, and as a penalty, it gets a negative
point.
To understand the working process of the RL, we need to consider two main things:
Environment: It can be anything such as a room, maze, football ground, etc.
Agent: An intelligent agent such as AI robot.
Reinforcement Learning
Key Features of Reinforcement Learning
In RL, the agent is not instructed about the environment and what actions need to be taken.
The agent takes the next action and changes states according to the feedback of the previous
action.
The environment is stochastic, and the agent needs to explore it to reach to get the maximum
positive rewards.
Key Features of Reinforcement Learning
Applications of Reinforcement Learning
Robotics:
RL is used in Robot navigation, Robo-soccer, walking, juggling, etc.
Control:
RL can be used for adaptive control such as Factory processes, admission control in
telecommunication, and Helicopter pilot is an example of reinforcement learning.
Game Playing:
RL can be used in Game playing such as tic-tac-toe, chess, etc.
Chemistry:
RL can be used for optimizing the chemical reactions.
Business:
RL is now used for business strategy planning.
Applications of Reinforcement Learning
Manufacturing:
In various automobile manufacturing companies, the robots use deep reinforcement learning to pick
goods and put them in some containers.
Finance Sector:
The RL is currently used in the finance sector for evaluating trading strategies.
Reinforcement Learning in Bank
Create ‘Next Best Offer’ Model for call Centre
Reinforcement Learning in Healthcare
Allocate Scarce Medical Resources to Handle Different ER Cases
Reinforcement Learning in Retail
Reduce Excess Stock with Dynamic Pricing
Batch Learning / Offline Learning
Training of machine learning models in a batch manner.
They typically construct models using the full training set, which are then placed into production. If
we want batch learning algorithms to learn from new data as it arrives, we must construct a new model
from scratch on the full training set and the new data.
If the amount of data is huge, training on the full data may incur a high cost of computing resources
(i.e., CPU, memory space, disk space, disk I/O, network I/O, etc.).
The models trained using batch learning are moved into production only at regular intervals based
on the performance of models trained with new data.
If our system does not need to adapt to rapidly changing data, then the batch learning approach
may be good enough
Process of training, evaluation and testing is very simple and straightforward and often leads to better
results than online methods.
Disadvantages of Batch Learning
The disadvantage of batch learning is it takes lot of time and resources for re-training the model.
In other words, the system is incapable of learning incrementally from the stream of data e.g. Stock
Prices.
Online Learning /Incremental Learning:
Trains a model incrementally from a stream of incoming data.
Fast and cheap, and execute with constant (or at least sub-linear) time and space complexity.
Training happens in an incremental manner by continuously feeding new data on the fly as it arrives or in
small group
Receive data as a continuous flow (e.g., stock prices) and need to adapt to change rapidly or
autonomously.
Hence they usually do not require a lot of computing resources. Online algorithms achieve this because
they do not need the full data to train a model
once an online learning system has learned about new data instances, it does not need them anymore,
so you can discard them (unless you want to be able to roll back to a previous state and “replay” the data).
This can save a huge amount of space.
Online Learning /Incremental Learning:
Online Learning /Incremental Learning:
Online learning algorithms can also be used to train systems on huge datasets that cannot fit in one
machine’s main memory (this is also called out-of-core learning).
Key Aspect – Learning rate [The rate at which you want your machine learning to adapt to new data set is
called as learning rate]
A system with low learning rate will be more like batch learning.
Online Learning /Incremental Learning:
Big disadvantages:
if it is fed with bad data, the system will have bad performance and the user will see the impact
instantly.
Measures:
Put appropriate filters in place to ensure that the data fed is of high quality.
To monitor the performance of the machine learning system in a very close manner.
Main Challenges of Machine Learning:
Data Collection
Irrelevant/Unwanted
Poor Quality of Data
Features
Main Challenges of Machine Learning:
1.Data Collection:
60% of the work of a data scientist lies in collecting the data.
For beginners to experiment with machine learning, they can easily find data from Kaggle, UCI ML
Repository, etc.
Kaggle:
World's largest data science community with powerful tools and resources to help you achieve your
data science goals.
Allow users to find and publish data sets, explore and build models in a web-based data-
science environment,
Work with other data scientists and machine learning engineers, and enter competitions to solve
data science challenges.
It offers a no-setup, customizable, Jupyter Notebooks environment.
Access free GPUs and a huge repository of community published data & code.
UCI ML Repository
The UCI Machine Learning Repository is a collection of databases, domain theories, and data
generators that are used by the machine learning community for the empirical analysis of
machine learning algorithms.
Main Challenges of Machine Learning:
1.Data Collection:
To implement real case scenarios, you need to collect the data
For solving business problems you need to attain data from clients (here ML engineers need to
coordinate with domain experts to collect the data).
Once the data is collected, we need to structure the data and store it in the database. This requires
knowledge of Big data (or data engineer) which plays a major role here.
Main Challenges of Machine Learning:
2.Less amount of Training Data:
Once the data is collected you need to validate if the quantity is sufficient for the use case
The two important things we do while doing a machine learning project are
Selecting a learning algorithm
Training the model using some of the acquired data.
A child may distinguish the animal with less number of samples by identifying shapes, colors or any other
feature, but a machine learning model requires thousands of examples for even simple problems.
For complex problems like Image Classification and Speech Recognition, it may require data in a count of
millions.
Therefore, one thing is clear. We need to train a model with SUFFICIENT DATA.
Main Challenges of Machine Learning:
3.Non representative Training Data:
The training data should be representative of the new cases to generalize well i.e., the training data
should cover all the cases that occurred and that is going to occur.
By using a non-representative training set, the trained model is not likely to make accurate predictions.
If the number of training samples is low, we have sampling noise which is unrepresentative data, again
countless training tests bring sampling bias if the strategy utilized for training is defective.
A popular case of examining sampling bias occurred during the US Presidental election in 1936 (Landon
against Roosevelt), a very large poll was conducted by the Literary Digest by sending mail to around ten
million people out of which 2.4 million answered, and predicted that Landon is going to get 57% of votes
with high confidence. Be that as it may, Roosevelt won with 62% of the votes.
The problem here is in the sampling method, to get the addresses for conducting the poll, Literary
Digest used magazine subscribes, club membership lists, and the likes, which are utilized by wealthier
individuals who are bound to cast a ballot Republican, (hence Landon). Also, non-response bias comes into
the picture as only 25% of people answered the poll.
To make accurate predictions without any drifts, the training datasets must be representative.
Main Challenges of Machine Learning:
4.Poor Quality of Data:
In reality, we don’t directly start training the model, analyzing data is the most important step. But the data we
collected might not be ready for training, some samples are abnormal from others having outliers or missing
values for instance.
In these cases,
We don’t want our system to make false predictions, right? So the quality of data is very important to get
accurate results.
Data pre-processing needs to be done by filtering missing values, extract & rearrange what the model needs.
Main Challenges of Machine Learning:
5.Irrelevant or unwanted Features:
If the training data contains a large number of irrelevant features and enough relevant features, the machine
learning system will not give the results as expected.
One of the important aspects required for the success of a machine learning project is the selection of good
features to train the model also known as Feature Selection.
Let’s say we are working on a project to predict the number of hours a person needs to exercise based on the
input features that we collected — age, gender, weight, height, and location (i.e., where he/she lives).
Among these 5 features, location value might not impact our output function. This is an irrelevant feature; we
know that we can have better results without this feature.
Also, we can combine two features to produce a more useful one i.e., Feature Extraction. In our example, we
can produce a feature called BMI by eliminating weight and height. We can apply transformations on the dataset
too.
Main Challenges of Machine Learning:
6.Overfitting the Training Data:
Say you visited a restaurant in a new city. You looked at the menu to order something and found that the
cost or bill is too high. You might be tempted to say that ‘all the restaurants in the city are too costly and
not affordable’. Overgeneralizing is something that we do very frequently, and shockingly, the
frameworks can likewise fall into a similar snare and in AI, we call it overfitting.
At the point when the model is excessively unpredictable when compared to the noisiness of the training
dataset, overfitting occurs. We can avoid it by:
A higher degree polynomial model is not preferred compared to the linear model.
It generally happens when we have less information to construct an exact model and when we attempt
to build a linear model with non-linear information.
Main options to reduce underfitting are:
Feature Engineering — feeding better features to the learning algorithm.
Removing noise from the data.
Increasing parameters and selecting a powerful model.
Main Challenges of Machine Learning:
8. Offline Learning & Deployment of Model:
Deployment [bringing their cool applications into production] has become one of the biggest challenges
Lack of skills for deployment
Due to lack of practice and dependencies issues,
Low understanding of underlying models with business,
Understanding of business problems, unstable models.
Generally, many of the developers collect data from websites like Kaggle and start training the model. But
in reality, we need to make a source for data collection, that varies dynamically.
It is always preferred to build a pipeline to collect, analyze, build/train, test & validate the dataset
for any machine learning project and train the model in batches.
Here the data might drift as it changes dynamically. So offline learning or batch learning is not preferred.