Machine Learning Semester Paper
Machine Learning Semester Paper
Machine Learning Semester Paper
Ans : It is a branch of artificial intelligence based on the idea that systems can learn from data,
identify patterns and make decisions with minimal human intervention. Large amounts of data can
be used to create much more accurate Machine Learning algorithms that are actually viable in the
technical industry. And so, Machine Learning is now a buzz word in the industry despite having
existed for a long time.
This machine learning process starts with feeding them good quality data and then training
the machines by building various machine learning models using the data and different
algorithms. The choice of algorithms depends on what type of data we have and what kind of
task we are trying to automate.
As for the formal definition of Machine Learning, we can say that a Machine Learning
algorithm learns from experience E with respect to some type of task T and performance
measure P, if its performance at tasks in T, as measured by P, improves with experience E.
For example, If a Machine Learning algorithm is used to play chess. Then the experience E is
playing many games of chess, the task T is playing chess with many players, and the
performance measure P is the probability that the algorithm will win in the game of chess.
This is where Machine Learning comes into action. Some of the most common examples are:
• Image Recognition
• Speech Recognition
• Recommender Systems
• Fraud Detection
• Self Driving Cars
• Medical Diagnosis
• Stock Market Trading
• Virtual Try On
Image Recognition
Image Recognition is one of the reasons behind the boom one could have experienced
in the field of Deep Learning. The task which started from classification between cats
and dog images has now evolved up to the level of Face Recognition and real-world
use cases based on that like employee attendance tracking.
Speech Recognition
Speech Recognition based smart systems like Alexa and Siri have certainly come across and
used to communicate with them. In the backend, these systems are based basically on Speech
Recognition systems. These systems are designed such that they can convert voice
instructions into text.
Recommender Systems
As our world has digitalized more and more approximately every tech giants try to provide
customized services to its users. This application is possible just because of the recommender
systems which can analyze a user’s preferences and search history and based on that they can
recommend content or services to them.
Fraud Detection
In today’s world, most things have been digitalized varying from buying toothbrushes or
making transactions of millions of dollars everything is accessible and easy to use. But with
this process of digitization cases of fraudulent transactions and fraudulent activities have
increased. Identifying them is not that easy but machine learning systems are very efficient in
these tasks.
Medical Diagnosis
If you are a machine learning practitioner or even if you are a student then you must have
heard about projects like breast cancer Classification, Parkinson’s Disease Classification,
Pneumonia detection, and many more health-related tasks which are performed by machine
learning models with more than 90% of accuracy.
Ethical concerns
There are, of course, many advantages to trusting algorithms. Humanity has benefited from
relying on computer algorithms to automate processes, analyze large amounts of data, and
make complex decisions. However, trusting algorithms has its drawbacks. Algorithms can be
subject to bias at any level of development. And since algorithms are developed and trained
by humans, it’s nearly impossible to eliminate bias.
Many ethical questions still remain unanswered. For example, who is to blame if something
goes wrong? Let’s take the most obvious example — self-driving cars.
Deterministic problems
ML is a powerful technology well suited for many domains, including weather forecasting
and climate and atmospheric research. ML models can be used to help calibrate and correct
sensors that allow you to adjust the operation of sensors that measure environmental
indicators like temperature, pressure, and humidity.
Models can be programmed, for example, to simulate weather and emissions into the
atmosphere to forecast pollution. Depending on the amount of data and the complexity of the
model, this can be computationally intensive and take up to a month.
However, neural networks do not understand the physics of a weather system, nor do not
understand its laws. For example, ML can make predictions, but the calculations of such
intermediate fields as density can have negative values that are impossible under the laws of
physics. AI does not recognize cause-and-effect relationships. The neural network finds a
connection between input and output data but cannot explain the reason they are connected.
Lack of Data
Neural networks are complex architectures and require enormous amounts of training data to
produce viable results. As the size of a neural network’s architecture grows, so does its data
requirement. In such cases, some may decide to reuse the data, but this will never bring good
results.
Another problem is related to the lack of quality data. This is not the same as simply not
having data. Let’s say your neural network requires more data, and you give it a sufficient
quantity, but you give it poor quality data. This can significantly reduce the model’s
accuracy.
Lack of interpretability
One significant problem with deep learning algorithms is interpretability. Let’s say you work
for a financial firm, and you need to build a model to detect fraudulent transactions. In this
case, your model should be able to justify how it classifies transactions. A deep learning
algorithm may have good accuracy and responsiveness for this task but may not validate its
solutions.
Or maybe you work for an AI consulting firm. You want to offer your services to a client that
uses only traditional statistical methods. AI models can be powerless if they cannot be
interpreted, and the process of human interpretation involves nuances that go far beyond
technical skill.
Lack of reproducibility
Lack of reproducibility in ML is a complex and growing issue exacerbated by a lack of code
transparency and model testing methodologies. Research labs develop new models that can
be quickly deployed in real-world applications. However, even if the models are developed to
take into account the latest research advances, they may not work in real cases.
Reproducibility can help different industries and professionals implement the same model
and discover solutions to problems faster.
4. Reinforcement Learning
Example: Consider a scenario where you have to build an image classifier to differentiate
between cats and dogs. If you feed the datasets of dogs and cats labelled images to the
algorithm, the machine will learn to classify between a dog or a cat from these labeled images.
When we input new dog or cat images that it has never seen before, it will use the learned
algorithms and predict whether it is a dog or a cat. This is how supervised learning works,
and this is particularly an image classification.
There are two main categories of supervised learning that are mentioned below:
• Classification
• Regression
Classification
Classification deals with predicting categorical target variables, which represent discrete
classes or labels. For instance, classifying emails as spam or not spam, or predicting whether
a patient has a high risk of heart disease.
• Logistic Regression
• Random Forest
• Decision Tree
• Naive Bayes
Regression
Regression, on the other hand, deals with predicting continuous target variables, which
represent numerical values. For example, predicting the price of a house based on its size,
location, and amenities, or forecasting the sales of a product. Regression algorithms learn to
map the input features to a continuous numerical value.
• Linear Regression
• Polynomial Regression
• Ridge Regression
• Lasso Regression
• Decision tree
• Random Forest
Advantages of Supervised Machine Learning
• Supervised Learning models can have high accuracy as they are trained on labelled data.
• It can often be used in pre-trained models which saves time and resources when developing
new models from scratch.
• It has limitations in knowing patterns and may struggle with unseen or unexpected patterns
that are not present in the training data.
• Natural language processing: Extract information from text, such as sentiment, entities, and
relationships.
Consider that you have a dataset that contains information about the purchases you made
from the shop. Through clustering, the algorithm can group the same purchasing behavior
among you and other customers, which reveals potential customers without predefined labels.
This type of information can help businesses get target customers as well as identify outliers.
There are two main categories of unsupervised learning that are mentioned below:
• Clustering
• Association
Clustering
Clustering is the process of grouping data points into clusters based on their similarity. This
technique is useful for identifying patterns and relationships in data without the need for
labeled examples.
• Mean-shift algorithm
• DBSCAN Algorithm
Association
• Apriori Algorithm
• Eclat
• FP-growth Algorithm
• It helps to discover hidden patterns and various relationships between the data.
• Used for tasks such as customer segmentation, anomaly detection, and data exploration.
• It does not require labeled data and reduces the effort of data labeling.
Disadvantages of Unsupervised Machine Learning
• Without using labels, it may be difficult to predict the quality of the model’s output.
• Cluster Interpretability may not be clear and may not have meaningful interpretations.
• It has techniques such as autoencoders and dimensionality reduction that can be used to
extract meaningful features from raw data.
• Dimensionality reduction: Reduce the dimensionality of data while preserving its essential
information.
3. Semi-Supervised Learning
Semi-Supervised learning is a machine learning algorithm that works between the supervised
and unsupervised learning so it uses both labelled and unlabelled data. It’s particularly
useful when obtaining labeled data is costly, time-consuming, or resource-intensive. This
approach is useful when the dataset is expensive and time-consuming. Semi-supervised
learning is chosen when labeled data requires skills and relevant resources in order to train or
learn from it.
We use these techniques when we are dealing with data that is a little bit labeled and the rest
large portion of it is unlabeled. We can use the unsupervised techniques to predict labels and
then feed these labels to supervised techniques. This technique is mostly applicable in the
case of image data sets where usually all images are not labeled.
Example: Consider that we are building a language translation model, having labeled
translations for every sentence pair can be resources intensive. It allows the models to learn
from labeled and unlabeled sentence pairs, making them more accurate. This technique has
led to significant improvements in the quality of machine translation services.
• It still requires some labeled data that might not always be available or easy to obtain.
• Game Playing: RL can teach agents to play games, even complex ones.
• Autonomous Vehicles: RL can help self-driving cars navigate and make decisions.
• Natural Language Processing (NLP): RL can be used in dialogue systems and chatbots.
Ans :
Online machine learning is a type of machine learning where data is acquired sequentially
and is utilized to update the best predictor for future data at each step.
In other words, online machine learning means that learning takes place as data becomes
available. With online learning, the learning algorithm’s parameters are updated after
learning from each individual training instance. In online learning, each learning step is quick
and cheap, and the model can learn from new information in real-time as it arrives.
Online learning is ideal for machine learning systems that receive data as a continuous flow
and need to be able to adapt to rapidly changing conditions.
An example of one of these systems might be one that predicts the weather or analyses stock
prices. This type of machine learning is also an ideal option if computing resources are a
factor – when an online model has learned from new data instances, it no longer needs to use
them and can therefore discard them. This can save a huge amount of storage space.
Offline :
While online learning does have its uses, traditional machine learning is performed offline
using the batch learning method.
In batch learning, data is accumulated over a period of time. The machine learning model is
then trained with this accumulated data from time to time in batches. It is the direct opposite
of online learning because the model is unable to learn incrementally from a stream of live
data. In batch learning, the machine learning algorithm updates its parameters only after
consuming batches of new data.
Differences :
In an online machine learning process, however, the changes of weights and parameters that
occur at a given step are dependent on the example that’s being shown. If the model has
already been deployed, the model’s current state might also be a factor.
2. Training time required
Offline batch learning is generally a lot faster than online machine learning because offline
learning only uses a dataset once throughout the entire model to modify weights and
parameters.
That said, the sheer size of modern big data streams means that it can be a time-consuming
and sometimes impossible task to feed all available data into an offline model. In this
situation, engineers can either opt for online machine learning or feed the model with data
incrementally.
Offline machine learning is often cheaper than online machine learning, too. This is because
in online machine learning, the model obtains and tunes its parameters as new data becomes
available in real-time.
Online machine learning is an ongoing, continuous process that requires a constant input of
data. This is because model refinement and improvement can only be carried out when the
model is being fed this data.
The computational power required for online machine learning is therefore higher than
offline batch learning which in contrast requires fewer computations.
4. Use in production
Online machine learning models are a lot harder to manage in a production environment. This
is because online learning models churn through large amounts of data in real-time and learn
from them. This has an immediate impact on the machine learning model and the solution it
powers and can affect the overall performance of it, a problem known as concept drift.
While this can be controlled, i.e., by filtering out “bad data” as we mentioned earlier, it
requires a larger resource input and leads to higher costs.
With batch learning, changes to the model are only reflected when updated models that have
been trained with new data are manually pushed to production.
5. Limits in scalability
A purely online model can be difficult to deploy in a way that’s scalable. This is because
when a model is purely online, its health and performance must be constantly monitored, as
must the health of the system that’s sending data to the model. It is somewhat telling that
even large multinational companies that can do this because they have the resources choose
not to.
It’s also more difficult to get an algorithm to behave in the desired way on a purely online,
automatic basis.
Ques : Random data sampling vs Stratified data sampling ?
Ans : The method of collecting data from a population, regarding a sample on a group of
items and examining it to draw out some conclusion, is known as Sample Method. This
method is even used in the day-to-day lives of people. For example, a cook takes a spoon of
pulses to check whether the whole pulse is evenly cooked. The sampling method of collecting
data is suitable for a large population and when the investigator does not require a high level
of accuracy. It is also preferred by investigators when they do not need an intensive
examination of items.
• Methods of Sampling
• 1. Random Sampling
• 2. Purposive or Deliberate Sampling
• 3. Stratified or Mixed Sampling
• 4. Systematic Sampling
1. Random Sampling
As the name suggests, in this method of sampling, the data is collected at random. It means
that every item of the universe has an equal chance of getting selected for the investigation
purpose. In other words, each item has an equal probability of being in the sample, which
makes the method impartial. As there is no control of the investigator in selecting the sample,
the random sampling method is used for homogeneous items. As there is no tool or a large
number of people required for collecting data through random sampling, this method is
economical. There are two ways of collecting data through the random sampling method.
These are the Lottery Method and Tables of Random Numbers.
For example, there are 60 students in Class 10th. Out of these 60 students, 10 opted for Arts
and Humanities, 30 opted for Commerce, and 20 opted for Science in Class 11th. It means
that the population of 60 students is divided into three strata; viz., Arts and Humanities,
Commerce, and Science, containing 10, 30, and 20 students, respectively. Now, for
investigation purpose, some of the items will be proportionately selected from each of the
strata in a way that those items forming a sample represents the entire population. Besides, an
investigator can even select the items from different strata, unproportionately.
Ans : In Machine Learning (and in mathematics) there are often three values that interests us:
Mean
The mean value is the average value.
To calculate the mean, find the sum of all values, and divide the sum by the number of
values:
(99+86+87+88+111+86+103+87+94+78+77+85+86) / 13 = 89.77
Median
The median value is the value in the middle, after you have sorted all the values:
77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111
It is important that the numbers are sorted before you can find the median.
If there are two numbers in the middle, divide the sum of those numbers by two.
77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103
Mode
The Mode value is the value that appears the most number of times:
99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86 = 86
Ques : ANI VS ASI VS AGI ?
Ans : Three Types (Stages) of AI - Based on Capabilities
There are various ways to create AI, depending on what we want to achieve with it and how
we will measure its success. It ranges from extremely rare and complex systems, such as self-
driving cars and robotics, to parts of our everyday lives, such as facial recognition, machine
translation, and email categorization. The path you choose will depend on what your AI goals
are and how well you understand the intricacies and feasibility of various approaches.
AI technologies are categorized according to their ability to mimic human traits, the
techniques they use to do so, their real-world applications, and theory of mind. Using these
characteristics as a reference, all AI systems — real and hypothetical — fall into one of three
categories:
Today, we have three different variants of AI technology; ANI, AGI, and ASI. These are the
three stages in which AI can evolve. We have only achieved narrow AI so far.
In contrast to strong AI, which can learn to perform any task humans do, weak AI (or narrow
AI) is limited to one or a few specific tasks. This is the kind of artificial intelligence we
currently have. In fact, deep learning, named after the human brain (and often compared to
it), has very limited capabilities and is nowhere near what a human child's brain can do. This
is not a bad thing.
In fact, narrow AI can focus on specific tasks and do it better than humans. For example, feed
a deep learning algorithm enough pictures of skin cancer and it will be better at spotting skin
cancer than an experienced doctor. This does not mean that deep learning will replace
doctors. You need intuition, abstract thinking, and more skills to decide what is best for your
patients.
General AI (AGI) is only theoretical at this point. This is the AI that writers have made up for
years in sci-fi stories. Ultimately, when we achieve AGI, machines will have consciousness
and decision-making capabilities - full human cognitive abilities. These machines do not
require human input to be programmed to function. For all intents and purposes, this will be a
time when machines act, feel, respond and think like humans. We can say that a powerful AI
has a mind of its own and is capable of doing whatever it wants to do like any human being.
Unlike narrow AI, which classifies data and finds patterns, general AI uses clustering and
association when processing data. AGI will also be self-aware. However, like a child, AI
must learn through experience, improving knowledge and skills over time.
But while all of this talent is focused on finding a way to create a powerful AI that can
compete with the human brain, we are missing a lot of opportunities and failing to address the
threats posed by current weak AI technologies.
ASI is a futuristic concept and idea of artificial intelligence replacing human intelligence
capabilities. For ASI to become a reality, computational programs must surpass human
intelligence in all parameters and environments. ASI will only become a reality when AI
becomes smarter than humans.
ASI with a futuristic halo seems far removed from the future of human evolution, in the sense
that this so-called perceptual AI variant shows the conceptual limits of AI technology and its
promises that it won't deliver, at least in become a reality decades later.
If ASI becomes possible and becomes a reality, the role of humans in decision-making, the
arts and humanities, and emotional understanding of all aspects of life could be at odds with
the rise of machines.
Ans : A classification data set with skewed class proportions is called imbalanced. Classes
that make up a large proportion of the data set are called majority classes. Those that make
up a smaller proportion are minority classes.
Case 1:
If there are 900 ‘Yes’ and 100 ‘No’ then it represents an Imbalanced dataset as there is highly
unequal distribution of the two classes. .
Case 2:
If there are 550 ‘Yes’ and 450 ‘No’ then it represents a Balanced dataset as there is
approximately equal distribution of the two classes.
• Algorithms may get biased towards the majority class and thus tend to predict output
as the majority class.
• Minority class observations look like noise to the model and are ignored by the
model.
• Imbalanced dataset gives misleading accuracy score.
• Under Sampling :
In this technique, we reduce the sample size of Majority class and try to match it with
the sample size of Minority Class.
Example :
Let’s take an imbalanced training dataset with 1000 records.
Pros :
Cons :
Note : Under Sampling should only be done when we have huge number of records.
• Over Sampling :
In this technique, we increase the sample size of Minority class by replication and try to
match it with the sample size of Majority Class.
Example :
Let’s take the same imbalanced training dataset with 1000 records.
Pros :
Cons :
RL algorithms can be mainly divided into two categories – model-based and model-free.
Model-based, as it sounds, has an agent trying to understand its environment and creating a
model for it based on its interactions with this environment. In such a system, preferences
take priority over the consequences of the actions i.e. the greedy agent will always try to
perform an action that will get the maximum reward irrespective of what that action may
cause.
On the other hand, model-free algorithms seek to learn the consequences of their actions
through experience via algorithms such as Policy Gradient, Q-Learning, etc. In other words,
such an algorithm will carry out an action multiple times and will adjust the policy (the
strategy behind its actions) for optimal rewards, based on the outcomes.
Think of it this way, if the agent can predict the reward for some action before actually
performing it thereby planning what it should do, the algorithm is model-based. While if it
actually needs to carry out the action to see what happens and learn from it, it is model-free.
This results in different applications for these two classes, for e.g. a model-based approach
may be the perfect fit for playing chess or for a robotic arm in the assembly line of a product,
where the environment is static and getting the task done most efficiently is our main
concern. However, in the case of real-world applications such as self-driving cars, a model-
based approach might prompt the car to run over a pedestrian to reach its destination in less
time (maximum reward), but a model-free approach would make the car wait till the road is
clear (optimal way out).
To better understand this, we will explain everything with an example. In the example, we’ll
build model-free and model-based RL for tennis games. To build the model, we need an
environment for the policy to get implemented. However we won’t build the environment in
this article, we’ll import one to use for our program.
As it happens, analyzing and understanding data is a prerequisite for data processing. This is
particularly important because data comes in two different forms: structured and
unstructured. Each data type is accumulated, processed, sorted, and analyzed to derive
valuable information and improve overall decision-making. Both structured and unstructured
data are stored in different databases.
Structured data is well-organized, easy to quantify, well defined, simple to search and analyze
with software in data analytics. Structured data is usually located in a specific field within
files or records. It is easy to place structured data into a standard pattern of set rows, tables,
and columns.
A good example of handling structured data is accessing the hotel database where all the
relevant details of the inmates, like name, contact number, address, etc., can be accessed with
ease. Such types of data are structured.
Structured data is encased in RDBMS (relational databases). Any information stored in the
database can be updated by person or machines and accessed with ease by algorithms or
manual search. Structured Query Language (SQL) is the standard tool used to handle
structured data, be it locating, adding & deleting, or updating.
The well-organized and quantitative nature of structured data makes it very easy for them to
update, modify, and search for data.
Anyone with basic knowledge of data and its related applications can use structured data.
Structured data facilitates the self-service mode of data access to the user. So, it is not
necessary to have in-depth knowledge of data types and their relationships.
As structured data has been in use for a long time, most tools have been tested for their
efficiency in data analysis.
4. Seamless integrations
Simple and streamlined programs like Excel can be used to store and organize structured
data. Furthermore, several other analytical tools can be linked to Excel for further data
analysis as required.
5. Suitability
Structured data is highly suitable for basic organization and quantitative analysis.
Structured data lacks versatility. It can be used only with a set vision and cannot deviate from
that as it has a predefined structure.
Structured data is stored in data warehouses with a rigid data storage method. Any change in
data storage will require a complete update of existing data to accommodate additional
expensive and time-consuming requirements.
Structured data can offer limited insight as it works on pre-set parameters. It does not provide
the details of how and why the data analytics is carried out.
Unstructured data :
Unstructured data refers to information that is not organized and cannot be accommodated in
a set or defined framework. It can be stored only in its original form until put to use. This
feature is known as schema on read.
The majority of the data we come across is unstructured. Nearly 80% of the enterprise data is
unstructured; this percentage appears to be constantly growing. Unstructured data comes in
various formats like emails, posts on social media platforms, chats, presentations, images,
satellite feeds, and data from IoT sensors.
Naturally, companies that invest time and money in deciphering unstructured data get access
to vital and valuable business intelligence to increase their profits. It can also help them
connect to their customers more efficiently and in a personalized fashion, thereby
contributing to increased profits.
Advantages of Unstructured Data
1. Liberty to stay in the natural form
As unstructured data is accumulated in its original form (native form), it is not defined until
used. This results in a larger reserve pool as the unstructured data can adapt to any data
requirement. It also facilitates data analysts and data scientists to process and analyze only
the required information.
Unstructured data has an impressive accumulation rate. As it does not require pre-set
parameters, it can be gathered easily and quickly.
Cloud data lakes store unstructured data due to their impressive storage capacity. Cloud data
lakes charge on a pay-for-what-you-use basis and are highly cost-effective, flexible, and
scalable.
As we mentioned before, you require data science expertise to leverage unstructured data for
useful processing and analysis. So, a regular business person or user can not possibly extract
any meaningful information from unstructured data in its crude native form. Processing
unstructured data requires the knowledge of the topic related to the data and the knowledge of
linking the data to make it resourceful. Even more disadvantageous is that there is a shortage
of data science professionals despite the continually growing demand across industries.
Unstructured data requires specialized tools for manipulation besides data science expertise.
Standard data analytics tools are useful and compatible with structured data, and data
engineers only have a limited choice of tools to analyze unstructured data.
Dimensions are features that may be dependent or independent. The concept of dimensions in
context to the curse of dimensionality becomes easier to understand with the help of an
example. Suppose there is a dataset with 100 features. Now let’s assume you intend to build
various separate machine learning models from this dataset. The models can be model-1,
model-2, …. model-100. The difference between these models is the number of features.
With the increase in the number of features, the model’s accuracy increases. However, after a
specific threshold value, the model’s accuracy will not increase, although the number of
features increases. This is because a model is fed with a lot of information, making it
incompetent to train with correct information.
The phenomenon when a machine learning model’s accuracy decreases, although increasing
the number of features after a certain threshold, is called the curse of dimensionality.
As the number of dimensions or features increases, the amount of data needed to generalize
the machine learning model accurately increases exponentially. The increase in dimensions
makes the data sparse, and it increases the difficulty of generalizing the model. More training
data is needed to generalize that model better.
The higher dimensions lead to equidistant separation between points. The higher the
dimensions, the more difficult it will be to sample from because the sampling loses its
randomness.
It becomes harder to collect observations if there are plenty of features. These dimensions
make all observations in the dataset to be equidistant from all other observations. The
clustering uses Euclidean distance to measure the similarity between the observations. The
meaningful clusters can’t be formed if the distances are equidistant.
Note: The underfitting model has High bias and low variance.
1. The model is too simple, So it may be not capable to represent the complexities in the data.
2. The input features which is used to train the model is not the adequate representations of
underlying factors influencing the target variable.
3. Early stopping during the training phase (have an eye over the loss over the training period
as soon as loss begins to increase stop training).
Gradient decent :
Linear regression is about finding the line of best fit for a dataset. This line can then be used
to make predictions.
Gradient descent is a tool to arrive at the line of best fit
In gradient descent, you start with a random line. Then you change the parameters of the line
(i.e. slope and y-intercept) little by little to arrive at the line of best fit.
How do you know when you arrived at the line of best fit?
For every line you try — line A, line B, line C, etc — you calculate the sum of squares of the
errors. If line B has a smaller value than line A, then line B is a better fit, etc.
Error is your actual value minus your predicted value. The line of best fit minimizes the sum
of the squares of all the errors. In linear regression, the line of best fit we computed above
using correlation coefficient also happens to be the least squared error line. That’s why the
regression line is called the LEAST SQUARE REGRESSION LINE.
• Data Collection
• Data Preparation
• Choosing Learning Algorithm
• Training Model
• Evaluating Model
• Predictions
1. Data Collection-
In this stage,
2. Data Preparation-
In this stage,
In this stage,
4. Training Model-
In this stage,
5. Evaluating Model-
In this stage,
6. Predictions-
In this stage,
• The built system is finally used to do something useful in the real world.
• Here, the true value of machine learning is realized.
Pre-processing :
Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.
When creating a machine learning project, it is not always a case that we come across the
clean and formatted data.
A real-world data generally contains noises, missing values, and maybe in an unusable format
which cannot be directly used for machine learning models. Data preprocessing is required
tasks for cleaning the data and making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.
2) Importing Libraries
In order to perform data preprocessing using Python, we need to import some predefined
Python libraries. These libraries are used to perform some specific jobs. There are three
specific libraries that we will use for data preprocessing, which are:
Since machine learning model completely works on mathematics and numbers, but if our
dataset would have a categorical variable, then it may create trouble while building the
model. So it is necessary to encode these categorical variables into numbers.
6) Splitting the Dataset into the Training set and Test set
In machine learning data preprocessing, we divide our dataset into a training set and test set.
This is one of the crucial steps of data preprocessing as by doing this, we can enhance the
performance of our machine learning model.
7) Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a technique to
standardize the independent variables of the dataset in a specific range.
In the context of machine learning, hyperparameters are configuration variables that are set
before the training process of a model begins. They control the learning process itself, rather
than being learned from the data. Hyperparameters are often used to tune the performance of
a model, and they can have a significant impact on the model’s accuracy, generalization, and
other metrics.
Hyperparameter Tuning techniques
Models can have many hyperparameters and finding the best combination of parameters can
be treated as a search problem. The two best strategies for Hyperparameter tuning are:
1. GridSearchCV
2. RandomizedSearchCV
3. Bayesian Optimization
1. GridSearchCV
Grid search can be considered as a “brute force” approach to hyperparameter optimization.
We fit the model using all possible combinations after creating a grid of potential discrete
hyperparameter values. We log each set’s model performance and then choose the
combination that produces the best results. This approach is called GridSearchCV, because it
searches for the best set of hyperparameters from a grid of hyperparameters values.
An exhaustive approach that can identify the ideal hyperparameter combination is grid
search. But the slowness is a disadvantage. It often takes a lot of processing power and time
to fit the model with every potential combination, which might not be available.
2. RandomizedSearchCV
As the name suggests, the random search method selects values at random as opposed to the
grid search method’s use of a predetermined set of numbers. Every iteration, random search
attempts a different set of hyperparameters and logs the model’s performance. It returns the
combination that provided the best outcome after several iterations. This approach reduces
unnecessary computation.
3. Bayesian Optimization
Grid search and random search are often inefficient because they evaluate many unsuitable
hyperparameter combinations without considering the previous iterations’ results. Bayesian
optimization, on the other hand, treats the search for optimal hyperparameters as an
optimization problem. It considers the previous evaluation results when selecting the next
hyperparameter combination and applies a probabilistic function to choose the combination
that will likely yield the best results. This method discovers a good hyperparameter
combination in relatively few iterations.
Ques : Regression vs Classification ?
Ans : Regression and Classification algorithms are Supervised Learning algorithms. Both the
algorithms are used for prediction in Machine learning and work with the labeled datasets.
But the difference between both is how they are used for different machine learning
problems.
The main difference between Regression and Classification algorithms that Regression
algorithms are used to predict the continuous values such as price, salary, age, etc. and
Classification algorithms are used to predict/Classify the discrete values such as Male or
Female, True or False, Spam or Not Spam, etc.
Classification:
Classification is a process of finding a function which helps in dividing the dataset into
classes based on different parameters. In Classification, a computer program is trained on the
training dataset and based on that training, it categorizes the data into different classes.
The task of the classification algorithm is to find the mapping function to map the input(x) to
the discrete output(y).
Example: The best example to understand the Classification problem is Email Spam
Detection. The model is trained on the basis of millions of emails on different parameters,
and whenever it receives a new email, it identifies whether the email is spam or not. If the
email is spam, then it is moved to the Spam folder.
Regression:
Regression is a process of finding the correlations between dependent and independent
variables. It helps in predicting the continuous variables such as prediction of Market
Trends, prediction of House prices, etc.
The task of the Regression algorithm is to find the mapping function to map the input
variable(x) to the continuous output variable(y).
Example: Suppose we want to do weather forecasting, so for this, we will use the Regression
algorithm. In weather prediction, the model is trained on the past data, and once the training
is completed, it can easily predict the weather for future days.